Monday, April 03, 2006

The Delta Web

1) Overview

I’ve started a new project called “the delta web”. The idea is to define a markup language called "delta" for describing changes to Web-based documents such as Atom feeds. Delta lets you capture the notion of “change over time” or “work done” to content on the Web.

The motivation for delta is that there’s no simple way to describe how things change on the Web. I know that’s sort of a generalization. However, people create documents and publish them to URI’s, and then, when they change these documents, they usually just overwrite the originals, or create new versions with new URI’s. To see what’s changed, others have to compare versions to find “diffs”, or else look for time stamps within documents to get hints on when individual entities were added or removed.

Delta provides a new way for people to create standalone documents that explicitly describe changes to Web documents. It’s different from Microsoft's approach called SSE (Simple Shared Extensions to RSS), in that delta captures and stores change descriptions in their own documents, rather than being extensions to existing Web documents. More on how delta works later…

Simplicity is a key characteristic of delta. Delta is a markup language for describing changes. It is not a protocol or API, or a set of rules describing what to do with delta documents. Certainly, the delta web project will strive to explore different usage scenarios, but it's not about defining how people should collaborate in creating, or otherwise using delta documents.

One of the practical applications of delta is describing changes to feeds. People can create delta documents to describe changes to their feeds; so that others can see what changed explicitly. People can also create chains of deltas, each referencing the preceding delta, or they can stick to their habits of creating versions of feed documents, and augmenting this practice by using deltas to describe the changes to each version.


2) Core Concepts

Let’s say an application makes a set of changes to a Web-based XML document. Normally, such an application would store the result of these changes as a new version of the document, and notify others of the changes. Delta makes it possible for such an application to record the descriptions of the changes as they occur, and save them in the form of a delta document.

Some might say that delta provides redundant information, given the presence of old and new versions of the changed document. This isn’t quite true. A delta document describes when and how each of the changes was made. With a delta, an application can reapply changes, or a subset of changes, or can combine the elements of multiple deltas to create new deltas. This isn’t possible if all an application has to work with are discrete versions of documents made at different points in time.



A delta document referencing a start document.
Using an update mechanism, one can derive an end document using a delta.


In order to get things started, I've created a "strawman" XML schema for delta, and have provided an example (see below). A delta document contains a sequence of “add” and “remove” operations, in addition to a reference to a “start” or “end” document. Each delta operation has a precise date/time stamp, as well as an XPath reference to a location in the document, and an ability to hold a copy or reference to the value in question. An application can process a delta document by applying the set of changes in “forward” manner to a referenced “start” document, or in “reverse” manner to a referenced “end” document.

Since the application of delta operations changes the state of the referenced document, each operation is understood to be meaningful in the context of all preceding delta operations having been applied. This is important because, for example, a change operation may add an XML element to a document, and the next operation may reference a portion of this just-added element. Without having processed the sequence of change ops prior to a particular operation, the context, or state, of the document is not necessarily valid for that operation.

An update mechanism can apply a set of changes specified in a delta document to the start/end document, to produce a new updated document. People have the freedom of creating an updated document, and then discarding one or both of the original and the delta document. However, this results in a loss of information, namely the work that produced the updated document. Even if the original document and updated document are saved, and the delta document is discarded, the set of changes cannot be accurately derived, since there are many ways in which a set of changes can produce the same result. The only way to preserve the history of changes is by keeping the delta documents, and the original documents that they reference. Keep in mind that there are potentially many ways of preserving delta documents. One extreme approach is to keep a single monster-sized delta document versus a sequence of delta and intermediate documents. This raises the classic trade off between processing versus memory resources.

Some advantages of the delta markup language are as follows:

  1. Able to non-intrusively describe changes to any XML document, including those governed by schemas that do not allow extensions,

  2. Allows changes to be described without bloating the original data, or corrupting the context of the original data by inter-mixing change data with the underlying data,

  3. Is flexible enough to allow users to decide if and when they want to store deltas or process deltas and store document versions instead.


Some of the things that delta explicitly does not address:

  1. The rules for how one merges two delta documents that reference the same start/end document,

  2. The rules for how one applies delta operations in one delta to another delta that references a different start/end document,

  3. The rules for determining if conflicts exist among deltas, as well as how conflicts are to be resolved.


3) Example

The example involves an Atom feed located at a fictitious web site at the URI: http://www.somewhere.com/atom1.xml. For now, we’ll restrict the discussion to “forward” processing on “start” documents, rather than “reverse” processing on “end” documents.


<?xml version="1.0" encoding="utf-8"?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en-US'>
<title>Example Feed</title>
<subtitle>Insert witty or insightful remark here</subtitle>
<link href='http://example.org/' />
<updated>2003-12-13T18:30:02Z</updated>
<author>
<name>John Doe</name>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
<entry>
<title>Atom-Powered Robots Run Amok</title>
<link href='http://example.org/2003/12/13/atom03' />
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2003-12-13T18:30:02Z</updated>
<summary>Some text.</summary>
</entry>
</feed>


Document: atom1.xml


The change in question is simple. We want to add an <entry> element and change the <updated> element. The delta document representing these changes appears as follows, and we’ll give it the fictitious URI: http://www.somewhere.com/delta1.xml.


<?xml version="1.0" encoding="utf-8"?>
<delta xmlns='http://www.delta.org/2006/Delta' version='0.1'
xmlns:xhtml='http://www.w3.org/1999/xhtml'
xmlns:atom='http://www.w3.org/2005/Atom'>
<updated>2006-03-31T11:42:55-05:00</updated>
<start>http://www.somewhere.com/atom1.xml</start>
<operations>
<add id='1'>
<date>2006-03-31T11:42:54-05:00</date>
<path>//atom:feed/atom:entry[2]</path>
<value>
<atom:entry>
<atom:id>tag:intertwingly.net,2004:2180</atom:id>
<atom:link rel='alternate' />
<atom:title>Bridge Crossing Puzzle</atom:title>
<atom:summary>My daughter was given a puzzle.</atom:summary>
<atom:content type='xhtml'>
<xhtml:div>
<xhtml:p>My daughter was given a puzzle as extra credit.</xhtml:p>
</xhtml:div>
</atom:content>
<atom:updated>2006-03-31T11:42:54-05:00</atom:updated>
</atom:entry>
</value>
</add>
<remove id='2'>
<date>2006-03-31T11:42:53-05:00</date>
<path>//atom:feed/atom:updated</path>
</remove>
<add id='3'>
<date>2006-03-31T11:42:54-05:00</date>
<path>//atom:feed/*[4]</path>
<value>
<atom:updated>2006-03-31T11:42:54-05:00</atom:updated>
</value>
</add>
</operations>
</delta>


Document: delta1.xml


The <delta> element contains an <operations> element that contains a sequence of <add> and <remove> elements. These <add> and <remove> elements specify operations that are to be applied to the “start” document, as identified by the <start> element. The <start> element must have a valid URI format, but it can also be missing, indicating that the operations are to be applied to an empty document. The order of the sequence of <add> and <remove> elements corresponds to the order of operations. Each <add> and <remove> element has an “id” attribute that must be unique within the context of the parent <operations> element. The value of an id attribute does not necessarily have to be a valid URI, nor does it have to be globally unique by way of a mechanism such as “tag” URI format (RFC 4151), for example. This is because an id is valid only in the context of the sequence of operations specified by the enclosing delta.

For “remove” operations, we need to reference the path to the instance of the object in the start document that's being removed. The term object refers to element and/or attribute, because initially we'll focus on XML representations of start documents. For an element being removed, the path is an identifier to the instance of the element, and for an attribute, the path is an identifier specifying both the name of the attribute, and a particular element instance. It’s possible to specify a path that resolves to multiple object instances, in which case, all of them are operated upon. The language for specifying paths to elements and attributes is XPath, although the specification could easily allow other languages.

When an element is removed, the portion from, including the start tag, up to, including the end tag, is removed. For an attribute, the name and value are removed from the element. If a user wants to remove just the value of an attribute or change its value to a new value, this is accomplished by first removing the attribute, and then adding it with a null or different value. The goal is to keep the nature of “remove” operations consistent according to the KISS principle. Here, removal corresponds to complete eradication of the data, including its metadata.

For “add” operations, we need to reference the location in the start document where an object instance is to be added. For element locations, this is the path to an element instance, in addition to the position within that element’s child elements, if the element supports multiple values. The following illustrates a reference to the first <entry> element in a <feed> element, and furthermore, the second position among the <entry> element's children: /feed/entry[1]/*[2].

For attribute locations, we reference the name of the attribute in a specific instance of an element. Finally, for add operations, we include the value that is added, as the content of a <value> element. The value can be specified explicitly, or alternatively, as a reference to a block of data that resides in an external document, which is identified by a URI and a path to the object’s instance. For now, we'll focus on specifying values explicitly, in-line, as either a list of one or more elements, or as a text value suitable for XML attributes.

Both <add> and <remove> elements have a child <date> element specified in RFC 3339 format. Dates are not critical in determining the order in which to process a sequence of operations, but they are critical when comparing the operations in multiple delta documents that reference the same start document.

The following illustrates the new document that is produced by applying the changes specified by our example delta document.


<?xml version="1.0" encoding="utf-8"?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en-US'>
<title>Example Feed</title>
<subtitle>Insert witty or insightful remark here</subtitle>
<link href='http://example.org/' />
<updated>2006-03-31T11:42:54-05:00</updated>
<author>
<name>John Doe</name>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
<entry>
<title>Atom-Powered Robots Run Amok</title>
<link href='http://example.org/2003/12/13/atom03' />
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2006-03-31T11:42:54-05:00</updated>
<summary>Some text.</summary>
</entry>
<entry>
<id>tag:intertwingly.net,2004:2180</id>
<link href='http://www.intertwingly.net/blog/2006/03/31/Bridge-Crossing-Puzzle' rel='alternate' />
<title>Bridge Crossing Puzzle</title>
<summary>My daughter was given a puzzle.</summary>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<p>My daughter was given a puzzle as extra credit.</p>
</div>
</content>
<updated>2006-03-31T11:42:54-05:00</updated>
</entry>
</feed>


Document: atom2.xml


4) Collaboration

The simplest form of collaboration with delta documents involves merging two delta documents that reference the same start document. A merging mechanism needs to perform a chronology-based interleaving of the operations in each delta to form a single delta. During this process, the mechanism must detect if each operation has become invalid due to the merge. This happens if an operation’s path into the start document cannot be resolved, assuming that preceding delta operations have been applied to the start document. Finally, this merge mechanism must assign new ids to the merged operation elements. If the mechanism detects an invalid operation, it should throw an exception.



Merging Delta Documents that reference the same Start Document



The next level of collaboration involves applying operations from one delta document to another, in the case where the two start documents are different. Here, the challenge is to decipher the meaning of each “to-be-applied” operation in the context of the new start document. This is different from merging, where the start documents are the same. Each operation must first be examined in the context of its own start document, and then the merge mechanism must determine if it’s possible to create a new operation in the new delta document that has an equivalent meaning.

It’s possible for an operation in the first delta document to have a reference to a start document object that makes no sense in the context of the new start document. If this is the case, then the merge mechanism should throw an exception.



Applying a set delta operations from one delta to another, with different start documents


5) Additional Functionality

So far, we’ve only allowed delta documents to reference start documents. Next, we introduce “extension” references, which allow a delta document to extend, by continuation, the operations of another delta document. A delta document that specifies an extension does not need to specify a start document, since the start document is the same as the one specified in the referenced delta.





A second delta document extends a first delta document


An <extends> element specifies a URI that references a delta document that’s being extended. The operations in the extending delta are assumed to follow those in the delta being extended. Following is an example of a delta document that extends the delta shown earlier for our Atom feed start document.


<?xml version="1.0"?>
<delta xmlns='http://www.delta.org/2006/Delta' version='0.1'>
<updated>2006-04-01T11:42:55-05:00</updated>
<extends>http://www.somewhere.com/delta1.xml</extends>
<operations></operations>
</delta>



A delta document that extends another delta document.


Earlier, we saw that the ids on operations must be unique only in the context of the <operations> element of a delta. This means that ids on operations in an extension delta can potentially be the same as ids in the referenced delta document. This is OK, because the notion of “id” in a global sense is the combination of the URI of each delta document, followed by the local id used by the operation. The URI makes these operation ids unique in the global context.

A delta document followed by a chain of extension delta documents is logically equivalent to a single delta document. In other words, one could process the chain of deltas, including the root delta that points to the start document, and produce a single delta document that includes all of the operations from the individual extension deltas. This means that the merge and apply mechanisms described in the previous section apply to a chain of extension deltas in the same manner as they apply to a single delta document.


6) For Further Thought

6.1) Linked Operations

For now, delta supports only the two primitive operations, “add” and “remove”, in order to satisfy the KISS principle. It’s possible to consider expanding upon these operations in order to support “copy/paste” and “cut/paste”.

In order to specify a “copy/paste” operation, a user would effectively need the ability to specify an “add” operation that would include an additional reference to a “from” location in the start document. This location would be a place where the data would be copied from, as opposed to the “add” operation getting its data from its <value> element.

In order to specify a “cut/paste” operation, a user would need the ability to use a pair of “remove” and “add” operations, whereby the “add” operation would reference a “from” location that coincided with the path specified in the “remove” operation. In other words, the data being removed via the “remove” operation would be the same data that’s added by the “add” operation, but in a different place.

Having a “from” location property on an “add” operation would also enable a set of “add” operations to put copies of the same block of data in multiple places. A first “add” operation would place the data at a location specified by its <path> element, and subsequent “add” operations would reference the location of the added data as the “from” source for its own data. Alternatively, one could use a single “add” operation with an XPath value that resolved to multiple locations.

6.2) Cross-Delta Operations

With the linking capability specified in the previous section, by way of the proposed “from” reference, it’s conceivable that a delta operation could contain a “from” reference to a location outside the start document. This would allow a delta operation to perform a “copy” from a document other than the start document, followed by a “paste” into the start document.

6.3) Nested Operations

The concept of operation nesting would involve allowing a new <operation> element to enclose a list of child operations. This would allow processors to identify groups of related operations by a common id. For example, one could enclose a set of <add> elements with an <operation> element, and give the <operation>element an “id” attribute.

6.4) Multiple Deltas in a Delta Document

There might be times when a single delta document would benefit by being able to contain multiple <delta> elements. This would allow a single delta document to specify changes for a number of start documents, since each <delta> element would be free to specify its own start document. In order to enable this, we would need to define a <deltaset> element that would enclose some number of <delta> elements, and reside at the root of a delta document. This has ramifications on referencing that would need to be addressed.

6.5) Roll-forward and Roll-back Processing

As we mentioned previously, delta supports both roll-forward and roll-back processing. This section describes the requirements to support these scenarios in more detail.

By supporting the <value> element in the “remove” operation, it would be possible to add support for “roll-back” processing in addition to the “roll-forward” processing we’ve already discussed. In order to perform “roll-back” processing, there would have to be an additional <end> element as well as the <start> element, in order to hold a URI reference to the final finished, or “end-state” document. The idea would be that the delta operations could be processed in the forward chronological order against the referenced start document (as we’ve discussed), or they could be processed in reverse chronological order against the end document, to produce the start document. With this dual processing mechanism in place, one could create a “one sided” delta document that would either reference only a start or end document, or a “two sided” delta document that would reference both a start and end document.

The reason we’d need to store a value in the <value> element for “remove” operations, is that in order to perform the opposite of the "remove" operation against the end document, the processor would have to be able to get the removed element from the delta "remove" operation and stick it back into the end document. With one-sided delta documents, either all the “add” or “remove” operations would all have to store values. In the two-sided case, both the “add” and “remove” operations would have to store values.

7) References

I used a number of references to research this project. The following list is far from complete, but indicates some of the areas that I looked into.

  1. Simple Sharing Extensions for RSS and OPML
    http://msdn.microsoft.com/xml/rss/sse/

  2. Atom Publishing Protocol
    http://atompub.org/2005/07/11/draft-ietf-atompub-format-10.html

  3. Code Merging Algorithms – RevCtrl WiKi
    http://revctrl.org/FrontPage

  4. Criss-Cross Merge Issues
    http://www.gelato.unsw.edu.au/archives/git/0504/2279.html

  5. Delta: an ontology for the distribution of differences between RDF graphs
    http://www.w3.org/DesignIssues/Diff

  6. DeltaXML – Diff/Merge for XML Documents
    http://www.deltaxml.com/products/core/index.html

  7. Wikipedia change visualization applications – Jon Udel commentary
    http://weblog.infoworld.com/udell/2005/06/24.html

  8. Phil Stanhope – Adesso Systems – Some observations on SSE
    http://componentry.com/blogs/phil/archives/000164.html


This page is powered by Blogger. Isn't yours?