Notes On Versioning

The following essay is the main part of a letter sent to Luc Beaudoin, Jan 04. It presents a scheme for converting an arbitrary database to a completely general versioning database.

A quick addendum to our lengthy iChat conversation. The context that I would like to set is that we are representing some complex state (a "document" or "kit") in a relational database (e.g. postgres). The goal is to transform the usual snapshot representation into a format that supports versioning - and we want the result to be reasonably efficient.

We will call these two representations

  1. the snapshot-representation
  2. the versioning-representation

and we are interested in promoting [1] to [2].

The first observation is about the semantics of modification: deletion, update and insertion. In the snapshot-representation we accomplish change by editing our one and only version - irreversibly. By contrast, in the versioning-representation we will never lose any data. We will be able to recover every version of the document ever checked-in.

To do this we are definitely going to have to maintain copies of records. And that means being able to relate all the copies via, say, an ancestral-line id. (The ancestral-line is a kind of record-id shared by all the copies of a record.) To distinguish the different copies we will need something like a revision number. However, the way we will do this is to relate the copy to the SYNC operation that created it - it'll become obvious that this is more efficient in a minute or so.

All changes are going to be associated with a synchronization event. We shall keep track of every synchronization in a table rather like an audit trail. Each synchronization will record the time, the user and motivational comments for the changes they made. So when we want to record the cause of a change we can simply refer to an entry in this table.

    sync_event
        sync_id : long        # primary key
        when: date
        who: user_id
        comment: text

[N.B. managing unique IDs becomes a pain as soon as we go into a distributed development situation. The best way to sort this out is to cut off the problem at source and allocate a unique number generator for each user. I assume this. I will explain in another letter how to accomplish this effect.]

For the sake of efficiency, we will separate off the most recent version of a record from all of its ancestors. This saves us from having to revise the primary key and other unique constraints. However it does mean we will have two essentially identical tables.

For example, if we have a table with say

    person
        name : text
        age : int

then it becomes two tables
    person_current
        ancestral_line : long
        sync_id : long            # not present in shadows
        name : text
        age : int
    person_shadows
        ancestral_line : long
        sync_id : long
        name : text
        age : int

N.B. the latter table must have heavily revised unique constraints (esp primary key).

Thus far, we have enough wiggle room to represent a linear stream of updates (i.e. no branches.) However, when we add in branching we will need more than _one_ current table. Let's put that on one side, however, and go through the modification operators in this the basic situation.

Deletion is implemented by copying the current record to the shadows and then deleting it. Note that this is slightly unsatisfactory in the sense that the deletion is not represented explicitly in our database - there are other schemes that fix this but this scheme is more elegant. Update is slightly more complicated. We need to copy the current record to the shadows (of course) and then update both the data and the sync_id of the current synchronization. Inserting is simple, of course, simply generating a new entry with a brand new ancestral_line linked to the sync_id of the current synchronization.

At this point we have a nice linear model of evolution. We operate on our data almost exactly the same as before and, almost magically, we establish a trail from which an arbitrary version can be reconstructed. Furthermore, because you are not obliged to refer to the shadows during a synchronization, it is easy to see that in any particular application you don't actually have to implement _any_ shadow tables unless you want to provide rollback to earlier sync states. (Of course, that does mean that your sync conflict messages are less informative than they can be. But they are not reduced to idiocy.)

Before moving onto branches, let's look at the different requirements of the client and server. The server has the responsibility of maintaining the entire state - so we need to retain everything. By contrast, the client only needs to check-out a snapshot and sync, so the client has no need for shadow tables. That's good news! We didn't really want to download the entire state PLUS its history!

Now let's add in branches. When we create a new branch we are effectively taking a complete copy of the entire model and allowing it to evolve independently. There is a very simple and direct way of achieving that effect. We add in the branch id to every one of our tables so far.

In the case of our person table, this simple minded approach would look like this:

    person_current_all_branches
        branch_id : long
        ancestral_line : long
        sync_id : long
        name : text
        age : int
    person_shadows_all_branches
        branch_id : long
        ancestral_line : long
        sync_id : long
        name : text
        age : int

Note that we have to widen any unique constraints in both tables, again. However, when a new branch is created, it is possible share information between branches - which is essential to maintain any semblance of efficiency.

But if we were hoping to do any queries over our table we have lost any hope of being efficient. Not only have our primary keys all become composite but the sharing of branch information has destroyed our ability to use their indices. (Because now we want to search for records with different branch_ids.) We can recover some of that, at a reasonable cost, by saying that creating a branch makes copies of every current entry of the parent branch. The cost is roughly equivalent to storing a snapshot the entire document, of course, and I suggest this is typically well worth it.

Fortunately, we know that in practice we have relatively few active branches in progress at the same time. So we can afford to represent the head of development separately. Let's suppose we want to support 3 active branches we would have, for our person table, 5 versions of the same table. Three would be for efficient access to the heads of the three mainlines of development - the other two would represent the historical state of a branch.

    person_active_branch_1   
        branch_id : long        # shared by all records in this table
        ancestral_line : long   # unique in this table
        sync_id : long
        name : text
        age : int    
    person_active_branch_2
        branch_id : long        # shared by all records in this table
        ancestral_line : long   # unique in this table
        sync_id : long
        name : text
        age : int
    person_active_branch_3
        branch_id : long        # shared by all records in this table
        ancestral_line : long   # unique in this table
        sync_id : long
        name : text
        age : int
    person_current_all_branches
        branch_id : long        # } this pair is
        ancestral_line : long   # } unique
        sync_id : long
        name : text
        age : int
    person_shadows_all_branches
        branch_id : long        # } this 
        ancestral_line : long   # } triple
        sync_id : long          # } is unique
        name : text
        age : int

It is a bit of a shame that we need to limit ourselves to a set number of active lines. But that's just the active sets that are "efficient" for querying. The records in person_active_ branch_X are going to have a duplicate in person_current_all_branches. So the 3 active lines are really cached representations of a few branches. So this boils down to cache management.

This might seem to be an awesome expansion but, as you can see, there is plenty of flexibility in this scheme. You don't have to have the shadow set. You don't have to have the active set(s). And the clients only need the original person table! And by demanding that creating branches makes a copy of all current entries, it is not even necessary to have any active sets.

If we revisit the delete, update, insert operators in this new scheme we see that they get slightly slower when there are active sets. They have to manipulate both the active set and the current set (i.e. it is a write-thru cache). But querying of an active set is every bit as fast as it was *and* the SELECT queries don't even have to be modified.

Synchronization is potentially quite a complicated dance. However, assuming a user checks out the head of an active branch for editing, then the synchronization boils down to comparing the sync_ids for all checked-out records with the matching records in the updated head of the active branch. If there is no difference then the records have not been changed by either party. If there is a difference then the client must refer to the sync_id of the record when it was checked-out to discover who is responsible for the change - which can be either party or both. Furthermore, we have to handle inserts and deletes (i.e. have been added to or deleted from the checked-out set or the active head.)

When we detect a change, we must refer to our synchronization policy to determine how to handle it: to automatically resolve it or to request user assistance. The latter is called a synchronization-conflict. Naturally, this is where all the serious problems start - when you involve a human being you must try to rise to their level of cognition. This is the UI problem.

If you are in a very simple domain, you may be able to translate a record-level conflict into an intelligible request for help. But in more complex domains, you may be into a very difficult area of trying to reconstruct the big picture. When this happens, a good, simple option is to take precise control over the order in which tables are searched for conflicts and make sure that "big" conflicts are detected before "little" conflicts.

Failing that, you must assert "big picture" propositions that make interpretation of the conflict straightforward. And failing that, you must allow users to start up new branches.