Once of the concepts I found difficult initially when looking at non-relational systems is the concept of the vector clock. Some databases – like Cassandra - use timestamps to work out which is the “latest” transaction. If there are two conflicting modifications to a column value, the one with the highest timestamp will be considered the most recent and the most correct.
Other Dynamo systems use a more complex mechanism known as a vector clock. The vector clock has the advantage of not requiring clock synchronization across all nodes, and helps us identify transactions that might be in conflict.
Despite its name, the vector clock does not include any timestamps. Rather it is composed of a set of counters. These counters are incremented when operations complete, in a similar way to the traditional System Change Number pattern that we are familiar with from relational systems like Oracle. The set contains one counter for each node in the cluster. Whenever an operation occurs on a node, that node will increment its own counter within its vector clock. Whenever a node transmits an operation to another node it will include its vector clock within the request. The transmitted vector clock will include the highest counter for the transmitting node as well is the highest counters from other nodes that the transmitting node has ever seen.
When a node receives possibly conflicting updates from other nodes, it can compare the vector clocks to determine the relative sequencing of the requests. There is a defined set of vector clock operations that can tell if:
- The two vector clocks come from nodes that are completely in sync
- One node is “out of date” with respect of the other node
- The clocks are “concurrent” in that each node has some information that is more up to date than the other node. In this case we can’t choose which update is truly the more correct.
Vector clocks are notoriously difficult to understand, though the underlying algorithm is really quite simple. The diagram below shows an example of three vector clocks incrementing across three nodes. The algorithm is somewhat simplified to improve clarity
In the example the vector clocks start out set to 0 for all nodes (1). Updates to nodes from external clients caused the nodes to increment their own element of the vector clock (2). When these changes are propagated to other nodes, the receiving node updates its vector clock and merges the vector clocks from the other nodes (3). Event (H) occurs when node 2 receives the vector clock (F) from node 1 and (G) from node 3 (4). Each of these vector clocks contain elements higher than the other - vector clock F has the higher value for node 1, while vector clock G has the higher value for node 3. There is no way for node 2 to be sure which of the two vector clocks represent the most up-to-date data - each of the sending nodes “knows” something that the other node does not, and consequently it’s not clear which of the two nodes “knows” best.
For those of us from the world of strictly consistent databases like Oracle, think of the vector clock as a set of System Change Numbers from each system. We examine the SCNs from each node to see if there are nodes that might not have seen all the changes that have been recorded on another node.
The Vector clock in above us that Version G and Version F are conflicting – each contains information from unique updates that could both contain important information. What then, is the system to do? Here are some of the options:
- Revert to last write wins: two updates are unlikely to have occurred at the exact same nanosecond, so one will have a higher timestamp value. We could decide that the highest timestamp “wins”.
- Keep both copies, and require that the application or the user resolve the conflict.
- Somehow merge the data. This is the approach taken by the original Dynamo which managed Amazon’s shopping cart. If there are two conflicting shopping carts they are merged and the worst that can happen (from Amazon’s point of view) is that you buy some things twice. Another merge can occur with things like counters: rather than having one counter increment overwrite another, we can deduce that both operations wanted to increment the counter and increment it twice. A special class of data types: Conflict-Free Replicated Data Type (CRDT) exist that allow these sort of merges to be predefined.
There are advocates for the vector clock – such as the architects of Riak - , and advocates for the timestamp system used in Cassandra. Neither party disagree about the concrete implications of the two approaches: they differ on the desirability of the consequences. Last Write Wins represents a simpler model for the application developer and administrator, Vector clocks allow for conflicts to be identified but which must then be resolved. In a later post I’ll give an example of how you programmatically resolve conflicts in Riak.