Monday
Jul042016

Join performance in MongoDB 3.2 using $lookup

One of the key tenants of MongoDB schema design is to account for the absence of server-side joins.  Data is joined all the time inside of application code of course, but traditionally there’s been no way to perform joins within the server itself. 

This changed in 3.2 with the introduction of the $lookup operator within the aggregation framework.  $lookup performs the equivalent of a left outer join – eg: it retrieves matching data from another document and returns null data if no match is found.

Here’s an example using the MongoDB version of the Sakila dataset that I converted from MySQL back in this post

   1: db.films.aggregate([
   2:     {$match:{"Actors.First name" : "CHRISTIAN",
   3:              "Actors.Last name" : "GABLE"}},
   4:     {$lookup: {
   5:         from: "customers",
   6:         as: "customerData",
   7:         localField: "_id",
   8:         foreignField: "Rentals.filmId"
   9:     }},
  10:     {$unwind: "$customerData"},
  11:     {$project:{"Title":1,
  12:                "FirstName":"$customerData.First Name",
  13:                "LastName" :"$customerData.Last Name"}},
  14: ])

 

What we’re doing here is finding all customers who have ever hired a film staring “Christian Gable”;  We start by finding those films in the films collection (lines 2-3), then use $lookup to retrieve customer data (lines 4-9).  Films embeds actors in the “Actors” array;  the customers collection embeds films that have been hired in the "Rentals" array. 

The result of the join contains all the customers who have borrowed the movie returned as an array, so we use the $unwind operator to “flatten” them out (line 10).  The resulting output looks like this:

{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "SUSAN", "LastName" : "WILSON" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "REBECCA", "LastName" : "SCOTT" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "DEBRA", "LastName" : "NELSON" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "MARIE", "LastName" : "TURNER" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "TINA", "LastName" : "SIMMONS" }

One thing that we need to be careful here is with join performance.  The $lookup function is going to be executed once for each document returned by our $match condition.  There is - AFAIK - no equivalent of a hash or sort merge join operation possible here, so we need to make sure that we've used an index.  Unfortunately, the explain() command doesn’t help us.  It tells us only if we have used an index to perform the initial $match, but doesn't show us if we used an index within the $lookup

Here's the explain output from the operation above (TL;DR):

   1: > db.films.explain().aggregate([
   2: ... {$match:{"Actors.First name" : "CHRISTIAN",
   3: ...          "Actors.Last name" : "GABLE"}},
   4: ...     {$lookup: {
   5: ... from: "customers",
   6: ... as: "customerData",
   7: ... localField: "_id",
   8: ... foreignField: "Rentals.filmId"
   9: ... }},
  10: ... {$unwind: "$customerData"},
  11: ... {$project:{"Title":1,
  12: ...            "FirstName":"$customerData.First Name",
  13: ...            "LastName" :"$customerData.Last Name"}},
  14: ...
  15: ... ])
  16: {
  17:         "waitedMS" : NumberLong(0),
  18:         "stages" : [
  19:                 {
  20:                         "$cursor" : {
  21:                                 "query" : {
  22:                                         "Actors.First name" : "CHRISTIAN",
  23:                                         "Actors.Last name" : "GABLE"
  24:                                 },
  25:                                 "fields" : {
  26:                                         "Title" : 1,
  27:                                         "customerData.First Name" : 1,
  28:                                         "customerData.Last Name" : 1,
  29:                                         "_id" : 1
  30:                                 },
  31:                                 "queryPlanner" : {
  32:                                         "plannerVersion" : 1,
  33:                                         "namespace" : "sakila.films",
  34:                                         "indexFilterSet" : false,
  35:                                         "parsedQuery" : {
  36:                                                 "$and" : [
  37:                                                         {
  38:                                                                 "Actors.First name" : {
  39:                                                                         "$eq" : "CHRISTIAN"
  40:                                                                 }
  41:                                                         },
  42:                                                         {
  43:                                                                 "Actors.Last name" : {
  44:                                                                         "$eq" : "GABLE"
  45:                                                                 }
  46:                                                         }
  47:                                                 ]
  48:                                         },
  49:                                         "winningPlan" : {
  50:                                                 "stage" : "COLLSCAN",
  51:                                                 "filter" : {
  52:                                                         "$and" : [
  53:                                                                 {
  54:                                                                         "Actors.First name" : {
  55:                                                                                 "$eq" : "CHRISTIAN"
  56:                                                                         }
  57:                                                                 },
  58:                                                                 {
  59:                                                                         "Actors.Last name" : {
  60:                                                                                 "$eq" : "GABLE"
  61:                                                                         }
  62:                                                                 }
  63:                                                         ]
  64:                                                 },
  65:                                                 "direction" : "forward"
  66:                                         },
  67:                                         "rejectedPlans" : [ ]
  68:                                 }
  69:                         }
  70:                 },
  71:                 {
  72:                         "$lookup" : {
  73:                                 "from" : "customers",
  74:                                 "as" : "customerData",
  75:                                 "localField" : "_id",
  76:                                 "foreignField" : "Rentals.filmId",
  77:                                 "unwinding" : {
  78:                                         "preserveNullAndEmptyArrays" : false
  79:                                 }
  80:                         }
  81:                 },
  82:                 {
  83:                         "$project" : {
  84:                                 "Title" : true,
  85:                                 "FirstName" : "$customerData.First Name",
  86:                                 "LastName" : "$customerData.Last Name"
  87:                         }
  88:                 }
  89:         ],
  90:         "ok" : 1
  91: }

However, we can see the queries created by the $lookup function if we enable profiling.  For instance if we turn profiling on can see a full collection scan of customers has have been generated for every film document that has been joined:

image

These “nested” collection scans are bad news.  Below is the results of a benchmark in which I joined two collections using $lookup with and without an index.  As you can see, the unindexed $lookup degrades steeply as the number of rows to be joined increases. The solution is obvious:

 Always create an index on the foreignField attributes in a $lookup, unless the collections are of trivial size. 


image

The MongoDB company is putting a lot of new features into the aggregation framework:  they clearly intend to create a very powerful and flexible capability that matches and maybe even exceeds what can be done with SQL.  Indeed,  the aggregation framework seems poised to become a dataflow language similar to Pig.  Anyone wanting to do any serious work in MongoDB should make sure they are very comfortable with aggregate.  If you use $lookup to perform joins in aggregate, make sure there is an index on the ForiegnField attribute.

 

Wednesday
Jun292016

Good bye Quest!

You may have read that Francisco Partners and Elliott Management have entered into an agreement to  Acquire the Dell Software Group – largely composed of the Quest software company bought from Dell in 2012 .   I’ve worked at Quest since 1998, but alas I will not be participating in this next stage of the Quest journey.
Although the timing of the announcement was influenced by the logistics of this sale, it is actually a decision I came to some time ago, well before this latest saga commenced.  I’ve been though an amazing 18 year journey with Quest – I’ve seen IPO, attempted privatisation of Quest, purchase by Dell, transition of Dell to private company as well as having participated in many other major company events.  In that time I’ve been involved in building software products like Spotlight that have been wildly successful and had the honour of leading a team of hundreds of fantastic developers across the world.  It’s been a tremendous experience, but after spending almost one third of my life with Quest I feel that I’d like to work closer to home (in Melbourne), work more directly with technology and with smaller teams. 
The details of my next endeavours are somewhat works in progress, but I can tell you that I’ll be working closely with fellow Quest Alumni Mark Gurry at MGA.  
Quest has been by far the most important professional relationship of my life and I owe so much to the company and it’s founders.  I’ve made friends and established collegial relationships that will last the rest of my life.   I’m going to miss all my friends at Quest and always look back fondly on my time there.  That having been said, I’m incredibly excited to be changing gears and very happy that I’ll be spending less times as the guest of United airlines :-)
Wednesday
Jan132016

Next Generation Databases

dbtngMy latest book Next Generation Databases is now available to purchase!   You can buy it from Amazon here, or directly from Apress here.  The e-book versions are not quite ready but if you prefer the print version you’re good to go.

I wrote this book as an attempt to share what I’ve learned about non-relational databases in the last decade and position these in the context of the relational database landscape that I’ve worked in all my professional life.  

The book is divided into two sections:  the first section explains the market and technology drivers that lead to the end of complete “one size fits all” relational dominance and describes each of the major new database technologies.   These first 7 chapters are:

  • Three Database Revolutions
  • Google, Big Data, and Hadoop  
  • Sharding, Amazon, and the Birth of NoSQL
  • Document Databases
  • Tables are Not Your Friends: Graph Databases
  • Column Databases
  • The End of Disk? SSD and In-Memory Databases 

The second half of the book covers the “gory details” of the internals of the major new database technologies.  We look at how databases like MongoDB, Cassandra, HBase, Riak and others implement clustering and replication, locking and consistency management, logical and physical storage models and the languages and APIs provided.  These chapters are: 

  • Distributed Database Patterns 
  • Consistency Models 
  • Data Models and Storage 
  • Languages and Programming Interfaces

The final chapter speculates on how databases might develop in the future.  Spoiler alert: I think the explosion of new database technologies over the last few years is going to be followed by a consolidation phase, but there’s some potentially disruptive technologies on the horizon such as universal memory, blockchain and even quantum computing. 

The relational database is a triumph of software engineering and has been the basis for most of my career.  But the times they are a changing and speaking personally I’ve really enjoyed learning about these new technologies.  I learned a lot more about the internals of the newer database architectures while writing the book and I’m feeling pretty happy with the end result.   As always I’m anxious to engage with readers and find out what you guys think!

Saturday
Dec122015

Blockchain and databases of the future

You would have to have been living under a rock for the past few years not to have heard of Bitcoin.    Bitcoin is an electronic cryptocurrency which can be used like cash in many web transactions.  At time of writing there are about 15 million Bitcoins in circulation, trading at approximately $USD 360 each for a total value of about $USD 5.3 billion.

Bitcoin combines peer-to-peer technology and public key cryptography.  The owner of a bitcoin can use a private key to assert ownership and authorize transactions – others can use the public key to validate the transaction.  As in other peer to peer systems such as Bittorrent, there is no central server which maintains Bitcoin transactions – rather there is a distributed public ledger called the Blockchain.   I wrote a short article on blockchain here

The implications of cryptocurrencies is way beyond our scope here, but there are definite implications for database technologies in the Blockchain concept.  Blockchain replaces the trusted third party that must normally mediate any transfer of funds.   Rather than a centralized database that records transactions and authenticates each party, Blockchain allows transactions and identity to be validated by consensus with the Blockchain network – each transaction is confirmed by public-key based authentication from multiple nodes before being concluded.   Blockchain could be as disruptive to our traditional notions of banking and non-monetary transactions as peer-to-peer systems like napster were to the music business. 

The Blockchain underlying Bitcoin is public, but there can be private (or permissioned) Blockchains which are “invitation only”.  Whether Private or public, Blockchains arguably represent a new sort of shared distributed database.  Like systems based on the Dynamo model, the data in the block chain is distributed redundantly across a large number of hosts. However, the Blockchain represents a complete paradigm shift in how permissions are managed within the database. In an existing database system, the database owner has absolute control over the data held in the database. However in a Blockchain system, ownership is maintained by the creator of the data. 

Consider a database that maintains a social network like Facebook: although the application is programmed to allow only you to modify your own posts or personal details, the reality is that the Facebook company actually has total control over your online data. They can – if they wish – remove your posts, censor your posts or even modify your posts if they really wanted to. In a Blockchain based database you would retain total ownership of your posts and it would be impossible for any other entity to modify them.

Applications based on Blockchain have the potential to disrupt a very wide range of social and economic activities.  Transfers of money, property, management of global identity (passports, birth certificates), voting, permits, wills, health data, and a multitude of other transactional data could be regulated in the future by Blockchains.   The databases that currently maintain records of these types of transactions may become obsolete.

Most database owners will probably want to maintain control of the data in the database, and therefore it’s unlikely that Blockchain will completely transform database technology in the short term.  However it does seem likely that database systems will implement Blockchain based authentication and authorization protocols for specific application scenarios. Furthermore, it seems likely that formal database systems built upon a Blockchain foundation will soon emerge.

Monday
Oct122015

Vector clocks

 

Once of the concepts I found difficult initially when looking at non-relational systems is the concept of the vector clock.  Some databases – like Cassandra - use timestamps to work out which is the “latest” transaction. If there are two conflicting modifications to a column value, the one with the highest timestamp will be considered the most recent and the most correct.

Other Dynamo systems use a more complex mechanism known as a vector clock. The vector clock has the advantage of not requiring clock synchronization across all nodes, and helps us identify transactions that might be in conflict.

Despite its name, the vector clock does not include any timestamps. Rather it is composed of a set of counters. These counters are incremented when operations complete, in a similar way to the traditional System Change Number pattern that we are familiar with from relational systems like Oracle. The set contains one counter for each node in the cluster. Whenever an operation occurs on a node, that node will increment its own counter within its vector clock. Whenever a node transmits an operation to another node it will include its vector clock within the request. The transmitted vector clock will include the highest counter for the transmitting node as well is the highest counters from other nodes that the transmitting node has ever seen.

When a node receives possibly conflicting updates from other nodes, it can compare the vector clocks to determine the relative sequencing of the requests. There is a defined set of vector clock operations that can tell if:

  • The two vector clocks come from nodes that are completely in sync
  • One node is “out of date” with respect of the other node
  • The clocks are “concurrent” in that each node has some information that is more up to date than the other node. In this case we can’t choose which update is truly the more correct.

Vector clocks are notoriously difficult to understand, though the underlying algorithm is really quite simple. The diagram below shows an example of three vector clocks incrementing across three nodes. The algorithm is somewhat simplified to improve clarity

 9781484213308_Figure_09-04

In the example the vector clocks start out set to 0 for all nodes (1). Updates to nodes from external clients caused the nodes to increment their own element of the vector clock (2). When these changes are propagated to other nodes, the receiving node updates its vector clock and merges the vector clocks from the other nodes (3). Event (H) occurs when node 2 receives the vector clock (F) from node 1 and (G) from node 3 (4). Each of these vector clocks contain elements higher than the other - vector clock F has the higher value for node 1, while vector clock G has the higher value for node 3. There is no way for node 2 to be sure which of the two vector clocks represent the most up-to-date data - each of the sending nodes “knows” something that the other node does not, and consequently it’s not clear which of the two nodes “knows” best.

For those of us from the world of strictly consistent databases like Oracle, think of the vector clock as a set of System Change Numbers from each system.  We examine the SCNs from each node to see if there are nodes that might not have seen all the changes that have been recorded on another node.

The Vector clock in above us that Version G and Version F are conflicting – each contains information from unique updates that could both contain important information. What then, is the system to do? Here are some of the options:

  • Revert to last write wins: two updates are unlikely to have occurred at the exact same nanosecond, so one will have a higher timestamp value. We could decide that the highest timestamp “wins”.
  • Keep both copies, and require that the application or the user resolve the conflict.
  • Somehow merge the data. This is the approach taken by the original Dynamo which managed Amazon’s shopping cart. If there are two conflicting shopping carts they are merged and the worst that can happen (from Amazon’s point of view) is that you buy some things twice. Another merge can occur with things like counters: rather than having one counter increment overwrite another, we can deduce that both operations wanted to increment the counter and increment it twice. A special class of data types: Conflict-Free Replicated Data Type (CRDT) exist that allow these sort of merges to be predefined.

There are advocates for the vector clock – such as the architects of Riak - , and advocates for the timestamp system used in Cassandra. Neither party disagree about the concrete implications of the two approaches: they differ on the desirability of the consequences. Last Write Wins represents a simpler model for the application developer and administrator, Vector clocks allow for conflicts to be identified but which must then be resolved.   In a later post I’ll give an example of how you programmatically resolve conflicts in Riak.