Graph Lookup in MongoDB 3.3

Specialized graph databases such as Neo4J specialize in traversing graphs of relationships – such as those you might find in a social network.  Many non-graph databases have been incorporating Graph Compute Engines to perform similar tasks.  In the MongoDB 3.3 release, we now have the ability to perform simple graph traversal using the $graphLookup aggregation framework function.  This will become a production feature in the 3.4 release.

The new feature is documented in MongoDB Jira SERVER-23725.  The basic syntax is shown here:

   1: {$graphLookup:
   2:         from: <name of collection to look up into>,
   3:         startWith: <expression>,
   4:         connectFromField: <name of field in document from “from”>,
   5:         connectToField: <name of field in document from “from”>,
   6:         as: <name of field in output document>,
   7:         maxDepth: <optional - non-negative integer>,
   8:         depthField: <optional - name of field in output
   9:  documents>
  10:     }

I started playing with this capability originally using the POKEC dataset which represents data from a real social network in Slovakia.  The relationship file soc-pokec-relationships.txt.gz  contains the social network for about 1.2 million people.  I loaded it into Mongo using this perl script.   The following pipeline did the trick:

   1: gzip -dc ~/Downloads/soc-pokec-relationships.txt |perl|mongoimport -d GraphTest -c socialGraph --drop

Now we have a collection with records like this:

   1: > db.socialGraph.findOne()
   2: {
   3:     "_id" : ObjectId("57b841b02e2a30792c8bb6bd"),
   4:     "person" : 1327456,
   5:     "name" : "User# 1327456",
   6:     "friends" : [
   7:         427220,
   8:         488072,
   9:         975403,
  10:         1322901,
  11:         1343431,
  12:         51639,
  13:         54468,
  14:         802341
  15:     ]
  16: }
We can expand the social network for a single person using a syntax like this:
   1: db.socialGraph.aggregate([
   2:     {
   3:         $match: {person:1476767}
   4:     },
   5:     {
   6:         $graphLookup: {
   7:             from: "socialGraph",
   8:             startWith: [1476767],
   9:             connectFromField: "friends",
  10:             connectToField: "person",
  11:             as: "socialNetwork",
  12:             maxDepth:2,
  13:             depthField:"depth"
  14:         }
  15:     },
  16:     {
  17:        $project: {_id:0,name:1,"Network":"$",
  18:                                  "Depth":"$socialNetwork.depth" }
  19:     },
  20:     {$unwind: {"Network"}}
  21: ])

What we are doing here is starting with person 1476767, then following the elements of the friends array out to two levels – i.e.: to “friends of friends”.

Increasing the maxdepth exponentially increases the amount of data we have to cope with.  This is the notorious “seven degrees of separation” effect – most people in a social network are linked by 6-7 hops, so once we get past that we are effectively traversing the entire set.   Unfortunately, this meant that traversing more than 3 deep caused me to run out of memory:

   1: assert: command failed: {
   2:     "ok" : 0,
   3:     "errmsg" : "$graphLookup reached maximum memory consumption",
   4:     "code" : 40099
   5: } : aggregate failed

The graph lookup can only consume at most 100MB of memory, and currently doesn't spill to disk, even if theallowDiskUse : true clause is specified within the aggregation arguments.   SERVER-23980 is open to correct this but it doesn't appear to have been scheduled yet. 

So I tried building a “flatter” network so that I wouldn’t run out of memory.  This JavaScript builds the network and this Javascript runs some performance tests.  I tried expanding the network both with and without a supporting index on the connectToField (person) in this case.  Here’s the results (note the logarithmic scale):


For shallow networks,  having an index on the connectToField makes an enormous difference.  But as the depth increases, the index performance advantage decreases and eventually performance matches that of the unindexed case.   In this example data that just happens to be at the “7 degrees of separation” but it will clearly depend on the nature of the data.

The $graphLookup operator is a very powerful addition to the MongoDB aggregation framework and continues the trend of providing richer query capabilities within the server.  Mastering the aggregation framework is clearly a high priority for anyone wanting to exploit the full power of MongoDB


Join performance in MongoDB 3.2 using $lookup

One of the key tenants of MongoDB schema design is to account for the absence of server-side joins.  Data is joined all the time inside of application code of course, but traditionally there’s been no way to perform joins within the server itself. 

This changed in 3.2 with the introduction of the $lookup operator within the aggregation framework.  $lookup performs the equivalent of a left outer join – eg: it retrieves matching data from another document and returns null data if no match is found.

Here’s an example using the MongoDB version of the Sakila dataset that I converted from MySQL back in this post

   1: db.films.aggregate([
   2:     {$match:{"Actors.First name" : "CHRISTIAN",
   3:              "Actors.Last name" : "GABLE"}},
   4:     {$lookup: {
   5:         from: "customers",
   6:         as: "customerData",
   7:         localField: "_id",
   8:         foreignField: "Rentals.filmId"
   9:     }},
  10:     {$unwind: "$customerData"},
  11:     {$project:{"Title":1,
  12:                "FirstName":"$customerData.First Name",
  13:                "LastName" :"$customerData.Last Name"}},
  14: ])


What we’re doing here is finding all customers who have ever hired a film staring “Christian Gable”;  We start by finding those films in the films collection (lines 2-3), then use $lookup to retrieve customer data (lines 4-9).  Films embeds actors in the “Actors” array;  the customers collection embeds films that have been hired in the "Rentals" array. 

The result of the join contains all the customers who have borrowed the movie returned as an array, so we use the $unwind operator to “flatten” them out (line 10).  The resulting output looks like this:

{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "SUSAN", "LastName" : "WILSON" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "REBECCA", "LastName" : "SCOTT" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "DEBRA", "LastName" : "NELSON" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "MARIE", "LastName" : "TURNER" }
{ "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "TINA", "LastName" : "SIMMONS" }

One thing that we need to be careful here is with join performance.  The $lookup function is going to be executed once for each document returned by our $match condition.  There is - AFAIK - no equivalent of a hash or sort merge join operation possible here, so we need to make sure that we've used an index.  Unfortunately, the explain() command doesn’t help us.  It tells us only if we have used an index to perform the initial $match, but doesn't show us if we used an index within the $lookup

Here's the explain output from the operation above (TL;DR):

   1: > db.films.explain().aggregate([
   2: ... {$match:{"Actors.First name" : "CHRISTIAN",
   3: ...          "Actors.Last name" : "GABLE"}},
   4: ...     {$lookup: {
   5: ... from: "customers",
   6: ... as: "customerData",
   7: ... localField: "_id",
   8: ... foreignField: "Rentals.filmId"
   9: ... }},
  10: ... {$unwind: "$customerData"},
  11: ... {$project:{"Title":1,
  12: ...            "FirstName":"$customerData.First Name",
  13: ...            "LastName" :"$customerData.Last Name"}},
  14: ...
  15: ... ])
  16: {
  17:         "waitedMS" : NumberLong(0),
  18:         "stages" : [
  19:                 {
  20:                         "$cursor" : {
  21:                                 "query" : {
  22:                                         "Actors.First name" : "CHRISTIAN",
  23:                                         "Actors.Last name" : "GABLE"
  24:                                 },
  25:                                 "fields" : {
  26:                                         "Title" : 1,
  27:                                         "customerData.First Name" : 1,
  28:                                         "customerData.Last Name" : 1,
  29:                                         "_id" : 1
  30:                                 },
  31:                                 "queryPlanner" : {
  32:                                         "plannerVersion" : 1,
  33:                                         "namespace" : "sakila.films",
  34:                                         "indexFilterSet" : false,
  35:                                         "parsedQuery" : {
  36:                                                 "$and" : [
  37:                                                         {
  38:                                                                 "Actors.First name" : {
  39:                                                                         "$eq" : "CHRISTIAN"
  40:                                                                 }
  41:                                                         },
  42:                                                         {
  43:                                                                 "Actors.Last name" : {
  44:                                                                         "$eq" : "GABLE"
  45:                                                                 }
  46:                                                         }
  47:                                                 ]
  48:                                         },
  49:                                         "winningPlan" : {
  50:                                                 "stage" : "COLLSCAN",
  51:                                                 "filter" : {
  52:                                                         "$and" : [
  53:                                                                 {
  54:                                                                         "Actors.First name" : {
  55:                                                                                 "$eq" : "CHRISTIAN"
  56:                                                                         }
  57:                                                                 },
  58:                                                                 {
  59:                                                                         "Actors.Last name" : {
  60:                                                                                 "$eq" : "GABLE"
  61:                                                                         }
  62:                                                                 }
  63:                                                         ]
  64:                                                 },
  65:                                                 "direction" : "forward"
  66:                                         },
  67:                                         "rejectedPlans" : [ ]
  68:                                 }
  69:                         }
  70:                 },
  71:                 {
  72:                         "$lookup" : {
  73:                                 "from" : "customers",
  74:                                 "as" : "customerData",
  75:                                 "localField" : "_id",
  76:                                 "foreignField" : "Rentals.filmId",
  77:                                 "unwinding" : {
  78:                                         "preserveNullAndEmptyArrays" : false
  79:                                 }
  80:                         }
  81:                 },
  82:                 {
  83:                         "$project" : {
  84:                                 "Title" : true,
  85:                                 "FirstName" : "$customerData.First Name",
  86:                                 "LastName" : "$customerData.Last Name"
  87:                         }
  88:                 }
  89:         ],
  90:         "ok" : 1
  91: }

However, we can see the queries created by the $lookup function if we enable profiling.  For instance if we turn profiling on can see a full collection scan of customers has have been generated for every film document that has been joined:


These “nested” collection scans are bad news.  Below is the results of a benchmark in which I joined two collections using $lookup with and without an index.  As you can see, the unindexed $lookup degrades steeply as the number of rows to be joined increases. The solution is obvious:

 Always create an index on the foreignField attributes in a $lookup, unless the collections are of trivial size. 


The MongoDB company is putting a lot of new features into the aggregation framework:  they clearly intend to create a very powerful and flexible capability that matches and maybe even exceeds what can be done with SQL.  Indeed,  the aggregation framework seems poised to become a dataflow language similar to Pig.  Anyone wanting to do any serious work in MongoDB should make sure they are very comfortable with aggregate.  If you use $lookup to perform joins in aggregate, make sure there is an index on the ForiegnField attribute.



Good bye Quest!

You may have read that Francisco Partners and Elliott Management have entered into an agreement to  Acquire the Dell Software Group – largely composed of the Quest software company bought from Dell in 2012 .   I’ve worked at Quest since 1998, but alas I will not be participating in this next stage of the Quest journey.
Although the timing of the announcement was influenced by the logistics of this sale, it is actually a decision I came to some time ago, well before this latest saga commenced.  I’ve been though an amazing 18 year journey with Quest – I’ve seen IPO, attempted privatisation of Quest, purchase by Dell, transition of Dell to private company as well as having participated in many other major company events.  In that time I’ve been involved in building software products like Spotlight that have been wildly successful and had the honour of leading a team of hundreds of fantastic developers across the world.  It’s been a tremendous experience, but after spending almost one third of my life with Quest I feel that I’d like to work closer to home (in Melbourne), work more directly with technology and with smaller teams. 
The details of my next endeavours are somewhat works in progress, but I can tell you that I’ll be working closely with fellow Quest Alumni Mark Gurry at MGA.  
Quest has been by far the most important professional relationship of my life and I owe so much to the company and it’s founders.  I’ve made friends and established collegial relationships that will last the rest of my life.   I’m going to miss all my friends at Quest and always look back fondly on my time there.  That having been said, I’m incredibly excited to be changing gears and very happy that I’ll be spending less times as the guest of United airlines :-)

Next Generation Databases

dbtngMy latest book Next Generation Databases is now available to purchase!   You can buy it from Amazon here, or directly from Apress here.  The e-book versions are not quite ready but if you prefer the print version you’re good to go.

I wrote this book as an attempt to share what I’ve learned about non-relational databases in the last decade and position these in the context of the relational database landscape that I’ve worked in all my professional life.  

The book is divided into two sections:  the first section explains the market and technology drivers that lead to the end of complete “one size fits all” relational dominance and describes each of the major new database technologies.   These first 7 chapters are:

  • Three Database Revolutions
  • Google, Big Data, and Hadoop  
  • Sharding, Amazon, and the Birth of NoSQL
  • Document Databases
  • Tables are Not Your Friends: Graph Databases
  • Column Databases
  • The End of Disk? SSD and In-Memory Databases 

The second half of the book covers the “gory details” of the internals of the major new database technologies.  We look at how databases like MongoDB, Cassandra, HBase, Riak and others implement clustering and replication, locking and consistency management, logical and physical storage models and the languages and APIs provided.  These chapters are: 

  • Distributed Database Patterns 
  • Consistency Models 
  • Data Models and Storage 
  • Languages and Programming Interfaces

The final chapter speculates on how databases might develop in the future.  Spoiler alert: I think the explosion of new database technologies over the last few years is going to be followed by a consolidation phase, but there’s some potentially disruptive technologies on the horizon such as universal memory, blockchain and even quantum computing. 

The relational database is a triumph of software engineering and has been the basis for most of my career.  But the times they are a changing and speaking personally I’ve really enjoyed learning about these new technologies.  I learned a lot more about the internals of the newer database architectures while writing the book and I’m feeling pretty happy with the end result.   As always I’m anxious to engage with readers and find out what you guys think!


Blockchain and databases of the future

You would have to have been living under a rock for the past few years not to have heard of Bitcoin.    Bitcoin is an electronic cryptocurrency which can be used like cash in many web transactions.  At time of writing there are about 15 million Bitcoins in circulation, trading at approximately $USD 360 each for a total value of about $USD 5.3 billion.

Bitcoin combines peer-to-peer technology and public key cryptography.  The owner of a bitcoin can use a private key to assert ownership and authorize transactions – others can use the public key to validate the transaction.  As in other peer to peer systems such as Bittorrent, there is no central server which maintains Bitcoin transactions – rather there is a distributed public ledger called the Blockchain.   I wrote a short article on blockchain here

The implications of cryptocurrencies is way beyond our scope here, but there are definite implications for database technologies in the Blockchain concept.  Blockchain replaces the trusted third party that must normally mediate any transfer of funds.   Rather than a centralized database that records transactions and authenticates each party, Blockchain allows transactions and identity to be validated by consensus with the Blockchain network – each transaction is confirmed by public-key based authentication from multiple nodes before being concluded.   Blockchain could be as disruptive to our traditional notions of banking and non-monetary transactions as peer-to-peer systems like napster were to the music business. 

The Blockchain underlying Bitcoin is public, but there can be private (or permissioned) Blockchains which are “invitation only”.  Whether Private or public, Blockchains arguably represent a new sort of shared distributed database.  Like systems based on the Dynamo model, the data in the block chain is distributed redundantly across a large number of hosts. However, the Blockchain represents a complete paradigm shift in how permissions are managed within the database. In an existing database system, the database owner has absolute control over the data held in the database. However in a Blockchain system, ownership is maintained by the creator of the data. 

Consider a database that maintains a social network like Facebook: although the application is programmed to allow only you to modify your own posts or personal details, the reality is that the Facebook company actually has total control over your online data. They can – if they wish – remove your posts, censor your posts or even modify your posts if they really wanted to. In a Blockchain based database you would retain total ownership of your posts and it would be impossible for any other entity to modify them.

Applications based on Blockchain have the potential to disrupt a very wide range of social and economic activities.  Transfers of money, property, management of global identity (passports, birth certificates), voting, permits, wills, health data, and a multitude of other transactional data could be regulated in the future by Blockchains.   The databases that currently maintain records of these types of transactions may become obsolete.

Most database owners will probably want to maintain control of the data in the database, and therefore it’s unlikely that Blockchain will completely transform database technology in the short term.  However it does seem likely that database systems will implement Blockchain based authentication and authorization protocols for specific application scenarios. Furthermore, it seems likely that formal database systems built upon a Blockchain foundation will soon emerge.