« Bulk inserts in MongoDB | Main | Join performance in MongoDB 3.2 using $lookup »

Graph Lookup in MongoDB 3.3

Specialized graph databases such as Neo4J specialize in traversing graphs of relationships – such as those you might find in a social network.  Many non-graph databases have been incorporating Graph Compute Engines to perform similar tasks.  In the MongoDB 3.3 release, we now have the ability to perform simple graph traversal using the $graphLookup aggregation framework function.  This will become a production feature in the 3.4 release.

The new feature is documented in MongoDB Jira SERVER-23725.  The basic syntax is shown here:

   1: {$graphLookup:
   2:         from: <name of collection to look up into>,
   3:         startWith: <expression>,
   4:         connectFromField: <name of field in document from “from”>,
   5:         connectToField: <name of field in document from “from”>,
   6:         as: <name of field in output document>,
   7:         maxDepth: <optional - non-negative integer>,
   8:         depthField: <optional - name of field in output
   9:  documents>
  10:     }

I started playing with this capability originally using the POKEC dataset which represents data from a real social network in Slovakia.  The relationship file soc-pokec-relationships.txt.gz  contains the social network for about 1.2 million people.  I loaded it into Mongo using this perl script.   The following pipeline did the trick:

   1: gzip -dc ~/Downloads/soc-pokec-relationships.txt |perl loadit.pl|mongoimport -d GraphTest -c socialGraph --drop

Now we have a collection with records like this:

   1: > db.socialGraph.findOne()
   2: {
   3:     "_id" : ObjectId("57b841b02e2a30792c8bb6bd"),
   4:     "person" : 1327456,
   5:     "name" : "User# 1327456",
   6:     "friends" : [
   7:         427220,
   8:         488072,
   9:         975403,
  10:         1322901,
  11:         1343431,
  12:         51639,
  13:         54468,
  14:         802341
  15:     ]
  16: }
We can expand the social network for a single person using a syntax like this:
   1: db.socialGraph.aggregate([
   2:     {
   3:         $match: {person:1476767}
   4:     },
   5:     {
   6:         $graphLookup: {
   7:             from: "socialGraph",
   8:             startWith: [1476767],
   9:             connectFromField: "friends",
  10:             connectToField: "person",
  11:             as: "socialNetwork",
  12:             maxDepth:2,
  13:             depthField:"depth"
  14:         }
  15:     },
  16:     {
  17:        $project: {_id:0,name:1,"Network":"$socialNetwork.name",
  18:                                  "Depth":"$socialNetwork.depth" }
  19:     },
  20:     {$unwind: {"Network"}}
  21: ])

What we are doing here is starting with person 1476767, then following the elements of the friends array out to two levels – i.e.: to “friends of friends”.

Increasing the maxdepth exponentially increases the amount of data we have to cope with.  This is the notorious “seven degrees of separation” effect – most people in a social network are linked by 6-7 hops, so once we get past that we are effectively traversing the entire set.   Unfortunately, this meant that traversing more than 3 deep caused me to run out of memory:

   1: assert: command failed: {
   2:     "ok" : 0,
   3:     "errmsg" : "$graphLookup reached maximum memory consumption",
   4:     "code" : 40099
   5: } : aggregate failed

The graph lookup can only consume at most 100MB of memory, and currently doesn't spill to disk, even if theallowDiskUse : true clause is specified within the aggregation arguments.   SERVER-23980 is open to correct this but it doesn't appear to have been scheduled yet. 

So I tried building a “flatter” network so that I wouldn’t run out of memory.  This JavaScript builds the network and this Javascript runs some performance tests.  I tried expanding the network both with and without a supporting index on the connectToField (person) in this case.  Here’s the results (note the logarithmic scale):


For shallow networks,  having an index on the connectToField makes an enormous difference.  But as the depth increases, the index performance advantage decreases and eventually performance matches that of the unindexed case.   In this example data that just happens to be at the “7 degrees of separation” but it will clearly depend on the nature of the data.

The $graphLookup operator is a very powerful addition to the MongoDB aggregation framework and continues the trend of providing richer query capabilities within the server.  Mastering the aggregation framework is clearly a high priority for anyone wanting to exploit the full power of MongoDB

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>