Main | Bulk inserts in MongoDB »
Wednesday
Nov302016

Optimizing the order of MongoDB aggregation steps

MongoDB does have a query optimizer, and in most cases it's effective at picking the best of multiple possible plans.  However it's worth remembering that in the case of the aggregate function the sequence in which various steps are executed is completely under your control.  The optimizer won't reorder steps into the optimal sequence to get you out of trouble. 


Optimizing the order of steps probably comes mainly to reducing the amount of data in the pipeline as early as possible – this reduces the amount of work that has to be done by each successive step.  The corollary is that steps that perfom a lot of work on data should be placed after any filter steps.

Nowhere is this more important that in $lookup steps.  Since $lookup steps perform a separate collection lookup – hopefully using an index – we should make sure we delay them until all data has been filtered.   Consider this aggregation function, which generates a “top 10” list of product purchases by customer:

   1: var output=db.orders.aggregate([
   2:       {$sample:{size:sampleSize}},
   3:       {$match:{orderStatus:"C"}},
   4:       {$project:{CustId:1,lineItems:1}},
   5:       {$unwind:"$lineItems"},
   6:       {$group:{_id:{ CustId:"$CustId",ProdId:"$lineItems.prodId"},
   7:                 "prodCount":{$sum:"$lineItems.prodCount"},
   8:                 "prodCost":{$sum:"$lineItems.Cost"}}},
   9:       {$sort:{prodCost:-1}},
  10:       {$limit:10},
  11:       {$lookup:{
  12:                    from: "customers",
  13:                      as: "c",
  14:              localField: "_id.CustId",
  15:            foreignField: "_id"
  16:       }},
  17:       {$lookup:{
  18:                    from: "products",
  19:                      as: "p",
  20:              localField: "_id.ProdId",
  21:            foreignField: "_id"
  22:       }},
  23:       {$unwind:"$p"},{$unwind:"$c"}, //Get rid of single element arrays
  24:       {$project:{"Customer":"$c.CustomerName","Product":"$p.ProductName",
  25:        prodCount:1,prodCost:1,_id:0}}
  26:     ]);

Lines 11-22 perform lookups on the customers and products collection to get customer and product names. 

We could have done these lookups much earlier in the pipeline.  So for instance, this code returns the exact same results, but does the lookup a little earlier in the sequence:

   1: var output=db.orders.aggregate([
   2:       {$sample:{size:sampleSize}},
   3:       {$match:{orderStatus:"C"}},
   4:       {$project:{CustId:1,lineItems:1}},
   5:       {$unwind:"$lineItems"},
   6:       {$group:{_id:{ CustId:"$CustId",ProdId:"$lineItems.prodId"},
   7:                 "prodCount":{$sum:"$lineItems.prodCount"},
   8:                 "prodCost":{$sum:"$lineItems.Cost"}}},
   9:       {$lookup:{
  10:                    from: "customers",
  11:                      as: "c",
  12:              localField: "_id.CustId",
  13:            foreignField: "_id"
  14:       }},
  15:       {$lookup:{
  16:                    from: "products",
  17:                      as: "p",
  18:              localField: "_id.ProdId",
  19:            foreignField: "_id"
  20:       }},
  21:       {$sort:{prodCost:-1}},
  22:       {$limit:10},
  23:       {$unwind:"$p"},{$unwind:"$c"}, //Get rid of single element arrays
  24:       {$project:{"Customer":"$c.CustomerName","Product":"$p.ProductName",
  25:        prodCount:1,prodCost:1,_id:0}}
  26:     ]);

The difference in performance is striking.  By moving the $lookup a few lines earlier, we have created a much less scalable solution:

image

When the $lookups are before the $limit step, we have to perform as many lookups as there are matching rows.  When we move after the $limit we only have to perform 10.  It’s an obvious but important optimization.

The aggregation framework is similar in nature to pig (see this post).  Both provide a procedural way for processing data which is philosophically different from that that we have become familiar with in the SQL world.  The main thing to remember is that you are in control of the execution plan in an aggregation pipeline.  As the Pig programmers like to say “it uses the query optimizer between your ears”!

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>