dbKoda for MongoDB

dbKoda for MongoDB - a modern open source database IDE, now available for MongoDB. Download it here!

Next Generation
Databases: NoSQL,
NewSQL and Big Data

Buy at Amazon

Read sample at Amazon

Buy at Apress

Latest Postings:

Search

Oracle Performance Survival Guide

Buy It
Read it on Safari
Scripts and Examples
Sample Chapter

                                    

Powered by Squarespace

 MySQL Stored procedure programming

Buy It
Read it on Safari
Scripts and Examples 

                                                

Entries in MongoDB (8)

Tuesday
Sep052017

Announcing dbKoda 0.7

0.7.0 is the second public release of dbKoda and our first post-MVP release.  With the MVP (Minimal Viable Product) we definitely nailed the "M" criteria, and in this release we've pushing harder on the "V" side of the equation. 

As with 0.6, dbKoda is a free, open source, Vegan product made by groovy people in Melbourne Australia.  It's licensed under the AGPL 3.0.   

As well as many bug fixes, brand new bugs and performance improvements, we added the following features:

Aggregation Builder

At this years MongoDB conference, we spotted one of the MongoDB engineers wearing a "Aggregate() is the new Find()" T-shirt.  It's funny, because it's true: almost every non-trivial MongoDB data retrieval operation requires an aggregation pipeline.  Features such as joins and graph lookups can only be done through the aggregation framework and under the hood features such as the BI connector depend on aggregation as well. 

As anybody who has ever written an aggregation framework pipeline knows, the process is very tedious and error prone - matching braces and getting syntax exactly right is difficult.  So in dbKoda 0.7 our aggregation builder allows you to drag and drop pipeline elements and use file in the blank forms to construct complex pipelines.  It's amazing how quickly you can build up a complex pipeline using the builder - a video on our youtube channel shows me building up a non-trival pipeline in under 60 seconds: so try it out!  

(click to enlarge)

Storage Drilldown 

MongoDB can tell you how much space is used up in databases, collections and indexes, but it is not so good at breaking down space within a document. Because MongoDB's document model supports nested arrays of documents, it’s often the space used within a collection that is the most important thing to identify. For instance, typical reasons for space blowouts MongoDB are unbounded arrays of nested collections.

dbKoda's storage drilldown breaks down space used within databases, collections, indexes and shows you how storage is used within a collection. It does this in an intuitive graphical presentation that allows you to drill in and out of nested documents.

 

SSH tunneling connections

We were all horrified at the explosion of ransomware attacks on MongoDB databases early in 2017. The root cause of the security vulnerabilities in these databases was the failure to correctly create authenticated users, but it is also true that you take your life in your hands whenever you expose a database port to the public Internet. For this reason it’s often the best practice to leave database ports open only within a walled garden. If you want to perform day administration using tools such as dbKoda then you use SSH tunneling to establish a connection. 

This FAQ entry describes SSH tunnelling and how it is used in dbKoda. Put simply, you can now specify an intermediate host which offers you SSH connectivity and use that host to forward  database requests to the secured MongoDB server.

Enhanced JSON viewer

Complex JSON documents can be difficult to read. By default dbKoda will display JSON output as it would appear in the MongoDB shell. We aspire to complete shall compatibility after all!  In 0.7, we offer an enhanced trace and your that allows you to examine JSON documents with multiple levels of detail, allowing you to expand collapse subdocuments and long strings. This facility is available wherever JSON output is displayed on the product, by right clicking and choosing “enhanced JSON output”.

 

Export/Import

0.7 allows you to load or unload data to or from your MongoDB server. This facility provides GUI access to the `mongodump`, `mongorestore`, `mongoexport` and `mongoimport` commands. 

 

Enhanced performance on windows

In 0.6, there were some performance issues when very large amounts of data were displayed in our output panel. We worked hard to resolve these and now believe that performance on windows will match performance on Linux and Mac.

Summing up

We added a lot of cool functionality in this release - we hope you'll try it out and let us know what you think.  Download dbKoda from www.dbkoda.com and let us know what you think at our support site.

Monday
Jul172017

Announcing dbKoda!

I'm very excited to announce the release of dbKoda - a next generation database development and administration tool now available for MongoDB.

Those who've been following me know that I've been working with databases since the early Mesozoic period and I've worked in database tooling for almost two decades.

Working with next generation databases like MongoDB has been a lot of fun, but did make me realise how much need there is for a strong tooling ecosystem around these new databases.  I like to think that I made significant contributions to tooling for relational databases and had a strong desire to build something for post-relational systems.

Consequently, late last year I founded the company Southbank Software and this week we launched our first product - dbKoda (www.dbKoda.com).

dbKoda is a modern, open source database development tool.  The first release targets MongoDB.   It is a 100% Javascript application which runs on Linux, Mac or Windows.  It features a rich editing environment with syntax highlighting, code completion and formatting.  It also offers easy graphical access to common MongoDB administration and configuration tasks.

Coding

I'm really excited about dbKoda - I hope that it will become the foundation for a product family that will support modern database development across a wide range of environments.   And working closely with the small team of brilliant dbKoda developers has been an absolute privilege.

Checkout the dbKoda website and download dbKoda here.  You can also checkout an introductory video on dbKoda.   Please also follow dbKoda on https://twitter.com/db_Koda.

Chart

Wednesday
Nov302016

Optimizing the order of MongoDB aggregation steps

MongoDB does have a query optimizer, and in most cases it's effective at picking the best of multiple possible plans.  However it's worth remembering that in the case of the aggregate function the sequence in which various steps are executed is completely under your control.  The optimizer won't reorder steps into the optimal sequence to get you out of trouble. 


Optimizing the order of steps probably comes mainly to reducing the amount of data in the pipeline as early as possible – this reduces the amount of work that has to be done by each successive step.  The corollary is that steps that perfom a lot of work on data should be placed after any filter steps.

Nowhere is this more important that in $lookup steps.  Since $lookup steps perform a separate collection lookup – hopefully using an index – we should make sure we delay them until all data has been filtered.   Consider this aggregation function, which generates a “top 10” list of product purchases by customer:

   1: var output=db.orders.aggregate([
   2:       {$sample:{size:sampleSize}},
   3:       {$match:{orderStatus:"C"}},
   4:       {$project:{CustId:1,lineItems:1}},
   5:       {$unwind:"$lineItems"},
   6:       {$group:{_id:{ CustId:"$CustId",ProdId:"$lineItems.prodId"},
   7:                 "prodCount":{$sum:"$lineItems.prodCount"},
   8:                 "prodCost":{$sum:"$lineItems.Cost"}}},
   9:       {$sort:{prodCost:-1}},
  10:       {$limit:10},
  11:       {$lookup:{
  12:                    from: "customers",
  13:                      as: "c",
  14:              localField: "_id.CustId",
  15:            foreignField: "_id"
  16:       }},
  17:       {$lookup:{
  18:                    from: "products",
  19:                      as: "p",
  20:              localField: "_id.ProdId",
  21:            foreignField: "_id"
  22:       }},
  23:       {$unwind:"$p"},{$unwind:"$c"}, //Get rid of single element arrays
  24:       {$project:{"Customer":"$c.CustomerName","Product":"$p.ProductName",
  25:        prodCount:1,prodCost:1,_id:0}}
  26:     ]);

Lines 11-22 perform lookups on the customers and products collection to get customer and product names. 

We could have done these lookups much earlier in the pipeline.  So for instance, this code returns the exact same results, but does the lookup a little earlier in the sequence:

   1: var output=db.orders.aggregate([
   2:       {$sample:{size:sampleSize}},
   3:       {$match:{orderStatus:"C"}},
   4:       {$project:{CustId:1,lineItems:1}},
   5:       {$unwind:"$lineItems"},
   6:       {$group:{_id:{ CustId:"$CustId",ProdId:"$lineItems.prodId"},
   7:                 "prodCount":{$sum:"$lineItems.prodCount"},
   8:                 "prodCost":{$sum:"$lineItems.Cost"}}},
   9:       {$lookup:{
  10:                    from: "customers",
  11:                      as: "c",
  12:              localField: "_id.CustId",
  13:            foreignField: "_id"
  14:       }},
  15:       {$lookup:{
  16:                    from: "products",
  17:                      as: "p",
  18:              localField: "_id.ProdId",
  19:            foreignField: "_id"
  20:       }},
  21:       {$sort:{prodCost:-1}},
  22:       {$limit:10},
  23:       {$unwind:"$p"},{$unwind:"$c"}, //Get rid of single element arrays
  24:       {$project:{"Customer":"$c.CustomerName","Product":"$p.ProductName",
  25:        prodCount:1,prodCost:1,_id:0}}
  26:     ]);

The difference in performance is striking.  By moving the $lookup a few lines earlier, we have created a much less scalable solution:

image

When the $lookups are before the $limit step, we have to perform as many lookups as there are matching rows.  When we move after the $limit we only have to perform 10.  It’s an obvious but important optimization.

The aggregation framework is similar in nature to pig (see this post).  Both provide a procedural way for processing data which is philosophically different from that that we have become familiar with in the SQL world.  The main thing to remember is that you are in control of the execution plan in an aggregation pipeline.  As the Pig programmers like to say “it uses the query optimizer between your ears”!

Monday
Nov072016

Bulk inserts in MongoDB

Like most database systems,  MongoDB provides API calls that allow multiple documents to be inserted in a single operation.  I’ve written about similar interfaces in Oracle in the past – for instance in this post

Array/Bulk interfaces improve database performance markedly by reducing the number of round trips between the client and the databases – Dramatically.  To realize how fundamental an optimization this is, consider that you have a bunch of people that you are going to take across a river.  You have a boat that can take 100 people at a time, but for some reason you are only taking one person across in each trip – not smart, right?  Failing to take advantage of array inserts is very similar: you are essentially sending  network packets that could take hundreds of documents over with only a single document.

Coding bulk inserts in MongoDB is a little more work, but far from rocket science.  The exact syntax varies depending on the language.  Here we’ll look at a little bit of JavaScript code. 

 

   1: if (orderedFlag==1) 
   2:   bulk=db.bulkTest.initializeOrderedBulkOp();
   3: else 
   4:   bulk=db.bulkTest.initializeUnorderedBulkOp(); 
   5:  
   6: for (i=1;i<=NumberOfDocuments;i++) {
   7:   //Insert a row into the bulk batch
   8:   var doc={_id:i,i:i,zz:zz};
   9:   bulk.insert(doc);
  10:   // Execute the batch if batchsize reached
  11:   if (i%batchSize==0) {
  12:     bulk.execute();
  13:     if (orderedFlag==1)
  14:       bulk=db.bulkTest.initializeOrderedBulkOp();
  15:     else
  16:       bulk=db.bulkTest.initializeUnorderedBulkOp();
  17:   }
  18: }
  19: bulk.execute();

On lines 2 or 4 we initialize a bulk object for the “bulkTest” collection.  There are two ways to do this – we can create it ordered or non-ordered.  Ordered guarantees that the collections are inserted in the order they are presented to the bulk object.  Otherwise, MongoDB can optimize the inserts into multiple streams which may not insert in order. 

On line 9 we add documents to the “bulk” object.  When we hit an appropriate batch size (line 11), we execute the batch (line 12) and reinitialize the bulk object (lines 14 or 16).  We do a further execute at the end (line 19) to make sure all documents are inserted. 

I inserted 100,000 documents into a collection on my laptop, using various “batch” sizes (eg, the number of documents inserted between execute() calls). I tried both ordered and unordered bulk operations.  The results are charted below:

image

The results are pretty clear – inserting in batches improves performance dramatically.  Initially, every increase in batchsize reduces performance but eventually the improvement levels off.  I believe MongoDB transparently limits batches to 1000 per operation anyway, but even before then, the chances are your network packets will be filled up and you won’t see any reduction in elapsed time by increasing the batch size.  To use the analogy above – the rowboat is full! 

For my example, there was no real difference between ordered and nonordered bulk operations but this might reflect a limitation on my laptop.  Something to play with next time….

When inserting multiple documents into a MongoDB collection you should generally take advantage of the massive performance advantages offered by the bulk operations interface.

Wednesday
Aug242016

Graph Lookup in MongoDB 3.3

Specialized graph databases such as Neo4J specialize in traversing graphs of relationships – such as those you might find in a social network.  Many non-graph databases have been incorporating Graph Compute Engines to perform similar tasks.  In the MongoDB 3.3 release, we now have the ability to perform simple graph traversal using the $graphLookup aggregation framework function.  This will become a production feature in the 3.4 release.

The new feature is documented in MongoDB Jira SERVER-23725.  The basic syntax is shown here:

   1: {$graphLookup:
   2:         from: <name of collection to look up into>,
   3:         startWith: <expression>,
   4:         connectFromField: <name of field in document from “from”>,
   5:         connectToField: <name of field in document from “from”>,
   6:         as: <name of field in output document>,
   7:         maxDepth: <optional - non-negative integer>,
   8:         depthField: <optional - name of field in output
   9:  documents>
  10:     }

I started playing with this capability originally using the POKEC dataset which represents data from a real social network in Slovakia.  The relationship file soc-pokec-relationships.txt.gz  contains the social network for about 1.2 million people.  I loaded it into Mongo using this perl script.   The following pipeline did the trick:

   1: gzip -dc ~/Downloads/soc-pokec-relationships.txt |perl loadit.pl|mongoimport -d GraphTest -c socialGraph --drop

Now we have a collection with records like this:

   1: > db.socialGraph.findOne()
   2: {
   3:     "_id" : ObjectId("57b841b02e2a30792c8bb6bd"),
   4:     "person" : 1327456,
   5:     "name" : "User# 1327456",
   6:     "friends" : [
   7:         427220,
   8:         488072,
   9:         975403,
  10:         1322901,
  11:         1343431,
  12:         51639,
  13:         54468,
  14:         802341
  15:     ]
  16: }
We can expand the social network for a single person using a syntax like this:
   1: db.socialGraph.aggregate([
   2:     {
   3:         $match: {person:1476767}
   4:     },
   5:     {
   6:         $graphLookup: {
   7:             from: "socialGraph",
   8:             startWith: [1476767],
   9:             connectFromField: "friends",
  10:             connectToField: "person",
  11:             as: "socialNetwork",
  12:             maxDepth:2,
  13:             depthField:"depth"
  14:         }
  15:     },
  16:     {
  17:        $project: {_id:0,name:1,"Network":"$socialNetwork.name",
  18:                                  "Depth":"$socialNetwork.depth" }
  19:     },
  20:     {$unwind: {"Network"}}
  21: ])

What we are doing here is starting with person 1476767, then following the elements of the friends array out to two levels – i.e.: to “friends of friends”.

Increasing the maxdepth exponentially increases the amount of data we have to cope with.  This is the notorious “seven degrees of separation” effect – most people in a social network are linked by 6-7 hops, so once we get past that we are effectively traversing the entire set.   Unfortunately, this meant that traversing more than 3 deep caused me to run out of memory:

   1: assert: command failed: {
   2:     "ok" : 0,
   3:     "errmsg" : "$graphLookup reached maximum memory consumption",
   4:     "code" : 40099
   5: } : aggregate failed

The graph lookup can only consume at most 100MB of memory, and currently doesn't spill to disk, even if theallowDiskUse : true clause is specified within the aggregation arguments.   SERVER-23980 is open to correct this but it doesn't appear to have been scheduled yet. 

So I tried building a “flatter” network so that I wouldn’t run out of memory.  This JavaScript builds the network and this Javascript runs some performance tests.  I tried expanding the network both with and without a supporting index on the connectToField (person) in this case.  Here’s the results (note the logarithmic scale):

image

For shallow networks,  having an index on the connectToField makes an enormous difference.  But as the depth increases, the index performance advantage decreases and eventually performance matches that of the unindexed case.   In this example data that just happens to be at the “7 degrees of separation” but it will clearly depend on the nature of the data.

The $graphLookup operator is a very powerful addition to the MongoDB aggregation framework and continues the trend of providing richer query capabilities within the server.  Mastering the aggregation framework is clearly a high priority for anyone wanting to exploit the full power of MongoDB