Like most database systems, MongoDB provides API calls that allow multiple documents to be inserted in a single operation. I’ve written about similar interfaces in Oracle in the past – for instance in this post.
Array/Bulk interfaces improve database performance markedly by reducing the number of round trips between the client and the databases – Dramatically. To realize how fundamental an optimization this is, consider that you have a bunch of people that you are going to take across a river. You have a boat that can take 100 people at a time, but for some reason you are only taking one person across in each trip – not smart, right? Failing to take advantage of array inserts is very similar: you are essentially sending network packets that could take hundreds of documents over with only a single document.
On lines 2 or 4 we initialize a bulk object for the “bulkTest” collection. There are two ways to do this – we can create it ordered or non-ordered. Ordered guarantees that the collections are inserted in the order they are presented to the bulk object. Otherwise, MongoDB can optimize the inserts into multiple streams which may not insert in order.
On line 9 we add documents to the “bulk” object. When we hit an appropriate batch size (line 11), we execute the batch (line 12) and reinitialize the bulk object (lines 14 or 16). We do a further execute at the end (line 19) to make sure all documents are inserted.
I inserted 100,000 documents into a collection on my laptop, using various “batch” sizes (eg, the number of documents inserted between execute() calls). I tried both ordered and unordered bulk operations. The results are charted below:
The results are pretty clear – inserting in batches improves performance dramatically. Initially, every increase in batchsize reduces performance but eventually the improvement levels off. I believe MongoDB transparently limits batches to 1000 per operation anyway, but even before then, the chances are your network packets will be filled up and you won’t see any reduction in elapsed time by increasing the batch size. To use the analogy above – the rowboat is full!
For my example, there was no real difference between ordered and nonordered bulk operations but this might reflect a limitation on my laptop. Something to play with next time….