Traackr recently open sourced Mongo Follower, our easy to use harness for synchronizing your application with a Mongo database. It provides a simple interface to efficiently export all documents from a Mongo collection seamlessly followed by an indefinite stream of Mongo oplog events. Your location in the oplog is saved to disk allowing for clean restarts with no data loss.

At Traackr, we use MongoDB to store social data and keep that data synchronized with a separate analytics database. Over the years, we’ve used various tools to keep these two data stores synchronized — MongoRiver, MongoConnector, Mongolastic, scripts, cron jobs and a lot of carefully crafted code. More and more, we’ve realized that utilizing the Mongo oplog to synchronize this data and keep our applications decoupled makes things easier to implement and less error prone. After creating several internal apps which use the oplog, we decided to build a generic library to make building and maintaining those applications even easier. We’re calling it Mongo Follower.

How does it work?

A producer thread pulls data from Mongo, and a dispatch thread asynchronously sends that data to a special listener object. Processing happens in two phases: export and oplog. The export phase is only needed for the initial run. A Mongo query requests all documents from your collection, which are then sent to the ‘exportDocument’ callback. Once all documents are exported, the oplog phase begins.

The export phase is only needed for the initial run, a mongo query requests all documents from your collection which are then sent to the ‘exportDocument’ callback. Once all documents are exported the oplog phase begins.

The Mongo oplog acts like any other collection in MongoDB, so setting up the oplog tail is very similar to the initial export. The first difference is that the cursor is configured with cursorType(CursorType.TailableAwait). This means the cursor remains open even after reading the last record because more will become available when the collection changes. The query is also configured to skip part of the oplog in case of a restart. This is done using a timestamp file loaded from the disk allowing for restarts. This file is automatically updated at regular intervals. Oplog documents have a very special format indicating an insert, delete or update. Mongo Follower parses this format and calls corresponding insert, delete and update callbacks to make data easier to consume.

How do you use it?

Mongo Follower is currently available on github and maven central and can be included in your project like any other maven library:

Once the library has been included, you can implement the MongoEventListener interface. An instance of this listener will be provided to the Mongo Follower process:

To start Mongo Follower, we’ve created a Runner utility, and it can be used in a couple ways. The easiest way is to pass in a FollowerConfig object:

That’s all there is to it! Additional details can be found on github.

What’s next?

We’re excited to try out a more reactive approach to dealing with our data. We can decouple creation and setup code when adding special documents, re-index documents to Elasticsearch whenever there are changes, optimize performance by observing usage patterns, or even test out alternative storage backends for some of our larger collections.

There are a handful of improvements we’d like to make, specifically:

  • Multiple listeners
  • Tailing and exporting multiple databases simultaneously
  • Built-in metrics
  • Cluster / Shard management

Even without those features, we’re eager to continue leveraging the Mongo oplog to make our lives easier.