Mongo Follower, a MongoDB export and oplog event generator for Java

Traackr recently open sourced Mongo Follower, our easy to use harness for synchronizing your application with a Mongo database. It provides a simple interface to efficiently export all documents from a Mongo collection seamlessly followed by an indefinite stream of Mongo oplog events. Your location in the oplog is saved to disk allowing for clean restarts with no data loss.

At Traackr, we use MongoDB to store social data and keep that data synchronized with a separate analytics database. Over the years, we've used various tools to keep these two data stores synchronized -- MongoRiver, MongoConnector, Mongolastic, scripts, cron jobs and a lot of carefully crafted code. More and more, we've realized that utilizing the Mongo oplog to synchronize this data and keep our applications decoupled makes things easier to implement and less error prone. After creating several internal apps which use the oplog, we decided to build a generic library to make building and maintaining those applications even easier. We're calling it Mongo Follower.

How does it work?

A producer thread pulls data from Mongo, and a dispatch thread asynchronously sends that data to a special listener object. Processing happens in two phases: export and oplog. The export phase is only needed for the initial run. A Mongo query requests all documents from your collection, which are then sent to the 'exportDocument' callback. Once all documents are exported, the oplog phase begins.

The export phase is only needed for the initial run, a mongo query requests all documents from your collection which are then sent to the 'exportDocument' callback. Once all documents are exported the oplog phase begins.

The Mongo oplog acts like any other collection in MongoDB, so setting up the oplog tail is very similar to the initial export. The first difference is that the cursor is configured with cursorType(CursorType.TailableAwait). This means the cursor remains open even after reading the last record because more will become available when the collection changes. The query is also configured to skip part of the oplog in case of a restart. This is done using a timestamp file loaded from the disk allowing for restarts. This file is automatically updated at regular intervals. Oplog documents have a very special format indicating an insert, delete or update. Mongo Follower parses this format and calls corresponding insert, delete and update callbacks to make data easier to consume.

How do you use it?

Mongo Follower is currently available on github and maven central and can be included in your project like any other maven library:

Once the library has been included, you can implement the MongoEventListener interface. An instance of this listener will be provided to the Mongo Follower process:

To start Mongo Follower, we've created a Runner utility, and it can be used in a couple ways. The easiest way is to pass in a FollowerConfig object:

That's all there is to it! Additional details can be found on github.

What's next?

We're excited to try out a more reactive approach to dealing with our data. We can decouple creation and setup code when adding special documents, re-index documents to Elasticsearch whenever there are changes, optimize performance by observing usage patterns, or even test out alternative storage backends for some of our larger collections.

There are a handful of improvements we'd like to make, specifically:

  • Multiple listeners
  • Tailing and exporting multiple databases simultaneously
  • Built-in metrics
  • Cluster / Shard management

Even without those features, we're eager to continue leveraging the Mongo oplog to make our lives easier.


Git Driven Release Management

Like many engineering teams, here at Traackr we like to streamline our workflow to spend as much time as possible building cool features. We also like to release early and often. And our product team likes being able to answer pesky questions like "has feature X been deployed?" or "there was a regression with feature Y, what release was that included in?". At times, these requirements can be a give and take. Using an issue tracker helps answer questions about when things were released, but navigating an application like Jira is a context switch. Plus, it's easy to click the wrong buttons. Releasing often means these context switches and mistakes happen frequently. Can we do better?

Some of our teams release more often than others.
Some of our teams release more often than others.

We think so, which led us to look at what we like about our tooling. One great thing about using Jira and Bitbucket is the way they seamlessly links commits to issues by including a ticket ID in your commit message. Because of this, we've standardized on a "<Ticket ID> <Message>" commit message format. We enforce this standard using a commit hook. This got us thinking... If we can link Jira tickets to commit messages, can we also map our git tags to Jira fix versions? It turns out the answer is yes!

Out of the box integration is nice, but sometimes you need to do it yourself.
Out of the box integration is nice, but sometimes you need to do it yourself.

The prototype

There were two main problems to solve. How do you identify a range of commits associated with a release? How do you interact with Jira to tag the issues with a new fix version? Enter release_fix_versioner.py.

At Traackr, we use a git-flow tagging and branching strategy; this is very convenient because for every release we already have a tag to reference. By providing our script with a new git tag, we can grab all commits between this tag and the previous tag to represent all of the commits to create a new release. With startTag and endTag, git has a handy command "git log --format=%s startTag...endTag" to print a list of commit messages in an easy to parse format. From there, it's a simple matter of parsing the commits to come up with a unique set of issue IDs. In order to support a variety of commit formats across different teams in our organization, the parsing is done using a regular expression with named groups to identify the key and message: "(?P<key>[\w]*-[\d]*)[ :-](?P<value>.*)".

Interacting with Jira turns out to be simple as well; the interface is extensive and does anything you could ask for. To start with, we decided to add some validation to each ticket prior to dealing with tags. We check things like making sure the work is finished by verifying the state equals "Done". This method is really all there is to it:

After creating two more similar methods to create a new fix version and add that fix version to each ticket, we've got ourselves a prototype:

Next Steps

Now that we have this script, we can start using it to simplify the manual issue management. If it works out, we'll include it as an automated step in our one button deploy job, and we'll never have to worry about Jira being out of date again. This could even be used to generate release notes for our public facing applications.

All the code for this hackweek project is available on github. Give it a try and let us know how it went!