Athena vs Q: Battle of the Data Parsers

What are we fighting for?

Here at Traackr, we pull together data from a variety of sources and visualize it in various forms -- bar charts, timelines, etc. Typically, this data comes from APIs in the form of JSON or XML, formats most developers and their tools are well equipped to handle. Now, our team has been tasked with a new data source in the format of... CSV... Artisanally crafted and delivered by our data team... Oh, joy... We will be getting these on a regular basis, and they are not small files.

The APIs we use typically have an interface to search through the data and return what we need. CSV files are just lumps of text formatted into columns. What we need is some way to easily query that data. We could roll our own solution to parse the files and store them away in MySQL, MongoDB, etc. However, someone must have invented this wheel already. Surely, there must be a solution we could leverage rather than building our own process and infrastructure.

Here comes a new challenger...

Traackr uses a variety of Amazon services, and Amazon's services are legion. I began searching through their list and found an interesting item... Amazon Athena. The one line service description was "Query data in S3 using SQL". Intriguing... Let's read on...

" with a variety of standard data formats, including CSV..." What?! You're telling me our data team could dump the CSV files into a bucket, and we could query it using a language we are very familiar with. That sounds too good to be true! I must play with this... Now...

A quick chat with the keeper of our Amazon access got the ball rolling to be able to access Athena and start tinkering. While discussing the use case with them, they mentioned they use a similar tool locally called q. Excellent! I now have two combatants to duke it out in the arena and see what solution might best fit our needs. The initial battle is over; time to break down the fight blow by blow.

Ease and cost of setup

Setting up q is a straightforward process of installing the package through your operating system's package manager. I use OS X myself making it a quick call to brew. It is also a free open source tool, so no out of pocket expenses incurred. After installation, I am ready to run queries against CSV files on my drive.

If you have ever worked with any Amazon service, you'll know setting something up is never quite that simple. Athena needs a bucket in S3 to house the CSV files. Right... Not a huge time consuming task, but that storage will add to our monthly costs. Bucket created... Some folders under the bucket created to separate the CSV files by type... CSV files uploaded... Now what?

Next, you have to create a database in Athena. There is a decent UI for doing this, so not a huge time hurdle there. A database has to have some tables to query; Athena creates tables by defining what S3 bucket the source data is held and what data types and names you want for each column in the data. Some of the non-CSV data sources have ways to auto-discover and build the tables for you.

Unfortunately, there was no auto-generation I could find for CSV files. The UI for building the tables manually is decent but a bit clunky. However, once you have built one table, you can use the UI to issue a command to get the CREATE TABLE syntax used to create it. Armed with this, you can easily modify the query for other tables and issue it directly. With my tables built, what then will Amazon charge me for using this service? Turns out the pricing is US$5 per terabyte of data scanned. Amazon has recommendations on how you can partition and compress your data to reduce the amount of scanning. Our CSV files are large but not large enough to likely push us over that price point too often. Time will tell...

There is one additional cost to Athena that is not readily apparent though. When you query data, Athena creates and stores two files -- one is a CSV file with the results of the query and the other is a file with metadata about the query. These files are stored in another S3 bucket (you can configure where that bucket lives or just have it use an auto-generated bucket) meaning you will be paying storage fees beyond the fees for storing your source data. We'll likely throw a caching layer in front of our use of Athena if we move forward with it, as we do with other API sources. Still, something we need to keep in mind.

Winner: q takes the opening round for being quick to install and completely free.

Current Score: q 1 - Athena 0

Show me how you set a table!

We've already shown how to build a table in Athena. The nice bit is adding data to that table is as easy as uploading a new CSV file into the S3 bucket. That's it! Any query you issue against that table is going to pull from all files in the bucket. No need for any "query against these things" logic beyond the standard "issue a query against this table" SQL syntax.

q, however, is a command line tool requiring you to somehow tell it what files to operate on. This is done in the FROM portion of the query you issue:

SELECT * FROM /path/to/file1.csv+/path/to/file2.csv+... WHERE...

Hmm... It would be easy enough to build a string in the proper format by globbing all the files out of directory. However, this syntax raises a question. The list of files will do nothing but grow. How long before we reach some odd limit on the length of the command you can issue? That's the sort of question that tends to be answered around 2AM when your most important customer in Europe needs the data this would provide ASAP. We could build in logic to issue multiple commands and then massage the data together. Ick... Sounds like a big ball of complex mud filled with nails...

Winner: Athena evens things up by having tables grow simply by uploading a new file.

Current Score: q 1 - Athena 1

I have questions... How do I get my answers?

Both Athena and q boast the ability to query the CSV files using SQL, and they both do exactly that. I can issue this SQL against either and get the same data back:

SELECT first_name, last_name, location
FROM users
WHERE first_name = 'Odin'

The only initial difference is the table name for q is a file path. q supports the SQLite syntax. Athena is based on Presto but does not support all of Presto's features. However, the list of what is supported seems sufficient for our needs.

One item did stand out though. I can use LIMIT and OFFSET with q to get pages of data. Athena supports LIMIT but not OFFSET. After some research, this appears to be due to the architectural and conceptual mindsets behind the Presto engine. You get back all the data in one go due to how it parses things.

Winner: Too close to call... Both satisfy the SQL needs to pull the data from the files. The loss of OFFSET in Athena is not a significant blow. Call it a draw.

Current Score: q 2 - Athena 2

I want to change the design of my table

We all know data formats never, ever change. #sarcasm If the data were to change, how easy is it to alter your tables accordingly? q has a straightforward answer. It simply doesn't care. It reads the header row from your CSV file and uses that information for column names. Just adjust the column names in your query and you are good to go.

Athena handles this differently. We had to explicitly define the table structure in Athena. Each column in the table maps to a column in the CSV file in order. e.g., Column 0 in the table maps to column 0 in the CSV file, column 1 in the table to column 1 in the file, etc. Any alteration in the file or alteration you wish to make to the table itself -- new columns added to the file, you want to rename a column in the table -- require dropping the table and recreating it.

Recall the Athena UI has a way to get the CREATE TABLE syntax used to build a table. You would need to get that statement, alter it, drop the existing table, and then issue the new command. The upside is you can drop the table entirely without worrying about the data. It is stored in S3, and the table is "repopulated" once you create it again.

What happens though if your CSV files have no header row defining the column names? Athena has a slight edge here given it doesn't need a header row. You define column X in the file is column Y in my table with the name I choose. q can handle not having a header row, but the syntax for column references becomes numeric:

SELECT c1, c2, c3
FROM users
WHERE c2 = 'Odinson'

Winner: Again, too close to call... You are going to have to update something with either tool when the table structure changes. Neither the drop / recreate pattern in Athena nor q's numeric pattern for handling no header row in the data seem to sway the judges either way. Another draw...

Current Score: q 3 - Athena 3


Just how fast are these tools? It is difficult to make a true comparison here. I do not have enough files to work with currently to say how they perform at scale. I also have yet to throw the actual SQL statements we would need (they are still being forged in the dark places) to use against both tools. q certainly feels slightly faster, but nothing about Athena's performance has given me concerns yet.

Winner: Not enough data to go on. No points awarded.

Current Score: q 3 - Athena 3

How would I integrate them?

The final and most important round... If I were to use this tool, how would I go about integrating it into our stack and workflow? Either tool requires some interface for our data team to upload new files. They don't care where the files end up; they just need to ensure they are available for use. The difference between having a web form dump those files into S3 or a disk elsewhere is marginal.

q is a command line tool, so it would not be terribly difficult to issue a system call and get the results back. The results returned though are just the lines directly from the files. We would have to do some additional parsing after the fact to break things down into named bits. Hrm... That means carrying the logic of the row result structure around and needing to alter it if the data format ever changes. Combine that with the concern on having to find and concatenate file paths and the length of said file paths on the command line execution... I think q may be on the ropes.

Amazon provides a SDK for most of the frequently used programming languages, and the ones I would need have support for using Athena. Query results are just an API call away via the SDK, and they come back in an array of data. The format of that array is interesting.

Aws\Result Object
    [data:Aws\Result:private] => Array
            [UpdateCount] => 0
            [ResultSet] => Array
                    [Rows] => Array
                            [0] => Array
                                    [Data] => Array
                                            [0] => Array
                                                    [VarCharValue] => first_name

                                            [1] => Array
                                                    [VarCharValue] => last_name

                                            [2] => Array
                                                    [VarCharValue] => location
                            [1] => Array
                                    [Data] => Array
                                            [0] => Array
                                                    [VarCharValue] => Thor

                                            [1] => Array
                                                    [VarCharValue] => Odinson

                                            [3] => Array
                                                    [VarCharValue] => Asgard

The first row in the data set is essentially the table definition with subsequent rows being the actual data. While we don't get back the data in a typical SQL-esque fashion -- each entry in the array is an associative array with the key being column name, we could easily build a parser / formatter to take that first row of data and use it to return a similar structure. That is one bit of coding that can handle any data changes down the line.

Winner: Athena lands a solid blow to q's gut by not needing to worry about having to update code any time the data format changes.

Current Score: Athena 4 - q 3

Here is your winner...

Your winner... Athena (Battlestar Galactica)

Athena wins with a decisive knock out blow in the final round. While you do have some operating expenses in its use of S3 and scanning the data, Athena provides everything we need to process CSV data and serve it up to our customers. Our developers can query the data in SQL like they would with any other data source through the SDK API calls. Given it can parse other file formats, this is definitely a tool I can see reaching for more frequently.

Book Review: SQL Antipatterns

SQL is an interesting yet temperamental tool in the programmer's kit. You can use it to find amazing insights in your data, but you can watch a poorly formed query grind your application to a halt. SQL Anitpatterns wants to help you avoid the latter by identifying common pitfalls when designing and using databases.

The book identifies twenty-four antipatterns and breaks each down in the same formula -- an objective one is trying to achieve, the antipattern typically used as a solution, how to spot this antipattern, the legitimate uses of the antipattern, and alternative solutions. The author introduces each antipattern with a brief example scenario followed by a summary of the objective people are typically trying to achieve.

Next, examples of the antipattern are given with an explanation of why you would not want to do this. Even with examples, the author provides other ways to recognize the antipattern; most of these are questions you are likely to ask or hear colleagues ask. If you hear or ask "How do I do X", that would be a good opportunity to pull this book off the shelf, give it a quick skim, and possibly avoid a pitfall.

However, the author recognizes there are legitimate uses of each antipattern and presents them before the solutions to avoid it. This allows you to decide if this is a pitfall to avoid or a necessary evil in your application. Finally, the appropriate solutions are presented with attention given to how the solutions might differ between database vendors.

Easy to read and digest, this is the worthy reference book to add to your shelf. As you extend or update a database and related application code, you will find yourself reaching for it to reduce the likelihood of being paged at night by a misbehaving database.

Photo by Karolina Szczur on Unsplash

Fresh Baked Roles with Zend Permissions RBAC

As our Traackr application has grown, so has our need to restrict access to features or pieces of features. Recently, our search feature added new filters to use. However, some of these filters are only meant to be available to enterprise customers. This was not a new problem for us; our analytics feature added a new report type, one only enterprise customers should have.

Feature flags were our initial solution to both efforts. This worked well enough for our analytics use case. Our customer success team attaches an account to the feature flag, and the account can view the new report. Cracks started showing in this approach when applied to our search effort.

Let's say we added four new filters to search. That is four new feature flags to create, not a terrible burden from a coding standpoint. But our poor customer success team... They now have to go account by account and flip four new switches to make sure the account has access to the search filters granted to them by their subscription. Yuck...

Role-based Access Control to the Rescue!

During one of our hack weeks, I had tinkered with the Zend Permissions RBAC package without a solid use case in mind. It was just an interesting concept that crossed my path at the time. Now, it seemed a perfect use case for it was in front of me. Before we dive into details, we should first answer an important question.

What is role-based access control (RBAC)?

A standard access control list (ACL) focuses on objects and who can act on them. Your computer's file system is an example of an ACL. Using a Linux file system, each file and directory is an object. The ACL for a file contains three entities -- the owner of the object, the group associated with the object, and everyone else. A feature in our application would be an object, and the feature flag ended up being the ACL attached to it.

The statement ACL makes is "This object is associated with these people, and each person can access the object in specific ways". Phrased in the context of our application, "this feature can be accessed by the following accounts."

How does RBAC differ then? I'll let the package's documentation make the initial introduction:

RBAC differs from access control lists (ACL) by putting the emphasis on roles and their permissions rather than objects (resources).

RBAC makes the statement "These people are associated with these roles. These roles grant them these permissions." Again, in the context of our application, the statement becomes "This account is associated with this role. This role allows them access to these features." This concept has an immediate benefit for us. Accounts can be granted a specific role. As we add a new feature, we update the role to have new permissions for the feature, and every account is automatically able to use the feature. No longer do we need to edit accounts individually.

Baking up our roles

Let's start building this out! We first need to define our roles and the permissions associated with them. Here is a stripped down version of our class to manage that:

Our "Enterprise Subscription Package" starts with a parent role representing just the subscription for the account. It contains no feature specific permissions and would contain more general permissions instead -- number of users allowed in the account, number of concurrent logins, quotas, etc. A child "Search" role is added to the subscription role. Within the "Search" role we grant permission (search) to use the search feature and grant permission to use the topic (search.topics), brand (search.brands), and location (search.locations) filters. The dot notation is our own convention for role and permissions names. Names can be any string, but this gives us a visual hierarchy of the pieces of a feature.

May I?

When a customer logs into our site, we check the subscription attached to their account and attach the appropriate Role object to their session. Our search page must now check they can access the page, and if so, what filters they should see.

That smells good...

The sample checks above may not seem much different than what you would do with features flags. We are just replacing "are they on this feature flag" with "do they have this role and permission". The benefit here is in the ease of maintenance, especially for managing accounts and features. Using RBAC, a new feature is added to a role, that role attached to a subscription, and that role addition / subscription update is shipped with the feature. Every account tied to that subscription has immediate access to the feature. Our customer success team no longer has to take time to turn the feature on for an account.

RBAC has also made it easy to extend the system even further. As an example, we built an override and exclusion system on top of it. If a customer abuses the location search feature, we can shut it off just for them. Need an incentive for a customer to upgrade? We can override the permissions to allow them access to a feature for a short time.

We will be migrating our existing feature flagged based items to RBAC. Using it has also sparked ways to improve other areas of our code, particularly user based permissions. Our code and process has been much more manageable since we baked RBAC into it. But don't take my word; give the recipe a try for yourself.

Yoda conditionals... Change my mind I should?

I had another blog topic in mind but ran across the article "The Mistakes I Made As a Beginner Programmer". The entire article is worth reading, but this item leaped off the page at me:

... avoid Yoda conditions and assignments within conditionals.

I have encountered this statement in other articles and in the coding standards for various projects. Reduced readability is the criticism typically levied against the pattern, though they seldom elaborate on why readability suffers. I picked up the habit of using Yoda conditionals from a former mentor, but it has never been a practice I strongly advocated using. Perhaps, it is time to reexamine if there is any value in using it at all.

What exactly is a "Yoda conditional"? Rather than saying this

a Yoda conditional reverses the expression:

It is a coding practice where you always phrase a conditional statement with the "constant" portion, 66 in this example, first. The goal of the practice is to avoid errors where you unintentionally perform variable assignment inside the conditional rather than making an actual comparison.

The first conditional will always pass because variable assignment results in a truth-y value; your clone troopers will exterminate the Jedi on every order given. Having Yoda on the field though causes an immediate disturbance in the Force. Hidden this subtle bug is not.

There is a clear benefit here, but is it enough to trump the benefits of readable code? Is there truly a problem with it being less readable? On the surface, there appears to be no issue. You see the "if" and know some sort of comparison will be made. What does it matter if 66 is listed first?

Our brains do not read code the way a machine does. It makes no difference to the machine what the order is. However, say the conditional out loud. "If sixty-six equals the order variable." "If the order variable equals sixty-six." The latter falls into more familiar patterns of speech. Reading one slightly awkward statement may not be a big deal, but try reading more and more of them while debugging a large code base. It will tax your mental capacity.

My development environment also includes a bevy of code sniffers and linters. They constantly remind me to use the triple equals for equality checks along with warnings when assigning variables inside a conditional. I am less likely to fat finger away an equal sign and end up with an assignment rather than a conditional check.

So... Hrm... Yeah, I think I finally agree; the benefit of using Yoda conditionals is not sufficient enough compared against code readability. I would much rather have my code be easy for people to read and understand. No longer use this pattern will I. Wait... Apologies... I will no longer use this pattern. There, that is much better.

Type Safe REST Client with Kotlin and Retrofit2

This article was originally posted on Will's Blog.

If you regularly work with REST endpoints you're probably used to a lot of tedium. Between carefully constructing the parameters, encoding data in just the right format, efficiently sending the requests, parsing the results and handling errors at every step along the way; there is a lot of room to make mistakes.

For simple APIs, it often makes sense to manually format the request and send it out. Checking a status code, and maybe mapping results to an object aren't that big a deal for a simple API. But what if there are 20 different endpoints? Or you need to manage authorization headers? Eventually, it makes sense to start looking for a wrapper library, or maybe making your own wrapper layer.

That's what this post is about.

For JVM developers, pairing Kotlin, Retrofit2 and Gson make it easy to bang out a type-safe client wrapper.

Let us jump right into the code:

// Create data classes for serializing inputs and outputs.
data class Results<out T>(
    val next: String?,
    val previous: String?,
    val results: List<T>

data class CryptoQuote (
    val ask_price: Double,
    val bid_price: Double,
    val mark_price: Double,
    val high_price: Double,
    val low_price: Double,
    val open_price: Double,
    val symbol: String,
    val id: String,
    val volume: Double

// Bounds, Intervals, and Spans are enum types.
data class Historical (
    val data_points: List<DataPoint>,
    val bounds: Bounds,
    val interval: Intervals,
    val span: Spans,
    val symbol: String,
    val id: String,
    val open_price: Double?,
    val open_time: Date?,
    val previous_close_price: Double?,
    val previous_close_time: Date?
) {
  // Nested data object used by Historical
  data class DataPoint(
      val begins_at: Date,
      val open_price: Double,
      val close_price: Double,
      val high_price: Double,
      val low_price: Double,
      val volume: Double,
      val session: String,
      val interpolated: Boolean

// Define interface using Retrofit annotations and data objects.
interface RobinhoodCryptoQuotesApi {
   * @param ids comma separated list of ids
  fun getQuoteIds(
    @Query(ids) ids: String
  ): Call<Results<CryptoQuote>>

   * @param symbols comma separated list of symbols
  fun getQuoteSymbols(
    @Query(symbols) symbols: String
  ): Call<Results<CryptoQuote>>

   * @param symbol id or symbol for a pairing.
  fun getQuote(
    @Path(symbol) symbol: String
  ): Call<CryptoQuote>

  fun getHistoricals(
      @Path("id-pair") idPair: String,
      @Query("bounds") bounds: Bounds,
      @Query("spans") spans: Spans,
      @Query("interval") interval: Intervals
  ): Call<Historical>

This is a complete listing of the Robinhood market data endpoints. If you aren't familiar with Kotlin, this code is essentially declaring an interface and some POJOs. By sprinkling in some special annotations used by Retrofit2, it can be converted into a concrete object for accessing the endpoints. Errors are even handled by the Call objects which wrap your results.

To go from endpoint definition to concrete object we need to invoke the Retrofit2 builder. While the HttpClient doesn't really support anything besides OkHttp, there are hooks to adapt the serialization and call adapter layers. Here you can see I configure OkHttp to insert a token header, and add Gson and Enum adapters for serialization.

  // Inject the authentication header.
  private class TokenInterceptor(private val token: String) : Interceptor {
    override fun intercept(chain: Interceptor.Chain): Response {
      val newRequest = chain.request().newBuilder()
          .addHeader("Authorization", "Token $token")
      return chain.proceed(newRequest)

  // Build a token aware HttpClient
  val tokenHttpClient = OkHttpClient.Builder()

  // Build the api
  val api = Retrofit.Builder()

  // Make a request 
  val response = api.getHistoricals(
      idPair = "3d961844-d360-45fc-989b-f6fca761d511",
      spans = Spans.DAY,
      bounds = Bounds.TWENTY_FOUR_SEVEN,
      interval = Intervals.DAY

  // Print out some data  
  for(data in response!!.body()!!.data_points) {

And that's all! In the final code, there was some additional work to make an OAuthTokenInterceptor, but all of that complexity is constrained to the constructor. Defining the endpoints is a matter of typing out the data classes for serialization, naming your methods, and applying the Retrofit2 annotations.

All in all, Retrofit is a very concise way to define a client library in a type-safe way. The code described in this article can be found on GitHub.

A pipeline for cleaning up technical debt?

Traackr's customer facing application is built on top of CakePHP and has accumulated its fair share of technical debt over time. One particular piece of debt weighed on my mind. Our application fetches a core set of data from an API and then augments it with additional data our customers have entered. A single method handles augmenting the initial data through a series of transformations. This beast of a method cannot be tested. There are far too many pieces going on to write maintainable tests. It was long overdue for a refactor, and there was a particular pattern and accompanying package I wanted to try out.

A Pipeline for Success?

I decided to use the Pipeline library from The League of Extraordinary Packages for a few reasons. First, the pattern lends itself to writing code that does one thing and one thing very well. Here is a sample pipe:

A pipe is invoked and passed a single parameter, a payload of data. You act upon that payload and then return it for the next pipe to use or to be used as the final output of the entire pipeline. Given you can chain any number of pipes together, your logic can be focused on performing a single task. This in turn leads to being able to write and maintain unit tests for a pipe.

Second, it introduces a level of reuse the beast method could not. Each pipe is a reusable element we can chain together in any fashion. If we only needed to augment the data with a small subset of our customer data, this could easily be done by putting together a different pipeline. For example, here is our entire augmentation pipeline:

Just need to augment the data with notes and projects? Simple:

Small, easy to read classes... Classes that are easy to test and maintain... Easy to reuse items in a flexible way... Yes, please!

How's It Flowin'?

Excitement must be tempered with reality though. Yes, the beast may be bloated, but it does run very fast. Would this pipeline run just as fast? The initial results were... not promising. The augmentation was not horribly slow, but it was noticeably slower. Timing logic revealed the beast ran under half a second on average, while the pipeline was running a full second on average.

Why is this approach slower? I cannot provide concrete answers, but I have a theory. Notice the pipes in our full pipeline take in a number of arguments like $this->InfData or $this->TagUtil. These are CakePHP component and model classes. My best guess is there is some overhead in having these items passed around into the pipe classes. The beast does not suffer from this overhead because the framework has already bootstrapped those classes and made it available.

So... Now what?

While the execution time of the pipeline makes using it impractical, it was worth the effort to go through the exercise. Yes, we may not be able to use these pipe classes, but they and their tests can easily be ported back into CakePHP component methods. We will still get the benefit of small, easy to test, and reusable units of code.

This library may not be used in conjunction with our CakePHP code base, but it is certainly a tool I will reach for outside of that context. More importantly, using it reinforces the mind set of writing small, testable code. And that, more than any library or tool, is the best pipeline towards reducing your technical debt.

Mongo Follower, a MongoDB export and oplog event generator for Java

Traackr recently open sourced Mongo Follower, our easy to use harness for synchronizing your application with a Mongo database. It provides a simple interface to efficiently export all documents from a Mongo collection seamlessly followed by an indefinite stream of Mongo oplog events. Your location in the oplog is saved to disk allowing for clean restarts with no data loss.

At Traackr, we use MongoDB to store social data and keep that data synchronized with a separate analytics database. Over the years, we've used various tools to keep these two data stores synchronized -- MongoRiver, MongoConnector, Mongolastic, scripts, cron jobs and a lot of carefully crafted code. More and more, we've realized that utilizing the Mongo oplog to synchronize this data and keep our applications decoupled makes things easier to implement and less error prone. After creating several internal apps which use the oplog, we decided to build a generic library to make building and maintaining those applications even easier. We're calling it Mongo Follower.

How does it work?

A producer thread pulls data from Mongo, and a dispatch thread asynchronously sends that data to a special listener object. Processing happens in two phases: export and oplog. The export phase is only needed for the initial run. A Mongo query requests all documents from your collection, which are then sent to the 'exportDocument' callback. Once all documents are exported, the oplog phase begins.

The export phase is only needed for the initial run, a mongo query requests all documents from your collection which are then sent to the 'exportDocument' callback. Once all documents are exported the oplog phase begins.

The Mongo oplog acts like any other collection in MongoDB, so setting up the oplog tail is very similar to the initial export. The first difference is that the cursor is configured with cursorType(CursorType.TailableAwait). This means the cursor remains open even after reading the last record because more will become available when the collection changes. The query is also configured to skip part of the oplog in case of a restart. This is done using a timestamp file loaded from the disk allowing for restarts. This file is automatically updated at regular intervals. Oplog documents have a very special format indicating an insert, delete or update. Mongo Follower parses this format and calls corresponding insert, delete and update callbacks to make data easier to consume.

How do you use it?

Mongo Follower is currently available on github and maven central and can be included in your project like any other maven library:

Once the library has been included, you can implement the MongoEventListener interface. An instance of this listener will be provided to the Mongo Follower process:

To start Mongo Follower, we've created a Runner utility, and it can be used in a couple ways. The easiest way is to pass in a FollowerConfig object:

That's all there is to it! Additional details can be found on github.

What's next?

We're excited to try out a more reactive approach to dealing with our data. We can decouple creation and setup code when adding special documents, re-index documents to Elasticsearch whenever there are changes, optimize performance by observing usage patterns, or even test out alternative storage backends for some of our larger collections.

There are a handful of improvements we'd like to make, specifically:

  • Multiple listeners
  • Tailing and exporting multiple databases simultaneously
  • Built-in metrics
  • Cluster / Shard management

Even without those features, we're eager to continue leveraging the Mongo oplog to make our lives easier.

Adventures in Two Factor Authentication

Off we go!

Another hack week was upon us, and I had a definite idea in mind. I wanted to add two factor authentication to our web application. Having enabled it on web sites I use regularly, I was curious to see how difficult it was to implement and learn about the best practices associated with it. A bare bones implementation turned out to be remarkably trivial.

Our application runs on PHP, and it took minimal Google-fu to determine the RobThree/TwoFactorAuth library was the best option to try first -- well maintained, great documentation. One quick

composer require RobThree/TwoFactorAuth

to pull the library in, and it was right into setting it up. First hurdle reached! Our authentication code needed some tender loving refactoring to be able to support having two factor authentication enabled for your login or not. Ah... the satisfaction of being able to clean code up including squashing a bug that has long irritated you...

Once the authentication code was cleaned up, actually adding two factor authentication was trivial. It took more time to refactor our existing code than it did to use the library to generate the shared secret, display the QR code for scanning, and verify the entered token against the shared secret.

Which thirds to invite to the party?

By default, the PHP library reaches out to an external Google service to generate the QR code. However, you can use a local library to generate it if security of sending the shared secret over the wire is a concern. Hrm... Should I be concerned about that? Should I use an external third party service for generating QR codes, or should I compose in another third party library into our application?

Did a bit of research and found no compelling argument either way. Thinking it through, let's say I use an external third party and send the shared secret to them. What is the attack surface? They could sniff and record the secret, but they get no other details. How would they tie it back and use it with an actual username? They would need access to our database or server to make that association. If they have that, the game is already up, and they likely don't care or need the secret anyway.

What other drawbacks might there be then? There are two that spring to mind. First is the need to make an HTTP request to the service to get the QR code back. You are now dependent on the service being up and the network being fast and stable between you and the service. Second is the (likely) closed nature of the service; you have no way of vetting the code to make sure there are no shenanigans behind the curtain.

Given all of that, I stuck with the default Google service. Google knows just a little bit about keeping a service up and running, and their QR code service has likely been battle tested as much if not more than any other third party library I could use locally. However, the RobThree documentation provides an easy example of using a local library should the need or desire to switch arise.

Drifting away...

It was demo day; time to practice before presenting my work to the rest of the team! I'll just plug in my username and password... Type in the token from my device, and... Wait... Why am I not logged in? What did I break with my last commit?! All my work is for naught! I'm a frauuuud!

OK... OK... calm down... There must be a logical explanation. What would make a token work yesterday but not today? This is a time based protocol... My phone is setup to automatically keep my time correct. Let's check the time on my server... AHA! My local vagrant instance is not running any time synchronization and has decided to drift away by about an hour. One quick NTP query to fix that, and... We're back in business! Whew... Crisis averted; demo proceeds without issue.

I've learned an important lesson about TOTP. Make sure your server is automatically keeping the time correct!

Dude... Where's my phone?

If you are like me, you have used two factor authentication on other sites and have come across "recovery codes". These are meant to be written down and stored offline; if you were to ever lose the device that generated the token for the site, you could enter one of these codes to complete authentication and establish the shared secret on a new device. While I had an idea in mind of how this could be implemented, I wanted to see if there were any established best practices people follow. How many codes should be generated? How should you generate each code? What is the best way to securely store them?

After some digging, I found... nothing. I've seen sites provide anywhere from five to ten codes with the codes varying in length. I have yet to implement this feature but am leaning to at least generating five codes. However I end up generating them, I will store them exactly like passwords -- as encrypted salted hashes.

I need more...

One other flow I have encountered with other two factor enabled sites is the need to input multiple tokens if the server seems to think your token is out of sync. Again, I could find no information on best practices. How many tries should you give the user before entering this flow? How many tokens should you ask for? Does this really add any security value? For now, I am unlikely to implement this flow. While some big names use it, I'm not seeing much benefit from this flow versus just allowing the user to keep entering a single token.

That's all folks!

The RobThree library made it easy to add two factor authentication to our application. Follow the examples in the documentation, and you'll be up and running in no time. However, there seem to be no best practices (none I could easily surface at least) around the items like recovery codes and "out of sync... give us more than one token to proceed". Did I miss anything? Have any best practices from your own implementations? Sound off in the comments and let me know.

Staatus: Exploring the Slack API

Back in April, Slack released a cool new feature that allows you to set a status message on your account. Some of the examples they provide are:

  • In a meeting
  • Commuting
  • Out sick
  • Vacationing
  • Working remotely

"This is awesome!," I thought. "Now I can do what did on AIM in 1997!"

Slack did sprinkle this status in various places in the UI, but it's still a pain to go searching for it, especially if you haven't chatted with a person in a while. Plus, there doesn't seem to be an easy way to see a compiled status for a group (e.g. @engineering-team).

Enter: Staatus

I built Staatus on a PHP framework called Silex. It's a microframework based on Symfony that provides the guts for building simple single-file apps. Here's an example:

require_once __DIR__.'/../vendor/autoload.php';

$app = new Silex\Application();

$app->get('/hello/{name}', function($name) use($app) {
    return 'Hello '.$app->escape($name);


Staatus allows you to see a compiled status of an individual, a particular team, or everyone in a channel. The code is over on our Github.

I have to hand it to the devs at Slack. The API was super-easy to work with. I made use of these REST commands:

With that, I was able to display both status emojis/messages, but also a blue diamond if the person is online:

The Code

Coding this up was very straightforward. I'd say the only quirk I ran into is that the calls take too long, and Slack does not like that. If your plugin doesn't return within a few seconds, you get an error message. My plugin is hosted on Heroku, so between the lag of Heroku waking up instances and the time it takes to run all the API calls (which can amount to many, especially when Staatus is run against a team name), I needed to figure out a way to return a delayed response. Normally, in a single-threaded language like PHP, that's not easy. Most frameworks follow a very "serial" manner of generating a response:

  1. A request comes in (possibly with parameters).
  2. A response is generated.
  3. The response is returned and the request dies.

So if step 2 was taking too long, how was I going to work around Slack's request time limits? The answer comes in Silex's $app->finish() method. It is documented as:

A finish application middleware allows you to execute tasks after the Response has been sent to the client (like sending emails or logging)

Here's how I used it:

$process = function(Request $request) use($app) {

// first, validate token
 $token = $request->get('token');
 if (empty($token) || $token !== VERIFY_TOKEN) {
 $app->abort(403, "Invalid token.");

$response = [
 'response_type' => 'ephemeral',
 'text' => 'Gathering staatus...'];

// just return; the rest of the processing will happen in finish()

return $app->json($response);

$app->get('/', $process); // for testing
$app->post('/', $process);

$app->finish(function (Request $request, Response $response) use($app) {

As you can see, my $process method returns immediately and I do the rest of the processing in a finish() method. This allows me to take as long as I'd like (give or take) to return a response back to Slack using a one-time response URL they provide. (Technically, you can't take as long as you'd like to respond. Slack has a hard limit of "up to 5 times within 30 minutes"):

$response = [
 'response_type' => 'ephemeral',
 'text' => $text];

$client2 = new Client(['base_uri' => $responseUrl]);
 $response2 = $client2->post('', [
 'json' => $response,
 'verify' => false

And that's all, folks. So, go ahead and download Staatus, deploy it on Heroku, and you'll be partying like it's 1997 all over again. BOOYAH!

Round-robin Cross Subdomain AJAX Requests

Well, that’s a mouthful! But what is it and what is it useful for? If you have done any web development in the last 10 years you have most likely used AJAX to make your web pages or, even more likely your web based app, more responsive and dynamic. You also probably know that most browsers limit the number of concurrent AJAX requests that can be made back to the origin server (i.e. the same server the page is served from). Usually that number of concurrent requests is 6 to 8 (sometimes a bit more) depending of the browser.


This might be fine for what you are trying to do, but if your page has a lot of dynamic components or if you are developing a single page web app, this might make the initial loading of your page, and all its content, a bit slower. After your first 6 (or 8) AJAX requests have fired, any subsequent AJAX query back to that same origin server will be queued and wait. Obviously your AJAX requests should be fairly fast, but even if your AJAX requests are just 50 milliseconds, any request after the initial 6 or 8 would have to wait 50 milliseconds when queued before executing. So in total a request could take 50 milliseconds (waiting) + 50 milliseconds executing = 100 milliseconds. This short waiting times add up pretty quickly.


The solution? Make these requests look like they are going to different origin servers so that the browser will execute them in parallel, even if they are, in fact, going to the same server. This will allow more AJAX requests to run in parallel, and therefore speed up your site/app. Here I will show you how to accomplish this using JQuery (but the same strategy can be used with other frameworks).

Cross subdomain AJAX requests with a single server

Basic idea

The basic idea is pretty simple. Say you are serving your pages from You will want to send some of the AJAX requests through, some through, some through …etc. You get the idea. And we will set it up so that all these hosts are, in fact, the same server.

What I will show you here is how to do this transparently so you don’t have to manually set up each AJAX request and decide where it should go. Our little trick will randomly round-robin through all the available hosts. This will work even if your application runs on a single server and the best part is that if you need to scale and upgrade to multiple redundant servers you will have little if anything to change.


Easy enough, right? Well of course, there is just one more little thing. When you start making AJAX requests to a host different than the original host, you open the door to security vulnerabilities. I won’t go into the specific details of the potential security issues, but suffice it to say that your browser won’t let you make these cross-domain (subdomain in our case) requests. Luckily, CORS comes to the rescue. CORS is a rule-based mechanism that allows for secure cross-domain requests. I will show you how to setup the proper CORS headers so everything works seamlessly.

The setup

First things first - you will have to create DNS entries for all these subdomain hosts. Because we are doing this with a single server, you need to create DNS entries for thru that all resolve to the same host as (they can be AAA or CNAME records). All set? Moving on.

The code

We need to accomplish two things:

  1. (Almost) Randomly round-robin the AJAX request to the various hosts we have just defined
  2. Modify AJAX requests so they can work across subdomains with CORS

Round-robin your AJAX requests

Here we are going to leverage the excellent JQuery framework, so I’m assuming you are using JQuery for all your AJAX requests.


The following Javascript code should be executed as soon as your document is ready. This piece of code uses global AJAX settings in JQuery to hijack all AJAX requests made with JQuery. Once we identify an AJAX request going to the origin server, we rewrite it right before it hits the wire (beforeSend()).


Almost there. Now we need to make sure your server can handle these requests when they come via one of the aliases we defined.

Enable CORS for your AJAX requests

Before we look at the CORS headers needed for these cross-domain AJAX requests to work, you need to understand how cross-domain AJAX requests are different than regular AJAX requests. Because of CORS, your browser needs to ensure the target server will accept and respond to cross-domain AJAX requests. Your browser does this by issuing OPTIONS requests. They are called ‘preflighted’ requests. These requests are very similar to GET or POST requests except that the server is not required to send anything in the body of the response. The HTTP headers are the only thing that matters in these requests. This diagram illustrates the difference:

So you need to make sure your server returns the proper headers. Remember in the current set up and webapp[1-3] are the same server. There are many ways to do this. Here is one that leverages Apache .htaccess config:

Because we allow credentials to be passed in these cross-domain requests, the origin allowed header ('Access-Control-Allow-Origin’) must specify a full hostname. You can not use ’*’, this is a requirement of the CORS specification.


An important note here. Because these OPTIONS calls are made before each AJAX request, you want to make sure they are super fast (i.e. a few milliseconds). Since the only thing that matters are the headers (see above), I recommend you serve a static empty file for these requests. Here is another .htaccess config that will do the trick (make sure options.html exists and is empty):

And that is it! Pop open your JavaScript console and look your AJAX requests being routed through the various aliases and being executed in parallel.

Be warned

So what trade-offs are you are making? Remember this is engineering, there are always trade-offs! Well all of a sudden you might get twice (or more) as many requests hitting your server in parallel. Make sure it can handle the load.

What’s next?

There are probably a few things you can improve on. I will mention two we use at Traackr but will leave their details as an exercise for the reader:

  • You will probably want to avoid hardcoding your list of servers in the script.
  • Some corporate firewalls do not allow OPTIONS requests. This can cause this entire approach to fail. The script we use at Traackr will actually detect errors in OPTIONS requests and will fall back to regular AJAX requests.

Use a load balancer and let it scale

What can you do if/when you need to scale? One easy approach that doesn’t require you to change much is to put 2 or more servers behind a load balancer (maybe an Elastic Load Balancer on AWS). Then update your DNS so that all your webapp[1-3] URL are now pointing to your load balancer.


What magic happens then? Browsers will make more parallel requests, just as we have described all along here. But now your load balancer will round robin each of these requests to the many servers you have running behind it. Magically you are spreading the load (and the love).


Thank you to the entire awesome Traackr engineering team for the help with this post. We are Traackr, the global and ultimate Influencer Management Platform. Everything you need to discover your influencers, manage key relationships, and measure their impact on your business. Check our blog. We are hiring.