What to Expect When You’re Expecting Failure

Background

Instrumental is a key piece of infrastructure for many businesses, including Instrumental!  We put significant effort into making sure that Instrumental customers can rely on us to be accurate, available, and consistent, but no system is perfect.  There are two key components of our approach to reliability:

  • Make it hard to do the wrong thing
  • Assume that everything is going to fail

Example Incident

With that approach  mind, let’s talk about what happened on the 16th of November.  We’d been testing a different configuration for our middle storage layer for some time, and were ready to switch all traffic to the new layer.  For an idea of the scale we’re looking at, here’s the average workload our various layers deal with:

  • Outermost (raw incoming data): 100,000s of events per second
  • Middle layer (aggregated, heavily queried): 10,000s of events per second
  • Long term storage (heavily aggregated, occasionally queried): 100s of events per second

Unfortunately, when we deployed the new middle layer, we failed to deploy the updated configuration to all components of the system. This was a manual error on our part. Not realizing our error, we then proceeded to remove the old layer. Because parts of the system were unaware of the new middle layer, and the old layer was gone, we immediately lost the ability to write data through the middle of pipeline.

Here’s what the incident looked like, in terms of batches processed per minute at the outer layer:

Incident Overview

The error was easy to fix, but unfortunately, it was also too easy to make.  Any data loss, even on the order of a few seconds, is too much. We failed to live up to our principle of making it hard to do the wrong thing.

Fortunately, because we assume any part of the system could fail at any time, the servers at the outer layer detected the problem and responded by simply queueing up incoming data to local storage.  This way, when the system returned to proper functioning, the data was processed so that little or no data was lost.  We also take the precaution of archiving those files, so that in the event that data is not properly processed, any period of time can be restored from raw data.  It is nearly impossible for data sent to Instrumental to be permanently lost – even a total outage (like what briefly happened on the 16th) manifests as a delay rather than a drop.

Speaking of delays, here’s what that recovery looked like on an Instrumental graph.  The scale is how many seconds behind ‘live’ various parts of Instrumental were during the incident:

Latency during recovery

Future Proofing

So, how do we make a mistake like this hard to make in the future?  We already have a component in our deploy process to make sure that the branch being deployed matches what’s in source control, and that it contains all of the commits from the master branch.  In effect, we made deploying a wrong or old branch impossible.  This should be straightforward in any modern deploy system – we use some Capistrano tasks that look like this:

task :ensure_remote_matches_local do
  local_name = current_branch
  remote_name = "origin/#{local_name}"
  diff = `git diff --shortstat origin/#{local_name}..#{local_name}`.chomp
  if !diff.empty?
    raise "Local and remote branch #{local_name} are different: #{diff}"
  end
end


task :ensure_origin_master_is_merged do
  if !system("git branch --merged HEAD -r | grep 'origin/master'")
    raise "origin/master must be merged into the current branch before deploying"
  end
end

We’ve added a little bit of code to make sure that all of the servers in the deploy group are also in our pipeline configuration.  If you try to deploy a configuration that is missing servers, the deploy script will detect that and tell you what pieces are wrong.  Simple changes like this, that protect you from doing the wrong thing during future upgrades or incidents, are a rare opportunity to give future-you a gift instead of a headache.

To summarize:  an incorrect database configuration resulted in data backing up at the outermost layer, just like it is supposed to do in case of emergency.  Once the configuration issue was corrected, the system quickly processed its backlog and returned to normal functioning with essentially no permanent loss of data.  We were forced to temporarily disable alerts while the backlog was being processed.  Improvements to our deploy scripts will make this sort of misconfiguration impossible in the future. Never stop looking for opportunities to build resilience into your system.

Thanks for reading, and feel free to tweet at @Instrumental or contact support@instrumentalapp.com if you have any questions!

Instrumental Free Trial

Understanding what's happening with your software is only possible if you monitor it at the code layer. From agents to our metric-based pricing, we’re focused on making it easy to measure your code in real-time. Try Instrumental free for 30 days.