Unfortunately this past Friday (April 19th), Instrumental ceased displaying up to the minute data for a period approaching 15 hours. Astute observers of the app and our Twitter account (@instrumentalapp) have likely noticed that this is the worst of a series of graph delays that have been occurring for the past month.
I’ll give a technical post mortem later in this message, but I want to underline the most important parts of these delays to you:
- Your data may not appear quickly, but it’s NEVER lost.
- The changes we made to fix last Friday’s problems should ensure we don’t see problems like we have in the past month.
- We’re going to do a better job communicating service information to you.
We’re currently working on a series of changes that should insure we don’t experience delays of this magnitude again, and that you don’t lack for up to date information again. We’ll also be working on better ways to communicate essential information regarding Instrumental’s current status, so that you never have to wonder if your app is down, or that Instrumental’s data is delayed.
This past month’s unpredictable data delays we hope to have behind us soon. Internally we still have more work to do to ensure an even more reliable future, but we’ve removed the weak parts of our infrastructure that were the primary cause of the delay.
If you have questions about what happened, or need to address concerns, please email us at firstname.lastname@example.org. We want to make sure you understand exactly what happened, and know that you can expect more reliable behavior in the future.
April 19th Technical Post-Mortem
Background: Instrumental’s data is persisted in two places:
- A MongoDB replica set previously hosted on Linode
- A series of flat files archived in S3
Writing to either of these two data stores is done by queueing the write documents (collections of metric updates), then having separate processes commit them to the database or S3. Typically we’ve seen Mongo writes using our particular configuration to take roughly 10ms, with little variation. As of March 22nd, 2013, we’ve seen that number begin to vary wildly, with batches of 100 writes taking anywhere from 1s to 10s.
Description: Around 6 AM EST, our Mongo replica set encountered an unknown event while one of the read members was performing an overnight compaction that caused our primary member to fail. The other members of the set failed to elect a new primary member, causing all Mongo writes to fail. Our incoming SQS queue had grown to 20,000 documents (each usually containing from 100-2000 writes to the Mongo database) by 8 AM EST. During this period, a new replica member was forced to become the primary member so the Mongo database could begin accepting writes again. It was at this time that we observed that our MongoDB write times had begun to vary wildly, and we could not predict a time when the entire queue could be consumed by the new primary member.
Presuming that the write variability might be caused by overactive shared tenants on the Linode instance assuming primary responsibilities, we rebuilt the previous primary member and had it rejoin the replica set around 1 PM EST. While we did observe slightly faster write times during this period that caused us to begin reducing overall queue size, we could still not accurately predict when the queue would be entirely consumed and the service would be up to date.
Due to the increasing variability we were seeing on Linode (and a host of other small complaints and limitations), we had already begun a slow migration to AWS that included moving queue responsibilities to SNS and SQS, and an eventual primary data store switch to DynamoDB. While the DynamoDB migration was not yet ready for production, we had enough of the infrastructure automation scripts ported to use EC2 that we had enough tools at our disposal to spin up a new Mongo replica set on EC2 that used EBS provisioned iops for more consistent behavior. This new replica set was spun up around 5PM EST, and over the course of the evening was replicating with the Linode hosts at around 1-5s latency.
During the time period with which this new set was brought up, the Linode hosts had eventually caught up to the queue and were displaying up to date data sometime around 8PM EST. On Saturday, we migrated our database writing processes to AWS and switched primary member responsibility, after which we observed an overall system latency decrease and more consistent behavior.
What Caused The Problem: We’re not certain of the original event that caused the primary member to fail out. We do know that the compaction command on the read member of the database failed, and suspect that the primary might have failed while the compacting member was unavailable, but that is only supposition on our parts (and even if it did occur, Mongo’s election behavior should handle this event). That is only part of the picture, however: the delay would not have been nearly as severe were it not for the high variability in IO that we saw on our Linode host. We suspect that the advent of Linode’s “next gen” hardware and network has been slightly to blame, as its introduction roughly correlates with our IO increases; given our small sample size, we cannot make that claim with certainty.
What We Did Wrong: We had observed the increasing IO lag, and had already begun migrating our system to move to DynamoDB. This decision (to migrate to AWS and a new datastore in one swoop) had too many dependencies to happen quickly, and put us in a more dangerous time period in which our ability to recover from a large pending queue was significantly reduced. We should have made the transition more piecemeal, and migrated our database servers to AWS earlier. Additionally, we failed to communicate information to our customers regularly, which caused more uncertainty regarding graph data amongst all customers; lacking an accurate estimate of graphs catching up to date, we should have been more frequent with our updates.
What We Did Right: When we realized we were unable to accurately estimate the time the system would be up to date, we immediately began migrating to EC2; we followed through on the migration to such a point that the overall system state is more reliable than it was last week. We now have tools in place to allow for better growth and handle unexpected events in a better manner than we had available on Linode. Other systems will now be migrated from Linode over the next week, after which the system will become even more performant.
Additionally, the queueing behavior of the system (and the persistent, decoupled attributes which SQS adds) ensured that we were still receiving and archiving customer data even though our primary data store was hours out of date.
What Happens Next: Over the next week, most Instrumental servers will begin to migrate to AWS. Internally we will be discussing the advantages of tuning our AWS Mongo replica set to grow better, or to continue apace with our DynamoDB migration. We will also continue discussions regarding the implementation of an in-memory store that will supplement our persistent store, that will reduce visible system latency and allow us to better handle persistence lag. Also, we’ll begin planning an in-app method of communicating service status to our customers, such that extroardinary events like Friday’s are visible and up to date.
Thanks, Instrumental customers, for your patience last Friday. Those of you that came into the Campfire chatroom, or talked to us on Twitter, were uniformly awesome in talking to us during an extremely stressful period, and we hope this helps better explain why we couldn’t keep your dashboards and graphs up to date on Friday.