INSTRUMENTAL BLOG

Instrumental is about to get a brand new look! For the last few months we’ve been hard at work on a modern, intuitive, and easier UI. We’ll be rolling it out to users over the next few weeks, but we wanted to give you a sneak peek now.

New Time Selector

We merged the resolution and duration dropdowns into a single global control. Now the time selector is always in the same place, whether you’re looking at a group of graphs or focusing in on just one.

Updated time selector

Reorganization: Sidebar

We’ve reorganized and simplified most of the interface, but let’s focus on the sidebar. All of your projects are now listed in the sidebar instead of hidden behind a dropdown. For the project you’re currently viewing, you can see all graph groups, access the metric explorer, and manage alerts.

Project sidebar

Improved Graph View

We spent a lot of time improving the interface for creating and viewing individual graphs. You spend most of your time there, and we wanted to make sure we got it right. The biggest change you’ll notice (aside from the new color scheme, of course) is the expression list, which is more prominent and easier to use. We also merged graph view settings, export and delete into a single dropdown. Graph options are there when you need them, ignorable when you don’t.

Graph settings dropdown

New expression list

…and more!

There’s plenty more in the new Instrumental that we haven’t talked about: our new lighter color scheme, better typography, improved usability throughout, and even a few Easter eggs.

We can’t wait to get the new Instrumental into your hands, so keep an eye out for a notification to opt-in during the beta period!

We use Instrumental to track all kinds of things. Many of them are about server and app performance, like latency, queue response time, and disk utilization. But we also have graphs for other kinds of monitoring. This series explores some of our more interesting graphs.

We acquired Gauges in January. Since then, we’ve reworked the server architecture, moved data centers, increased performance, updated the interface for retina displays, rethought our marketing…and that’s just what we’ve shipped so far! We’ve got plenty more in the works, of course. But we ran into a problem awhile ago as we prioritized some of our bigger feature ideas: we realized we had a gap in our understanding of how our users actually use Gauges. We’ve sent surveys and talked to many of our users directly, but that only goes so far. What people do and what they say are often different.

So, we decided to instrument feature usage. Here’s how we did it. The Gauges interface effectively has 3 layers of navigation. First is the gauge (site), then the gauge-specific sections of traffic data, then (in some cases) a subdivision of that data.

Levels of navigation in Gauges

We added some JavaScript[1] code that tracks when the second two of those change. We use the word ‘feature’ to namespace them and dots between the second and third level, which gives us metrics like:

feature.traffic.past_day
feature.technology.platforms
feature.referrers

When we increment those in Instrumental, we can use the following query to see a graph of how much each feature is used as a percentage of total feature usage:

(feature.? * 100) / ts_sum(feature.?)

That gives us this awesome graph (when viewed at a 1 hour resolution, and with the step graph style):

Tracking application feature usage with Instrumental

Now we can see exactly how much each feature gets used. Even better, when we add features, we can see how much they’re used compared to the existing features! We’ve already learned some interesting things in the short time we’ve been tracking feature usage in Gauges.

After we let this graph simmer for a couple of days, we learned something that surprised us: AirTraffic isn’t used as much as we thought.

Some people use it, but not many. That surprised us. AirTraffic shows pins dropping on a map of the world as new visitors reach your sites, in real time. It’s pretty cool to watch, and we use it to see the impact of big announcements. Turns out our customers don’t use it as much as we thought.

Gauges AirTraffic feature We’ve got some ideas about why our customers aren’t using AirTraffic, and what we can do about it. Now that we’re measuring feature usage this way, we can visualize how our changes affect how features are used in Gauges.

[1] The JavaScript actually hits our web server, which makes the request to Instrumental. That way, we use our existing setup using the Ruby Gem.

We’ve added two new features to make the query language even better - a new wildcard operator that doesn’t match metric children, and a method for adding a constant value to any graph.

1) ? wildcard operator.

This wildcard operator is basically the same as * except that it will never match a dot. This makes matching top level metric names without matching the children easy.

For example, here’s a few metrics we use to track new user signups. These are simple increments that keep track of where a user came from:

signup.organic            # this user came out of nowhere
signup.newsletter         # a new subscriber to our newsletter
signup.paid               # any paid signup
signup.paid.google_ads    # a paid signup from google ads
signup.paid.bing_ads      # a paid signup from bing ads
signup.paid.facebook_ads  # a paid signup from facebook ads

Using our * wildcard to graph these metrics would include these paid breakdowns, throwing off our numbers and making that graph harder to understand. The fix was to graph each metric individually, avoiding signup.paid.*.

Using this new wildcard, you can graph:

signup.? 

This will match signup.paid, but not signup.paid.google_ads. This little guy’s already making our graphs cleaner and easier to understand.

2) constant(x)

Setting a constant value in Instrumental

Sometimes you want to just show a reference value on a graph. This is super useful if you have an alert and you want to see the threshold alongside the graph.

Here we’ve set a constant value of 175,000,000 on a graph tracking collector metrics received in Instrumental. This is a pretty high constant, but we do receive a ton of metrics.

Application metrics received in Instrumental

We’re excited to bring this improvement to the query language and make it even easier to monitor your application!

Our new graph grouping feature is live! We’re super excited about this update - groups have been one of our most requested features, and it’s already making our lives easier. Let’s take a look at how you can use it!

Creating a new group from your dashboard is easy:

Creating a new group of graphs

After creating a new group and giving it a useful name, you can move a graph to your group like so:

Adding a graph to a group

Graphs added to a group will disappear from your default dashboard view. You can manage graphs inside a group the same way you’ve managed them from your default dashboard: repositioning graphs by dragging and dropping them into place, renaming graphs, deleting graphs and so on.

You can also remove a graph from a group. Any graphs removed from a group are sent back to your default dashboard view. This won’t actually delete your graphs!

Managing graphs inside of a group

What about creating a temporary group of graphs? Maybe you’re testing an infrastructure change and you don’t need those server monitoring graphs hanging around forever. Never fear! Deleting a group is also really easy, and will also send any graphs in that group back to your default dashboard view:

Deleting a group of graphs

We’ve got dozens of graphs for monitoring application and server performance in Instrumental. Now we’ve also created two groups of related graphs: one for tracking metric collection, and another for monitoring performance after deploying new code to production.

Instrumental is all about customized application monitoring, and grouping makes it easier for you to keep an eye on the metrics that matter to your app.

Your application is going to break. Maybe not today, maybe not tomorrow, but it will happen. Data centers fail, worker boxes get saturated and disks run out of storage. It happens to the best applications.

You’ve probably got a good handle on catastrophic server failure. Smart developers keep a watchful eye on their infrastructure, and we know you’re smart. Monitoring server performance is basically a no-brainer for seasoned web developers.

What about when your app isn’t strictly broken, but isn’t performing in a way your users expect? Your servers appear to be fine - no alerts have been triggered, and your house isn’t burning down as far as you can tell. But your users aren’t getting the response they’ve come to expect from your app.

If a user tries to use your application and does not get the expected response, your service might as well be down, as far as that user is concerned. Are you watching these errors as closely as you’re watching your server performance?

We designed Instrumental for application monitoring at every level, and this includes tracking errors exposed to our users. Sure, some of these are unavoidable - minor hiccups in response times, or a user entering bad data, or any of a thousand tiny problems that arise in application development.

We’re not super interested in these incidental errors - we’re more concerned about trends and recurring problems. We’re especially concerned with errors exposed to our users, so we make graphs to track multiple kinds of user-facing errors. A sample graph looks like this:

Sample graph tracking user facing errors

We’ll give this graph a clever name (something catchy like “This graph should be at 0”) and then add metrics for failures exposed to our users. This can be any number of things: malformed data, bad requests, any time a user hits a 404 or 500, slow web requests, and so on and so forth. We’re tracking any instance of the application performing in a way the user does not expect.

This graph’s not staying steady at 0, but the errors we’re tracking are all resolving pretty quickly. We’d love for this graph to always be at 0, and for our users to never experience any problems, ever. But that’s not super realistic. Intermittent errors like these aren’t as troubling as prolonged, increasing failure. For example, if this clever graph looked more like this, we’d have a much different problem on our hands:

A sample graph showing serious application problems

That blue line is moving up and to the right, which tells us that something bad is happening, and it’s getting worse. Graphing errors is useful not just for noticing when things go sideways (often before your users are aware) but for quickly spotting trends. Remember - this graph is tracking user-facing errors, so our users are aware of this problem. Not the best situation, but at least we’ve got an idea of where to look!

Better still, we’ve got a good idea of when we started seeing this error, how long it’s been going on, and how bad the problem is.

The actual errors and methods of tracking them will vary between applications, but there are two things we want to do: keep an eye on user-facing errors, and make sure those metrics are measured in a way that make sense on the same graph. For example, if a low trend is good for one metric, we don’t want to add it to a graph with metrics where a high trend is good. Nobody wants to waste time interpreting data when there are problems to solve!

If we see one or two 404s being returned, we’ll investigate the issue but we aren’t super worried. If that number spikes to 100, establishing a trend line that doesn’t resolve itself, then we’re more worried.

Application monitoring at the user level isn’t that different from server monitoring, really - we just keep an eye on problems that are exposed to our users, rather than less visible infrastructure problems. We never, ever want our first notification of a user-facing problem to be an angry email from a paying customer, asking us why the service is down.

The service may not actually be down, as far as we can tell. All of our server monitoring graphs could be performing just as expected. But if our service isn’t working for our users, the end result is the same - a paid service that isn’t delivering value. Talk about a nightmare scenario!

Luckily, Instrumental’s query language makes it easy to track everything that happens in our application, and we can easily create graphs with the exact metrics we want to measure. Having similar types of errors in a single graph makes monitoring our applications easy, and tracking user-facing problems leads to happier customers.

We’ve done a lot of infrastructure work on Instrumental over the last few weeks. Our main goals with this infrastructure work were making Instrumental’s data collection and writing process more resilient and easier to scale. As an application monitoring service, we process billions of data points every single day, and we would love to write more data in less time.

We’ve been using MongoDB for a while now, and TokuMX sounded pretty great. TokuMX is designed for big data applications, and at billions of writes per day, we definitely think Instrumental qualifies.

Toku’s marketing page promises a ton of things that sound way better than standard MongoDB - most notably, 20x faster performance (without tuning, no less!) and a 90% reduction in database size.

Sure, let’s try this thing out!

But first, let’s test our assumptions. We don’t actually want to push Toku into production before we’re reasonably sure it will outperform our current implementation of MongoDB.

An important component of successful application monitoring is using data to make better decisions. We’ve already got a graph that tracks how fast our MongoDB install is writing to disk. This gives us a ton of useful historical data, so we can measure Toku’s performance against Mongo’s performance, and prove whether Toku is actually an improvement over our MongoDB implementation.

Side note: we’re not exactly running a standard MongoDB instance here. Our dev team is very familiar with Mongo, and we’ve squeezed quite a bit of performance improvements out of it over the years. Still, Toku seems worth trying out. To the graphs!

Here’s the graph we use to track database write speed for Instrumental, measured in Mongo updates per second, and we’re graphing this number in 5 minute increments. On this graph, higher is better, and if we’re seeing a lot of big valleys, we know we’ve got a problem writing data somewhere.

MongoDB updates per second

I mentioned we’re not really running a standard install of MongoDB, so we aren’t really expecting 20x improvement. That would be AWESOME, but it’s not realistic. Still, Toku’s specifically designed to improve performance for high scale applications, so we should see SOME bump. Let’s take a look!

Based on our current graph, we do expect to see some boost in performance. We’re assuming the Toku graph will look similar to our current Mongo graph, just better. Let’s find out!

We dropped in Toku on one of our database boxes, and let it percolate for a day or so to get an idea of what performance would look like. I should note we’re running Mongo and Toku on identical hardware here. This would be a pretty flawed test if we weren’t!

Here’s what this graph looks like, once we start charting Toku performance:

MongoDB updates versus Toku updates

Sure, sometimes Toku is performing better than MongoDB, but do you see how spiky those write speeds are? Way spikier than we’d like, even when compared to how fast MongoDB can write to disk. Total average write speeds were lower: a single process in Mongo could perform around 6,000 upserts per second.

At first, Toku performed very well - we were seeing speeds of 10,000 upserts per second or so, but this slowly dwindled to close to 3,000 to 6,000 upserts per second. After letting this run for a while, we were seeing roughly a 20% decline in write speeds.

We tried different compressions settings - zlib, fastlz, and no compression. Nothing we tried had much impact on upsert speed, although it could be that the effects were delayed since the reindex commands returned near instantaneously. After a day of testing and performance monitoring, it became clear Toku wasn’t right for Instrumental.

This isn’t to say we don’t like Toku - we were drawn in by the performance boosts we thought we could realize, and it was a lot of fun to work with. Admin work was a lot of fun, and we found it to be very fast and responsive.

Still, our assumptions around possible improvements were absolutely worth testing, and we’re glad we had Instrumental to help us realize Toku wasn’t right for us.

Once upon a time, automated testing was not a popular idea. It was too expensive. It was too time-consuming. At best, it was a nice-to-have.

The prevailing idea was that if you were a good and careful software developer, regressions weren’t a problem. When a regression did happen (rarely, of course), the good and careful software developer that you are would carefully consider why it had happened and make a correction to prevent it from happening again. This now seems crazy to most developers.

Unit testing is now accepted as a best practice. Done correctly, it saves time, money, and development effort. It helps developers make good software faster.

Application monitoring is moving along a similar path and for similar reasons. Unfortunately, application monitoring is still in that nice-to-have phase.

If you haven’t thought of application monitoring as being very similar to automated testing - and just as important as automated testing - I’m asking you to do it now. Here are just a few of their similarities:

1) It’s a process, not a product.

Application monitoring is production-level, real-time testing. Tools like Instrumental exist to make it easy, but there’s some work only you can do.

Imagine installing some testing software, providing it with minimal configuration, and then just running that before putting code into production. Could that be helpful? Maybe. Is it as good as writing your own tests? No. Obviously not.

Application monitoring is not something you can install once and have the problem stay solved. Every piece of software is different. It has unique code and unique hotspots. Similar to testing, every time the code is changed it’s worth taking a minute to consider what should be monitored in production. If you’re not, your monitoring tool is delivering about as much value as an install-and-run test suite.

2) Quality is better than quantity.

Imagine a 10-line application with 30,000 unit tests. You’re either about to experience the most sophisticated 10 lines of software ever written or someone decided that every time you changed a character you needed to see 1000 failing tests.

Like testing, application monitoring isn’t about having as much of it as you can. There’s no award for having the most measurement. In fact, it’s counter-productive to have too much. Having more things flash red doesn’t help you solve a problem faster. It’s about getting the right information to help solve the problem.

When it comes to your production environment, there’s no substitute for seeing exactly what you need to know. What code is having a problem? What, exactly, is the problem? Thoughtful, customized measurement is critical.

3) It tells you what’s wrong before anyone else.

In this case “anyone else” is “the customer”. Application monitoring is about spotting problems before they happen. It’s about being proactive. In the worst case, it’s about figuring out what’s wrong as fast as possible. Like a good unit test, good application monitoring points to exactly what’s wrong so you can start fixing it immediately.

If you’ve worked on a project where the tests are brittle, wrong, or misleading, you know exactly how valuable it is to have tests that break at exactly the right time and provide a clear message of what is wrong. Good application monitoring provides that same value in production.

4) It saves you time, money, and sanity.

Good testing is about increased development speed through increased confidence. The more sure you are you’ve done the right thing, the faster you can go. Application monitoring does the same thing in production. It’s about knowing that every part of your software is doing the right thing every moment of every day. Guessing about the production status of any part of your application, hand checking it occasionally, or using some high-level off-the-shelf solution is a recipe for missing out on the critical details that are specific to your application.

5) It’s about admitting that you don’t always know what’s going to happen.

Application monitoring is not a replacement for testing. Automated testing is incredibly valuable. Even the smartest and most careful developer can’t always predict every little nuance of the code they’re working on. Tests help prevent those mistakes from seeing the light of day.

Application monitoring works the same way. A lot can happen in production that even an excellent developer can’t predict or can’t (easily) replicate. Customers do things you never considered. There’s a weird interaction with a new version of a browser. There’s a difference between the development environment and production that no one ever noticed before. A famous person tweeted about your app and your customer base tripled over night.

Adding carefully thought-out custom monitoring can help safeguard against the things you can’t predict.


The case is clear; application monitoring is not a nice-to-have. It’s time for software developers to start making application monitoring part of their standard practice.

Are you excited about using Instrumental for monitoring your .NET application, but less excited about using the Instrumental backend for StatsD? Good news!

One of our users from Bloomerang wrote an awesome C# agent library, and we’re excited to share it with you guys:

https://github.com/ralph-bloomerang/Instrumental.NET

Implementation is pretty easy, too. Here’s some sample code:

using Instrumental.NET;

var agent = new Agent("your api key here");

agent.Increment("myapp.logins");
agent.Gauge("myapp.server1.free_ram", 1234567890);
agent.Time("myapp.expensive_operation", () => LongRunningOperation());
agent.Notice("Server maintenance window", 3600);

BOOM! StatsD is no longer required for using Instrumental for application monitoring in a .NET framework.

Do you remember that time one user with a rogue script saturated your workers, causing a ton of problems for your system (and your other users!) without even realizing he’d caused a problem?

We ran into a similar issue a while back with DocRaptor. One of our users was generating way more documents than normal - his test document creation alone accounted for 75% of all documents being generated!

This problem was easy to spot, as we’d built a graph around enqueued documents by user id. We’ve used Instrumental for application monitoring for years now, and this graph is usually pretty boring. We like this graph to be boring, because it means nobody’s abusing our system.

Once in a while, though, it gets pretty exciting. Here’s what our enqueued documents graph looked like when we spotted the problem:

Monitoring application load with Instrumental

Pretty gross, right? Luckily we were able to catch this bug before it actually became a problem and a quick chat with the user resolved the issue.

This user had added DocRaptor to several applications and part of his test suite for one implementation had a bug that generated far too many test documents every time he ran the suite. Let me clarify: his test suite didn’t stop generating documents. You can see why we found this slightly concerning.

Creating graphs to monitor application load with Instrumental is easy. We’ve used the series_top_n function to track users who have enqueued more than 25 documents at once. The syntax looks like this:

series_top_n(amount, metric_pattern, ...)

And here’s how we’ve implemented this function to track users who have enqueued documents in increments greater than 25:

series_top_n(25, document_enqueued.user*)

We’ re using this method to monitor application load because we only care about large number of enqueued documents per user. This counter is incremented upon each document creation request, and then we can send interesting information graph that data.

Thanks to Instrumental, we were able to give this user enough information to quickly debug and solve the issue on his end.

It’s been a crazy week for security, and we’re barely halfway through yet!

News about the Heartbleed OpenSSL security flaw broke Monday night. This bug allows anyone to read the memory of systems using vulnerable versions of the OpenSSL software, which means certain private information is no longer as private as you might like.

We’re not just talking credit card information (though this is certainly the most troubling for many people) but email addresses, passwords, and anything stored in memory. It’s kind of a huge problem.

Amazon moved quickly and began rolling out patched ELBs shortly after the Heartbleed vulnerability was discovered, and our servers have been upgraded as of this morning.

Has your site been affected? Here’s a great post about Github’s response to Heartbleed, as well as advice for making sure your account remains secure.

Github user titanous wrote a neat tool to determine whether a site has been affected by Heartbleed.

We’re glad we were able to get Instrumental patched so quickly, but this vulnerability will definitely have long-reaching consequences for a ton of businesses.

How did Heartbleed affect your business? What steps have you guys taken to fix the problem?