Your dashboard is where you’ll spend most of your time using Instrumental. You’ll make graphs to monitor application performance in real time, keep an eye on server load, track user churn, and all other aspects of your application that matter to your business.

We spend a lot of time thinking about how we can make Instrumental’s dashboard better for you, and we recently made some updates to graph performance and sharing.

Improvements to Graph URLs

The URL for any graph you create now saves all the options you’ve selected. You can copy and paste this URL to any member of your team, and the selected time frame, duration, and any display options will be included. If you’ve created a stacked graph with no events, anyone you share the graph will see the same stacked graph with no events.

Here’s an example for how you might use this in production.

Let’s assume you’ve noticed an uptick in alerts over the last three days. You have some idea of where the issue is coming from, and you want to investigate the problem with another member of your team with a new shared graph.

You create a temporary graph to investigate the problem, setting the default time period and resolution to the last 3 days, in 20 minute increments. Once you’re confident you’ve found the correct time frame and have a graph that illustrates the issue, you can copy and paste the URL and share it with the members of your team.

Previously, any graphs you shared would save some of your options, but not all. It was impossible to predict how the graph would look once it was shared.

The URL will now save all the options you used to create the graph, including timeframe, notification visibility, and all other options on the graph. Anyone you share the URL with will see exactly what you’re seeing.

Permanently Updating Graphs

Second, we tracked down a bug that caused graphs to stop updating. Several customers display their application dashboard on a set of wall mounted monitors and they noticed some of their graphs stopped updating automatically when the browser tab was left open for a very long time.

We realized there was an issue with the underlying graph refresh code, and that’s where we started looking. We discovered a bug that caused the graph refresh code to stop attempting to update a graph if it experienced a failure. If a graph refresh attempt failed, the code would simply stop trying to refresh the graph.

We’ve updated the code around this behavior, and now the graph refresh code will retry an infinite number of times to refresh a graph, even if its initial attempt fails. This change should make life easier for everyone keeping an eye on application performance with Instrumental, even if you aren’t using an entire wall of monitors to track your app.

Timely and relevant alerts are a key component of effective application monitoring, and we’ve done a lot of work recently to improve our alerts. Let’s dig in and talk about the improvements we’ve made!

The first thing you’re likely to notice are the improvements to your alerts configuration page.

From here, you’ll see a list of all your alerts. You can click through on any of your alerts, view the related graph, and get a complete historical view of the alert itself. Instead of just showing the initial cause for an alert, this page now lists all causes of an alert firing, and the graph shown links to the time range for the alert.

We’ve updated all links to alert graphs, including embedded graphs sent out via email. Any graphs sent out with an alert email will include a relevant time range for the alert. Even if you come back days later and view the email again, the graph will show the time range for the alert.

All alerts now include a triggering metric. Now you’ll always know what happened to trigger the alert, and where you should start looking to fix the problem.

Causes for alerts will now be constantly updated. If an alert is triggered by one part of your application breaking and remains open thanks to some other part of your application breaking, you’ll see relevant messages from each event.

Bug Fixes

We’ve squashed a couple of bugs that will make it easier to quickly build new graphs using metric autocompletion, and improve resolution selection on graph views.

Metric Autocompletion

Previously, trying to auto complete metric names would result in the entire statement being replaced when you hit tab. Now you can tab your way to happiness without losing the statement you’ve built:

This fix makes it a bit faster to build a new graph, or to add a complicated query string to an existing graph. This should make your life a bit easier.

Updated Resolutions

We’ve made a change to our API to better handle resolution and duration selection.

Previously, if you requested an unsupported duration or resolution, the API would default to the smallest available option. Now Instrumental will automatically display the option closest to your selection.

Deciding what to measure is hard, and even daunting at first. There’s a ton of code in your project and you don’t want to just slap a gaggle of useless metrics in there. Your measurement should mean something, dammit! On the other hand, it would be nice if it didn’t take forever to get started :)

Don’t worry - getting started doesn’t have to take forever. Your code base is a great place to get inspiration for measurement, so let’s take a look at the usual suspects!

1) Your test suite

You’ve already thought long and hard about what’s worth protecting in your application (or someone has, hopefully). It’s impossible for me to guess the unique and beautiful snowflakes that make up your test suite, but I’m sure you know them well. Look for the code with lots of tests.

Bonus - this is a great opportunity to deal with Heisenbugs and code that has a history of acting differently in production. Nothing makes dealing with hard-to-solve problems easier than a huge body of relevant data.

2) Your “features”

Add a metric for each of the features in your application. On your first pass, keep the features high-level: “created a game”, “skipped turn”, etc. An unexpected change in feature usage can be a good indicator of subtle (or not so subtle) problems. It’s easy to add more detailed monitoring as you figure out which features are the best indicators of application performance.

3) Your code coverage

The inverse of your code coverage, actually. The uncovered code in your application is the interesting part. There’s a high likelihood that the code that isn’t covered by your automated tests falls into one of two categories:

A) Code that’s really inconvenient to test in development

B) Defensive code to protect against worst case scenarios

It’s often much easier to instrument these with your application monitoring tool than to try and figure out a test. While you’re instrumenting these, you should also set up alerts. It’s code that you’re pretty sure works. If it doesn’t, you want to know as soon as possible.

There are many more things that are worth instrumenting, but these three things will have you off to a good start. Remember: good application monitoring is a process, not a product. You can’t think of everything up front. If you start with your best ideas and add to it over time, your application monitoring will grow in value just like a good test suite.

Instrumental is about to get a brand new look! For the last few months we’ve been hard at work on a modern, intuitive, and easier UI. We’ll be rolling it out to users over the next few weeks, but we wanted to give you a sneak peek now.

New Time Selector

We merged the resolution and duration dropdowns into a single global control. Now the time selector is always in the same place, whether you’re looking at a group of graphs or focusing in on just one.

Updated time selector

Reorganization: Sidebar

We’ve reorganized and simplified most of the interface, but let’s focus on the sidebar. All of your projects are now listed in the sidebar instead of hidden behind a dropdown. For the project you’re currently viewing, you can see all graph groups, access the metric explorer, and manage alerts.

Project sidebar

Improved Graph View

We spent a lot of time improving the interface for creating and viewing individual graphs. You spend most of your time there, and we wanted to make sure we got it right. The biggest change you’ll notice (aside from the new color scheme, of course) is the expression list, which is more prominent and easier to use. We also merged graph view settings, export and delete into a single dropdown. Graph options are there when you need them, ignorable when you don’t.

Graph settings dropdown

New expression list

…and more!

There’s plenty more in the new Instrumental that we haven’t talked about: our new lighter color scheme, better typography, improved usability throughout, and even a few Easter eggs.

We can’t wait to get the new Instrumental into your hands, so keep an eye out for a notification to opt-in during the beta period!

We use Instrumental to track all kinds of things. Many of them are about server and app performance, like latency, queue response time, and disk utilization. But we also have graphs for other kinds of monitoring. This series explores some of our more interesting graphs.

We acquired Gauges in January. Since then, we’ve reworked the server architecture, moved data centers, increased performance, updated the interface for retina displays, rethought our marketing…and that’s just what we’ve shipped so far! We’ve got plenty more in the works, of course. But we ran into a problem awhile ago as we prioritized some of our bigger feature ideas: we realized we had a gap in our understanding of how our users actually use Gauges. We’ve sent surveys and talked to many of our users directly, but that only goes so far. What people do and what they say are often different.

So, we decided to instrument feature usage. Here’s how we did it. The Gauges interface effectively has 3 layers of navigation. First is the gauge (site), then the gauge-specific sections of traffic data, then (in some cases) a subdivision of that data.

Levels of navigation in Gauges

We added some JavaScript[1] code that tracks when the second two of those change. We use the word ‘feature’ to namespace them and dots between the second and third level, which gives us metrics like:


When we increment those in Instrumental, we can use the following query to see a graph of how much each feature is used as a percentage of total feature usage:

(feature.? * 100) / ts_sum(feature.?)

That gives us this awesome graph (when viewed at a 1 hour resolution, and with the step graph style):

Tracking application feature usage with Instrumental

Now we can see exactly how much each feature gets used. Even better, when we add features, we can see how much they’re used compared to the existing features! We’ve already learned some interesting things in the short time we’ve been tracking feature usage in Gauges.

After we let this graph simmer for a couple of days, we learned something that surprised us: AirTraffic isn’t used as much as we thought.

Some people use it, but not many. That surprised us. AirTraffic shows pins dropping on a map of the world as new visitors reach your sites, in real time. It’s pretty cool to watch, and we use it to see the impact of big announcements. Turns out our customers don’t use it as much as we thought.

Gauges AirTraffic feature We’ve got some ideas about why our customers aren’t using AirTraffic, and what we can do about it. Now that we’re measuring feature usage this way, we can visualize how our changes affect how features are used in Gauges.

[1] The JavaScript actually hits our web server, which makes the request to Instrumental. That way, we use our existing setup using the Ruby Gem.

We’ve added two new features to make the query language even better - a new wildcard operator that doesn’t match metric children, and a method for adding a constant value to any graph.

1) ? wildcard operator.

This wildcard operator is basically the same as * except that it will never match a dot. This makes matching top level metric names without matching the children easy.

For example, here’s a few metrics we use to track new user signups. These are simple increments that keep track of where a user came from:            # this user came out of nowhere
signup.newsletter         # a new subscriber to our newsletter
signup.paid               # any paid signup
signup.paid.google_ads    # a paid signup from google ads
signup.paid.bing_ads      # a paid signup from bing ads
signup.paid.facebook_ads  # a paid signup from facebook ads

Using our * wildcard to graph these metrics would include these paid breakdowns, throwing off our numbers and making that graph harder to understand. The fix was to graph each metric individually, avoiding signup.paid.*.

Using this new wildcard, you can graph:


This will match signup.paid, but not signup.paid.google_ads. This little guy’s already making our graphs cleaner and easier to understand.

2) constant(x)

Setting a constant value in Instrumental

Sometimes you want to just show a reference value on a graph. This is super useful if you have an alert and you want to see the threshold alongside the graph.

Here we’ve set a constant value of 175,000,000 on a graph tracking collector metrics received in Instrumental. This is a pretty high constant, but we do receive a ton of metrics.

Application metrics received in Instrumental

We’re excited to bring this improvement to the query language and make it even easier to monitor your application!

Our new graph grouping feature is live! We’re super excited about this update - groups have been one of our most requested features, and it’s already making our lives easier. Let’s take a look at how you can use it!

Creating a new group from your dashboard is easy:

Creating a new group of graphs

After creating a new group and giving it a useful name, you can move a graph to your group like so:

Adding a graph to a group

Graphs added to a group will disappear from your default dashboard view. You can manage graphs inside a group the same way you’ve managed them from your default dashboard: repositioning graphs by dragging and dropping them into place, renaming graphs, deleting graphs and so on.

You can also remove a graph from a group. Any graphs removed from a group are sent back to your default dashboard view. This won’t actually delete your graphs!

Managing graphs inside of a group

What about creating a temporary group of graphs? Maybe you’re testing an infrastructure change and you don’t need those server monitoring graphs hanging around forever. Never fear! Deleting a group is also really easy, and will also send any graphs in that group back to your default dashboard view:

Deleting a group of graphs

We’ve got dozens of graphs for monitoring application and server performance in Instrumental. Now we’ve also created two groups of related graphs: one for tracking metric collection, and another for monitoring performance after deploying new code to production.

Instrumental is all about customized application monitoring, and grouping makes it easier for you to keep an eye on the metrics that matter to your app.

Your application is going to break. Maybe not today, maybe not tomorrow, but it will happen. Data centers fail, worker boxes get saturated and disks run out of storage. It happens to the best applications.

You’ve probably got a good handle on catastrophic server failure. Smart developers keep a watchful eye on their infrastructure, and we know you’re smart. Monitoring server performance is basically a no-brainer for seasoned web developers.

What about when your app isn’t strictly broken, but isn’t performing in a way your users expect? Your servers appear to be fine - no alerts have been triggered, and your house isn’t burning down as far as you can tell. But your users aren’t getting the response they’ve come to expect from your app.

If a user tries to use your application and does not get the expected response, your service might as well be down, as far as that user is concerned. Are you watching these errors as closely as you’re watching your server performance?

We designed Instrumental for application monitoring at every level, and this includes tracking errors exposed to our users. Sure, some of these are unavoidable - minor hiccups in response times, or a user entering bad data, or any of a thousand tiny problems that arise in application development.

We’re not super interested in these incidental errors - we’re more concerned about trends and recurring problems. We’re especially concerned with errors exposed to our users, so we make graphs to track multiple kinds of user-facing errors. A sample graph looks like this:

Sample graph tracking user facing errors

We’ll give this graph a clever name (something catchy like “This graph should be at 0”) and then add metrics for failures exposed to our users. This can be any number of things: malformed data, bad requests, any time a user hits a 404 or 500, slow web requests, and so on and so forth. We’re tracking any instance of the application performing in a way the user does not expect.

This graph’s not staying steady at 0, but the errors we’re tracking are all resolving pretty quickly. We’d love for this graph to always be at 0, and for our users to never experience any problems, ever. But that’s not super realistic. Intermittent errors like these aren’t as troubling as prolonged, increasing failure. For example, if this clever graph looked more like this, we’d have a much different problem on our hands:

A sample graph showing serious application problems

That blue line is moving up and to the right, which tells us that something bad is happening, and it’s getting worse. Graphing errors is useful not just for noticing when things go sideways (often before your users are aware) but for quickly spotting trends. Remember - this graph is tracking user-facing errors, so our users are aware of this problem. Not the best situation, but at least we’ve got an idea of where to look!

Better still, we’ve got a good idea of when we started seeing this error, how long it’s been going on, and how bad the problem is.

The actual errors and methods of tracking them will vary between applications, but there are two things we want to do: keep an eye on user-facing errors, and make sure those metrics are measured in a way that make sense on the same graph. For example, if a low trend is good for one metric, we don’t want to add it to a graph with metrics where a high trend is good. Nobody wants to waste time interpreting data when there are problems to solve!

If we see one or two 404s being returned, we’ll investigate the issue but we aren’t super worried. If that number spikes to 100, establishing a trend line that doesn’t resolve itself, then we’re more worried.

Application monitoring at the user level isn’t that different from server monitoring, really - we just keep an eye on problems that are exposed to our users, rather than less visible infrastructure problems. We never, ever want our first notification of a user-facing problem to be an angry email from a paying customer, asking us why the service is down.

The service may not actually be down, as far as we can tell. All of our server monitoring graphs could be performing just as expected. But if our service isn’t working for our users, the end result is the same - a paid service that isn’t delivering value. Talk about a nightmare scenario!

Luckily, Instrumental’s query language makes it easy to track everything that happens in our application, and we can easily create graphs with the exact metrics we want to measure. Having similar types of errors in a single graph makes monitoring our applications easy, and tracking user-facing problems leads to happier customers.

We’ve done a lot of infrastructure work on Instrumental over the last few weeks. Our main goals with this infrastructure work were making Instrumental’s data collection and writing process more resilient and easier to scale. As an application monitoring service, we process billions of data points every single day, and we would love to write more data in less time.

We’ve been using MongoDB for a while now, and TokuMX sounded pretty great. TokuMX is designed for big data applications, and at billions of writes per day, we definitely think Instrumental qualifies.

Toku’s marketing page promises a ton of things that sound way better than standard MongoDB - most notably, 20x faster performance (without tuning, no less!) and a 90% reduction in database size.

Sure, let’s try this thing out!

But first, let’s test our assumptions. We don’t actually want to push Toku into production before we’re reasonably sure it will outperform our current implementation of MongoDB.

An important component of successful application monitoring is using data to make better decisions. We’ve already got a graph that tracks how fast our MongoDB install is writing to disk. This gives us a ton of useful historical data, so we can measure Toku’s performance against Mongo’s performance, and prove whether Toku is actually an improvement over our MongoDB implementation.

Side note: we’re not exactly running a standard MongoDB instance here. Our dev team is very familiar with Mongo, and we’ve squeezed quite a bit of performance improvements out of it over the years. Still, Toku seems worth trying out. To the graphs!

Here’s the graph we use to track database write speed for Instrumental, measured in Mongo updates per second, and we’re graphing this number in 5 minute increments. On this graph, higher is better, and if we’re seeing a lot of big valleys, we know we’ve got a problem writing data somewhere.

MongoDB updates per second

I mentioned we’re not really running a standard install of MongoDB, so we aren’t really expecting 20x improvement. That would be AWESOME, but it’s not realistic. Still, Toku’s specifically designed to improve performance for high scale applications, so we should see SOME bump. Let’s take a look!

Based on our current graph, we do expect to see some boost in performance. We’re assuming the Toku graph will look similar to our current Mongo graph, just better. Let’s find out!

We dropped in Toku on one of our database boxes, and let it percolate for a day or so to get an idea of what performance would look like. I should note we’re running Mongo and Toku on identical hardware here. This would be a pretty flawed test if we weren’t!

Here’s what this graph looks like, once we start charting Toku performance:

MongoDB updates versus Toku updates

Sure, sometimes Toku is performing better than MongoDB, but do you see how spiky those write speeds are? Way spikier than we’d like, even when compared to how fast MongoDB can write to disk. Total average write speeds were lower: a single process in Mongo could perform around 6,000 upserts per second.

At first, Toku performed very well - we were seeing speeds of 10,000 upserts per second or so, but this slowly dwindled to close to 3,000 to 6,000 upserts per second. After letting this run for a while, we were seeing roughly a 20% decline in write speeds.

We tried different compressions settings - zlib, fastlz, and no compression. Nothing we tried had much impact on upsert speed, although it could be that the effects were delayed since the reindex commands returned near instantaneously. After a day of testing and performance monitoring, it became clear Toku wasn’t right for Instrumental.

This isn’t to say we don’t like Toku - we were drawn in by the performance boosts we thought we could realize, and it was a lot of fun to work with. Admin work was a lot of fun, and we found it to be very fast and responsive.

Still, our assumptions around possible improvements were absolutely worth testing, and we’re glad we had Instrumental to help us realize Toku wasn’t right for us.

Once upon a time, automated testing was not a popular idea. It was too expensive. It was too time-consuming. At best, it was a nice-to-have.

The prevailing idea was that if you were a good and careful software developer, regressions weren’t a problem. When a regression did happen (rarely, of course), the good and careful software developer that you are would carefully consider why it had happened and make a correction to prevent it from happening again. This now seems crazy to most developers.

Unit testing is now accepted as a best practice. Done correctly, it saves time, money, and development effort. It helps developers make good software faster.

Application monitoring is moving along a similar path and for similar reasons. Unfortunately, application monitoring is still in that nice-to-have phase.

If you haven’t thought of application monitoring as being very similar to automated testing - and just as important as automated testing - I’m asking you to do it now. Here are just a few of their similarities:

1) It’s a process, not a product.

Application monitoring is production-level, real-time testing. Tools like Instrumental exist to make it easy, but there’s some work only you can do.

Imagine installing some testing software, providing it with minimal configuration, and then just running that before putting code into production. Could that be helpful? Maybe. Is it as good as writing your own tests? No. Obviously not.

Application monitoring is not something you can install once and have the problem stay solved. Every piece of software is different. It has unique code and unique hotspots. Similar to testing, every time the code is changed it’s worth taking a minute to consider what should be monitored in production. If you’re not, your monitoring tool is delivering about as much value as an install-and-run test suite.

2) Quality is better than quantity.

Imagine a 10-line application with 30,000 unit tests. You’re either about to experience the most sophisticated 10 lines of software ever written or someone decided that every time you changed a character you needed to see 1000 failing tests.

Like testing, application monitoring isn’t about having as much of it as you can. There’s no award for having the most measurement. In fact, it’s counter-productive to have too much. Having more things flash red doesn’t help you solve a problem faster. It’s about getting the right information to help solve the problem.

When it comes to your production environment, there’s no substitute for seeing exactly what you need to know. What code is having a problem? What, exactly, is the problem? Thoughtful, customized measurement is critical.

3) It tells you what’s wrong before anyone else.

In this case “anyone else” is “the customer”. Application monitoring is about spotting problems before they happen. It’s about being proactive. In the worst case, it’s about figuring out what’s wrong as fast as possible. Like a good unit test, good application monitoring points to exactly what’s wrong so you can start fixing it immediately.

If you’ve worked on a project where the tests are brittle, wrong, or misleading, you know exactly how valuable it is to have tests that break at exactly the right time and provide a clear message of what is wrong. Good application monitoring provides that same value in production.

4) It saves you time, money, and sanity.

Good testing is about increased development speed through increased confidence. The more sure you are you’ve done the right thing, the faster you can go. Application monitoring does the same thing in production. It’s about knowing that every part of your software is doing the right thing every moment of every day. Guessing about the production status of any part of your application, hand checking it occasionally, or using some high-level off-the-shelf solution is a recipe for missing out on the critical details that are specific to your application.

5) It’s about admitting that you don’t always know what’s going to happen.

Application monitoring is not a replacement for testing. Automated testing is incredibly valuable. Even the smartest and most careful developer can’t always predict every little nuance of the code they’re working on. Tests help prevent those mistakes from seeing the light of day.

Application monitoring works the same way. A lot can happen in production that even an excellent developer can’t predict or can’t (easily) replicate. Customers do things you never considered. There’s a weird interaction with a new version of a browser. There’s a difference between the development environment and production that no one ever noticed before. A famous person tweeted about your app and your customer base tripled over night.

Adding carefully thought-out custom monitoring can help safeguard against the things you can’t predict.

The case is clear; application monitoring is not a nice-to-have. It’s time for software developers to start making application monitoring part of their standard practice.