Your application is going to break. Maybe not today, maybe not tomorrow, but it will happen. Data centers fail, worker boxes get saturated and disks run out of storage. It happens to the best applications.
You’ve probably got a good handle on catastrophic server failure. Smart developers keep a watchful eye on their infrastructure, and we know you’re smart. Monitoring server performance is basically a no-brainer for seasoned web developers.
What about when your app isn’t strictly broken, but isn’t performing in a way your users expect? Your servers appear to be fine – no alerts have been triggered, and your house isn’t burning down as far as you can tell. But your users aren’t getting the response they’ve come to expect from your app.
If a user tries to use your application and does not get the expected response, your service might as well be down, as far as that user is concerned. Are you watching these errors as closely as you’re watching your server performance?
We designed Instrumental for application monitoring at every level, and this includes tracking errors exposed to our users. Sure, some of these are unavoidable – minor hiccups in response times, or a user entering bad data, or any of a thousand tiny problems that arise in application development.
We’re not super interested in these incidental errors – we’re more concerned about trends and recurring problems. We’re especially concerned with errors exposed to our users, so we make graphs to track multiple kinds of user-facing errors. A sample graph looks like this:
We’ll give this graph a clever name (something catchy like “This graph should be at 0”) and then add metrics for failures exposed to our users. This can be any number of things: malformed data, bad requests, any time a user hits a 404 or 500, slow web requests, and so on and so forth. We’re tracking any instance of the application performing in a way the user does not expect.
This graph’s not staying steady at 0, but the errors we’re tracking are all resolving pretty quickly. We’d love for this graph to always be at 0, and for our users to never experience any problems, ever. But that’s not super realistic. Intermittent errors like these aren’t as troubling as prolonged, increasing failure. For example, if this clever graph looked more like this, we’d have a much different problem on our hands:
That blue line is moving up and to the right, which tells us that something bad is happening, and it’s getting worse. Graphing errors is useful not just for noticing when things go sideways (often before your users are aware) but for quickly spotting trends. Remember – this graph is tracking user-facing errors, so our users are aware of this problem. Not the best situation, but at least we’ve got an idea of where to look!
Better still, we’ve got a good idea of when we started seeing this error, how long it’s been going on, and how bad the problem is.
The actual errors and methods of tracking them will vary between applications, but there are two things we want to do: keep an eye on user-facing errors, and make sure those metrics are measured in a way that make sense on the same graph. For example, if a low trend is good for one metric, we don’t want to add it to a graph with metrics where a high trend is good. Nobody wants to waste time interpreting data when there are problems to solve!
If we see one or two 404s being returned, we’ll investigate the issue but we aren’t super worried. If that number spikes to 100, establishing a trend line that doesn’t resolve itself, then we’re more worried.
Application monitoring at the user level isn’t that different from server monitoring, really – we just keep an eye on problems that are exposed to our users, rather than less visible infrastructure problems. We never, ever want our first notification of a user-facing problem to be an angry email from a paying customer, asking us why the service is down.
The service may not actually be down, as far as we can tell. All of our server monitoring graphs could be performing just as expected. But if our service isn’t working for our users, the end result is the same – a paid service that isn’t delivering value. Talk about a nightmare scenario!
Luckily, Instrumental’s query language makes it easy to track everything that happens in our application, and we can easily create graphs with the exact metrics we want to measure. Having similar types of errors in a single graph makes monitoring our applications easy, and tracking user-facing problems leads to happier customers.