We recently launched our CloudWatch integration, which brings your CloudWatch data into the same monitoring environment as your application metrics, service metrics, and uptime monitoring. Most of our infrastructure is in AWS, so this is a great feature for us; hopefully you’ll like it too!
While developing the CloudWatch integration, we found that AWS Lambda was a great fit for a specific engineering challenge.¹ To quote an earlier post on the Expected Behavior blog about adding Lambda to an existing application:
What we really needed was a system to work many thousands of jobs concurrently, but one that only costs us money when we’re actually using it. Essentially, the workload is “here are 10,000 I/O intensive things that all need to happen at the same time, then do nothing for awhile”.
Our use of Lambda is very targeted and minimal: we have one function that is directly invoked via the AWS SDK from our Rails application.
That small footprint doesn’t necessarily mean a smaller monitoring need, though. CloudWatch provides a great baseline-level of metrics, but like all of our features, we really need custom application metrics to fully monitor the functionality and performance of the integration.
Instrumental’s CloudWatch integration captures every Lambda metric provided by CloudWatch. You can see the full list in our docs. However, the ones that matter most to us for this feature are:
We’ve made the following graphs with those metrics:
This graph recreates the availability graph found in the monitoring tab of the Lambda interface. It gives us both the raw count of errors and the percentage of requests represented by those errors. This helps us get a handle on the overall level of availability of the integration.
This is simply a raw count of function invocations for this feature. This provides a great view of current health (it should be above 0!), a potential runaway state, and long-term growth of the feature. We also used this metric to make an alert so that we’ll know if things aren’t going well.
This graph includes the average duration for the lambda function plus a constant of our timeout setting for the Lambda function. This helps us know how close we are, on average, to hitting our timeout.
If you’re looking to do some Lambda monitoring in your own project, we’ve made a premade Instrumental dashboard for Lambda that provides a great starting point.
The basic CloudWatch metrics give us a great birds-eye view of how things are working, but there’s quite a bit more that we want to know.
The Lambda function in question is responsible for fetching metric data from the CloudWatch API and then sending it the Instrumental data collector. Things we want to know that we can’t get from CloudWatch:
- Whether the API calls to fetch metrics failed or succeeded
- How long it takes to fetch metric data, for each customer and in aggregate
- Whether metrics were successfully submitted to the Instrumental collector
- How many metrics were submitted, both for each customer and in aggregate
Gathering this data allows us to make several important graphs. Here are two interesting ones:
This shows how much time we’re spending waiting on CloudWatch to respond. If we’re waiting too long on CloudWatch, we may hit our timeout and not complete the work. We can use that info to tune various things, like reducing the number of metrics fetched per invocation.
CloudWatch provides most metrics at 1-minute resolution (period of 60, in CloudWatch lingo), but some at 5-minute resolution. This graph shows the total number of metrics we’re collecting plus the split between the two resolutions and the overall invocation count (a native CloudWatch metric). This gives us a number of things: an idea of growth over time, an idea of what healthy functioning of the system looks like (great for detecting anomalies), and an understanding of the ratio of metrics collected across both resolutions.
Beyond those graphs, we also gain the ability to see how many metrics we’re collecting per customer and how long those API calls took. That’s really useful for troubleshooting, customer support, and account management. We can’t show you those graphs, though. :)
Though not exactly “with Instrumental,” we do sometimes use CloudWatch logs to help us investigate potential issues. Since
console.log calls end up in the CloudWatch log stream, it’s easy to log useful debugging information for later review. We specifically log that each step in the process has started and whether the step succeeded or failed, while also including IDs identifying the account in question. One thing to note: we’ve found searching CloudWatch logs to be painfully slow and error-prone. For that reason, it’s never our first choice for investigating incidents.
Monitoring In Action
During the private beta of the integration, the CloudWatch-based metrics showed an increase in function timeouts. Our initial thought was that we were waiting on API responses for longer than we’d anticipated. The fix seemed simple: make fewer calls per function invocation. Reducing the number of API calls per function invocation means a less performant integration, which is not great. We needed more information to be sure!
We dug into the per-customer custom metrics and aggregate timing information for API calls, and it turns out our assumptions were wrong. Rather than spending most of the time waiting on API responses, most of the time was spent sending the data to Instrumental. More custom metrics indicated that the time was spent waiting to make the initial TCP connection. We solved the problem by forcing the agent to connect right at the beginning of the invocation, meaning the time to open the connection happened in parallel to the API request.
This is an excellent example of the value gained from combining CloudWatch-based metrics and custom application metrics in the same platform. With all of the data easily available, we were able to quickly detect and fix our performance problem.
Hopefully, this post provides a solid starting point for getting complete visibility into your Lambda function’s performance!
¹ This also meant we could monitor our CloudWatch integration using our CloudWatch integration. So meta! There are certainly downsides to this. Namely, if the integration is having problems, so too may our monitoring. In this case, our application metrics follow a different path to our system, meaning we’re not entirely reliant on the CloudWatch metrics for monitoring. We also have other tools to investigate a total loss of functionality (e.g., logs).