Many developers analyze background job queues too simply: queues are either too full (bad) or mostly empty (good). This is a dangerous perspective, especially at scale. To get the full picture of your queue’s health, you need to measure both queue depth and job age.
Queue depth is the common metric for measuring a queue’s health. Does your queue have 100 or 100,000 pending jobs? You can probably handle 100 jobs without a problem, but at some point, too many jobs can bring any system down. But relying solely on depth doesn’t give us a clear picture of a queue’s health.
In terms of your user’s experience, job age is just as important as depth. Modern servers can handle millions of jobs without crashing, and that may be ok for certain jobs. If an overfull queue can’t bring down your application, what value does monitoring depth provide, other than as a gross and misleading estimate of queue age?
Consider this scenario:
- Queue A has 100,000 jobs in line, and the next job is 30 seconds old
- Queue B has 1,000 waiting jobs, and the next job is 36 hours old
You might have a problem with Queue A, but you definitely have a problem with Queue B. Maybe Queue B’s jobs are silently failing and the queue is never getting processed, or maybe your workers have been programmed to only select from Queue A until it’s empty.
Monitoring queue age provides a clearer indicator of application issues than relying upon queue depth. With Instrumental, a simple line of code in your workers to report queue age allows you to monitor your largest concern: are your jobs actually being executed in a timely fashion?
Here’s a basic example in Ruby that measures both job age and execution time:
job = Queue1.pop I = Instrumental::Agent.new(project_token) # how long are jobs in the queue? I.gauge(“queues.job_name.time_in_queue”, (Time.now - job.enqueued_at).seconds) # how long does it take to execute the jobs? I.time(“queues.job_name.time_to_execute”) do job.execute end
The code relies on a certain minimum rate of completion for your jobs and will break down if a queue gets stuck. Because of stuck jobs and other potential issues, you should also consider out-of-band monitoring on job age and use the above code just for measuring execution time. You can use something like instrumental_tools plugin scripts to regularly peek at the first job in the queue and report its age – the script would look very similar to above code.
Lastly, you must also monitor the hardware used by your queue management system. The basics are good enough to start with: memory, CPU, and disk usage. If your queue server goes down, any other queue metrics are fairly unimportant until it’s back up. Ideally, your server monitoring can be easily referenced against your queue metrics.
We’ve learned the age vs depth lesson the hard way, and we think it highlights the importance of the type of custom application monitoring provided by Instrumental. Let us know if you have any questions at @Instrumental or firstname.lastname@example.org.