Latency Metrics

Durations critical to the application's success.

DB Operation Time

The amount of time that has passed between when the application code issued a query to the database and when it received a result.
Database drivers typically allow for some form of callback mechanism that allows you to "hook" when a query is performed. Otherwise, wherever you interact directly with the database.
Database health, schema and query efficiency, system healthiness
MySQL Insert Query Time
Postgres Select Query Time
Mongo Upsert Document Time
mysql.query.insert
postgres.query.select
mongo.query.upsert

External Service Response TIme

The amount of time required to interact with an external service (Facebook, Twitter, Twilio, Mandrill, etc.)
At the points of interaction with those services; "find your friends", "send an email", "page this number", etc.
External system reliability, third party induced bottlenecks
Facebook API Friend Fetch Time
Twilio Phone Call Time
Twitter Make Post Time
api.facebook.friends.get
api.twilio.call
api.twitter.post

Full Request Fulfillment Time

How long it takes your service to fulfill one user's request. This measurement would represent the total amount of user perceived time for a piece of data to be sent to your system and acknowledged.
At the edge of user interaction with your service. For a website, this would generally be in the browser; for an API service where you're not able to capture at the call site, you could capture this at the forward edge of your service ( load balancer, Apache / Nginx ).
System health, quality of service fulfillment
Overall System Latency
service.latency

Queue Processing Time

The amount of time that a particular item (job, data to be written, etc.) has spent in a system queue.
The queue itself, if it supports such instrumentation; alternatively, measure it at the output / queue processor point.
Worker efficiency, queue capacities
Time to Process an Email Job from the queue
Time to Convert an MPEG from the incoming movie queue
Time load balancer waits to connect to application server
queue.jobs.email.process
queue.incoming.mpegs.convert
queue.web.connections

Time to Receive Message

The amount of time that passed between the system issuing a message (email, notification, etc.) and when the recipient acknowledged its receipt.
If you control both the sender and receiver (see: iOS apps), you can measure this at the receiver. Alternatively, you can send test messages through this pipeline to a receiver you control to estimate overall time to receive a message.
Minimum expected user response time to communication, message subsystem efficiency, third party reliability
Time until a sent email was received
Time to receive an iOS Notification
Time to receive an Android Notification
notifications.email.receive_time
notifications.ios.receive_time
notifications.android.receive_time

Time to Write to Disk

The amount of time required to write some payload to disk (including fsync / confirmed write time).
After the fsync / flush / disk confirmation call
Contention for the disk resource, disk performance over time
Time to write an SQLite cache db
Time to write a dictionary file
cache.sqlite.write_time
dictionary.write_time

Lock Acquisition Time

How long a given process will wait until it is able to acquire a lock on a shared resource
Around the locking call (whether that be a local lock, eg @synchronized, or a distributed lock, eg Zookeeper)
Lock efficiency, potential gains of work splitting, effects of contention
Time to acquire a lock on a user's Mailbox
lock.user_mailbox.acquire_time

Deploy Time

The amount of time between issuing a new deploy and when the deploy is successfully running in production
Deploy scripts (Capistrano, Chef) or CI server
Minimum crisis response time
Time to deploy
deploy.time

Library Interaction Time

The amount of time required to interact with an external library
Calls to external libraries (GZip, FFMPEG, PhantomJS)
Library efficiency, cost of external framework integration
Time to compress a payload as GZip
Time to stitch together multiple images using ImageMagick
gzip_compress.time
imagemagick.stitch.time

Time To Load

The amount of time to go from a cold state (unloaded) to a fully responsive state
Web pages can measure this via DOM events (window onload, DOMContentLoaded); native applications can use view lifecycle methods to capture these points.
Overhead of included resources and initialization
Time until the window onload event occurred
Time until the DOMContentLoaded event occurred
Time until an iOS view's viewDidLoad message was received
page.load_time
page.ready_time
main_view.view_did_load.time

Object Age

The age of a given piece of data in the system. Usually such objects are ephemeral, such as a session, message or job.
Typically requires calling into your queue system (for queue messages) or into the relevant cache or DB
Efficiency of objects being processed or cleaned up
Age of the "next" job in a queue
Age of the oldest active session
Average age of current open sessions
queue.jobs.email.age
sessions.oldest.age
sessions.average.age

Capacity Metrics

Exhaustible resources where the depletion of such a resource will likely have a significant affect on the application.

Disk Space

The amount of space currently being used by a disk
The machine being measured, or a pre-existing external observer
Current system capacity, time until upgrade or replacement
Total megabytes used on device xvda1
mail_server.disk.xvda1.mb_used

Memory

The amount of memory currently in use by a system or process
The machine being measured, or a pre-existing external observer
System capacity, application efficiency and health
Total amount of memory in use on app server 001
app_server-001.memory.in_use

CPU

The amount of system and user CPU currently in use by a system or process
The machine being measured, or a pre-existing external observer
System load, application efficiency
CPU in use on the job_worker server
job_worker.cpu.in_use

Load

The aggregate load measurement of a system
The machine being measured, or a pre-existing external observer
System load, application health
Total load on server db-master
db-master.load

Cache in Use

The total amount of cache space currently in use by application caches
The server or system responsible for managing the cache
Cache efficiency, system capacity
Total amount of cache memory used on memcache-001
memcache-001.cache_size.used

Connections

The total number of connections to the application or application subsystem ( database, cache, etc. )
The server or system being connected to
Concurrent users, contention for resources
Number of outstanding ssl connections on a load balancer
Number of connections to a game server's lobby
loadbalancer.ssl.connections
game-server.lobby.connections

Items in Queue

The number of outstanding items in an application queue that have yet to be consumed
The server or system responsible for managing the queue
Queue processor efficiency, time for the system to process large workloads, maintenance window times
Size of incoming log queue
Size of export jobs queue
queue.incoming-logs.size
queue.jobs.big-exports.size

Disk and Network IO

The amount of incoming and outgoing bytes performed on a given network interface
The server or system performing the IO
Max throughput bottlenecks, compression efficiency, unexpected service interactions
Number of bytes written to disk on volume ebs01
Number of bytes read on network interface eth1
disk.ebs01.usage.written
netif.eth1.usage.read

Licenses

The number of software licenses currently in use by users of the system
The application at moments where a given interaction would exhaust a license
License overages, unused license capacity
Number of licenses currently in use
licenses.used

Active Processes

The number of active processes currently performing a task
At the moment when a given process becomes active or inactive
Contention causes, load spikes
Total number of processes active on worker-001
worker-001.processes.active

Process Active / Idle Time

The time your application processes spend active or processing a task
In your application or worker, as part of request or job lifecycle
System capacity and health, considering your system as a single, very parallel computer
Amount of time worker was idle since last job finished (at job start)
Amount of time worker was active since job started (at job end)
job_worker.idle_time
job_worker.active_time
"- or, on a per-machine level -"
worker-001.job_worker.idle_time

Interaction Metrics

Momentary events that were caused by a user interacting with your system.

Login, Logout

The moment when a user has logged into or out of the system
The application code that authorizes a login or logout
Interaction spikes, effective site availability
A login occurred
A logout occurred
app.logins
app.logouts

Feature Usage

The use of a particular application feature
At the moment that feature is invoked
Adoption rates, deprecation targets
A report was exported as a PDF
A friend was suggested to another user
app.reports.export.pdf
app.friends.suggest

Client State

The characteristics of the client interacting with your application ( browser version, origin network, supported features )
When the client interacts with your system
User capabilities
A user's browser indicates it supports WebWorkers
A user is browsing with IE8
client.capabilities.supports_webworkers
client.browser.ie8

API Call

A call is made to a given API or third party service
When you interact with that service
Third party reliance, usage spikes
A user sent a fax using Phaxio
A user caused an S3 List Objects API call to be made
api.phaxio.fax
api.aws.s3.list_objects

Errors

An unexpected error occurred in the application
When the error occurs
Application reliability
An ObjectBeingEdited exception occurred
A NullPointerException occurred
errors.object_being_edited_exception
errors.null_pointer_exception

Duration of Use

The amount of time a user has occupied a given resource ( viewed a page, played a game, edited an email )
Around the interaction
Quality of service, feature usage
The user reviewed changes for a certain number of seconds
A user edited an email for a certain number of seconds
document.review_changes.view_duration
email.edit.duration

Revenue Change

The moment when a charge or refund has been successfully made on behalf of the application ( invoice created, refund made )
When the application is able to observe this event occurring by view of your payment processor, or by application event
Revenue growth, payment system reliability
Stripe reported an invoice was paid
An in-app purchase was made for a coin pack
A refund was issued for a shipment
stripe.invoice_paid
ios.iap.coin_pack.purchased
shipments.refund

System Inspection

Specific information about your system that is interesting, but which may not be operationally significant on a minute-to-minute basis.

Path Versioning

A version metric to indicate which code path a given user or event is using, to aid in transitioning from older to newer versions of an application piece.
Where the code differs form version to version
New feature uptake, migration progress
A new version of a user record involving a live migration
A new version of an API is launched but requires customer migration
app.user_record_saved.v1 / app.user_record_saved.v2
api.user.get.v1 / api.user.get.v2

Object Count

A metric periodically sampled to track the scale of one or another part of the application and underlying data.
In a monitoring process
Application growth, feature usage
A count of the total documents in the system
A count of the total number of active users
A count of the total number of user-created widgets
app.documents.count
app.users.active.count
app.widgets.count
Questions? We can help!