Metrics For Your Web Application's Dashboards

Mar 2022

Whenever I create a dashboard for an application, it’s generally the same handful of metrics I look to. They’re the ones I always use to orient myself quickly when Pagerduty fires. They give me the grand overview, and then I’ll know what logging queries to start writing, code to look at, box to SSH into, or mitigation to activate. The same metrics are able to tell me during the day whether the system is ok, and I use them to do napkin math on e.g. capacity planning and imminent bottlenecks:

Web Backend (e.g. Django, Node, Rails, Go, ..)
- Response Time p50, p90, p99, sum, avg †
- Throughput by HTTP status †
- Worker Utilization ¹
- Request Queuing Time ²
- Service calls †
  - Database(s), caches, internal services, third-party APIs, ..
  - Enqueued jobs are important!
  - Circuit Breaker tripping † /min
  - Errors, throughput, latency p50, p90, p99
- Throttling †
- Cache hits and misses % †
- CPU and Memory Utilization
- Exception counts † /min
Job Backend (e.g. Sidekiq, Celery, Bull, ..)
- Job Execution Time p50, p90, p99, sum, avg †
- Throughput by Job Status {error, success, retry} †
- Worker Utilization ³
- Time in Queue † ⁴
- Queue Sizes † ⁵
  - Don’t forget scheduled jobs and retries!
- Service calls p50, p90, p99, count, by type †
- Throttling †
- CPU and Memory Utilization
- Exception counts † /min

† Metrics where you need the ability to slice by endpoint or job, tenant_id, app_id, worker_id, zone, hostname, and queue (for jobs). This is paramount to be able to figure out if it’s a single endpoint, tenant, or app that’s causing problems.

You can likely cobble a workable chunk of this together from your existing service provider and APM. The value is for you to know what metrics to pay attention to, and which key ones you’re missing. The holy grail is one dashboard for web, and one for job. The more incidents you have, the more problematic it becomes that you need to visit a dozen URLs to get the metrics you need.

If you have little of this and need somewhere to start, start with logs. They’re the lowest common denominator, and if you’re productive in a good logging system that will you very far. You can build all these dashboards with logs alone. Jumping into the detailed logs is usually the next step you take during an incident if it’s not immediately clear what to do from the metrics.

Use the canonical log line pattern (see figure below), resist emitting random logs throughout the request as this makes analysis difficult. A canonical log line is a log emitted at the end of the request with everything that happened during the request. This makes querying the logs bliss.