Dashboards & Alerting
Design dashboards that answer the right questions, and build alerting that pages humans only when human judgment is actually needed.
Dashboard Design Principles
Well-designed dashboards follow established methodologies rather than being built ad-hoc. Three frameworks dominate the industry.
USE Method — Infrastructure Focus
Developed by Brendan Gregg, USE stands for Utilization, Saturation, and Errors. Apply it to every hardware and OS resource.
- Utilization: Average time the resource was busy. Example: CPU busy 72% of the time.
- Saturation: Degree to which the resource has extra work queued. Example: CPU run queue length > 1.
- Errors: Count of error events. Example: NIC packet drops, disk I/O errors.
Best for: Kubernetes node dashboards, database server dashboards, network device monitoring.
# USE Method PromQL examples
# CPU Utilization (%)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU Saturation (run queue length — node_pressure_cpu_waiting_seconds on modern kernels)
rate(node_pressure_cpu_waiting_seconds_total[5m])
# Memory Saturation (page fault rate as proxy for memory pressure)
rate(node_vmstat_pgmajfault[5m])
# Disk Utilization (%)
rate(node_disk_io_time_seconds_total{device="sda"}[5m]) * 100
# Network Errors (per second)
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])
RED Method — Service / Microservice Focus
Developed by Tom Wilkie, RED focuses on the user-visible behavior of services. It stands for Rate, Errors, and Duration.
- Rate: Number of requests per second the service is handling.
- Errors: Number of failed requests per second (4xx and 5xx).
- Duration: Distribution of time each request takes (p50, p95, p99).
Best for: Per-service dashboards in a microservices architecture. Every service gets the same three panels for instant cross-service comparison.
# RED Method PromQL examples (assuming standard OTel/Prometheus metrics)
# Rate: requests per second by service and status
sum by (service_name) (rate(http_server_request_duration_seconds_count[5m]))
# Error rate: percentage of 5xx responses
sum by (service_name) (
rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[5m])
) /
sum by (service_name) (
rate(http_server_request_duration_seconds_count[5m])
) * 100
# Duration: p99 latency by service
histogram_quantile(0.99, sum by (service_name, le) (
rate(http_server_request_duration_seconds_bucket[5m])
))
Four Golden Signals — SRE Focus
From the Google SRE Book, the Four Golden Signals are the most critical user-facing metrics. They extend RED by adding Saturation as the fourth signal.
- Latency: Time to serve a request. Distinguish success latency from error latency (slow errors mask real problems).
- Traffic: Demand on the system (req/s, transactions/s, queries/s).
- Errors: Rate of failing requests, both explicit (5xx) and implicit (200 with wrong data).
- Saturation: How full the service is. Predict performance degradation before it happens (queue depth, thread pool exhaustion, connection pool usage).
Best for: Executive dashboards, SLO tracking, incident command dashboards during an outage.
Dashboard Layout Best Practices
- Top row: overview stat panels — total requests, error %, SLO burn rate, uptime. Status at a glance.
- Second row: time-series graphs — rate, error rate, p99 latency over the last 1h/6h/24h.
- Third row: resource breakdown — per-pod CPU/memory, per-endpoint latency heatmap.
- Bottom rows: detailed drill-down — database query times, dependency health, log panel.
- Use template variables —
$namespace,$service,$environment— so one dashboard serves all services. - Add deployment annotations — mark every deploy on all time-series graphs so latency changes are immediately correlated to code changes.
Grafana Provisioning via YAML
Dashboard and datasource configuration should be version-controlled and provisioned automatically — never configured manually through the UI.
Datasource Provisioning
# /etc/grafana/provisioning/datasources/observability.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-server.monitoring.svc.cluster.local:9090
uid: prometheus
isDefault: true
jsonData:
httpMethod: POST
manageAlerts: true
alertmanagerUid: alertmanager
prometheusType: Prometheus
prometheusVersion: 2.50.0
- name: Loki
type: loki
access: proxy
url: http://loki-gateway.monitoring.svc.cluster.local:80
uid: loki
jsonData:
maxLines: 1000
derivedFields:
- name: TraceID
matcherRegex: '"trace_id":\s*"(\w+)"'
url: "${__value.raw}"
datasourceUid: tempo
urlDisplayLabel: View in Tempo
- name: Tempo
type: tempo
access: proxy
url: http://tempo-query-frontend.monitoring.svc.cluster.local:3100
uid: tempo
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
customQuery: true
query: '{service_name="${__span.tags["service.name"]}"} | json | trace_id="${__trace.traceId}"'
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: true
- name: AlertManager
type: alertmanager
access: proxy
url: http://alertmanager.monitoring.svc.cluster.local:9093
uid: alertmanager
jsonData:
implementation: prometheus
Dashboard Provisioning
# /etc/grafana/provisioning/dashboards/default.yaml
apiVersion: 1
providers:
- name: Platform Dashboards
orgId: 1
type: file
disableDeletion: true # prevent manual deletion via UI
updateIntervalSeconds: 30 # reload from disk every 30s
allowUiUpdates: false # changes must go through Git
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true # folder = directory name
# Grafana Helm values — mount dashboards from ConfigMap
grafana.ini:
analytics:
check_for_updates: false
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: default
orgId: 1
type: file
disableDeletion: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
kubernetes-cluster:
gnetId: 15661 # Kubernetes cluster dashboard from grafana.com
revision: 1
datasource: Prometheus
node-exporter:
gnetId: 1860
revision: 37
datasource: Prometheus
sidecar:
dashboards:
enabled: true
label: grafana_dashboard # auto-load ConfigMaps with this label
labelValue: "1"
searchNamespace: ALL
AlertManager Configuration
AlertManager handles deduplication, grouping, silencing, and routing of alerts from Prometheus. Proper configuration is critical for effective on-call operations.
Core AlertManager Config
# alertmanager.yaml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
# Default grouping: alerts from the same alertname+cluster+service
# fire as a single notification
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # wait 30s before sending first notification (batch)
group_interval: 5m # send updates every 5 minutes
repeat_interval: 4h # re-notify if still firing after 4h
receiver: 'slack-default'
routes:
# Critical P1 alerts — page immediately via PagerDuty
- matchers:
- severity = "critical"
receiver: pagerduty-p1
group_wait: 0s # page immediately, no batching
repeat_interval: 30m # re-page every 30m until resolved
# SLO burn rate alerts — PagerDuty with 5-min grouping
- matchers:
- alertname =~ "SLOBurnRate.*"
receiver: pagerduty-slo
group_by: ['alertname', 'service', 'slo_name']
group_wait: 30s
repeat_interval: 1h
# Warning-level alerts — Slack only, no page
- matchers:
- severity = "warning"
receiver: slack-warnings
group_wait: 2m
repeat_interval: 8h
# Security alerts — dedicated security Slack + PagerDuty security team
- matchers:
- team = "security"
receiver: security-channel
group_wait: 0s
repeat_interval: 15m
inhibit_rules:
# If a critical alert is firing, suppress the corresponding warning
- source_matchers:
- severity = "critical"
target_matchers:
- severity = "warning"
equal: ['alertname', 'cluster', 'service']
receivers:
- name: 'slack-default'
slack_configs:
- channel: '#platform-alerts'
send_resolved: true
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#platform-warnings'
send_resolved: true
- name: 'pagerduty-p1'
pagerduty_configs:
- routing_key: '{{ .ExternalURL }}' # use secret ref in production
severity: 'critical'
description: '{{ .CommonAnnotations.summary }}'
details:
runbook: '{{ .CommonAnnotations.runbook_url }}'
dashboard: '{{ .CommonAnnotations.dashboard_url }}'
- name: 'pagerduty-slo'
pagerduty_configs:
- routing_key: '{{ .ExternalURL }}'
severity: 'error'
description: 'SLO burn rate alert: {{ .CommonAnnotations.summary }}'
- name: 'security-channel'
slack_configs:
- channel: '#security-alerts'
send_resolved: true
pagerduty_configs:
- routing_key: 'YOUR_SECURITY_PAGERDUTY_KEY'
severity: 'critical'
SLO-Based Alerting with Multiburn Rate
SLO-based alerting moves away from arbitrary threshold alerts toward alerts that are directly tied to user impact. The multiburn rate approach (from the Google SRE Workbook) pages when the error budget is being consumed too quickly.
Error Budget Recap
For a 99.9% availability SLO with a 30-day window:
- Allowed downtime: 43.8 minutes per 30 days
- Allowed error rate: 0.1% of requests
- 1x burn rate = consuming budget at exactly the SLO rate (uses 100% in 30 days)
- 14x burn rate = consuming budget 14x faster (uses 100% in ~2 days)
Multiburn Rate Alert Rules (PrometheusRule)
# slo-alerts.yaml — PrometheusRule for Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payment-service-slos
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: payment-service.slo.rules
interval: 30s
rules:
# ── Recording Rules ────────────────────────────────────────
# 5-minute error ratio
- record: job:http_requests:error_ratio5m
expr: |
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service",
http_response_status_code=~"5.."
}[5m]))
/
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service"
}[5m]))
# 30-minute error ratio
- record: job:http_requests:error_ratio30m
expr: |
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service",
http_response_status_code=~"5.."
}[30m]))
/
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service"
}[30m]))
# 1-hour error ratio
- record: job:http_requests:error_ratio1h
expr: |
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service",
http_response_status_code=~"5.."
}[1h]))
/
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service"
}[1h]))
# 6-hour error ratio
- record: job:http_requests:error_ratio6h
expr: |
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service",
http_response_status_code=~"5.."
}[6h]))
/
sum by (job) (rate(http_server_request_duration_seconds_count{
job="payment-service"
}[6h]))
# ── Alert Rules ────────────────────────────────────────────
# P1: >14x burn rate over 5m AND 30m windows → budget gone in <2 days
- alert: PaymentSLOBurnRateCritical
expr: |
job:http_requests:error_ratio5m{job="payment-service"} > (14 * 0.001)
and
job:http_requests:error_ratio30m{job="payment-service"} > (14 * 0.001)
for: 2m
labels:
severity: critical
team: platform
slo_name: payment-availability
annotations:
summary: "Payment service SLO critical burn rate (14x)"
description: >
Error rate is {{ $value | humanizePercentage }} over the last 5m and 30m.
At this rate the 30-day error budget will be exhausted in less than 2 days.
Current burn rate: {{ $value | humanize }}x the SLO threshold.
runbook_url: "https://wiki.internal.example.com/runbooks/payment-slo-burn"
dashboard_url: "https://grafana.internal.example.com/d/payment-slo/payment-service-slo"
# P2: >6x burn rate over 30m AND 6h windows → budget gone in <5 days
- alert: PaymentSLOBurnRateHigh
expr: |
job:http_requests:error_ratio30m{job="payment-service"} > (6 * 0.001)
and
job:http_requests:error_ratio6h{job="payment-service"} > (6 * 0.001)
for: 15m
labels:
severity: warning
team: platform
slo_name: payment-availability
annotations:
summary: "Payment service SLO elevated burn rate (6x)"
description: >
Error rate is {{ $value | humanizePercentage }} over 30m and 6h windows.
Error budget will be exhausted in approximately 5 days if trend continues.
runbook_url: "https://wiki.internal.example.com/runbooks/payment-slo-burn"
# P3: >3x burn rate over 6h → degraded but not urgent
- alert: PaymentSLOBurnRateSlow
expr: |
job:http_requests:error_ratio6h{job="payment-service"} > (3 * 0.001)
for: 60m
labels:
severity: info
team: platform
slo_name: payment-availability
annotations:
summary: "Payment service SLO slow burn (3x) — create ticket"
description: >
Error rate has been elevated for 1 hour at {{ $value | humanizePercentage }}.
Error budget will be exhausted in approximately 10 days. Create a ticket.
runbook_url: "https://wiki.internal.example.com/runbooks/payment-slo-burn"
Loki Log-Based Alerting
Not all problems surface in Prometheus metrics. Loki alert rules use LogQL to alert on log patterns — critical for detecting application errors that don't yet have corresponding metrics.
# loki-rules.yaml — Loki alerting rules (PrometheusRule format via Loki Ruler)
groups:
- name: application-log-alerts
rules:
# Alert on elevated error log rate
- alert: HighApplicationErrorRate
expr: |
sum by (service_name) (
rate({namespace="production"} | json | level="ERROR" [5m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error log rate in {{ $labels.service_name }}"
description: "{{ $labels.service_name }} is logging {{ $value | humanize }} errors/s"
runbook_url: "https://wiki.internal.example.com/runbooks/high-error-logs"
# Alert on panic/fatal log lines — page immediately
- alert: ApplicationPanic
expr: |
count_over_time(
{namespace="production"} |= "panic" [2m]
) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Application panic detected in production"
description: "A panic was logged in the last 2 minutes. Immediate investigation required."
# Alert on authentication failures exceeding brute-force threshold
- alert: AuthBruteForceAttempt
expr: |
sum by (client_ip) (
rate(
{service_name="auth-service"} | json
| message="authentication failed" [5m]
)
) > 20
for: 2m
labels:
severity: critical
team: security
annotations:
summary: "Possible brute-force attack from {{ $labels.client_ip }}"
description: "{{ $value | humanize }} failed auth attempts/s from {{ $labels.client_ip }}"
# Alert on database connection pool exhaustion via log pattern
- alert: DatabaseConnectionPoolExhausted
expr: |
count_over_time(
{namespace="production"}
|= "connection pool exhausted" [5m]
) > 5
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection pool exhausted"
description: "Connection pool exhaustion logged 5+ times in 5 minutes."
runbook_url: "https://wiki.internal.example.com/runbooks/db-pool-exhausted"
PagerDuty & OpsGenie Integration
Production on-call operations require tight integration between AlertManager and incident management platforms.
PagerDuty Integration
# alertmanager-secrets.yaml (sealed with Sealed Secrets or External Secrets Operator)
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-pagerduty
namespace: monitoring
type: Opaque
stringData:
routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY_HERE"
---
# alertmanager.yaml — PagerDuty receiver with full context
receivers:
- name: pagerduty-platform
pagerduty_configs:
- routing_key_file: /etc/alertmanager/secrets/pagerduty-routing-key
send_resolved: true
severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}error{{ end }}'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }} alert(s) firing'
alertname: '{{ .CommonLabels.alertname }}'
cluster: '{{ .CommonLabels.cluster }}'
service: '{{ .CommonLabels.service }}'
runbook: '{{ .CommonAnnotations.runbook_url }}'
dashboard: '{{ .CommonAnnotations.dashboard_url }}'
links:
- href: '{{ .CommonAnnotations.runbook_url }}'
text: 'Runbook'
- href: '{{ .CommonAnnotations.dashboard_url }}'
text: 'Dashboard'
OpsGenie Integration
# alertmanager.yaml — OpsGenie receiver
receivers:
- name: opsgenie-platform
opsgenie_configs:
- api_key_file: /etc/alertmanager/secrets/opsgenie-api-key
send_resolved: true
message: '{{ .CommonAnnotations.summary }}'
description: '{{ .CommonAnnotations.description }}'
priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else if eq .CommonLabels.severity "warning" }}P2{{ else }}P3{{ end }}'
tags: >-
cluster={{ .CommonLabels.cluster }},
service={{ .CommonLabels.service }},
env={{ .CommonLabels.environment }}
details:
runbook: '{{ .CommonAnnotations.runbook_url }}'
dashboard: '{{ .CommonAnnotations.dashboard_url }}'
firing_alerts: '{{ .Alerts.Firing | len }}'
responders:
- name: platform-oncall
type: team
Runbook Links in Alerts
Every actionable alert must link to a runbook — a documented procedure telling the on-call engineer exactly what to investigate and how to resolve the issue.
Runbook Standards
A good runbook answers five questions:
- What does this alert mean? Plain-English description of the condition.
- What is the user impact? How are end users affected right now?
- How do I diagnose it? Step-by-step investigation with exact commands.
- How do I mitigate it? Rollback procedure, feature flag disable, scale-out command.
- Who do I escalate to? Named individuals or teams with contact info.
# PrometheusRule — alert with full runbook annotations
- alert: PaymentServiceHighErrorRate
expr: |
sum(rate(http_server_request_duration_seconds_count{
job="payment-service",
http_response_status_code=~"5.."
}[5m])) /
sum(rate(http_server_request_duration_seconds_count{
job="payment-service"
}[5m])) > 0.05
for: 5m
labels:
severity: critical
team: platform
service: payment-service
annotations:
summary: "Payment service error rate > 5% for 5 minutes"
description: >
Payment service is returning {{ $value | humanizePercentage }} errors.
This is above the 5% threshold. Users cannot complete purchases.
Affects: all checkout flows. Estimated revenue impact: HIGH.
runbook_url: "https://wiki.internal.example.com/runbooks/payment-high-error-rate"
dashboard_url: "https://grafana.internal.example.com/d/payment/payment-service?var-env=production"
logs_url: "https://grafana.internal.example.com/explore?orgId=1&left=%5B%22now-1h%22%2C%22now%22%2C%22Loki%22%2C%7B%22expr%22%3A%22%7Bservice_name%3D%5C%22payment-service%5C%22%7D+%7C+json+%7C+level%3D%5C%22ERROR%5C%22%22%7D%5D"
trace_url: "https://grafana.internal.example.com/explore?datasource=tempo"
now-1h to now), the correct datasource, and a relevant query. An engineer receiving a PagerDuty alert should be able to click one link and immediately see the relevant logs — no manual query writing required.
On-Call Escalation Design
Technology alone does not make on-call sustainable. The escalation structure, rotation design, and cultural norms are equally important.
Escalation Tier Design
| Tier | Who | Response Time | Triggers |
|---|---|---|---|
| L1 (Primary) | On-call engineer (rotating weekly) | 5 minutes | All P1/P2 alerts; acknowledge or escalate within 5 min |
| L2 (Secondary) | Team lead / senior engineer | 15 minutes | L1 unacknowledged for 5 min, or L1 escalates manually |
| L3 (Incident Commander) | Engineering manager | 30 minutes | L2 unacknowledged for 10 min, or incident affects >10% users |
| L4 (Executive) | VP Engineering / CTO | 60 minutes | Complete service outage >15 minutes or data breach suspected |
Rotation Design Principles
- Weekly rotations — shorter rotations (daily) increase context-switching cost; longer (monthly) cause burnout.
- Follow-the-sun for global teams — hand off between time zones so no one receives alerts at 3 AM. Requires at least 3 geographic regions.
- Shadow rotations for new engineers — pair new team members with an experienced oncall for 2–4 weeks before solo shifts.
- Oncall load limit — no engineer should receive more than 5 actionable pages per shift on average. More than that = alert quality problem, not a staffing problem.
- Post-incident reviews — conduct a blameless postmortem for every P1 incident within 48 hours. Track action items in the ticket system.
Alert Quality Metrics
Track these metrics in your weekly engineering review to continuously improve alert quality:
- Actionability rate: % of alerts that required human action (target: >80%). If lower, delete or automate resolution.
- Mean time to acknowledge (MTTA): Target <5 minutes for P1. Track trends over time.
- False positive rate: Alerts that fired but required no action. Target: <10%. High false positives erode trust.
- Alert-to-ticket ratio: What % of P3/warning alerts become actionable tickets within 48h? If low, delete the alert.
- Oncall interruptions per week: Count of out-of-hours pages. Target: <2 per shift per week for sustainability.