Dashboards & Alerting

Design dashboards that answer the right questions, and build alerting that pages humans only when human judgment is actually needed.

Guiding Principle: Every alert that fires should be actionable and urgent. If an alert fires and the on-call engineer's first thought is "I'll deal with this in the morning," the alert should either be deleted or downgraded to a ticket. Alert fatigue kills on-call culture.

Dashboard Design Principles

Well-designed dashboards follow established methodologies rather than being built ad-hoc. Three frameworks dominate the industry.

USE Method — Infrastructure Focus

Developed by Brendan Gregg, USE stands for Utilization, Saturation, and Errors. Apply it to every hardware and OS resource.

  • Utilization: Average time the resource was busy. Example: CPU busy 72% of the time.
  • Saturation: Degree to which the resource has extra work queued. Example: CPU run queue length > 1.
  • Errors: Count of error events. Example: NIC packet drops, disk I/O errors.

Best for: Kubernetes node dashboards, database server dashboards, network device monitoring.

# USE Method PromQL examples

# CPU Utilization (%)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Saturation (run queue length — node_pressure_cpu_waiting_seconds on modern kernels)
rate(node_pressure_cpu_waiting_seconds_total[5m])

# Memory Saturation (page fault rate as proxy for memory pressure)
rate(node_vmstat_pgmajfault[5m])

# Disk Utilization (%)
rate(node_disk_io_time_seconds_total{device="sda"}[5m]) * 100

# Network Errors (per second)
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])

RED Method — Service / Microservice Focus

Developed by Tom Wilkie, RED focuses on the user-visible behavior of services. It stands for Rate, Errors, and Duration.

  • Rate: Number of requests per second the service is handling.
  • Errors: Number of failed requests per second (4xx and 5xx).
  • Duration: Distribution of time each request takes (p50, p95, p99).

Best for: Per-service dashboards in a microservices architecture. Every service gets the same three panels for instant cross-service comparison.

# RED Method PromQL examples (assuming standard OTel/Prometheus metrics)

# Rate: requests per second by service and status
sum by (service_name) (rate(http_server_request_duration_seconds_count[5m]))

# Error rate: percentage of 5xx responses
sum by (service_name) (
  rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[5m])
) /
sum by (service_name) (
  rate(http_server_request_duration_seconds_count[5m])
) * 100

# Duration: p99 latency by service
histogram_quantile(0.99, sum by (service_name, le) (
  rate(http_server_request_duration_seconds_bucket[5m])
))

Four Golden Signals — SRE Focus

From the Google SRE Book, the Four Golden Signals are the most critical user-facing metrics. They extend RED by adding Saturation as the fourth signal.

  • Latency: Time to serve a request. Distinguish success latency from error latency (slow errors mask real problems).
  • Traffic: Demand on the system (req/s, transactions/s, queries/s).
  • Errors: Rate of failing requests, both explicit (5xx) and implicit (200 with wrong data).
  • Saturation: How full the service is. Predict performance degradation before it happens (queue depth, thread pool exhaustion, connection pool usage).

Best for: Executive dashboards, SLO tracking, incident command dashboards during an outage.

Dashboard Layout Best Practices

  • Top row: overview stat panels — total requests, error %, SLO burn rate, uptime. Status at a glance.
  • Second row: time-series graphs — rate, error rate, p99 latency over the last 1h/6h/24h.
  • Third row: resource breakdown — per-pod CPU/memory, per-endpoint latency heatmap.
  • Bottom rows: detailed drill-down — database query times, dependency health, log panel.
  • Use template variables$namespace, $service, $environment — so one dashboard serves all services.
  • Add deployment annotations — mark every deploy on all time-series graphs so latency changes are immediately correlated to code changes.

Grafana Provisioning via YAML

Dashboard and datasource configuration should be version-controlled and provisioned automatically — never configured manually through the UI.

Datasource Provisioning

# /etc/grafana/provisioning/datasources/observability.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-server.monitoring.svc.cluster.local:9090
    uid: prometheus
    isDefault: true
    jsonData:
      httpMethod: POST
      manageAlerts: true
      alertmanagerUid: alertmanager
      prometheusType: Prometheus
      prometheusVersion: 2.50.0

  - name: Loki
    type: loki
    access: proxy
    url: http://loki-gateway.monitoring.svc.cluster.local:80
    uid: loki
    jsonData:
      maxLines: 1000
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace_id":\s*"(\w+)"'
          url: "${__value.raw}"
          datasourceUid: tempo
          urlDisplayLabel: View in Tempo

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo-query-frontend.monitoring.svc.cluster.local:3100
    uid: tempo
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        filterByTraceID: true
        customQuery: true
        query: '{service_name="${__span.tags["service.name"]}"} | json | trace_id="${__trace.traceId}"'
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

  - name: AlertManager
    type: alertmanager
    access: proxy
    url: http://alertmanager.monitoring.svc.cluster.local:9093
    uid: alertmanager
    jsonData:
      implementation: prometheus

Dashboard Provisioning

# /etc/grafana/provisioning/dashboards/default.yaml
apiVersion: 1

providers:
  - name: Platform Dashboards
    orgId: 1
    type: file
    disableDeletion: true        # prevent manual deletion via UI
    updateIntervalSeconds: 30    # reload from disk every 30s
    allowUiUpdates: false        # changes must go through Git
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true   # folder = directory name
# Grafana Helm values — mount dashboards from ConfigMap
grafana.ini:
  analytics:
    check_for_updates: false

dashboardProviders:
  dashboardproviders.yaml:
    apiVersion: 1
    providers:
      - name: default
        orgId: 1
        type: file
        disableDeletion: true
        options:
          path: /var/lib/grafana/dashboards/default

dashboards:
  default:
    kubernetes-cluster:
      gnetId: 15661     # Kubernetes cluster dashboard from grafana.com
      revision: 1
      datasource: Prometheus
    node-exporter:
      gnetId: 1860
      revision: 37
      datasource: Prometheus

sidecar:
  dashboards:
    enabled: true
    label: grafana_dashboard    # auto-load ConfigMaps with this label
    labelValue: "1"
    searchNamespace: ALL

AlertManager Configuration

AlertManager handles deduplication, grouping, silencing, and routing of alerts from Prometheus. Proper configuration is critical for effective on-call operations.

Core AlertManager Config

# alertmanager.yaml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  # Default grouping: alerts from the same alertname+cluster+service
  # fire as a single notification
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s         # wait 30s before sending first notification (batch)
  group_interval: 5m      # send updates every 5 minutes
  repeat_interval: 4h     # re-notify if still firing after 4h
  receiver: 'slack-default'

  routes:
    # Critical P1 alerts — page immediately via PagerDuty
    - matchers:
        - severity = "critical"
      receiver: pagerduty-p1
      group_wait: 0s          # page immediately, no batching
      repeat_interval: 30m    # re-page every 30m until resolved

    # SLO burn rate alerts — PagerDuty with 5-min grouping
    - matchers:
        - alertname =~ "SLOBurnRate.*"
      receiver: pagerduty-slo
      group_by: ['alertname', 'service', 'slo_name']
      group_wait: 30s
      repeat_interval: 1h

    # Warning-level alerts — Slack only, no page
    - matchers:
        - severity = "warning"
      receiver: slack-warnings
      group_wait: 2m
      repeat_interval: 8h

    # Security alerts — dedicated security Slack + PagerDuty security team
    - matchers:
        - team = "security"
      receiver: security-channel
      group_wait: 0s
      repeat_interval: 15m

inhibit_rules:
  # If a critical alert is firing, suppress the corresponding warning
  - source_matchers:
      - severity = "critical"
    target_matchers:
      - severity = "warning"
    equal: ['alertname', 'cluster', 'service']

receivers:
  - name: 'slack-default'
    slack_configs:
      - channel: '#platform-alerts'
        send_resolved: true
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#platform-warnings'
        send_resolved: true

  - name: 'pagerduty-p1'
    pagerduty_configs:
      - routing_key: '{{ .ExternalURL }}'   # use secret ref in production
        severity: 'critical'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          runbook: '{{ .CommonAnnotations.runbook_url }}'
          dashboard: '{{ .CommonAnnotations.dashboard_url }}'

  - name: 'pagerduty-slo'
    pagerduty_configs:
      - routing_key: '{{ .ExternalURL }}'
        severity: 'error'
        description: 'SLO burn rate alert: {{ .CommonAnnotations.summary }}'

  - name: 'security-channel'
    slack_configs:
      - channel: '#security-alerts'
        send_resolved: true
    pagerduty_configs:
      - routing_key: 'YOUR_SECURITY_PAGERDUTY_KEY'
        severity: 'critical'

SLO-Based Alerting with Multiburn Rate

SLO-based alerting moves away from arbitrary threshold alerts toward alerts that are directly tied to user impact. The multiburn rate approach (from the Google SRE Workbook) pages when the error budget is being consumed too quickly.

Error Budget Recap

For a 99.9% availability SLO with a 30-day window:

  • Allowed downtime: 43.8 minutes per 30 days
  • Allowed error rate: 0.1% of requests
  • 1x burn rate = consuming budget at exactly the SLO rate (uses 100% in 30 days)
  • 14x burn rate = consuming budget 14x faster (uses 100% in ~2 days)

Multiburn Rate Alert Rules (PrometheusRule)

# slo-alerts.yaml — PrometheusRule for Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-service-slos
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: payment-service.slo.rules
      interval: 30s
      rules:
        # ── Recording Rules ────────────────────────────────────────
        # 5-minute error ratio
        - record: job:http_requests:error_ratio5m
          expr: |
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service",
              http_response_status_code=~"5.."
            }[5m]))
            /
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service"
            }[5m]))

        # 30-minute error ratio
        - record: job:http_requests:error_ratio30m
          expr: |
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service",
              http_response_status_code=~"5.."
            }[30m]))
            /
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service"
            }[30m]))

        # 1-hour error ratio
        - record: job:http_requests:error_ratio1h
          expr: |
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service",
              http_response_status_code=~"5.."
            }[1h]))
            /
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service"
            }[1h]))

        # 6-hour error ratio
        - record: job:http_requests:error_ratio6h
          expr: |
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service",
              http_response_status_code=~"5.."
            }[6h]))
            /
            sum by (job) (rate(http_server_request_duration_seconds_count{
              job="payment-service"
            }[6h]))

        # ── Alert Rules ────────────────────────────────────────────
        # P1: >14x burn rate over 5m AND 30m windows → budget gone in <2 days
        - alert: PaymentSLOBurnRateCritical
          expr: |
            job:http_requests:error_ratio5m{job="payment-service"}  > (14 * 0.001)
            and
            job:http_requests:error_ratio30m{job="payment-service"} > (14 * 0.001)
          for: 2m
          labels:
            severity: critical
            team: platform
            slo_name: payment-availability
          annotations:
            summary: "Payment service SLO critical burn rate (14x)"
            description: >
              Error rate is {{ $value | humanizePercentage }} over the last 5m and 30m.
              At this rate the 30-day error budget will be exhausted in less than 2 days.
              Current burn rate: {{ $value | humanize }}x the SLO threshold.
            runbook_url: "https://wiki.internal.example.com/runbooks/payment-slo-burn"
            dashboard_url: "https://grafana.internal.example.com/d/payment-slo/payment-service-slo"

        # P2: >6x burn rate over 30m AND 6h windows → budget gone in <5 days
        - alert: PaymentSLOBurnRateHigh
          expr: |
            job:http_requests:error_ratio30m{job="payment-service"} > (6 * 0.001)
            and
            job:http_requests:error_ratio6h{job="payment-service"}  > (6 * 0.001)
          for: 15m
          labels:
            severity: warning
            team: platform
            slo_name: payment-availability
          annotations:
            summary: "Payment service SLO elevated burn rate (6x)"
            description: >
              Error rate is {{ $value | humanizePercentage }} over 30m and 6h windows.
              Error budget will be exhausted in approximately 5 days if trend continues.
            runbook_url: "https://wiki.internal.example.com/runbooks/payment-slo-burn"

        # P3: >3x burn rate over 6h → degraded but not urgent
        - alert: PaymentSLOBurnRateSlow
          expr: |
            job:http_requests:error_ratio6h{job="payment-service"} > (3 * 0.001)
          for: 60m
          labels:
            severity: info
            team: platform
            slo_name: payment-availability
          annotations:
            summary: "Payment service SLO slow burn (3x) — create ticket"
            description: >
              Error rate has been elevated for 1 hour at {{ $value | humanizePercentage }}.
              Error budget will be exhausted in approximately 10 days. Create a ticket.
            runbook_url: "https://wiki.internal.example.com/runbooks/payment-slo-burn"
Multiburn Rate Logic: The key insight is using two windows for each alert (e.g., 5m AND 30m). The short window detects fast-moving incidents quickly. The long window confirms the signal is sustained and not a brief blip. Both must be above the threshold to fire, reducing false positives dramatically.

Loki Log-Based Alerting

Not all problems surface in Prometheus metrics. Loki alert rules use LogQL to alert on log patterns — critical for detecting application errors that don't yet have corresponding metrics.

# loki-rules.yaml — Loki alerting rules (PrometheusRule format via Loki Ruler)
groups:
  - name: application-log-alerts
    rules:
      # Alert on elevated error log rate
      - alert: HighApplicationErrorRate
        expr: |
          sum by (service_name) (
            rate({namespace="production"} | json | level="ERROR" [5m])
          ) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error log rate in {{ $labels.service_name }}"
          description: "{{ $labels.service_name }} is logging {{ $value | humanize }} errors/s"
          runbook_url: "https://wiki.internal.example.com/runbooks/high-error-logs"

      # Alert on panic/fatal log lines — page immediately
      - alert: ApplicationPanic
        expr: |
          count_over_time(
            {namespace="production"} |= "panic" [2m]
          ) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Application panic detected in production"
          description: "A panic was logged in the last 2 minutes. Immediate investigation required."

      # Alert on authentication failures exceeding brute-force threshold
      - alert: AuthBruteForceAttempt
        expr: |
          sum by (client_ip) (
            rate(
              {service_name="auth-service"} | json
              | message="authentication failed" [5m]
            )
          ) > 20
        for: 2m
        labels:
          severity: critical
          team: security
        annotations:
          summary: "Possible brute-force attack from {{ $labels.client_ip }}"
          description: "{{ $value | humanize }} failed auth attempts/s from {{ $labels.client_ip }}"

      # Alert on database connection pool exhaustion via log pattern
      - alert: DatabaseConnectionPoolExhausted
        expr: |
          count_over_time(
            {namespace="production"}
            |= "connection pool exhausted" [5m]
          ) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool exhausted"
          description: "Connection pool exhaustion logged 5+ times in 5 minutes."
          runbook_url: "https://wiki.internal.example.com/runbooks/db-pool-exhausted"

PagerDuty & OpsGenie Integration

Production on-call operations require tight integration between AlertManager and incident management platforms.

PagerDuty Integration

# alertmanager-secrets.yaml (sealed with Sealed Secrets or External Secrets Operator)
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-pagerduty
  namespace: monitoring
type: Opaque
stringData:
  routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY_HERE"

---
# alertmanager.yaml — PagerDuty receiver with full context
receivers:
  - name: pagerduty-platform
    pagerduty_configs:
      - routing_key_file: /etc/alertmanager/secrets/pagerduty-routing-key
        send_resolved: true
        severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}error{{ end }}'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }} alert(s) firing'
          alertname: '{{ .CommonLabels.alertname }}'
          cluster: '{{ .CommonLabels.cluster }}'
          service: '{{ .CommonLabels.service }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'
          dashboard: '{{ .CommonAnnotations.dashboard_url }}'
        links:
          - href: '{{ .CommonAnnotations.runbook_url }}'
            text: 'Runbook'
          - href: '{{ .CommonAnnotations.dashboard_url }}'
            text: 'Dashboard'

OpsGenie Integration

# alertmanager.yaml — OpsGenie receiver
receivers:
  - name: opsgenie-platform
    opsgenie_configs:
      - api_key_file: /etc/alertmanager/secrets/opsgenie-api-key
        send_resolved: true
        message: '{{ .CommonAnnotations.summary }}'
        description: '{{ .CommonAnnotations.description }}'
        priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else if eq .CommonLabels.severity "warning" }}P2{{ else }}P3{{ end }}'
        tags: >-
          cluster={{ .CommonLabels.cluster }},
          service={{ .CommonLabels.service }},
          env={{ .CommonLabels.environment }}
        details:
          runbook: '{{ .CommonAnnotations.runbook_url }}'
          dashboard: '{{ .CommonAnnotations.dashboard_url }}'
          firing_alerts: '{{ .Alerts.Firing | len }}'
        responders:
          - name: platform-oncall
            type: team

Runbook Links in Alerts

Every actionable alert must link to a runbook — a documented procedure telling the on-call engineer exactly what to investigate and how to resolve the issue.

Runbook Standards

A good runbook answers five questions:

  1. What does this alert mean? Plain-English description of the condition.
  2. What is the user impact? How are end users affected right now?
  3. How do I diagnose it? Step-by-step investigation with exact commands.
  4. How do I mitigate it? Rollback procedure, feature flag disable, scale-out command.
  5. Who do I escalate to? Named individuals or teams with contact info.
# PrometheusRule — alert with full runbook annotations
- alert: PaymentServiceHighErrorRate
  expr: |
    sum(rate(http_server_request_duration_seconds_count{
      job="payment-service",
      http_response_status_code=~"5.."
    }[5m])) /
    sum(rate(http_server_request_duration_seconds_count{
      job="payment-service"
    }[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
    team: platform
    service: payment-service
  annotations:
    summary: "Payment service error rate > 5% for 5 minutes"
    description: >
      Payment service is returning {{ $value | humanizePercentage }} errors.
      This is above the 5% threshold. Users cannot complete purchases.
      Affects: all checkout flows. Estimated revenue impact: HIGH.
    runbook_url: "https://wiki.internal.example.com/runbooks/payment-high-error-rate"
    dashboard_url: "https://grafana.internal.example.com/d/payment/payment-service?var-env=production"
    logs_url: "https://grafana.internal.example.com/explore?orgId=1&left=%5B%22now-1h%22%2C%22now%22%2C%22Loki%22%2C%7B%22expr%22%3A%22%7Bservice_name%3D%5C%22payment-service%5C%22%7D+%7C+json+%7C+level%3D%5C%22ERROR%5C%22%22%7D%5D"
    trace_url: "https://grafana.internal.example.com/explore?datasource=tempo"
Deep Links: Include pre-filtered deep links to Grafana Explore in alert annotations. Pre-populate the time range (now-1h to now), the correct datasource, and a relevant query. An engineer receiving a PagerDuty alert should be able to click one link and immediately see the relevant logs — no manual query writing required.

On-Call Escalation Design

Technology alone does not make on-call sustainable. The escalation structure, rotation design, and cultural norms are equally important.

Escalation Tier Design

Tier Who Response Time Triggers
L1 (Primary) On-call engineer (rotating weekly) 5 minutes All P1/P2 alerts; acknowledge or escalate within 5 min
L2 (Secondary) Team lead / senior engineer 15 minutes L1 unacknowledged for 5 min, or L1 escalates manually
L3 (Incident Commander) Engineering manager 30 minutes L2 unacknowledged for 10 min, or incident affects >10% users
L4 (Executive) VP Engineering / CTO 60 minutes Complete service outage >15 minutes or data breach suspected

Rotation Design Principles

  • Weekly rotations — shorter rotations (daily) increase context-switching cost; longer (monthly) cause burnout.
  • Follow-the-sun for global teams — hand off between time zones so no one receives alerts at 3 AM. Requires at least 3 geographic regions.
  • Shadow rotations for new engineers — pair new team members with an experienced oncall for 2–4 weeks before solo shifts.
  • Oncall load limit — no engineer should receive more than 5 actionable pages per shift on average. More than that = alert quality problem, not a staffing problem.
  • Post-incident reviews — conduct a blameless postmortem for every P1 incident within 48 hours. Track action items in the ticket system.

Alert Quality Metrics

Track these metrics in your weekly engineering review to continuously improve alert quality:

  • Actionability rate: % of alerts that required human action (target: >80%). If lower, delete or automate resolution.
  • Mean time to acknowledge (MTTA): Target <5 minutes for P1. Track trends over time.
  • False positive rate: Alerts that fired but required no action. Target: <10%. High false positives erode trust.
  • Alert-to-ticket ratio: What % of P3/warning alerts become actionable tickets within 48h? If low, delete the alert.
  • Oncall interruptions per week: Count of out-of-hours pages. Target: <2 per shift per week for sustainability.
Alerting Maturity Checklist: Every alert in your system should pass this test: (1) Is it actionable? (2) Does it link to a runbook? (3) Does it link to a dashboard? (4) Is the severity label correct? (5) Is there an inhibit rule to prevent duplicates during cascading failures? (6) Has it been reviewed in the last 90 days? If any answer is "no," the alert needs attention before it pages a human at 3 AM.