Observability
A comprehensive guide to understanding system behavior in production — moving beyond reactive monitoring toward a truly observable infrastructure.
Observability vs. Monitoring
These two terms are frequently conflated, but they represent fundamentally different philosophies about how you interact with a production system.
Monitoring
Monitoring is the practice of collecting and evaluating predefined metrics and thresholds. You know in advance what you are looking for.
- Dashboard of known KPIs
- Alerting on CPU > 80%
- Checking uptime every 30 seconds
- Answers: "Is X broken?"
Limitation: You can only detect failures you anticipated.
Observability
Observability enables ad-hoc exploration of system behavior using high-cardinality, high-dimensionality telemetry data.
- Slice metrics by any attribute
- Correlate logs, traces, metrics
- Ask: "Why is user X experiencing latency?"
- Answers: "What is broken and why?"
Strength: Understand unknown-unknown failure modes.
The Three Pillars of Observability
The three pillars — Metrics, Logs, and Traces — form the foundational telemetry data types that together enable full system observability.
Pillar 1: Metrics
Metrics are numeric time-series measurements aggregated over time. They are highly efficient to store and query, making them ideal for dashboards and alerting at scale.
Examples:
http_requests_total{method="GET", status="200"}— request counternode_cpu_seconds_total— CPU utilization gaugehttp_request_duration_seconds— latency histogram (p50, p95, p99)go_goroutines— runtime state gauge
Tools: Prometheus, InfluxDB, Datadog Metrics, CloudWatch, VictoriaMetrics
Pillar 2: Logs
Logs are timestamped, immutable records of discrete events. They capture the rich context of what happened at a specific point in time and are essential for root cause analysis.
Examples:
{
"timestamp": "2026-03-28T08:42:11Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "4bf92f3577b34da6",
"span_id": "00f067aa0ba902b7",
"user_id": "usr_9f2a1b",
"message": "Payment gateway timeout after 5000ms",
"gateway": "stripe",
"amount_cents": 4999
}
Tools: Loki, Elasticsearch, Splunk, Datadog Logs, CloudWatch Logs
Pillar 3: Traces
Distributed traces track a single request as it propagates across multiple services. Each unit of work is a "span"; spans are linked into a "trace" via a shared TraceID.
Example trace for a checkout request:
TraceID: 4bf92f3577b34da6
│
├── [api-gateway] checkout POST /api/v1/order 0ms → 320ms
│ ├── [auth-service] ValidateJWT 5ms → 18ms
│ ├── [cart-service] GetCart(user_id) 20ms → 45ms
│ ├── [inventory-svc] ReserveItems([sku_101]) 47ms → 110ms
│ └── [payment-svc] ChargeCard(stripe) 112ms → 315ms ← SLOW
│ └── [stripe-api] POST /v1/charges 115ms → 313ms
Tools: Jaeger, Grafana Tempo, Zipkin, Datadog APM, AWS X-Ray
trace_id embedded in the log line.
OpenTelemetry (OTel)
OpenTelemetry is a CNCF project that provides a vendor-neutral, standardized framework for generating, collecting, and exporting telemetry data (metrics, logs, and traces).
Why OpenTelemetry?
Before OTel, every observability vendor had its own SDK, agent, and wire protocol. Switching from Datadog to Jaeger meant rewriting instrumentation code. OpenTelemetry solves this with a single API layer.
- Single API & SDK — one instrumentation surface for all signals
- Auto-instrumentation — instrument popular frameworks (Flask, gRPC, Spring) with zero code changes via agents
- Pluggable exporters — send to Jaeger, Prometheus, Datadog, or any OTLP-compatible backend
- Collector — a standalone proxy/processor that receives, transforms, and exports telemetry
OTel Architecture
┌─────────────────────────────────────────────────────────┐
│ Your Application │
│ ┌─────────────────────────────────────────────────┐ │
│ │ OTel SDK (auto-instrumented or manual spans) │ │
│ └────────────────────┬────────────────────────────┘ │
└───────────────────────┼─────────────────────────────────┘
│ OTLP (gRPC / HTTP)
▼
┌─────────────────────────┐
│ OTel Collector │
│ ┌──────────────────┐ │
│ │ Receivers │ │ ← OTLP, Jaeger, Zipkin, Prometheus
│ │ Processors │ │ ← batch, filter, sample, enrich
│ │ Exporters │ │ ← Jaeger, Tempo, Prometheus, Loki
│ └──────────────────┘ │
└────────────┬────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
Jaeger Prometheus Loki
(traces) (metrics) (logs)
Auto-Instrumentation Concept (Python)
# Install OTel packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Auto-instrument a Flask app — zero code changes required
opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--logs_exporter otlp \
--exporter_otlp_endpoint http://otel-collector:4317 \
python app.py
MELT Framework
MELT is an expansion of the three pillars that formally includes Events as a distinct signal type, giving a more complete picture of observable data.
M — Metrics
Aggregated numeric time-series. Low storage cost, fast to query. Best for trends and alerting thresholds.
E — Events
Discrete occurrences with rich context: deployments, feature flag changes, circuit breaker trips, config changes. Acts as "change markers" on your dashboards.
L — Logs
Timestamped, structured event records. High storage cost, highest detail. Best for root cause analysis and audit trails.
T — Traces
Request-scoped, cross-service execution paths. Moderate storage cost. Best for diagnosing latency and inter-service dependencies.
Observability Maturity Model
Organizations typically progress through three stages of observability maturity. Understanding where your team is helps prioritize investment.
Stage 1 Reactive
Characteristics: Alerts fire after users report problems. Teams SSH into servers to read logs. Dashboards exist but are rarely used proactively. Every outage involves manual log tailing and guesswork.
Signals: Basic uptime checks, server-level CPU/memory metrics, unstructured logs.
Goal: Move from unstructured logs to structured JSON logging; centralize logs into a searchable platform (Loki or ELK).
Stage 2 Proactive
Characteristics: Teams can answer "what is broken?" without user reports. SLOs are defined and tracked. Distributed tracing exists. Runbooks are linked to alerts. Oncall rotations are structured.
Signals: Application-level metrics (RED method), distributed traces, structured logs with trace correlation, SLO burn-rate alerts.
Goal: Close the correlation gap between signals; implement trace-to-log and metric-to-trace linking.
Stage 3 Predictive
Characteristics: Teams can answer "what will break?" before it happens. Anomaly detection flags unusual patterns. Capacity planning uses historical telemetry. Chaos experiments validate observability coverage.
Signals: Full MELT correlation, ML-based anomaly detection, synthetic monitoring, continuous profiling (Pyroscope), business-level SLIs.
Goal: Instrument business metrics alongside infrastructure metrics; build feedback loops from observability data into platform decisions.
Tool Landscape
The modern observability ecosystem is rich. Below is a curated overview of the most widely adopted open-source and commercial tools, organized by primary function.
| Tool | Category | Primary Function | Deployment |
|---|---|---|---|
| Prometheus | Metrics | Time-series database with pull-based scraping and PromQL query language | Self-hosted / Managed |
| Grafana | Visualization | Multi-source dashboard platform; integrates with Prometheus, Loki, Tempo, and 100+ datasources | Self-hosted / Cloud |
| Jaeger | Tracing | CNCF distributed tracing system; Cassandra/Elasticsearch backend; native OTLP support | Self-hosted / Kubernetes |
| Grafana Tempo | Tracing | Cost-efficient trace backend using object storage (S3/GCS); integrates natively with Grafana | Self-hosted / Grafana Cloud |
| Loki | Logs | Log aggregation system inspired by Prometheus; indexes labels, not full text; uses LogQL | Self-hosted / Grafana Cloud |
| OpenTelemetry | Instrumentation | Vendor-neutral SDK + Collector for generating and routing MELT telemetry | SDK (in-app) + Collector |
| Datadog | Full-stack (Commercial) | Unified SaaS platform for metrics, logs, traces, synthetics, RUM, and security monitoring | SaaS / Agent-based |
| New Relic | Full-stack (Commercial) | SaaS observability platform with APM, infrastructure monitoring, browser monitoring, and NRQL | SaaS / Agent-based |
Explore Further
Dive deeper into specific observability disciplines:
Distributed Tracing
OpenTelemetry SDK setup, Jaeger and Tempo deployment on Kubernetes, sampling strategies, trace-to-log correlation, and common anti-patterns.
Read More →Dashboards & Alerting
Grafana dashboard design with USE/RED/Four Golden Signals, AlertManager configuration, SLO-based multiburn alerting, and on-call escalation design.
Read More →