Observability

A comprehensive guide to understanding system behavior in production — moving beyond reactive monitoring toward a truly observable infrastructure.

Core Idea: Observability is the ability to understand the internal state of a system from its external outputs. If you can ask any question about your system without deploying new code, you have achieved observability.

Observability vs. Monitoring

These two terms are frequently conflated, but they represent fundamentally different philosophies about how you interact with a production system.

Monitoring

Monitoring is the practice of collecting and evaluating predefined metrics and thresholds. You know in advance what you are looking for.

Dashboard of known KPIs
Alerting on CPU > 80%
Checking uptime every 30 seconds
Answers: "Is X broken?"

Limitation: You can only detect failures you anticipated.

Observability

Observability enables ad-hoc exploration of system behavior using high-cardinality, high-dimensionality telemetry data.

Slice metrics by any attribute
Correlate logs, traces, metrics
Ask: "Why is user X experiencing latency?"
Answers: "What is broken and why?"

Strength: Understand unknown-unknown failure modes.

Common Misconception: Monitoring tells you that something is wrong. Observability tells you why it is wrong. A mature platform needs both — monitoring provides the alert, observability provides the investigation capability.

The Three Pillars of Observability

The three pillars — Metrics, Logs, and Traces — form the foundational telemetry data types that together enable full system observability.

Pillar 1: Metrics

Metrics are numeric time-series measurements aggregated over time. They are highly efficient to store and query, making them ideal for dashboards and alerting at scale.

Examples:

http_requests_total{method="GET", status="200"} — request counter
node_cpu_seconds_total — CPU utilization gauge
http_request_duration_seconds — latency histogram (p50, p95, p99)
go_goroutines — runtime state gauge

Tools: Prometheus, InfluxDB, Datadog Metrics, CloudWatch, VictoriaMetrics

Pillar 2: Logs

Logs are timestamped, immutable records of discrete events. They capture the rich context of what happened at a specific point in time and are essential for root cause analysis.

Examples:

{
  "timestamp": "2026-03-28T08:42:11Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "4bf92f3577b34da6",
  "span_id": "00f067aa0ba902b7",
  "user_id": "usr_9f2a1b",
  "message": "Payment gateway timeout after 5000ms",
  "gateway": "stripe",
  "amount_cents": 4999
}

Tools: Loki, Elasticsearch, Splunk, Datadog Logs, CloudWatch Logs

Pillar 3: Traces

Distributed traces track a single request as it propagates across multiple services. Each unit of work is a "span"; spans are linked into a "trace" via a shared TraceID.

Example trace for a checkout request:

TraceID: 4bf92f3577b34da6
│
├── [api-gateway]         checkout POST /api/v1/order   0ms → 320ms
│   ├── [auth-service]    ValidateJWT                   5ms → 18ms
│   ├── [cart-service]    GetCart(user_id)              20ms → 45ms
│   ├── [inventory-svc]   ReserveItems([sku_101])       47ms → 110ms
│   └── [payment-svc]     ChargeCard(stripe)            112ms → 315ms  ← SLOW
│       └── [stripe-api]  POST /v1/charges              115ms → 313ms

Tools: Jaeger, Grafana Tempo, Zipkin, Datadog APM, AWS X-Ray

Pillar Correlation: The real power of observability comes from correlating across pillars. A Grafana panel showing a latency spike links to a Loki log query filtered by the same time window, which links to a Tempo trace via trace_id embedded in the log line.

OpenTelemetry (OTel)

OpenTelemetry is a CNCF project that provides a vendor-neutral, standardized framework for generating, collecting, and exporting telemetry data (metrics, logs, and traces).

Why OpenTelemetry?

Before OTel, every observability vendor had its own SDK, agent, and wire protocol. Switching from Datadog to Jaeger meant rewriting instrumentation code. OpenTelemetry solves this with a single API layer.

Single API & SDK — one instrumentation surface for all signals
Auto-instrumentation — instrument popular frameworks (Flask, gRPC, Spring) with zero code changes via agents
Pluggable exporters — send to Jaeger, Prometheus, Datadog, or any OTLP-compatible backend
Collector — a standalone proxy/processor that receives, transforms, and exports telemetry

OTel Architecture

┌─────────────────────────────────────────────────────────┐
│                    Your Application                      │
│  ┌─────────────────────────────────────────────────┐    │
│  │  OTel SDK  (auto-instrumented or manual spans)  │    │
│  └────────────────────┬────────────────────────────┘    │
└───────────────────────┼─────────────────────────────────┘
                        │ OTLP (gRPC / HTTP)
                        ▼
          ┌─────────────────────────┐
          │   OTel Collector        │
          │  ┌──────────────────┐   │
          │  │ Receivers        │   │  ← OTLP, Jaeger, Zipkin, Prometheus
          │  │ Processors       │   │  ← batch, filter, sample, enrich
          │  │ Exporters        │   │  ← Jaeger, Tempo, Prometheus, Loki
          │  └──────────────────┘   │
          └────────────┬────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
       Jaeger       Prometheus    Loki
      (traces)      (metrics)    (logs)

Auto-Instrumentation Concept (Python)

# Install OTel packages
pip install opentelemetry-distro opentelemetry-exporter-otlp

# Auto-instrument a Flask app — zero code changes required
opentelemetry-instrument \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --logs_exporter otlp \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

Collector as a Buffer: Always deploy the OTel Collector as an intermediary rather than sending directly from your app to the backend. The Collector provides batching, retry logic, and the freedom to change backends without touching application code.

MELT Framework

MELT is an expansion of the three pillars that formally includes Events as a distinct signal type, giving a more complete picture of observable data.

M — Metrics

Aggregated numeric time-series. Low storage cost, fast to query. Best for trends and alerting thresholds.

E — Events

Discrete occurrences with rich context: deployments, feature flag changes, circuit breaker trips, config changes. Acts as "change markers" on your dashboards.

L — Logs

Timestamped, structured event records. High storage cost, highest detail. Best for root cause analysis and audit trails.

T — Traces

Request-scoped, cross-service execution paths. Moderate storage cost. Best for diagnosing latency and inter-service dependencies.

Observability Maturity Model

Organizations typically progress through three stages of observability maturity. Understanding where your team is helps prioritize investment.

Stage 1 Reactive

Characteristics: Alerts fire after users report problems. Teams SSH into servers to read logs. Dashboards exist but are rarely used proactively. Every outage involves manual log tailing and guesswork.

Signals: Basic uptime checks, server-level CPU/memory metrics, unstructured logs.

Goal: Move from unstructured logs to structured JSON logging; centralize logs into a searchable platform (Loki or ELK).

Stage 2 Proactive

Characteristics: Teams can answer "what is broken?" without user reports. SLOs are defined and tracked. Distributed tracing exists. Runbooks are linked to alerts. Oncall rotations are structured.

Signals: Application-level metrics (RED method), distributed traces, structured logs with trace correlation, SLO burn-rate alerts.

Goal: Close the correlation gap between signals; implement trace-to-log and metric-to-trace linking.

Stage 3 Predictive

Characteristics: Teams can answer "what will break?" before it happens. Anomaly detection flags unusual patterns. Capacity planning uses historical telemetry. Chaos experiments validate observability coverage.

Signals: Full MELT correlation, ML-based anomaly detection, synthetic monitoring, continuous profiling (Pyroscope), business-level SLIs.

Goal: Instrument business metrics alongside infrastructure metrics; build feedback loops from observability data into platform decisions.

Tool Landscape

The modern observability ecosystem is rich. Below is a curated overview of the most widely adopted open-source and commercial tools, organized by primary function.

Tool	Category	Primary Function	Deployment
Prometheus	Metrics	Time-series database with pull-based scraping and PromQL query language	Self-hosted / Managed
Grafana	Visualization	Multi-source dashboard platform; integrates with Prometheus, Loki, Tempo, and 100+ datasources	Self-hosted / Cloud
Jaeger	Tracing	CNCF distributed tracing system; Cassandra/Elasticsearch backend; native OTLP support	Self-hosted / Kubernetes
Grafana Tempo	Tracing	Cost-efficient trace backend using object storage (S3/GCS); integrates natively with Grafana	Self-hosted / Grafana Cloud
Loki	Logs	Log aggregation system inspired by Prometheus; indexes labels, not full text; uses LogQL	Self-hosted / Grafana Cloud
OpenTelemetry	Instrumentation	Vendor-neutral SDK + Collector for generating and routing MELT telemetry	SDK (in-app) + Collector
Datadog	Full-stack (Commercial)	Unified SaaS platform for metrics, logs, traces, synthetics, RUM, and security monitoring	SaaS / Agent-based
New Relic	Full-stack (Commercial)	SaaS observability platform with APM, infrastructure monitoring, browser monitoring, and NRQL	SaaS / Agent-based

Open-Source Stack Recommendation: For a cost-effective, production-grade self-hosted stack, combine Prometheus + Grafana + Loki + Tempo + OTel Collector (the "PLGT" stack). This provides full MELT coverage at near-zero licensing cost, with Grafana serving as the single pane of glass.

Explore Further

Dive deeper into specific observability disciplines:

Distributed Tracing

OpenTelemetry SDK setup, Jaeger and Tempo deployment on Kubernetes, sampling strategies, trace-to-log correlation, and common anti-patterns.

Dashboards & Alerting

Grafana dashboard design with USE/RED/Four Golden Signals, AlertManager configuration, SLO-based multiburn alerting, and on-call escalation design.