Distributed Tracing

Track the full journey of a request across every service, database call, and external dependency — transforming debugging in microservice architectures from guesswork to surgical precision.

Why Tracing Matters: In a monolith, a stack trace shows you exactly where a failure occurred. In a microservices architecture with 20+ services, a single user request may touch dozens of network hops. Distributed tracing is the equivalent of a stack trace for distributed systems.

Core Concepts

Understanding the vocabulary of distributed tracing is essential before deploying any tool.

Trace

A Trace represents the complete end-to-end journey of a single request through the entire distributed system. It is identified by a globally unique TraceID (typically a 128-bit random value) that is propagated across every service boundary via HTTP headers or message metadata.

Example TraceID: 4bf92f3577b34da6a3ce929d0e0e4736

Span

A Span is a named, timed operation representing a single unit of work within a trace. Every span has:

SpanID — unique identifier for this span
ParentSpanID — the SpanID of the calling span (empty for root spans)
Operation name — human-readable label (e.g., HTTP GET /api/orders)
Start time & duration — precise timing information
Attributes/Tags — key-value metadata (e.g., http.status_code=200, db.type=postgresql)
Events — timestamped annotations within a span (e.g., cache miss, retry attempt)
Status — OK, ERROR, or UNSET

Parent-Child Relationships

Spans form a tree structure via parent-child relationships. The root span has no parent. Each downstream call creates a child span that references the parent's SpanID. This tree is what tracing UIs visualize as a "waterfall" or "flame graph".

Trace: 4bf92f3577b34da6a3ce929d0e0e4736
│
└── [root] api-gateway: POST /checkout         SpanID: a1b2c3d4  ParentID: —
    ├── auth-service: ValidateToken            SpanID: e5f6a7b8  ParentID: a1b2c3d4
    ├── cart-service: GetCart                  SpanID: c9d0e1f2  ParentID: a1b2c3d4
    │   └── redis: GET cart:usr_9f2a1b         SpanID: a3b4c5d6  ParentID: c9d0e1f2
    ├── inventory-svc: ReserveItems            SpanID: e7f8a9b0  ParentID: a1b2c3d4
    │   └── postgres: UPDATE inventory         SpanID: c1d2e3f4  ParentID: e7f8a9b0
    └── payment-svc: ProcessPayment            SpanID: a5b6c7d8  ParentID: a1b2c3d4
        └── stripe-api: POST /v1/charges       SpanID: e9f0a1b2  ParentID: a5b6c7d8

Baggage

Baggage is a mechanism for propagating arbitrary key-value metadata alongside a trace across service boundaries. Unlike span attributes (which are local to a span), baggage flows with every downstream request automatically.

Common use cases: propagating tenant ID, feature flag state, A/B test cohort, or user locale for context-aware debugging.

Baggage Warning: Baggage is transmitted in HTTP headers on every downstream call. Keep values small and never put sensitive data (tokens, PII) in baggage — it propagates to all downstream services and logs.

Context Propagation

Traces only work if the TraceID and SpanID are propagated across every service boundary. OpenTelemetry uses the W3C TraceContext standard (traceparent header) for HTTP and message queue propagation.

# W3C traceparent header format
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
#            ^^ version   ^^ trace-id (128-bit)        ^^ parent-span-id  ^^ flags (sampled)

OpenTelemetry Tracing Setup

Python: Manual + Auto Instrumentation

The following example shows both auto-instrumentation (via Flask middleware) and manual span creation for a payment service.

# requirements.txt
opentelemetry-sdk==1.24.0
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-instrumentation-flask==0.45b0
opentelemetry-instrumentation-requests==0.45b0
opentelemetry-instrumentation-sqlalchemy==0.45b0

# tracing.py — initialize OTel tracing for the service
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def init_tracing(app, db_engine):
    resource = Resource.create({
        "service.name": "payment-service",
        "service.version": "2.4.1",
        "deployment.environment": "production",
        "cloud.region": "ap-southeast-1",
    })

    exporter = OTLPSpanExporter(
        endpoint="http://otel-collector:4317",
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Auto-instrument Flask, outbound HTTP, and SQLAlchemy
    FlaskInstrumentor().instrument_app(app)
    RequestsInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument(engine=db_engine)

    return trace.get_tracer("payment-service")

# app.py — manual span enrichment example
from flask import Flask, request, jsonify
from opentelemetry import trace, baggage
from tracing import init_tracing

app = Flask(__name__)
tracer = init_tracing(app, db_engine)

@app.route("/api/v1/charge", methods=["POST"])
def charge():
    # The route span is auto-created by FlaskInstrumentor
    # Add a child span for the Stripe API call with rich attributes
    with tracer.start_as_current_span("stripe.charge") as span:
        span.set_attribute("payment.gateway", "stripe")
        span.set_attribute("payment.amount_cents", request.json["amount"])
        span.set_attribute("payment.currency", request.json["currency"])
        span.set_attribute("customer.id", request.json["customer_id"])

        try:
            result = stripe_client.charge(request.json)
            span.set_attribute("payment.charge_id", result["id"])
            span.set_status(trace.StatusCode.OK)
            return jsonify({"charge_id": result["id"]}), 200
        except stripe.StripeError as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            return jsonify({"error": str(e)}), 502

Go: OTel SDK with gRPC Service

// tracing.go
package tracing

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func InitTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName("order-service"),
        semconv.ServiceVersion("1.3.0"),
        attribute.String("deployment.environment", "production"),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        // Tail-based sampling: sample 100% of error traces, 10% of OK traces
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

Jaeger on Kubernetes (Helm)

Jaeger is the most widely adopted open-source distributed tracing system. The recommended production deployment uses the Jaeger Operator or the all-in-one Helm chart with an Elasticsearch backend.

# jaeger-values.yaml — Helm chart: jaegertracing/jaeger
provisionDataStore:
  cassandra: false
  elasticsearch: false   # use externally managed ES

storage:
  type: elasticsearch
  elasticsearch:
    host: elasticsearch-master.monitoring.svc.cluster.local
    port: 9200
    user: jaeger
    usePassword: true
    existingSecret: jaeger-es-secret
    existingSecretKey: password
    indexPrefix: jaeger
    nodesWanOnly: false

collector:
  replicaCount: 2
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 1000m
      memory: 512Mi
  # Accept OTLP gRPC, OTLP HTTP, and legacy Jaeger Thrift
  extraEnv:
    - name: COLLECTOR_OTLP_ENABLED
      value: "true"
  service:
    otlp:
      grpc:
        port: 4317
      http:
        port: 4318

query:
  replicaCount: 1
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/auth-type: basic
      nginx.ingress.kubernetes.io/auth-secret: jaeger-basic-auth
    hosts:
      - host: jaeger.internal.example.com
        paths:
          - path: /
            pathType: Prefix

agent:
  enabled: false   # use OTel Collector instead

spark:
  enabled: true    # Spark job for service dependency graph

# Install Jaeger via Helm
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

helm install jaeger jaegertracing/jaeger \
  --namespace monitoring \
  --create-namespace \
  --values jaeger-values.yaml \
  --version 0.71.14

Grafana Tempo as Alternative

Grafana Tempo stores traces in object storage (S3, GCS, Azure Blob) making it significantly more cost-efficient than Jaeger+Elasticsearch for high-volume environments. It integrates natively with Grafana and supports TraceQL for querying.

# tempo-values.yaml — Helm chart: grafana/tempo-distributed
tempo:
  reportingEnabled: false
  storage:
    trace:
      backend: s3
      s3:
        bucket: my-tempo-traces
        endpoint: s3.ap-southeast-1.amazonaws.com
        region: ap-southeast-1
        # Use IRSA (IAM Roles for Service Accounts) — no static credentials
        insecure: false

distributor:
  replicas: 2
  resources:
    requests:
      cpu: 300m
      memory: 400Mi
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268

ingester:
  replicas: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
  config:
    replication_factor: 3

compactor:
  replicas: 1
  config:
    compaction:
      block_retention: 720h    # 30 days

querier:
  replicas: 2

queryFrontend:
  replicas: 1

metricsGenerator:
  enabled: true
  config:
    processors:
      - service-graphs
      - span-metrics
    storage:
      path: /var/tempo/generator/wal
      remote_write:
        - url: http://prometheus:9090/api/v1/write

Grafana Datasource for Tempo

# grafana-datasources.yaml — provisioned datasource
apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo-query-frontend.monitoring.svc.cluster.local:3100
    uid: tempo
    jsonData:
      httpMethod: GET
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: "-1m"
        spanEndTimeShift: "1m"
        filterByTraceID: true
        filterBySpanID: false
        customQuery: true
        query: '{service_name="${__span.tags["service.name"]}"} | json | trace_id="${__trace.traceId}"'
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      search:
        hide: false
      lokiSearch:
        datasourceUid: loki

Sampling Strategies

Tracing 100% of requests is often impractical at high throughput (10k+ req/s). Sampling reduces storage costs while preserving investigative value.

Head-Based Sampling

The sampling decision is made at the start of a trace (at the root span), before any downstream spans are created. All child spans inherit the decision.

Pros: Low overhead, simple to implement, no buffering needed.

Cons: Cannot sample based on outcome (e.g., errors) because the decision is made before the request completes.

# OTel Collector: probabilistic head-based sampling (10%)
processors:
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 10.0

# OTel Collector: always sample errors, 5% of successes
processors:
  filter/drop_success:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.status_code"] != nil and
           attributes["http.status_code"] < 400 and
           random() > 0.05'

Tail-Based Sampling

The sampling decision is made after all spans in a trace are collected, allowing decisions based on the complete trace (e.g., sample all traces with errors or high latency).

Pros: Can guarantee all error traces are captured; much more intelligent filtering.

Cons: Requires buffering entire traces in memory before the decision; more complex to operate.

# OTel Collector: tail-based sampling configuration
processors:
  tail_sampling:
    decision_wait: 30s        # wait 30s for all spans to arrive
    num_traces: 50000         # hold up to 50k traces in memory
    expected_new_traces_per_sec: 2000
    policies:
      # Always sample traces with errors
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always sample slow traces (>2s end-to-end)
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 2000}

      # Sample 5% of all other (healthy, fast) traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

      # Always sample traces for specific high-value users
      - name: vip-user-policy
        type: string_attribute
        string_attribute:
          key: customer.tier
          values: [platinum, enterprise]

Sampling Recommendation: Use head-based sampling at low traffic volumes or when simplicity is preferred. Use tail-based sampling in production environments with high request rates where you need guaranteed capture of all error and slow traces. A common pattern: deploy the OTel Collector with tail-based sampling as a DaemonSet, buffering traces in memory before forwarding to Tempo/Jaeger.

Trace-to-Log Correlation

The most powerful debugging workflow links distributed traces directly to log lines. This requires injecting trace_id and span_id into every log record.

Python: Inject TraceID into Structlog

# log_config.py — inject OTel context into every log record
import structlog
from opentelemetry import trace

def add_otel_context(logger, method, event_dict):
    """Structlog processor to inject trace context into log records."""
    current_span = trace.get_current_span()
    if current_span and current_span.is_recording():
        ctx = current_span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
        event_dict["trace_sampled"] = ctx.trace_flags.sampled
    return event_dict

structlog.configure(
    processors=[
        add_otel_context,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

This produces structured log lines like:

{
  "timestamp": "2026-03-28T09:15:42.183Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "a5b6c7d8e9f0a1b2",
  "trace_sampled": true,
  "message": "Stripe charge failed",
  "stripe_error_code": "card_declined",
  "customer_id": "usr_9f2a1b"
}

In Grafana, configure the Loki datasource's "Derived Fields" to automatically detect trace_id in log lines and create a clickable link that jumps directly to the trace in Tempo:

# Loki datasource — derived fields configuration (grafana provisioning)
jsonData:
  derivedFields:
    - name: TraceID
      matcherRegex: '"trace_id":\s*"(\w+)"'
      url: "${__value.raw}"
      datasourceUid: tempo
      urlDisplayLabel: View Trace in Tempo

Common Tracing Anti-Patterns

Even with the right tools, teams frequently fall into these tracing pitfalls.

Anti-Pattern 1: Missing Context Propagation

Problem: A service makes an async call (message queue, background job, cron) without propagating the trace context. The trace "breaks" at the async boundary, leaving orphan spans.

Fix: Inject the W3C traceparent header into all message headers (Kafka, RabbitMQ, SQS). On the consumer side, extract and restore the context before processing.

# Kafka producer — inject trace context into message headers
from opentelemetry.propagate import inject

headers = {}
inject(headers)   # adds "traceparent" and "tracestate" keys
producer.send("payment-events", value=payload, headers=list(headers.items()))

Anti-Pattern 2: Overly Noisy Spans

Problem: Auto-instrumentation creates a span for every SQL query, Redis command, and HTTP call, resulting in traces with 500+ spans. The UI becomes unusable and storage costs spike.

Fix: Use the OTel Collector's filter processor to drop low-value spans. Set a minimum duration threshold to remove spans shorter than 1ms (e.g., trivial Redis pings).

processors:
  filter/drop_health_checks:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'
        - 'attributes["http.target"] == "/readyz"'

  filter/drop_fast_db:
    error_mode: ignore
    traces:
      span:
        - 'duration < 1ms and attributes["db.system"] != nil'

Anti-Pattern 3: Sampling Everything in Development, Nothing in Production

Problem: Teams set sampling to 100% in dev and then reduce to 0.1% in production to control costs. Critical production errors become invisible because the 0.1% never captures the specific failing request.

Fix: Use tail-based sampling in production with a policy that guarantees 100% capture of error and slow traces, and probabilistic sampling for the happy path (5–10%).

Anti-Pattern 4: Tracing Without Attribute Standards

Problem: Service A uses user_id, Service B uses userId, Service C uses customer.id. Cross-service queries in Jaeger/Tempo by user become impossible.

Fix: Establish and enforce a shared attribute schema based on OTel Semantic Conventions. Automate validation in CI pipelines.

# Standardized span attributes — enforce via shared library
SPAN_ATTR_USER_ID      = "enduser.id"
SPAN_ATTR_TENANT_ID    = "enduser.tenant.id"
SPAN_ATTR_HTTP_METHOD  = "http.method"          # semconv standard
SPAN_ATTR_HTTP_STATUS  = "http.status_code"     # semconv standard
SPAN_ATTR_DB_SYSTEM    = "db.system"            # semconv standard
SPAN_ATTR_DB_STATEMENT = "db.statement"         # semconv standard

Anti-Pattern 5: No Trace Retention Policy

Problem: Traces accumulate indefinitely in Elasticsearch/S3. Storage costs grow unbounded and query performance degrades for older data.

Fix: Set explicit retention policies. A common tiering strategy: keep 7 days of full traces, 30 days of error traces only, 90 days of trace metadata (TraceID + root span) for audit purposes.

Tracing Checklist: Before going live with distributed tracing, verify: (1) all service-to-service HTTP calls propagate W3C headers, (2) async message producers inject trace context into headers, (3) all spans include service.name and deployment.environment resource attributes, (4) logs include trace_id field for correlation, (5) a tail-based sampler captures 100% of error traces.

Observability