Distributed Tracing
Track the full journey of a request across every service, database call, and external dependency — transforming debugging in microservice architectures from guesswork to surgical precision.
Core Concepts
Understanding the vocabulary of distributed tracing is essential before deploying any tool.
Trace
A Trace represents the complete end-to-end journey of a single request through the entire distributed system. It is identified by a globally unique TraceID (typically a 128-bit random value) that is propagated across every service boundary via HTTP headers or message metadata.
Example TraceID: 4bf92f3577b34da6a3ce929d0e0e4736
Span
A Span is a named, timed operation representing a single unit of work within a trace. Every span has:
- SpanID — unique identifier for this span
- ParentSpanID — the SpanID of the calling span (empty for root spans)
- Operation name — human-readable label (e.g.,
HTTP GET /api/orders) - Start time & duration — precise timing information
- Attributes/Tags — key-value metadata (e.g.,
http.status_code=200,db.type=postgresql) - Events — timestamped annotations within a span (e.g., cache miss, retry attempt)
- Status — OK, ERROR, or UNSET
Parent-Child Relationships
Spans form a tree structure via parent-child relationships. The root span has no parent. Each downstream call creates a child span that references the parent's SpanID. This tree is what tracing UIs visualize as a "waterfall" or "flame graph".
Trace: 4bf92f3577b34da6a3ce929d0e0e4736
│
└── [root] api-gateway: POST /checkout SpanID: a1b2c3d4 ParentID: —
├── auth-service: ValidateToken SpanID: e5f6a7b8 ParentID: a1b2c3d4
├── cart-service: GetCart SpanID: c9d0e1f2 ParentID: a1b2c3d4
│ └── redis: GET cart:usr_9f2a1b SpanID: a3b4c5d6 ParentID: c9d0e1f2
├── inventory-svc: ReserveItems SpanID: e7f8a9b0 ParentID: a1b2c3d4
│ └── postgres: UPDATE inventory SpanID: c1d2e3f4 ParentID: e7f8a9b0
└── payment-svc: ProcessPayment SpanID: a5b6c7d8 ParentID: a1b2c3d4
└── stripe-api: POST /v1/charges SpanID: e9f0a1b2 ParentID: a5b6c7d8
Baggage
Baggage is a mechanism for propagating arbitrary key-value metadata alongside a trace across service boundaries. Unlike span attributes (which are local to a span), baggage flows with every downstream request automatically.
Common use cases: propagating tenant ID, feature flag state, A/B test cohort, or user locale for context-aware debugging.
Context Propagation
Traces only work if the TraceID and SpanID are propagated across every service boundary. OpenTelemetry uses the W3C TraceContext standard (traceparent header) for HTTP and message queue propagation.
# W3C traceparent header format
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
# ^^ version ^^ trace-id (128-bit) ^^ parent-span-id ^^ flags (sampled)
OpenTelemetry Tracing Setup
Python: Manual + Auto Instrumentation
The following example shows both auto-instrumentation (via Flask middleware) and manual span creation for a payment service.
# requirements.txt
opentelemetry-sdk==1.24.0
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-instrumentation-flask==0.45b0
opentelemetry-instrumentation-requests==0.45b0
opentelemetry-instrumentation-sqlalchemy==0.45b0
# tracing.py — initialize OTel tracing for the service
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
def init_tracing(app, db_engine):
resource = Resource.create({
"service.name": "payment-service",
"service.version": "2.4.1",
"deployment.environment": "production",
"cloud.region": "ap-southeast-1",
})
exporter = OTLPSpanExporter(
endpoint="http://otel-collector:4317",
insecure=True,
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument Flask, outbound HTTP, and SQLAlchemy
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=db_engine)
return trace.get_tracer("payment-service")
# app.py — manual span enrichment example
from flask import Flask, request, jsonify
from opentelemetry import trace, baggage
from tracing import init_tracing
app = Flask(__name__)
tracer = init_tracing(app, db_engine)
@app.route("/api/v1/charge", methods=["POST"])
def charge():
# The route span is auto-created by FlaskInstrumentor
# Add a child span for the Stripe API call with rich attributes
with tracer.start_as_current_span("stripe.charge") as span:
span.set_attribute("payment.gateway", "stripe")
span.set_attribute("payment.amount_cents", request.json["amount"])
span.set_attribute("payment.currency", request.json["currency"])
span.set_attribute("customer.id", request.json["customer_id"])
try:
result = stripe_client.charge(request.json)
span.set_attribute("payment.charge_id", result["id"])
span.set_status(trace.StatusCode.OK)
return jsonify({"charge_id": result["id"]}), 200
except stripe.StripeError as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
return jsonify({"error": str(e)}), 502
Go: OTel SDK with gRPC Service
// tracing.go
package tracing
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
func InitTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("order-service"),
semconv.ServiceVersion("1.3.0"),
attribute.String("deployment.environment", "production"),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
// Tail-based sampling: sample 100% of error traces, 10% of OK traces
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
Jaeger on Kubernetes (Helm)
Jaeger is the most widely adopted open-source distributed tracing system. The recommended production deployment uses the Jaeger Operator or the all-in-one Helm chart with an Elasticsearch backend.
# jaeger-values.yaml — Helm chart: jaegertracing/jaeger
provisionDataStore:
cassandra: false
elasticsearch: false # use externally managed ES
storage:
type: elasticsearch
elasticsearch:
host: elasticsearch-master.monitoring.svc.cluster.local
port: 9200
user: jaeger
usePassword: true
existingSecret: jaeger-es-secret
existingSecretKey: password
indexPrefix: jaeger
nodesWanOnly: false
collector:
replicaCount: 2
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
# Accept OTLP gRPC, OTLP HTTP, and legacy Jaeger Thrift
extraEnv:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
service:
otlp:
grpc:
port: 4317
http:
port: 4318
query:
replicaCount: 1
ingress:
enabled: true
ingressClassName: nginx
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: jaeger-basic-auth
hosts:
- host: jaeger.internal.example.com
paths:
- path: /
pathType: Prefix
agent:
enabled: false # use OTel Collector instead
spark:
enabled: true # Spark job for service dependency graph
# Install Jaeger via Helm
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
helm install jaeger jaegertracing/jaeger \
--namespace monitoring \
--create-namespace \
--values jaeger-values.yaml \
--version 0.71.14
Grafana Tempo as Alternative
Grafana Tempo stores traces in object storage (S3, GCS, Azure Blob) making it significantly more cost-efficient than Jaeger+Elasticsearch for high-volume environments. It integrates natively with Grafana and supports TraceQL for querying.
# tempo-values.yaml — Helm chart: grafana/tempo-distributed
tempo:
reportingEnabled: false
storage:
trace:
backend: s3
s3:
bucket: my-tempo-traces
endpoint: s3.ap-southeast-1.amazonaws.com
region: ap-southeast-1
# Use IRSA (IAM Roles for Service Accounts) — no static credentials
insecure: false
distributor:
replicas: 2
resources:
requests:
cpu: 300m
memory: 400Mi
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
ingester:
replicas: 3
resources:
requests:
cpu: 500m
memory: 1Gi
config:
replication_factor: 3
compactor:
replicas: 1
config:
compaction:
block_retention: 720h # 30 days
querier:
replicas: 2
queryFrontend:
replicas: 1
metricsGenerator:
enabled: true
config:
processors:
- service-graphs
- span-metrics
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
Grafana Datasource for Tempo
# grafana-datasources.yaml — provisioned datasource
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo-query-frontend.monitoring.svc.cluster.local:3100
uid: tempo
jsonData:
httpMethod: GET
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: "-1m"
spanEndTimeShift: "1m"
filterByTraceID: true
filterBySpanID: false
customQuery: true
query: '{service_name="${__span.tags["service.name"]}"} | json | trace_id="${__trace.traceId}"'
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: true
search:
hide: false
lokiSearch:
datasourceUid: loki
Sampling Strategies
Tracing 100% of requests is often impractical at high throughput (10k+ req/s). Sampling reduces storage costs while preserving investigative value.
Head-Based Sampling
The sampling decision is made at the start of a trace (at the root span), before any downstream spans are created. All child spans inherit the decision.
Pros: Low overhead, simple to implement, no buffering needed.
Cons: Cannot sample based on outcome (e.g., errors) because the decision is made before the request completes.
# OTel Collector: probabilistic head-based sampling (10%)
processors:
probabilistic_sampler:
hash_seed: 22
sampling_percentage: 10.0
# OTel Collector: always sample errors, 5% of successes
processors:
filter/drop_success:
error_mode: ignore
traces:
span:
- 'attributes["http.status_code"] != nil and
attributes["http.status_code"] < 400 and
random() > 0.05'
Tail-Based Sampling
The sampling decision is made after all spans in a trace are collected, allowing decisions based on the complete trace (e.g., sample all traces with errors or high latency).
Pros: Can guarantee all error traces are captured; much more intelligent filtering.
Cons: Requires buffering entire traces in memory before the decision; more complex to operate.
# OTel Collector: tail-based sampling configuration
processors:
tail_sampling:
decision_wait: 30s # wait 30s for all spans to arrive
num_traces: 50000 # hold up to 50k traces in memory
expected_new_traces_per_sec: 2000
policies:
# Always sample traces with errors
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
# Always sample slow traces (>2s end-to-end)
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 2000}
# Sample 5% of all other (healthy, fast) traces
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 5}
# Always sample traces for specific high-value users
- name: vip-user-policy
type: string_attribute
string_attribute:
key: customer.tier
values: [platinum, enterprise]
Trace-to-Log Correlation
The most powerful debugging workflow links distributed traces directly to log lines. This requires injecting trace_id and span_id into every log record.
Python: Inject TraceID into Structlog
# log_config.py — inject OTel context into every log record
import structlog
from opentelemetry import trace
def add_otel_context(logger, method, event_dict):
"""Structlog processor to inject trace context into log records."""
current_span = trace.get_current_span()
if current_span and current_span.is_recording():
ctx = current_span.get_span_context()
event_dict["trace_id"] = format(ctx.trace_id, "032x")
event_dict["span_id"] = format(ctx.span_id, "016x")
event_dict["trace_sampled"] = ctx.trace_flags.sampled
return event_dict
structlog.configure(
processors=[
add_otel_context,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
]
)
This produces structured log lines like:
{
"timestamp": "2026-03-28T09:15:42.183Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "a5b6c7d8e9f0a1b2",
"trace_sampled": true,
"message": "Stripe charge failed",
"stripe_error_code": "card_declined",
"customer_id": "usr_9f2a1b"
}
In Grafana, configure the Loki datasource's "Derived Fields" to automatically detect trace_id in log lines and create a clickable link that jumps directly to the trace in Tempo:
# Loki datasource — derived fields configuration (grafana provisioning)
jsonData:
derivedFields:
- name: TraceID
matcherRegex: '"trace_id":\s*"(\w+)"'
url: "${__value.raw}"
datasourceUid: tempo
urlDisplayLabel: View Trace in Tempo
Common Tracing Anti-Patterns
Even with the right tools, teams frequently fall into these tracing pitfalls.
Anti-Pattern 1: Missing Context Propagation
Problem: A service makes an async call (message queue, background job, cron) without propagating the trace context. The trace "breaks" at the async boundary, leaving orphan spans.
Fix: Inject the W3C traceparent header into all message headers (Kafka, RabbitMQ, SQS). On the consumer side, extract and restore the context before processing.
# Kafka producer — inject trace context into message headers
from opentelemetry.propagate import inject
headers = {}
inject(headers) # adds "traceparent" and "tracestate" keys
producer.send("payment-events", value=payload, headers=list(headers.items()))
Anti-Pattern 2: Overly Noisy Spans
Problem: Auto-instrumentation creates a span for every SQL query, Redis command, and HTTP call, resulting in traces with 500+ spans. The UI becomes unusable and storage costs spike.
Fix: Use the OTel Collector's filter processor to drop low-value spans. Set a minimum duration threshold to remove spans shorter than 1ms (e.g., trivial Redis pings).
processors:
filter/drop_health_checks:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/metrics"'
- 'attributes["http.target"] == "/readyz"'
filter/drop_fast_db:
error_mode: ignore
traces:
span:
- 'duration < 1ms and attributes["db.system"] != nil'
Anti-Pattern 3: Sampling Everything in Development, Nothing in Production
Problem: Teams set sampling to 100% in dev and then reduce to 0.1% in production to control costs. Critical production errors become invisible because the 0.1% never captures the specific failing request.
Fix: Use tail-based sampling in production with a policy that guarantees 100% capture of error and slow traces, and probabilistic sampling for the happy path (5–10%).
Anti-Pattern 4: Tracing Without Attribute Standards
Problem: Service A uses user_id, Service B uses userId, Service C uses customer.id. Cross-service queries in Jaeger/Tempo by user become impossible.
Fix: Establish and enforce a shared attribute schema based on OTel Semantic Conventions. Automate validation in CI pipelines.
# Standardized span attributes — enforce via shared library
SPAN_ATTR_USER_ID = "enduser.id"
SPAN_ATTR_TENANT_ID = "enduser.tenant.id"
SPAN_ATTR_HTTP_METHOD = "http.method" # semconv standard
SPAN_ATTR_HTTP_STATUS = "http.status_code" # semconv standard
SPAN_ATTR_DB_SYSTEM = "db.system" # semconv standard
SPAN_ATTR_DB_STATEMENT = "db.statement" # semconv standard
Anti-Pattern 5: No Trace Retention Policy
Problem: Traces accumulate indefinitely in Elasticsearch/S3. Storage costs grow unbounded and query performance degrades for older data.
Fix: Set explicit retention policies. A common tiering strategy: keep 7 days of full traces, 30 days of error traces only, 90 days of trace metadata (TraceID + root span) for audit purposes.
service.name and deployment.environment resource attributes, (4) logs include trace_id field for correlation, (5) a tail-based sampler captures 100% of error traces.