Monitoring & Logging
Observability is the ability to understand the internal state of a system by examining its outputs. Modern observability is built on three pillars: Metrics, Logs, and Traces.
The Three Pillars of Observability
Metrics
Numeric measurements over time — CPU usage, request rate, error rate, latency percentiles. Ideal for alerting and dashboards. Tools: Prometheus, Datadog, CloudWatch.
Logs
Timestamped records of discrete events — errors, requests, state changes. Ideal for debugging and auditing. Tools: ELK Stack, Loki, CloudWatch Logs.
Traces
End-to-end request flows across distributed services. Ideal for latency analysis and dependency mapping. Tools: Jaeger, Zipkin, Tempo, Datadog APM.
Prometheus + Grafana Stack
Deploy with Docker Compose
# docker-compose.monitoring.yml
version: "3.9"
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules/:/etc/prometheus/rules/
- prometheus_data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=30d
- --web.enable-lifecycle
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:10.3.0
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning/:/etc/grafana/provisioning/
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: "false"
ports:
- "3000:3000"
depends_on: [prometheus]
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --collector.filesystem.ignored-mount-points='^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
external_labels:
cluster: production
region: ap-southeast-1
# Alerting
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
# Load alerting rules
rule_files:
- "rules/*.yml"
# Scrape configs
scrape_configs:
# Prometheus itself
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
# Node exporter (host metrics)
- job_name: node
static_configs:
- targets: ["node-exporter:9100"]
# Application metrics
- job_name: myapp
static_configs:
- targets: ["myapp:8080"]
metrics_path: /metrics
scrape_interval: 10s
# Kubernetes pods (via service discovery)
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
Alerting Rules
# prometheus/rules/application.yml
groups:
- name: application
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High HTTP error rate"
description: "Error rate is {{ printf \"%.2f\" $value | humanizePercentage }} over the last 5 minutes."
runbook: "https://wiki.example.com/runbooks/high-error-rate"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency on {{ $labels.service }}"
description: "P95 latency is {{ $value | humanizeDuration }}."
# Pod not ready
- alert: PodNotReady
expr: |
kube_pod_status_ready{condition="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
# High memory usage
- alert: HighMemoryUsage
expr: |
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} memory usage > 85%"
- name: infrastructure
rules:
# Disk running out
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining on {{ $labels.mountpoint }}."
# Node down
- alert: NodeDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/xxx"
route:
group_by: [alertname, cluster, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: slack-default
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# All alerts → Slack
- match_re:
severity: "warning|critical"
receiver: slack-default
receivers:
- name: slack-default
slack_configs:
- channel: "#alerts"
title: "{{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
{{ end }}
send_resolved: true
- name: pagerduty
pagerduty_configs:
- service_key: $PAGERDUTY_SERVICE_KEY
description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"
inhibit_rules:
# Suppress warning if critical is firing for the same alert
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, namespace]
Useful PromQL Queries
# Request rate (per second, averaged over 5 min)
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# P50/P95/P99 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# CPU usage per container
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, container)
# Memory usage (working set)
container_memory_working_set_bytes{container!=""}
# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])
# Deployment availability
kube_deployment_status_replicas_available / kube_deployment_spec_replicas
ELK Stack — Log Management
Deploy ELK with Docker Compose
# docker-compose.elk.yml
version: "3.9"
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.13.0
container_name: elasticsearch
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
ulimits:
memlock:
soft: -1
hard: -1
healthcheck:
test: curl -s http://localhost:9200/_cluster/health | grep -q '"status":"green\|yellow"'
interval: 30s
timeout: 10s
retries: 5
logstash:
image: docker.elastic.co/logstash/logstash:8.13.0
container_name: logstash
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
ports:
- "5044:5044" # Beats input
- "5000:5000" # TCP input
depends_on:
elasticsearch:
condition: service_healthy
kibana:
image: docker.elastic.co/kibana/kibana:8.13.0
container_name: kibana
environment:
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
elasticsearch:
condition: service_healthy
filebeat:
image: docker.elastic.co/beats/filebeat:8.13.0
container_name: filebeat
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
depends_on: [logstash]
volumes:
elasticsearch_data:
Logstash Pipeline
# logstash/pipeline/main.conf
input {
beats {
port => 5044
}
tcp {
port => 5000
codec => json_lines
}
}
filter {
# Parse JSON logs
if [message] =~ /^\{/ {
json {
source => "message"
target => "parsed"
}
}
# Parse nginx access logs
if [fields][type] == "nginx-access" {
grok {
match => {
"message" => '%{IPORHOST:client_ip} - %{DATA:user} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status_code:int} %{NUMBER:bytes:int}'
}
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
}
mutate {
convert => { "status_code" => "integer" }
add_field => { "is_error" => "%{status_code}" }
}
}
# Drop health check logs to reduce noise
if [request] =~ "/health" {
drop {}
}
# Enrich with GeoIP
geoip {
source => "client_ip"
}
# Add environment tag
mutate {
add_field => { "environment" => "${ENVIRONMENT:production}" }
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{[fields][service]}-%{+YYYY.MM.dd}"
ilm_enabled => true
ilm_rollover_alias => "logs"
ilm_policy => "logs-policy"
}
}
Filebeat Configuration
# filebeat/filebeat.yml
filebeat.inputs:
# Docker container logs
- type: container
paths:
- /var/lib/docker/containers/*/*.log
processors:
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
fields:
type: docker
fields_under_root: true
# Application log files
- type: log
paths:
- /var/log/myapp/*.log
fields:
service: myapp
type: application
multiline:
pattern: '^\d{4}-\d{2}-\d{2}'
negate: true
match: after
output.logstash:
hosts: ["logstash:5044"]
loadbalance: true
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
Loki + Grafana (Lightweight Alternative)
# docker-compose.loki.yml
version: "3.9"
services:
loki:
image: grafana/loki:2.9.5
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml
- loki_data:/loki
command: -config.file=/etc/loki/loki-config.yml
promtail:
image: grafana/promtail:2.9.5
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock
- ./promtail/promtail-config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
depends_on: [loki]
volumes:
loki_data:
# promtail/promtail-config.yml
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: [__meta_docker_container_name]
target_label: container
- source_labels: [__meta_docker_container_label_com_docker_compose_service]
target_label: service
pipeline_stages:
- json:
expressions:
level: level
msg: msg
- labels:
level:
Structured Logging Best Practices
Application Logging Example (Python)
import structlog
import logging
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer(), # Output as JSON
],
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
logger = structlog.get_logger()
# Usage — structured, searchable logs
logger.info("request_received",
method="POST",
path="/api/orders",
user_id="usr_123",
trace_id="abc-456",
)
logger.error("database_error",
operation="insert",
table="orders",
error=str(exc),
duration_ms=42.5,
retry_count=3,
)
- Use structured JSON logs — machine-parseable and searchable
- Always include trace/correlation IDs to follow requests across services
- Log at appropriate levels: DEBUG (dev), INFO (business events), WARN (degraded), ERROR (failures)
- Never log sensitive data — passwords, tokens, PII
- Set log retention policies — 30 days hot, 90 days warm, archive after
- Use sampling for high-traffic debug logs to control volume and cost
SLI, SLO, and SLA
SLI — Service Level Indicator
A specific metric that measures service behavior. Examples: request success rate, latency P99, error rate, availability.
# SLI: availability over 30 days
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
SLO — Service Level Objective
A target value for an SLI. Example: 99.9% of requests succeed over a 30-day rolling window. SLOs define your reliability budget.
SLA — Service Level Agreement
A contractual commitment to customers, usually with financial penalties for breach. SLO is internal; SLA is external. SLA should always be less ambitious than SLO.
Error Budget
# Error budget = 1 - SLO target
# For 99.9% SLO over 30 days:
# Error budget = 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime
# Track remaining error budget
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
) / (1 - 0.999) # 0.999 = SLO target
Distributed Tracing with Jaeger
# Deploy Jaeger (all-in-one for dev/test)
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 6831:6831/udp \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:1.55
# Instrument Python app with OpenTelemetry
pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Create spans
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", user_id)
result = process_order(order_id)
Next Steps
- CI/CD Pipelines — Automate builds and deployments
- DevOps Overview — Back to fundamentals