Monitoring & Logging

Observability is the ability to understand the internal state of a system by examining its outputs. Modern observability is built on three pillars: Metrics, Logs, and Traces.

The Three Pillars of Observability

Metrics

Numeric measurements over time — CPU usage, request rate, error rate, latency percentiles. Ideal for alerting and dashboards. Tools: Prometheus, Datadog, CloudWatch.

Logs

Timestamped records of discrete events — errors, requests, state changes. Ideal for debugging and auditing. Tools: ELK Stack, Loki, CloudWatch Logs.

Traces

End-to-end request flows across distributed services. Ideal for latency analysis and dependency mapping. Tools: Jaeger, Zipkin, Tempo, Datadog APM.

Prometheus + Grafana Stack

Deploy with Docker Compose

# docker-compose.monitoring.yml

version: "3.9"

services:

  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
      - --web.enable-lifecycle
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.3.0
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
    ports:
      - "3000:3000"
    depends_on: [prometheus]
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - --path.procfs=/host/proc
      - --path.sysfs=/host/sys
      - --collector.filesystem.ignored-mount-points='^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

# prometheus/prometheus.yml

global:
  scrape_interval: 15s        # How often to scrape targets
  evaluation_interval: 15s    # How often to evaluate rules
  external_labels:
    cluster: production
    region: ap-southeast-1

# Alerting
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

# Load alerting rules
rule_files:
  - "rules/*.yml"

# Scrape configs
scrape_configs:

  # Prometheus itself
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  # Node exporter (host metrics)
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]

  # Application metrics
  - job_name: myapp
    static_configs:
      - targets: ["myapp:8080"]
    metrics_path: /metrics
    scrape_interval: 10s

  # Kubernetes pods (via service discovery)
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

Alerting Rules

# prometheus/rules/application.yml

groups:
  - name: application
    interval: 30s
    rules:

      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High HTTP error rate"
          description: "Error rate is {{ printf \"%.2f\" $value | humanizePercentage }} over the last 5 minutes."
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value | humanizeDuration }}."

      # Pod not ready
      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{condition="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.container }} memory usage > 85%"

  - name: infrastructure
    rules:

      # Disk running out
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space remaining on {{ $labels.mountpoint }}."

      # Node down
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"

Alertmanager Configuration

# alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/xxx"

route:
  group_by: [alertname, cluster, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: slack-default

  routes:
    # Critical alerts → PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true

    # All alerts → Slack
    - match_re:
        severity: "warning|critical"
      receiver: slack-default

receivers:
  - name: slack-default
    slack_configs:
      - channel: "#alerts"
        title: "{{ .GroupLabels.alertname }}"
        text: |
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}
        send_resolved: true

  - name: pagerduty
    pagerduty_configs:
      - service_key: $PAGERDUTY_SERVICE_KEY
        description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"

inhibit_rules:
  # Suppress warning if critical is firing for the same alert
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, namespace]

Useful PromQL Queries

# Request rate (per second, averaged over 5 min)
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P50/P95/P99 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU usage per container
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, container)

# Memory usage (working set)
container_memory_working_set_bytes{container!=""}

# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])

# Deployment availability
kube_deployment_status_replicas_available / kube_deployment_spec_replicas

ELK Stack — Log Management

Deploy ELK with Docker Compose

# docker-compose.elk.yml

version: "3.9"

services:

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.13.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test: curl -s http://localhost:9200/_cluster/health | grep -q '"status":"green\|yellow"'
      interval: 30s
      timeout: 10s
      retries: 5

  logstash:
    image: docker.elastic.co/logstash/logstash:8.13.0
    container_name: logstash
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
    ports:
      - "5044:5044"    # Beats input
      - "5000:5000"    # TCP input
    depends_on:
      elasticsearch:
        condition: service_healthy

  kibana:
    image: docker.elastic.co/kibana/kibana:8.13.0
    container_name: kibana
    environment:
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      elasticsearch:
        condition: service_healthy

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.13.0
    container_name: filebeat
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on: [logstash]

volumes:
  elasticsearch_data:

Logstash Pipeline

# logstash/pipeline/main.conf

input {
  beats {
    port => 5044
  }
  tcp {
    port => 5000
    codec => json_lines
  }
}

filter {
  # Parse JSON logs
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "parsed"
    }
  }

  # Parse nginx access logs
  if [fields][type] == "nginx-access" {
    grok {
      match => {
        "message" => '%{IPORHOST:client_ip} - %{DATA:user} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status_code:int} %{NUMBER:bytes:int}'
      }
    }
    date {
      match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
    }
    mutate {
      convert => { "status_code" => "integer" }
      add_field => { "is_error" => "%{status_code}" }
    }
  }

  # Drop health check logs to reduce noise
  if [request] =~ "/health" {
    drop {}
  }

  # Enrich with GeoIP
  geoip {
    source => "client_ip"
  }

  # Add environment tag
  mutate {
    add_field => { "environment" => "${ENVIRONMENT:production}" }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{[fields][service]}-%{+YYYY.MM.dd}"
    ilm_enabled => true
    ilm_rollover_alias => "logs"
    ilm_policy => "logs-policy"
  }
}

Filebeat Configuration

# filebeat/filebeat.yml

filebeat.inputs:
  # Docker container logs
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata:
          host: "unix:///var/run/docker.sock"
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true
    fields:
      type: docker
    fields_under_root: true

  # Application log files
  - type: log
    paths:
      - /var/log/myapp/*.log
    fields:
      service: myapp
      type: application
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

output.logstash:
  hosts: ["logstash:5044"]
  loadbalance: true

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~

Loki + Grafana (Lightweight Alternative)

# docker-compose.loki.yml

version: "3.9"

services:
  loki:
    image: grafana/loki:2.9.5
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/loki-config.yml
      - loki_data:/loki
    command: -config.file=/etc/loki/loki-config.yml

  promtail:
    image: grafana/promtail:2.9.5
    container_name: promtail
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - ./promtail/promtail-config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml
    depends_on: [loki]

volumes:
  loki_data:
# promtail/promtail-config.yml

server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: [__meta_docker_container_name]
        target_label: container
      - source_labels: [__meta_docker_container_label_com_docker_compose_service]
        target_label: service
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: msg
      - labels:
          level:

Structured Logging Best Practices

Application Logging Example (Python)

import structlog
import logging

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),  # Output as JSON
    ],
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()

# Usage — structured, searchable logs
logger.info("request_received",
    method="POST",
    path="/api/orders",
    user_id="usr_123",
    trace_id="abc-456",
)

logger.error("database_error",
    operation="insert",
    table="orders",
    error=str(exc),
    duration_ms=42.5,
    retry_count=3,
)
✅ Logging Best Practices:
  • Use structured JSON logs — machine-parseable and searchable
  • Always include trace/correlation IDs to follow requests across services
  • Log at appropriate levels: DEBUG (dev), INFO (business events), WARN (degraded), ERROR (failures)
  • Never log sensitive data — passwords, tokens, PII
  • Set log retention policies — 30 days hot, 90 days warm, archive after
  • Use sampling for high-traffic debug logs to control volume and cost

SLI, SLO, and SLA

SLI — Service Level Indicator

A specific metric that measures service behavior. Examples: request success rate, latency P99, error rate, availability.

# SLI: availability over 30 days
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))

SLO — Service Level Objective

A target value for an SLI. Example: 99.9% of requests succeed over a 30-day rolling window. SLOs define your reliability budget.

SLA — Service Level Agreement

A contractual commitment to customers, usually with financial penalties for breach. SLO is internal; SLA is external. SLA should always be less ambitious than SLO.

Error Budget

# Error budget = 1 - SLO target
# For 99.9% SLO over 30 days:
# Error budget = 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime

# Track remaining error budget
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  / sum(rate(http_requests_total[30d]))
) / (1 - 0.999)    # 0.999 = SLO target
💡 Tip: When your error budget is exhausted, stop new feature work and focus entirely on reliability. When the budget is healthy, invest in new features. Error budgets align business and engineering priorities.

Distributed Tracing with Jaeger

# Deploy Jaeger (all-in-one for dev/test)
docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 6831:6831/udp \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:1.55

# Instrument Python app with OpenTelemetry
pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("user.id", user_id)
    result = process_order(order_id)

Next Steps