Site Reliability Engineering
What is SRE?
Site Reliability Engineering (SRE) was created at Google in 2003 by Ben Treynor Sloss, who defined it as "what happens when you ask a software engineer to design an operations function." SRE is a concrete implementation of DevOps philosophy, applying software engineering practices to operations challenges.
The core insight of SRE is that traditional operations — run by teams focused on manual processes, change avoidance, and stability above all — does not scale. As systems grow in complexity, purely manual operations becomes a bottleneck. SRE solves this by automating operations work, setting explicit reliability targets (SLOs), and creating a healthy engineering relationship between development velocity and system stability.
SRE vs DevOps
SRE and DevOps are complementary, not competing philosophies. DevOps describes the desired outcomes and culture; SRE provides a concrete, opinionated implementation of how to achieve those outcomes.
DevOps Philosophy
- Break down silos between Dev and Ops
- Shared ownership of production systems
- Continuous delivery and deployment
- Feedback loops and learning culture
- "You build it, you run it"
DevOps defines the "what" — the culture and goals
SRE Implementation
- SLOs and error budgets quantify reliability
- Toil reduction through automation
- Blameless post-mortems build learning culture
- Capacity planning and load testing
- On-call rotations with formal runbooks
SRE defines the "how" — specific practices and mechanisms
Core SRE Principles (Google SRE Book)
1. Embracing Risk
100% reliability is neither achievable nor desirable. Every nine of availability (99.9% vs 99.99%) has an exponentially higher cost in engineering time, architectural complexity, and restricted deployment velocity. SRE sets explicit reliability targets (SLOs) and accepts calculated risk within the error budget.
2. Service Level Objectives (SLOs)
SLOs are the heart of SRE. They define what "reliable enough" means for each service. Every reliability decision — whether to deploy a risky change, whether to add redundancy, whether to invest in automation — flows from the SLO and the error budget derived from it.
3. Eliminating Toil
Toil is manual, repetitive, automatable, devoid of long-term value, and scales linearly with service growth. SRE principle: no more than 50% of an SRE's time should be spent on toil. The rest must be engineering work — automation, tooling, process improvement.
4. Monitoring Distributed Systems
SRE uses structured monitoring with alerting on symptoms (user-visible impact) rather than causes (internal signals). The Four Golden Signals (latency, traffic, errors, saturation) provide a universal monitoring framework applicable to any service.
5. Automation
Automation is how SRE escapes the linear scaling trap. Every manual operational procedure is a candidate for automation. The hierarchy: (1) Don't do it, (2) Do it manually, (3) Automate it, (4) Self-healing systems that prevent the need for action.
6. Release Engineering
Reliable releases require automated build systems, hermetic builds, rigorous change management, staged rollouts with automatic rollback, and canary deployments. Release engineering is a first-class SRE concern.
7. Simplicity
Complexity is the enemy of reliability. Every additional component, integration, or feature adds failure modes. SREs actively push back on unnecessary complexity and advocate for the simplest architecture that meets reliability requirements.
SLI / SLO / SLA — Definitions
Service Level Indicator (SLI)
An SLI is a specific, measurable metric that quantifies one aspect of your service's reliability. It is always a ratio: the number of "good" events divided by total events over a time window.
# SLI examples
Availability SLI = successful_requests / total_requests
Latency SLI = requests_completed_under_threshold / total_requests
Error Rate SLI = successful_requests / total_requests (1 - error_rate)
Durability SLI = objects_readable_when_requested / total_objects
# Prometheus examples
# Availability SLI for HTTP service
(sum(rate(http_requests_total{job="my-service", code!~"5.."}[5m]))
/ sum(rate(http_requests_total{job="my-service"}[5m]))) * 100
# Latency SLI: % of requests completing in under 300ms
(sum(rate(http_request_duration_seconds_bucket{job="my-service", le="0.3"}[5m]))
/ sum(rate(http_request_duration_seconds_count{job="my-service"}[5m]))) * 100
Service Level Objective (SLO)
An SLO is the target value for an SLI over a rolling time window. It defines what "reliable enough" means for your service and its users. SLOs are internal targets — they exist to guide engineering decisions.
| Service Type | SLI | SLO Target | Window |
|---|---|---|---|
| Payment API | Success rate (non-5xx) | 99.95% | 30 days |
| Payment API | Latency p99 < 500ms | 99.9% | 30 days |
| User Auth Service | Success rate | 99.99% | 30 days |
| Search Service | Latency p95 < 200ms | 99.5% | 28 days |
| Object Storage | Durability (read success) | 99.9999% | 30 days |
| Notification Service | Delivery within 60s | 99.5% | 7 days |
Service Level Agreement (SLA)
An SLA is a contractual commitment to customers with defined consequences (financial credits, termination rights) if violated. SLAs are always looser than your internal SLOs — if your SLO is 99.9%, your SLA might be 99.5%. The gap between SLO and SLA is your safety margin.
Error Budgets
The error budget is the most powerful concept in SRE. It is the maximum allowable unreliability for a service over a given time window — derived directly from the SLO.
# Error Budget Calculation
# SLO = 99.9% availability over 30 days
total_minutes = 30 * 24 * 60 = 43,200 minutes
error_budget = (1 - 0.999) × 43,200 = 43.2 minutes of allowed downtime
# Equivalently in requests:
# If your service handles 10M requests/month:
allowed_errors = 10,000,000 × 0.001 = 10,000 allowed failures
# Error Budget Burn Rate (alerting)
# If you burn your monthly budget in 1 hour, that's a SEV1
# Burn rate = (error rate / (1 - SLO))
# Example: SLO = 99.9%, current error rate = 5%
burn_rate = 0.05 / 0.001 = 50x (burning budget 50x faster than allowed)
# Time to exhaustion at current burn rate:
time_to_exhaustion = 43.2 minutes / 50 = 0.86 minutes ≈ 52 seconds → PAGE IMMEDIATELY
Error Budget Policy
An error budget policy defines what happens when the budget is consumed. Without a policy, the error budget is just a number. With a policy, it governs real engineering decisions:
- Budget >50% remaining: Normal operations. Full deployment velocity. Feature work and reliability work proceed in parallel.
- Budget 25–50% remaining: Caution. Defer risky deployments. Prioritize reliability improvements. Begin root cause analysis of recent incidents.
- Budget <25% remaining: Feature freeze. No non-emergency deployments. All engineering capacity redirected to reliability. SRE team has veto power.
- Budget exhausted: Emergency mode. Rollback recent changes. All hands on reliability. Executive escalation. Incident review required before any deployment.
The Four Golden Signals
Defined in the Google SRE book, these four signals cover every aspect of user-visible service health. If you can only monitor four things, monitor these:
1. Latency
The time it takes to service a request. Crucially, track latency for both successful and failed requests separately — fast failures can mask high error rates. Use percentiles (p50, p95, p99, p99.9), not averages.
# PromQL: p99 latency for successful requests
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
job="my-service", code!~"5.."}[5m])) by (le)
)
2. Traffic
The amount of demand on your system, measured in requests per second, concurrent users, transactions per second, or queries per second. Traffic provides context for all other signals — a high error rate at 100 RPS is very different from the same rate at 10,000 RPS.
# PromQL: requests per second by service
sum(rate(http_requests_total{job="my-service"}[5m])) by (service, method)
# For message queues: messages processed per second
sum(rate(kafka_consumer_records_consumed_total{group="my-consumer"}[5m]))
3. Errors
The rate of requests that fail. Include both explicit failures (HTTP 5xx, gRPC error codes) and implicit failures (HTTP 200 with wrong content, requests that time out). Error rate is often the most direct signal of user impact.
# PromQL: error rate as percentage
(sum(rate(http_requests_total{job="my-service", code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="my-service"}[5m]))) * 100
# Alert rule: error rate > 1% for 5 minutes
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}"
4. Saturation
How "full" your service is. Saturation measures the most constrained resource — CPU, memory, disk I/O, network bandwidth, thread pool, connection pool. Saturation is a leading indicator: it predicts degradation before users feel it.
# PromQL: CPU saturation
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# Database connection pool saturation
hikaricp_connections_active / hikaricp_connections_max * 100
Toil — Definition and Measurement
Toil is the kind of work tied to running a production service that tends to be: manual, repetitive, automatable, tactical rather than strategic, devoid of enduring value, and that scales linearly with service growth.
Examples of Toil
- Manually restarting a service when it runs out of memory (should be auto-healed)
- Manually rotating credentials on a schedule (should be automated)
- Clicking through the console to provision new environments (should be Terraform)
- Copy-pasting incident data between tools (should be automated notification)
- Running the same runbook steps for the same alert every week (should be auto-remediation)
Toil Budget
Google's SRE principle: SREs should spend no more than 50% of their time on toil. Track it quarterly:
# Toil tracking — weekly time log template
# Each SRE tracks time in these categories:
| Category | Hours | % of Total |
|-----------------|-------|------------|
| Toil — On-call | 8 | 20% |
| Toil — Manual | 4 | 10% |
| Engineering | 16 | 40% |
| Project Work | 8 | 20% |
| Training/Docs | 4 | 10% |
| TOTAL | 40 | 100% |
# Toil: 12h / 40h = 30% ✓ Under 50% budget
# If toil > 50%: escalate to engineering manager, identify automation targets
# Track toil sources in a spreadsheet or Jira:
# - Alert name, frequency per week, time to resolve, automation potential (Y/N/Partial)
# - Prioritize high-frequency + high-time + automatable items
SRE Team Structure
Embedded vs Centralized SRE
| Model | Structure | Pros | Cons | Best for |
|---|---|---|---|---|
| Embedded | SREs sit within product teams | Deep product context; strong team ownership; faster feedback loop | Risk of "going native"; SREs may drift to feature work; no economies of scale in tooling | Large orgs with distinct services; mature engineering culture |
| Centralized | SRE team serves all product teams | Shared tooling; consistent standards; economies of scale; SRE identity preserved | Less product context; potential bottleneck; communication overhead | Smaller orgs; early-stage SRE programs; strong platform engineering |
| Hybrid | Central platform SRE + embedded reliability engineers | Best of both; central tooling + local ownership | Complex org design; coordination overhead | Mid-to-large orgs; recommended for most |
On-Call Rotation Design
- Minimum 8 engineers per rotation — ensures each person is on call no more than 1 week per 2 months
- Shadow rotations — new engineers shadow experienced on-callers for 4 weeks before taking primary
- Primary + Secondary + Escalation — three tiers. Secondary covers if primary is unreachable. Escalation goes to team lead or manager.
- Alert fatigue limit: No more than 2 significant alerts per on-call shift. More than that means your alerting needs tuning, not a stronger SRE.
- Compensation: On-call time is work time. Engineers should not be expected to resolve alerts at 2AM and be in a 9AM meeting. Follow-the-sun rotations across time zones reduce off-hours burden.
Next Steps
- Reliability Engineering — Reliability patterns, capacity planning, performance testing, auto-scaling, and SLO dashboards
- Incident Management — Severity levels, incident commander role, blameless post-mortems, PagerDuty integration
- Chaos Engineering — Chaos Mesh, AWS FIS, LitmusChaos, GameDay planning, building a chaos culture