Site Reliability Engineering

SRE applies software engineering principles to infrastructure and operations — treating reliability as a feature, not an afterthought. SREs write code to automate operational tasks, define measurable reliability targets, and use data to make risk decisions.

What is SRE?

Site Reliability Engineering (SRE) was created at Google in 2003 by Ben Treynor Sloss, who defined it as "what happens when you ask a software engineer to design an operations function." SRE is a concrete implementation of DevOps philosophy, applying software engineering practices to operations challenges.

The core insight of SRE is that traditional operations — run by teams focused on manual processes, change avoidance, and stability above all — does not scale. As systems grow in complexity, purely manual operations becomes a bottleneck. SRE solves this by automating operations work, setting explicit reliability targets (SLOs), and creating a healthy engineering relationship between development velocity and system stability.

SRE vs DevOps

SRE and DevOps are complementary, not competing philosophies. DevOps describes the desired outcomes and culture; SRE provides a concrete, opinionated implementation of how to achieve those outcomes.

DevOps Philosophy

Break down silos between Dev and Ops
Shared ownership of production systems
Continuous delivery and deployment
Feedback loops and learning culture
"You build it, you run it"

DevOps defines the "what" — the culture and goals

SRE Implementation

SLOs and error budgets quantify reliability
Toil reduction through automation
Blameless post-mortems build learning culture
Capacity planning and load testing
On-call rotations with formal runbooks

SRE defines the "how" — specific practices and mechanisms

Core SRE Principles (Google SRE Book)

1. Embracing Risk

100% reliability is neither achievable nor desirable. Every nine of availability (99.9% vs 99.99%) has an exponentially higher cost in engineering time, architectural complexity, and restricted deployment velocity. SRE sets explicit reliability targets (SLOs) and accepts calculated risk within the error budget.

2. Service Level Objectives (SLOs)

SLOs are the heart of SRE. They define what "reliable enough" means for each service. Every reliability decision — whether to deploy a risky change, whether to add redundancy, whether to invest in automation — flows from the SLO and the error budget derived from it.

3. Eliminating Toil

Toil is manual, repetitive, automatable, devoid of long-term value, and scales linearly with service growth. SRE principle: no more than 50% of an SRE's time should be spent on toil. The rest must be engineering work — automation, tooling, process improvement.

4. Monitoring Distributed Systems

SRE uses structured monitoring with alerting on symptoms (user-visible impact) rather than causes (internal signals). The Four Golden Signals (latency, traffic, errors, saturation) provide a universal monitoring framework applicable to any service.

5. Automation

Automation is how SRE escapes the linear scaling trap. Every manual operational procedure is a candidate for automation. The hierarchy: (1) Don't do it, (2) Do it manually, (3) Automate it, (4) Self-healing systems that prevent the need for action.

6. Release Engineering

Reliable releases require automated build systems, hermetic builds, rigorous change management, staged rollouts with automatic rollback, and canary deployments. Release engineering is a first-class SRE concern.

7. Simplicity

Complexity is the enemy of reliability. Every additional component, integration, or feature adds failure modes. SREs actively push back on unnecessary complexity and advocate for the simplest architecture that meets reliability requirements.

SLI / SLO / SLA — Definitions

Service Level Indicator (SLI)

An SLI is a specific, measurable metric that quantifies one aspect of your service's reliability. It is always a ratio: the number of "good" events divided by total events over a time window.

# SLI examples
Availability SLI = successful_requests / total_requests
Latency SLI      = requests_completed_under_threshold / total_requests
Error Rate SLI   = successful_requests / total_requests  (1 - error_rate)
Durability SLI   = objects_readable_when_requested / total_objects

# Prometheus examples
# Availability SLI for HTTP service
(sum(rate(http_requests_total{job="my-service", code!~"5.."}[5m]))
  / sum(rate(http_requests_total{job="my-service"}[5m]))) * 100

# Latency SLI: % of requests completing in under 300ms
(sum(rate(http_request_duration_seconds_bucket{job="my-service", le="0.3"}[5m]))
  / sum(rate(http_request_duration_seconds_count{job="my-service"}[5m]))) * 100

Service Level Objective (SLO)

An SLO is the target value for an SLI over a rolling time window. It defines what "reliable enough" means for your service and its users. SLOs are internal targets — they exist to guide engineering decisions.

Service Type	SLI	SLO Target	Window
Payment API	Success rate (non-5xx)	99.95%	30 days
Payment API	Latency p99 < 500ms	99.9%	30 days
User Auth Service	Success rate	99.99%	30 days
Search Service	Latency p95 < 200ms	99.5%	28 days
Object Storage	Durability (read success)	99.9999%	30 days
Notification Service	Delivery within 60s	99.5%	7 days

Service Level Agreement (SLA)

An SLA is a contractual commitment to customers with defined consequences (financial credits, termination rights) if violated. SLAs are always looser than your internal SLOs — if your SLO is 99.9%, your SLA might be 99.5%. The gap between SLO and SLA is your safety margin.

SLA vs SLO relationship: Never set your SLA at the same level as your SLO. Your SLO can be missed briefly due to infrastructure issues, measurement gaps, or incidents under investigation. Your SLA is the legal floor. The gap protects you from breach during normal SRE operations.

Error Budgets

The error budget is the most powerful concept in SRE. It is the maximum allowable unreliability for a service over a given time window — derived directly from the SLO.

# Error Budget Calculation
# SLO = 99.9% availability over 30 days

total_minutes   = 30 * 24 * 60  = 43,200 minutes
error_budget    = (1 - 0.999) × 43,200 = 43.2 minutes of allowed downtime

# Equivalently in requests:
# If your service handles 10M requests/month:
allowed_errors  = 10,000,000 × 0.001 = 10,000 allowed failures

# Error Budget Burn Rate (alerting)
# If you burn your monthly budget in 1 hour, that's a SEV1
# Burn rate = (error rate / (1 - SLO))
# Example: SLO = 99.9%, current error rate = 5%
burn_rate = 0.05 / 0.001 = 50x  (burning budget 50x faster than allowed)

# Time to exhaustion at current burn rate:
time_to_exhaustion = 43.2 minutes / 50 = 0.86 minutes ≈ 52 seconds → PAGE IMMEDIATELY

Error Budget Policy

An error budget policy defines what happens when the budget is consumed. Without a policy, the error budget is just a number. With a policy, it governs real engineering decisions:

Budget >50% remaining: Normal operations. Full deployment velocity. Feature work and reliability work proceed in parallel.
Budget 25–50% remaining: Caution. Defer risky deployments. Prioritize reliability improvements. Begin root cause analysis of recent incidents.
Budget <25% remaining: Feature freeze. No non-emergency deployments. All engineering capacity redirected to reliability. SRE team has veto power.
Budget exhausted: Emergency mode. Rollback recent changes. All hands on reliability. Executive escalation. Incident review required before any deployment.

The Four Golden Signals

Defined in the Google SRE book, these four signals cover every aspect of user-visible service health. If you can only monitor four things, monitor these:

1. Latency

The time it takes to service a request. Crucially, track latency for both successful and failed requests separately — fast failures can mask high error rates. Use percentiles (p50, p95, p99, p99.9), not averages.

# PromQL: p99 latency for successful requests
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{
    job="my-service", code!~"5.."}[5m])) by (le)
)

2. Traffic

The amount of demand on your system, measured in requests per second, concurrent users, transactions per second, or queries per second. Traffic provides context for all other signals — a high error rate at 100 RPS is very different from the same rate at 10,000 RPS.

# PromQL: requests per second by service
sum(rate(http_requests_total{job="my-service"}[5m])) by (service, method)

# For message queues: messages processed per second
sum(rate(kafka_consumer_records_consumed_total{group="my-consumer"}[5m]))

3. Errors

The rate of requests that fail. Include both explicit failures (HTTP 5xx, gRPC error codes) and implicit failures (HTTP 200 with wrong content, requests that time out). Error rate is often the most direct signal of user impact.

# PromQL: error rate as percentage
(sum(rate(http_requests_total{job="my-service", code=~"5.."}[5m]))
  / sum(rate(http_requests_total{job="my-service"}[5m]))) * 100

# Alert rule: error rate > 1% for 5 minutes
- alert: HighErrorRate
  expr: |
    (sum(rate(http_requests_total{code=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))) > 0.01
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}"

4. Saturation

How "full" your service is. Saturation measures the most constrained resource — CPU, memory, disk I/O, network bandwidth, thread pool, connection pool. Saturation is a leading indicator: it predicts degradation before users feel it.

# PromQL: CPU saturation
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
  / node_memory_MemTotal_bytes * 100

# Database connection pool saturation
hikaricp_connections_active / hikaricp_connections_max * 100

Toil — Definition and Measurement

Toil is the kind of work tied to running a production service that tends to be: manual, repetitive, automatable, tactical rather than strategic, devoid of enduring value, and that scales linearly with service growth.

Examples of Toil

Manually restarting a service when it runs out of memory (should be auto-healed)
Manually rotating credentials on a schedule (should be automated)
Clicking through the console to provision new environments (should be Terraform)
Copy-pasting incident data between tools (should be automated notification)
Running the same runbook steps for the same alert every week (should be auto-remediation)

Toil Budget

Google's SRE principle: SREs should spend no more than 50% of their time on toil. Track it quarterly:

# Toil tracking — weekly time log template
# Each SRE tracks time in these categories:

| Category        | Hours | % of Total |
|-----------------|-------|------------|
| Toil — On-call  |    8  |    20%     |
| Toil — Manual   |    4  |    10%     |
| Engineering     |   16  |    40%     |
| Project Work    |    8  |    20%     |
| Training/Docs   |    4  |    10%     |
| TOTAL           |   40  |   100%     |

# Toil: 12h / 40h = 30%  ✓ Under 50% budget
# If toil > 50%: escalate to engineering manager, identify automation targets

# Track toil sources in a spreadsheet or Jira:
# - Alert name, frequency per week, time to resolve, automation potential (Y/N/Partial)
# - Prioritize high-frequency + high-time + automatable items

SRE Team Structure

Embedded vs Centralized SRE

Model	Structure	Pros	Cons	Best for
Embedded	SREs sit within product teams	Deep product context; strong team ownership; faster feedback loop	Risk of "going native"; SREs may drift to feature work; no economies of scale in tooling	Large orgs with distinct services; mature engineering culture
Centralized	SRE team serves all product teams	Shared tooling; consistent standards; economies of scale; SRE identity preserved	Less product context; potential bottleneck; communication overhead	Smaller orgs; early-stage SRE programs; strong platform engineering
Hybrid	Central platform SRE + embedded reliability engineers	Best of both; central tooling + local ownership	Complex org design; coordination overhead	Mid-to-large orgs; recommended for most

On-Call Rotation Design

Minimum 8 engineers per rotation — ensures each person is on call no more than 1 week per 2 months
Shadow rotations — new engineers shadow experienced on-callers for 4 weeks before taking primary
Primary + Secondary + Escalation — three tiers. Secondary covers if primary is unreachable. Escalation goes to team lead or manager.
Alert fatigue limit: No more than 2 significant alerts per on-call shift. More than that means your alerting needs tuning, not a stronger SRE.
Compensation: On-call time is work time. Engineers should not be expected to resolve alerts at 2AM and be in a 9AM meeting. Follow-the-sun rotations across time zones reduce off-hours burden.

Next Steps

Reliability Engineering — Reliability patterns, capacity planning, performance testing, auto-scaling, and SLO dashboards
Incident Management — Severity levels, incident commander role, blameless post-mortems, PagerDuty integration
Chaos Engineering — Chaos Mesh, AWS FIS, LitmusChaos, GameDay planning, building a chaos culture