ADR Examples
Context and Problem Statement
The platform requires a relational database to store user accounts, application state, transaction records, and configuration data. The team has been using SQLite for local development, which is not suitable for production workloads. As we approach the first production release, we need to select a production-grade relational database management system (RDBMS) that will serve as the canonical data store for the foreseeable future.
The selection must account for: the team's existing expertise, operational requirements (high availability, backup, point-in-time recovery), cloud deployment on both AWS and GCP, and the need to store semi-structured data (product metadata, user preferences) alongside strictly relational data. The system is expected to handle up to 10,000 requests/second at peak with a data volume of up to 5TB in the next three years.
Decision Drivers
- Must be open source with a permissive licence (no per-core commercial licensing cost)
- Must support ACID transactions and complex JOIN queries across normalised schemas
- Must have mature, production-tested high availability solutions (multi-node, automatic failover)
- Must support semi-structured data storage to avoid a secondary NoSQL database for most use cases
- Team has existing experience with the chosen system or a very similar one
- Must be available as a fully managed service on both AWS and GCP for future managed migration
Considered Options
| Option | Licence | JSONB Support | HA Solution | Managed Cloud |
|---|---|---|---|---|
| PostgreSQL 15 (chosen) | PostgreSQL Licence (permissive) | Excellent — first-class JSONB type with indexing | Patroni, Citus, Stolon | Amazon RDS, Aurora, Cloud SQL |
| MySQL 8 / MariaDB | GPL / LGPL (MariaDB) | Limited — JSON type, no GIN index equivalent | InnoDB Cluster, Galera | Amazon RDS, Cloud SQL |
| Amazon Aurora PostgreSQL | Proprietary (AWS-only) | Via PostgreSQL compatibility | Built-in Aurora HA | AWS only (not GCP) |
Decision Outcome
Chosen option: PostgreSQL 15, because it uniquely satisfies all decision drivers: it is open source, has first-class JSONB support with GIN indexing (eliminating the need for a separate document store for most use cases), has a mature and well-understood HA ecosystem (Patroni), and is available as a managed service on both AWS (RDS/Aurora) and GCP (Cloud SQL).
Positive Consequences
- Single database technology for both relational and semi-structured data — eliminates a second datastore for the majority of use cases
- Rich extension ecosystem:
PostGISfor geospatial data,pgvectorfor embedding similarity search,pg_cronfor scheduled jobs - Strong community and a 30-year track record of production stability
- No per-core commercial licensing fees — cost scales only with compute and storage
- Portable: the same database can run on-premises, on AWS, or on GCP without application changes
EXPLAIN ANALYZE, VACUUM tuning, connection pool management with PgBouncer. Self-managed Patroni HA requires operational investment. This is mitigated by a plan to migrate to Amazon RDS or Cloud SQL (managed) within 12 months of initial deployment.
Implementation Notes
# High Availability: Deploy Patroni for self-managed HA
# Patroni manages leader election and automatic failover via etcd consensus
# Recommended patroni.yml snippet:
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB
pg_hba:
- host replication replicator 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
postgresql:
listen: 0.0.0.0:5432
connect_address: "${PATRONI_POSTGRESQL_CONNECT_ADDRESS}:5432"
data_dir: /data/patroni
parameters:
wal_level: replica
hot_standby: "on"
max_wal_senders: 5
max_replication_slots: 5
wal_log_hints: "on"
# Connection pooling: PgBouncer in transaction mode
[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
Context and Problem Statement
Infrastructure has been provisioned manually through the AWS and GCP web consoles. This approach is not reproducible, not auditable, and extremely error-prone. Two incidents in the past quarter were caused by manual configuration drift between environments. As the team grows from 4 to 12 engineers over the next year, and as we add a second cloud provider (GCP), the manual approach becomes completely untenable.
We need an Infrastructure as Code (IaC) tool that allows infrastructure to be defined in version-controlled code, reviewed via pull requests, and applied automatically through a CI/CD pipeline. The solution must support both AWS and GCP, and must be adoptable by engineers without deep cloud provider API knowledge.
Decision Drivers
- Must support both AWS and GCP from a single codebase — we are multi-cloud and will remain so
- Declarative model preferred — define desired state, let the tool figure out the diff
- Remote state management with locking to enable team collaboration without conflicts
- Large ecosystem of community modules to avoid reinventing common patterns
- Must integrate with GitHub Actions CI/CD for automated plan and apply
- Team has prior experience with the tool or the learning curve is manageable within one sprint
Considered Options
| Option | Language | Multi-cloud | Declarative | Maturity |
|---|---|---|---|---|
| Terraform (OpenTofu) (chosen) | HCL (HashiCorp Configuration Language) | Excellent — 3000+ providers | Yes | Very high — 10+ years, industry standard |
| Pulumi | TypeScript, Python, Go, C# | Good — major providers covered | Yes (imperative DSL) | Medium — growing but smaller ecosystem |
| AWS CDK | TypeScript, Python, Java | AWS-only (cdk8s for K8s) | Imperative with L2 constructs | High for AWS; not applicable for GCP |
| Ansible | YAML (Jinja2 templates) | Good via cloud modules | Procedural — no state management | Very high — but better for config mgmt than IaC |
Decision Outcome
Chosen option: Terraform with remote state in S3 (AWS) / GCS (GCP), because it is the industry-standard declarative IaC tool with the widest multi-cloud provider support, a large ecosystem of community modules (Terraform Registry), and the largest pool of community knowledge and job-market expertise. The team has prior Terraform experience on AWS, reducing the learning curve.
Positive Consequences
- All infrastructure changes go through Git pull request review — full audit trail and peer review of every change
- Reproducible environments: staging can be created identically to production with a single variable override
- Remote state in S3/GCS with DynamoDB/Cloud Spanner locking prevents concurrent apply conflicts
- Terraform Registry provides battle-tested modules for VPC, EKS, GKE, RDS — reducing custom code
- Community size means most problems have documented solutions and StackOverflow answers
for_each, dynamic blocks). State management requires discipline: the state file must never be edited manually, and state migrations require care. All engineers must be trained on terraform plan review before being granted apply permissions. Remote state backend must be bootstrapped manually before Terraform can manage itself.
State Management Architecture
# Remote state backend configuration (AWS)
# backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "prod/us-east-1/eks/terraform.tfstate"
region = "us-east-1"
encrypt = true
kms_key_id = "arn:aws:kms:us-east-1:123456789:key/xxxxxxxx"
dynamodb_table = "terraform-state-lock"
}
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
google = { source = "hashicorp/google", version = "~> 5.0" }
}
required_version = ">= 1.6.0"
}
# Remote state backend configuration (GCP)
# backend-gcp.tf
terraform {
backend "gcs" {
bucket = "company-terraform-state-gcp"
prefix = "prod/europe-west1/gke"
}
}
# Workspace strategy for environment isolation
# Use separate state files per environment, not workspaces, for blast-radius isolation
# Structure:
# environments/
# prod/
# us-east-1/
# vpc/ → terraform.tfstate in s3://.../prod/us-east-1/vpc/
# eks/ → terraform.tfstate in s3://.../prod/us-east-1/eks/
# rds/ → terraform.tfstate in s3://.../prod/us-east-1/rds/
# staging/
# us-east-1/
# vpc/
# eks/
Context and Problem Statement
Our current CI/CD pipeline (GitHub Actions) builds container images and pushes them to ECR, then directly applies Kubernetes manifests to the cluster using kubectl apply with a service account token embedded in GitHub Secrets. This model has several critical problems: the CI/CD pipeline has direct write access to the production cluster (a significant security risk), there is no reconciliation loop to detect and correct configuration drift, rollbacks require re-running the pipeline, and there is no single source of truth for what is actually deployed.
Three recent incidents were caused by partial CI/CD pipeline failures that left the cluster in an inconsistent state. We need a deployment model that is observable, self-healing, auditable, and follows security best practices (principle of least privilege — the cluster pulls from Git, rather than CI/CD pushing to the cluster).
Decision Drivers
- Eliminate direct kubectl access from CI/CD pipelines to production clusters
- Continuous reconciliation: automatically detect and correct configuration drift
- Git as the single source of truth for all deployed configuration
- Must support multi-cluster deployments (currently 2 clusters, expected to grow to 5+)
- Must support RBAC to allow different teams to manage their own applications without cluster-admin
- Web UI for visibility without requiring kubectl access for developers
Considered Options
| Option | UI | Multi-cluster | RBAC | App-of-Apps |
|---|---|---|---|---|
| ArgoCD (chosen) | Excellent — rich web UI with deployment diff view | Native — hub-and-spoke model | RBAC via OIDC (Okta, Dex) | Yes — App of Apps and ApplicationSets |
| Flux v2 | None built-in (CLI only, Weave GitOps UI is separate) | Yes — via multi-tenancy | Kubernetes RBAC native | Yes — Kustomization controller |
| Jenkins X | Basic web UI | Limited | Limited | Partial |
| Manual kubectl in CI | N/A | Via scripts | N/A | N/A |
Decision Outcome
Chosen option: ArgoCD, because it provides the best combination of a production-grade web UI (critical for developer visibility), native multi-cluster support via the hub-and-spoke model, OIDC-integrated RBAC, and the App of Apps / ApplicationSet patterns for managing many applications at scale. The ArgoCD community is large and the CNCF graduation status demonstrates production maturity.
Positive Consequences
- GitHub Actions no longer needs direct cluster credentials — it only pushes image tags to Git; ArgoCD pulls and applies
- Continuous reconciliation detects configuration drift within 3 minutes (default sync interval) and auto-corrects it
- All deployments are recorded in Git history — rollback is a
git revert - Developers can see the diff between desired state (Git) and live state (cluster) in the ArgoCD UI without kubectl access
- ApplicationSets allow templating of applications across environments and clusters — one template defines deployment across dev/staging/prod
kubectl apply. This introduces process discipline that some engineers will initially resist. An ArgoCD cluster is itself required to be running and healthy for deployments to function — this adds an operational dependency. ArgoCD itself must be bootstrapped manually (or via Terraform) the first time.
Architecture
# App of Apps pattern — root ArgoCD Application manages all other Applications
# File: argocd/root-app.yaml (in the gitops-config repository)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/gitops-config.git
targetRevision: main
path: apps/prod
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
---
# ApplicationSet for multi-environment deployment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: microservices
namespace: argocd
spec:
generators:
- matrix:
generators:
- git:
repoURL: https://github.com/company/gitops-config.git
revision: main
directories:
- path: services/*
- list:
elements:
- cluster: prod-us-east-1
url: https://prod-us-east-1.example.com
- cluster: staging-us-east-1
url: https://staging-us-east-1.example.com
template:
metadata:
name: '{{path.basename}}-{{cluster}}'
spec:
project: default
source:
repoURL: https://github.com/company/gitops-config.git
targetRevision: main
path: '{{path}}/overlays/{{cluster}}'
destination:
server: '{{url}}'
namespace: '{{path.basename}}'
syncPolicy:
automated:
prune: true
selfHeal: true
Context and Problem Statement
Application secrets (database passwords, API keys, TLS certificates, service account tokens) are currently stored in Kubernetes ConfigMaps and Secrets. While Kubernetes Secrets are base64-encoded (not encrypted at rest by default in our current setup), they are frequently synced to developer laptops, committed to Git accidentally, and shared in Slack messages. A security audit in Q4 2023 identified 12 instances of secrets committed to repositories, and a pentest found two Kubernetes Secrets containing production database credentials accessible to a compromised pod with default service account permissions.
We need a dedicated secrets management platform that provides: encryption at rest and in transit, audit logging of all secret access, dynamic secret generation (short-lived credentials), automatic rotation, and integration with our Kubernetes workloads without requiring developers to handle secret values directly.
Decision Drivers
- Cloud-agnostic — must work identically on AWS and GCP without vendor lock-in
- Dynamic secrets — ability to generate short-lived, auto-expiring database credentials to eliminate static long-lived passwords
- Full audit log of every secret read — required for PCI-DSS and SOC 2 compliance
- Kubernetes-native integration — secrets injected into pods without developer code changes
- PKI and certificate authority (CA) management — replaces manual certificate issuance
- Must not require application code changes to consume secrets
Considered Options
| Option | Cloud-Agnostic | Dynamic Secrets | K8s Integration | Self-Hosted |
|---|---|---|---|---|
| HashiCorp Vault (self-hosted on K8s) (chosen) | Yes — works everywhere | Yes — DB, AWS, GCP, PKI, SSH | Vault Agent Injector (sidecar) | Yes — full control |
| AWS Secrets Manager | No — AWS-only | Limited — RDS rotation only | External Secrets Operator | No — managed service |
| GCP Secret Manager | No — GCP-only | None built-in | External Secrets Operator | No — managed service |
| SOPS + Age/KMS | Yes | No | Manual / Helm secrets plugin | Yes — file-based |
Decision Outcome
Chosen option: HashiCorp Vault, self-hosted on Kubernetes using the official Helm chart, because it is the only option that satisfies all decision drivers simultaneously: full cloud-agnosticism, dynamic database secrets (eliminating static credentials), a complete PKI engine, a comprehensive audit log, and native Kubernetes integration via the Vault Agent Injector sidecar pattern — all without requiring application code changes.
Positive Consequences
- Dynamic database credentials: each pod gets a unique PostgreSQL username and password with a 1-hour TTL — even if credentials are captured, they expire within the hour
- Full audit log: every
vault read secret/myapp/db-passwordis logged with the Kubernetes service account, pod name, and timestamp — meets PCI-DSS Requirement 10 - Vault Agent Injector: secrets are written to the pod's filesystem as files via a sidecar init container — no changes to application code, no secrets in environment variables
- PKI engine replaces manual certificate management: Vault issues short-lived certificates signed by an internal CA, eliminating the risk of forgotten, non-rotated certificates
- Single secrets backend for all cloud providers and on-premises systems
Deployment Architecture
# Deploy Vault on Kubernetes using official Helm chart
helm repo add hashicorp https://helm.releases.hashicorp.com
# values.yaml for production HA deployment
cat > vault-values.yaml <<'EOF'
server:
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true
config: |
cluster_name = "production"
storage "raft" {
path = "/vault/data"
retry_join {
leader_api_addr = "http://vault-0.vault-internal:8200"
}
retry_join {
leader_api_addr = "http://vault-1.vault-internal:8200"
}
retry_join {
leader_api_addr = "http://vault-2.vault-internal:8200"
}
}
seal "awskms" {
region = "us-east-1"
kms_key_id = "arn:aws:kms:us-east-1:123456789:key/xxxxxxxx"
}
listener "tcp" {
tls_disable = false
tls_cert_file = "/vault/userconfig/vault-tls/tls.crt"
tls_key_file = "/vault/userconfig/vault-tls/tls.key"
}
api_addr = "https://vault.vault.svc.cluster.local:8200"
cluster_addr = "https://POD_IP:8201"
auditStorage:
enabled: true
size: 10Gi
injector:
enabled: true
replicas: 2
EOF
helm install vault hashicorp/vault \
-n vault --create-namespace \
-f vault-values.yaml
# Enable Kubernetes auth method
vault auth enable kubernetes
vault write auth/kubernetes/config \
kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443"
# Example Vault Agent Injector annotations on a Pod
# The sidecar reads the Vault path and writes the secret to /vault/secrets/
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "myapp-prod"
vault.hashicorp.com/agent-inject-secret-db-creds: "database/creds/myapp-prod"
vault.hashicorp.com/agent-inject-template-db-creds: |
{{- with secret "database/creds/myapp-prod" -}}
DB_USERNAME={{ .Data.username }}
DB_PASSWORD={{ .Data.password }}
{{- end -}}