ADR Examples

Four complete example ADRs written in MADR format. These illustrate decisions commonly encountered in cloud-native infrastructure projects and demonstrate the depth of reasoning expected in each section.

ADR-0001: Use PostgreSQL as Primary Relational Database

Status: Accepted Date: 2023-04-12 Author: Lê Bình Phương File: ADR-0001-use-postgresql-as-primary-database.md

Context and Problem Statement

The platform requires a relational database to store user accounts, application state, transaction records, and configuration data. The team has been using SQLite for local development, which is not suitable for production workloads. As we approach the first production release, we need to select a production-grade relational database management system (RDBMS) that will serve as the canonical data store for the foreseeable future.

The selection must account for: the team's existing expertise, operational requirements (high availability, backup, point-in-time recovery), cloud deployment on both AWS and GCP, and the need to store semi-structured data (product metadata, user preferences) alongside strictly relational data. The system is expected to handle up to 10,000 requests/second at peak with a data volume of up to 5TB in the next three years.

Decision Drivers

  • Must be open source with a permissive licence (no per-core commercial licensing cost)
  • Must support ACID transactions and complex JOIN queries across normalised schemas
  • Must have mature, production-tested high availability solutions (multi-node, automatic failover)
  • Must support semi-structured data storage to avoid a secondary NoSQL database for most use cases
  • Team has existing experience with the chosen system or a very similar one
  • Must be available as a fully managed service on both AWS and GCP for future managed migration

Considered Options

OptionLicenceJSONB SupportHA SolutionManaged Cloud
PostgreSQL 15 (chosen) PostgreSQL Licence (permissive) Excellent — first-class JSONB type with indexing Patroni, Citus, Stolon Amazon RDS, Aurora, Cloud SQL
MySQL 8 / MariaDB GPL / LGPL (MariaDB) Limited — JSON type, no GIN index equivalent InnoDB Cluster, Galera Amazon RDS, Cloud SQL
Amazon Aurora PostgreSQL Proprietary (AWS-only) Via PostgreSQL compatibility Built-in Aurora HA AWS only (not GCP)

Decision Outcome

Chosen option: PostgreSQL 15, because it uniquely satisfies all decision drivers: it is open source, has first-class JSONB support with GIN indexing (eliminating the need for a separate document store for most use cases), has a mature and well-understood HA ecosystem (Patroni), and is available as a managed service on both AWS (RDS/Aurora) and GCP (Cloud SQL).

Positive Consequences

  • Single database technology for both relational and semi-structured data — eliminates a second datastore for the majority of use cases
  • Rich extension ecosystem: PostGIS for geospatial data, pgvector for embedding similarity search, pg_cron for scheduled jobs
  • Strong community and a 30-year track record of production stability
  • No per-core commercial licensing fees — cost scales only with compute and storage
  • Portable: the same database can run on-premises, on AWS, or on GCP without application changes
Accepted Trade-offs: The team needs to build PostgreSQL-specific operational expertise: query optimisation with EXPLAIN ANALYZE, VACUUM tuning, connection pool management with PgBouncer. Self-managed Patroni HA requires operational investment. This is mitigated by a plan to migrate to Amazon RDS or Cloud SQL (managed) within 12 months of initial deployment.

Implementation Notes

# High Availability: Deploy Patroni for self-managed HA
# Patroni manages leader election and automatic failover via etcd consensus

# Recommended patroni.yml snippet:
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB
  pg_hba:
    - host replication replicator 0.0.0.0/0 md5
    - host all all 0.0.0.0/0 md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: "${PATRONI_POSTGRESQL_CONNECT_ADDRESS}:5432"
  data_dir: /data/patroni
  parameters:
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 5
    max_replication_slots: 5
    wal_log_hints: "on"

# Connection pooling: PgBouncer in transaction mode
[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

ADR-0002: Use Terraform for All Cloud Infrastructure Provisioning

Status: Accepted Date: 2023-06-03 Author: Lê Bình Phương File: ADR-0002-use-terraform-for-infrastructure-provisioning.md

Context and Problem Statement

Infrastructure has been provisioned manually through the AWS and GCP web consoles. This approach is not reproducible, not auditable, and extremely error-prone. Two incidents in the past quarter were caused by manual configuration drift between environments. As the team grows from 4 to 12 engineers over the next year, and as we add a second cloud provider (GCP), the manual approach becomes completely untenable.

We need an Infrastructure as Code (IaC) tool that allows infrastructure to be defined in version-controlled code, reviewed via pull requests, and applied automatically through a CI/CD pipeline. The solution must support both AWS and GCP, and must be adoptable by engineers without deep cloud provider API knowledge.

Decision Drivers

  • Must support both AWS and GCP from a single codebase — we are multi-cloud and will remain so
  • Declarative model preferred — define desired state, let the tool figure out the diff
  • Remote state management with locking to enable team collaboration without conflicts
  • Large ecosystem of community modules to avoid reinventing common patterns
  • Must integrate with GitHub Actions CI/CD for automated plan and apply
  • Team has prior experience with the tool or the learning curve is manageable within one sprint

Considered Options

OptionLanguageMulti-cloudDeclarativeMaturity
Terraform (OpenTofu) (chosen) HCL (HashiCorp Configuration Language) Excellent — 3000+ providers Yes Very high — 10+ years, industry standard
Pulumi TypeScript, Python, Go, C# Good — major providers covered Yes (imperative DSL) Medium — growing but smaller ecosystem
AWS CDK TypeScript, Python, Java AWS-only (cdk8s for K8s) Imperative with L2 constructs High for AWS; not applicable for GCP
Ansible YAML (Jinja2 templates) Good via cloud modules Procedural — no state management Very high — but better for config mgmt than IaC

Decision Outcome

Chosen option: Terraform with remote state in S3 (AWS) / GCS (GCP), because it is the industry-standard declarative IaC tool with the widest multi-cloud provider support, a large ecosystem of community modules (Terraform Registry), and the largest pool of community knowledge and job-market expertise. The team has prior Terraform experience on AWS, reducing the learning curve.

Positive Consequences

  • All infrastructure changes go through Git pull request review — full audit trail and peer review of every change
  • Reproducible environments: staging can be created identically to production with a single variable override
  • Remote state in S3/GCS with DynamoDB/Cloud Spanner locking prevents concurrent apply conflicts
  • Terraform Registry provides battle-tested modules for VPC, EKS, GKE, RDS — reducing custom code
  • Community size means most problems have documented solutions and StackOverflow answers
Accepted Trade-offs: HCL is not a full programming language — complex conditional logic and loops require non-obvious constructs (for_each, dynamic blocks). State management requires discipline: the state file must never be edited manually, and state migrations require care. All engineers must be trained on terraform plan review before being granted apply permissions. Remote state backend must be bootstrapped manually before Terraform can manage itself.

State Management Architecture

# Remote state backend configuration (AWS)
# backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/us-east-1/eks/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789:key/xxxxxxxx"
    dynamodb_table = "terraform-state-lock"
  }
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
  required_version = ">= 1.6.0"
}

# Remote state backend configuration (GCP)
# backend-gcp.tf
terraform {
  backend "gcs" {
    bucket = "company-terraform-state-gcp"
    prefix = "prod/europe-west1/gke"
  }
}

# Workspace strategy for environment isolation
# Use separate state files per environment, not workspaces, for blast-radius isolation
# Structure:
# environments/
#   prod/
#     us-east-1/
#       vpc/          → terraform.tfstate in s3://.../prod/us-east-1/vpc/
#       eks/          → terraform.tfstate in s3://.../prod/us-east-1/eks/
#       rds/          → terraform.tfstate in s3://.../prod/us-east-1/rds/
#   staging/
#     us-east-1/
#       vpc/
#       eks/

ADR-0003: Adopt GitOps with ArgoCD for Kubernetes Deployments

Status: Accepted Date: 2023-09-18 Author: Lê Bình Phương File: ADR-0003-adopt-gitops-argocd-kubernetes-deployments.md

Context and Problem Statement

Our current CI/CD pipeline (GitHub Actions) builds container images and pushes them to ECR, then directly applies Kubernetes manifests to the cluster using kubectl apply with a service account token embedded in GitHub Secrets. This model has several critical problems: the CI/CD pipeline has direct write access to the production cluster (a significant security risk), there is no reconciliation loop to detect and correct configuration drift, rollbacks require re-running the pipeline, and there is no single source of truth for what is actually deployed.

Three recent incidents were caused by partial CI/CD pipeline failures that left the cluster in an inconsistent state. We need a deployment model that is observable, self-healing, auditable, and follows security best practices (principle of least privilege — the cluster pulls from Git, rather than CI/CD pushing to the cluster).

Decision Drivers

  • Eliminate direct kubectl access from CI/CD pipelines to production clusters
  • Continuous reconciliation: automatically detect and correct configuration drift
  • Git as the single source of truth for all deployed configuration
  • Must support multi-cluster deployments (currently 2 clusters, expected to grow to 5+)
  • Must support RBAC to allow different teams to manage their own applications without cluster-admin
  • Web UI for visibility without requiring kubectl access for developers

Considered Options

OptionUIMulti-clusterRBACApp-of-Apps
ArgoCD (chosen) Excellent — rich web UI with deployment diff view Native — hub-and-spoke model RBAC via OIDC (Okta, Dex) Yes — App of Apps and ApplicationSets
Flux v2 None built-in (CLI only, Weave GitOps UI is separate) Yes — via multi-tenancy Kubernetes RBAC native Yes — Kustomization controller
Jenkins X Basic web UI Limited Limited Partial
Manual kubectl in CI N/A Via scripts N/A N/A

Decision Outcome

Chosen option: ArgoCD, because it provides the best combination of a production-grade web UI (critical for developer visibility), native multi-cluster support via the hub-and-spoke model, OIDC-integrated RBAC, and the App of Apps / ApplicationSet patterns for managing many applications at scale. The ArgoCD community is large and the CNCF graduation status demonstrates production maturity.

Positive Consequences

  • GitHub Actions no longer needs direct cluster credentials — it only pushes image tags to Git; ArgoCD pulls and applies
  • Continuous reconciliation detects configuration drift within 3 minutes (default sync interval) and auto-corrects it
  • All deployments are recorded in Git history — rollback is a git revert
  • Developers can see the diff between desired state (Git) and live state (cluster) in the ArgoCD UI without kubectl access
  • ApplicationSets allow templating of applications across environments and clusters — one template defines deployment across dev/staging/prod
Accepted Trade-offs: All Kubernetes manifests (or Helm charts / Kustomize overlays) must live in a Git repository accessible to ArgoCD — no more ad-hoc kubectl apply. This introduces process discipline that some engineers will initially resist. An ArgoCD cluster is itself required to be running and healthy for deployments to function — this adds an operational dependency. ArgoCD itself must be bootstrapped manually (or via Terraform) the first time.

Architecture

# App of Apps pattern — root ArgoCD Application manages all other Applications
# File: argocd/root-app.yaml (in the gitops-config repository)

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/gitops-config.git
    targetRevision: main
    path: apps/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

---
# ApplicationSet for multi-environment deployment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: microservices
  namespace: argocd
spec:
  generators:
  - matrix:
      generators:
      - git:
          repoURL: https://github.com/company/gitops-config.git
          revision: main
          directories:
          - path: services/*
      - list:
          elements:
          - cluster: prod-us-east-1
            url: https://prod-us-east-1.example.com
          - cluster: staging-us-east-1
            url: https://staging-us-east-1.example.com
  template:
    metadata:
      name: '{{path.basename}}-{{cluster}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/company/gitops-config.git
        targetRevision: main
        path: '{{path}}/overlays/{{cluster}}'
      destination:
        server: '{{url}}'
        namespace: '{{path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

ADR-0004: Use HashiCorp Vault for Secrets Management

Status: Accepted Date: 2024-01-22 Author: Lê Bình Phương File: ADR-0004-use-hashicorp-vault-for-secrets-management.md

Context and Problem Statement

Application secrets (database passwords, API keys, TLS certificates, service account tokens) are currently stored in Kubernetes ConfigMaps and Secrets. While Kubernetes Secrets are base64-encoded (not encrypted at rest by default in our current setup), they are frequently synced to developer laptops, committed to Git accidentally, and shared in Slack messages. A security audit in Q4 2023 identified 12 instances of secrets committed to repositories, and a pentest found two Kubernetes Secrets containing production database credentials accessible to a compromised pod with default service account permissions.

We need a dedicated secrets management platform that provides: encryption at rest and in transit, audit logging of all secret access, dynamic secret generation (short-lived credentials), automatic rotation, and integration with our Kubernetes workloads without requiring developers to handle secret values directly.

Decision Drivers

  • Cloud-agnostic — must work identically on AWS and GCP without vendor lock-in
  • Dynamic secrets — ability to generate short-lived, auto-expiring database credentials to eliminate static long-lived passwords
  • Full audit log of every secret read — required for PCI-DSS and SOC 2 compliance
  • Kubernetes-native integration — secrets injected into pods without developer code changes
  • PKI and certificate authority (CA) management — replaces manual certificate issuance
  • Must not require application code changes to consume secrets

Considered Options

OptionCloud-AgnosticDynamic SecretsK8s IntegrationSelf-Hosted
HashiCorp Vault (self-hosted on K8s) (chosen) Yes — works everywhere Yes — DB, AWS, GCP, PKI, SSH Vault Agent Injector (sidecar) Yes — full control
AWS Secrets Manager No — AWS-only Limited — RDS rotation only External Secrets Operator No — managed service
GCP Secret Manager No — GCP-only None built-in External Secrets Operator No — managed service
SOPS + Age/KMS Yes No Manual / Helm secrets plugin Yes — file-based

Decision Outcome

Chosen option: HashiCorp Vault, self-hosted on Kubernetes using the official Helm chart, because it is the only option that satisfies all decision drivers simultaneously: full cloud-agnosticism, dynamic database secrets (eliminating static credentials), a complete PKI engine, a comprehensive audit log, and native Kubernetes integration via the Vault Agent Injector sidecar pattern — all without requiring application code changes.

Positive Consequences

  • Dynamic database credentials: each pod gets a unique PostgreSQL username and password with a 1-hour TTL — even if credentials are captured, they expire within the hour
  • Full audit log: every vault read secret/myapp/db-password is logged with the Kubernetes service account, pod name, and timestamp — meets PCI-DSS Requirement 10
  • Vault Agent Injector: secrets are written to the pod's filesystem as files via a sidecar init container — no changes to application code, no secrets in environment variables
  • PKI engine replaces manual certificate management: Vault issues short-lived certificates signed by an internal CA, eliminating the risk of forgotten, non-rotated certificates
  • Single secrets backend for all cloud providers and on-premises systems
Accepted Trade-offs: Running Vault adds operational overhead — the Vault cluster (3-node HA with Raft storage) must be maintained, upgraded, and monitored. The Vault seal/unseal lifecycle requires careful key management (we use AWS KMS auto-unseal). Vault itself must be excluded from the secrets it manages. Engineers must be trained on Vault policies, AppRole authentication, and the Kubernetes auth method. On-call engineers must be prepared to handle Vault availability incidents as a Tier 1 dependency.

Deployment Architecture

# Deploy Vault on Kubernetes using official Helm chart
helm repo add hashicorp https://helm.releases.hashicorp.com

# values.yaml for production HA deployment
cat > vault-values.yaml <<'EOF'
server:
  ha:
    enabled: true
    replicas: 3
    raft:
      enabled: true
      setNodeId: true
      config: |
        cluster_name = "production"
        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "http://vault-0.vault-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-1.vault-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-2.vault-internal:8200"
          }
        }
        seal "awskms" {
          region     = "us-east-1"
          kms_key_id = "arn:aws:kms:us-east-1:123456789:key/xxxxxxxx"
        }
        listener "tcp" {
          tls_disable = false
          tls_cert_file = "/vault/userconfig/vault-tls/tls.crt"
          tls_key_file  = "/vault/userconfig/vault-tls/tls.key"
        }
        api_addr = "https://vault.vault.svc.cluster.local:8200"
        cluster_addr = "https://POD_IP:8201"
  auditStorage:
    enabled: true
    size: 10Gi
injector:
  enabled: true
  replicas: 2
EOF

helm install vault hashicorp/vault \
  -n vault --create-namespace \
  -f vault-values.yaml

# Enable Kubernetes auth method
vault auth enable kubernetes
vault write auth/kubernetes/config \
  kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443"

# Example Vault Agent Injector annotations on a Pod
# The sidecar reads the Vault path and writes the secret to /vault/secrets/
metadata:
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "myapp-prod"
    vault.hashicorp.com/agent-inject-secret-db-creds: "database/creds/myapp-prod"
    vault.hashicorp.com/agent-inject-template-db-creds: |
      {{- with secret "database/creds/myapp-prod" -}}
      DB_USERNAME={{ .Data.username }}
      DB_PASSWORD={{ .Data.password }}
      {{- end -}}
Compliance Outcome: With Vault in place, the PCI-DSS Requirement 8 (unique credentials per system component) and Requirement 10 (full audit log of credential access) are satisfied by design. The security audit follow-up confirmed zero static database credentials remaining in Kubernetes Secrets after the migration.