ADR Examples

ADR-0001: Use PostgreSQL as Primary Relational Database

Status: Accepted Date: 2023-04-12 Author: Lê Bình Phương File: ADR-0001-use-postgresql-as-primary-database.md

Context and Problem Statement

The platform requires a relational database to store user accounts, application state, transaction records, and configuration data. The team has been using SQLite for local development, which is not suitable for production workloads. As we approach the first production release, we need to select a production-grade relational database management system (RDBMS) that will serve as the canonical data store for the foreseeable future.

The selection must account for: the team's existing expertise, operational requirements (high availability, backup, point-in-time recovery), cloud deployment on both AWS and GCP, and the need to store semi-structured data (product metadata, user preferences) alongside strictly relational data. The system is expected to handle up to 10,000 requests/second at peak with a data volume of up to 5TB in the next three years.

Decision Drivers

Must be open source with a permissive licence (no per-core commercial licensing cost)
Must support ACID transactions and complex JOIN queries across normalised schemas
Must have mature, production-tested high availability solutions (multi-node, automatic failover)
Must support semi-structured data storage to avoid a secondary NoSQL database for most use cases
Team has existing experience with the chosen system or a very similar one
Must be available as a fully managed service on both AWS and GCP for future managed migration

Considered Options

Option	Licence	JSONB Support	HA Solution	Managed Cloud
PostgreSQL 15 (chosen)	PostgreSQL Licence (permissive)	Excellent — first-class JSONB type with indexing	Patroni, Citus, Stolon	Amazon RDS, Aurora, Cloud SQL
MySQL 8 / MariaDB	GPL / LGPL (MariaDB)	Limited — JSON type, no GIN index equivalent	InnoDB Cluster, Galera	Amazon RDS, Cloud SQL
Amazon Aurora PostgreSQL	Proprietary (AWS-only)	Via PostgreSQL compatibility	Built-in Aurora HA	AWS only (not GCP)

Decision Outcome

Chosen option: PostgreSQL 15, because it uniquely satisfies all decision drivers: it is open source, has first-class JSONB support with GIN indexing (eliminating the need for a separate document store for most use cases), has a mature and well-understood HA ecosystem (Patroni), and is available as a managed service on both AWS (RDS/Aurora) and GCP (Cloud SQL).

Positive Consequences

Single database technology for both relational and semi-structured data — eliminates a second datastore for the majority of use cases
Rich extension ecosystem: PostGIS for geospatial data, pgvector for embedding similarity search, pg_cron for scheduled jobs
Strong community and a 30-year track record of production stability
No per-core commercial licensing fees — cost scales only with compute and storage
Portable: the same database can run on-premises, on AWS, or on GCP without application changes

Accepted Trade-offs: The team needs to build PostgreSQL-specific operational expertise: query optimisation with EXPLAIN ANALYZE, VACUUM tuning, connection pool management with PgBouncer. Self-managed Patroni HA requires operational investment. This is mitigated by a plan to migrate to Amazon RDS or Cloud SQL (managed) within 12 months of initial deployment.

Implementation Notes

# High Availability: Deploy Patroni for self-managed HA
# Patroni manages leader election and automatic failover via etcd consensus

# Recommended patroni.yml snippet:
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB
  pg_hba:
    - host replication replicator 0.0.0.0/0 md5
    - host all all 0.0.0.0/0 md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: "${PATRONI_POSTGRESQL_CONNECT_ADDRESS}:5432"
  data_dir: /data/patroni
  parameters:
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 5
    max_replication_slots: 5
    wal_log_hints: "on"

# Connection pooling: PgBouncer in transaction mode
[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

ADR-0002: Use Terraform for All Cloud Infrastructure Provisioning

Status: Accepted Date: 2023-06-03 Author: Lê Bình Phương File: ADR-0002-use-terraform-for-infrastructure-provisioning.md

Context and Problem Statement

Infrastructure has been provisioned manually through the AWS and GCP web consoles. This approach is not reproducible, not auditable, and extremely error-prone. Two incidents in the past quarter were caused by manual configuration drift between environments. As the team grows from 4 to 12 engineers over the next year, and as we add a second cloud provider (GCP), the manual approach becomes completely untenable.

We need an Infrastructure as Code (IaC) tool that allows infrastructure to be defined in version-controlled code, reviewed via pull requests, and applied automatically through a CI/CD pipeline. The solution must support both AWS and GCP, and must be adoptable by engineers without deep cloud provider API knowledge.

Decision Drivers

Must support both AWS and GCP from a single codebase — we are multi-cloud and will remain so
Declarative model preferred — define desired state, let the tool figure out the diff
Remote state management with locking to enable team collaboration without conflicts
Large ecosystem of community modules to avoid reinventing common patterns
Must integrate with GitHub Actions CI/CD for automated plan and apply
Team has prior experience with the tool or the learning curve is manageable within one sprint

Considered Options

Option	Language	Multi-cloud	Declarative	Maturity
Terraform (OpenTofu) (chosen)	HCL (HashiCorp Configuration Language)	Excellent — 3000+ providers	Yes	Very high — 10+ years, industry standard
Pulumi	TypeScript, Python, Go, C#	Good — major providers covered	Yes (imperative DSL)	Medium — growing but smaller ecosystem
AWS CDK	TypeScript, Python, Java	AWS-only (cdk8s for K8s)	Imperative with L2 constructs	High for AWS; not applicable for GCP
Ansible	YAML (Jinja2 templates)	Good via cloud modules	Procedural — no state management	Very high — but better for config mgmt than IaC

Decision Outcome

Chosen option: Terraform with remote state in S3 (AWS) / GCS (GCP), because it is the industry-standard declarative IaC tool with the widest multi-cloud provider support, a large ecosystem of community modules (Terraform Registry), and the largest pool of community knowledge and job-market expertise. The team has prior Terraform experience on AWS, reducing the learning curve.

Positive Consequences

All infrastructure changes go through Git pull request review — full audit trail and peer review of every change
Reproducible environments: staging can be created identically to production with a single variable override
Remote state in S3/GCS with DynamoDB/Cloud Spanner locking prevents concurrent apply conflicts
Terraform Registry provides battle-tested modules for VPC, EKS, GKE, RDS — reducing custom code
Community size means most problems have documented solutions and StackOverflow answers

Accepted Trade-offs: HCL is not a full programming language — complex conditional logic and loops require non-obvious constructs (for_each, dynamic blocks). State management requires discipline: the state file must never be edited manually, and state migrations require care. All engineers must be trained on terraform plan review before being granted apply permissions. Remote state backend must be bootstrapped manually before Terraform can manage itself.

State Management Architecture

# Remote state backend configuration (AWS)
# backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/us-east-1/eks/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789:key/xxxxxxxx"
    dynamodb_table = "terraform-state-lock"
  }
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
  required_version = ">= 1.6.0"
}

# Remote state backend configuration (GCP)
# backend-gcp.tf
terraform {
  backend "gcs" {
    bucket = "company-terraform-state-gcp"
    prefix = "prod/europe-west1/gke"
  }
}

# Workspace strategy for environment isolation
# Use separate state files per environment, not workspaces, for blast-radius isolation
# Structure:
# environments/
#   prod/
#     us-east-1/
#       vpc/          → terraform.tfstate in s3://.../prod/us-east-1/vpc/
#       eks/          → terraform.tfstate in s3://.../prod/us-east-1/eks/
#       rds/          → terraform.tfstate in s3://.../prod/us-east-1/rds/
#   staging/
#     us-east-1/
#       vpc/
#       eks/

ADR-0003: Adopt GitOps with ArgoCD for Kubernetes Deployments

Status: Accepted Date: 2023-09-18 Author: Lê Bình Phương File: ADR-0003-adopt-gitops-argocd-kubernetes-deployments.md

Context and Problem Statement

Our current CI/CD pipeline (GitHub Actions) builds container images and pushes them to ECR, then directly applies Kubernetes manifests to the cluster using kubectl apply with a service account token embedded in GitHub Secrets. This model has several critical problems: the CI/CD pipeline has direct write access to the production cluster (a significant security risk), there is no reconciliation loop to detect and correct configuration drift, rollbacks require re-running the pipeline, and there is no single source of truth for what is actually deployed.

Three recent incidents were caused by partial CI/CD pipeline failures that left the cluster in an inconsistent state. We need a deployment model that is observable, self-healing, auditable, and follows security best practices (principle of least privilege — the cluster pulls from Git, rather than CI/CD pushing to the cluster).

Decision Drivers

Eliminate direct kubectl access from CI/CD pipelines to production clusters
Continuous reconciliation: automatically detect and correct configuration drift
Git as the single source of truth for all deployed configuration
Must support multi-cluster deployments (currently 2 clusters, expected to grow to 5+)
Must support RBAC to allow different teams to manage their own applications without cluster-admin
Web UI for visibility without requiring kubectl access for developers

Considered Options

Option	UI	Multi-cluster	RBAC	App-of-Apps
ArgoCD (chosen)	Excellent — rich web UI with deployment diff view	Native — hub-and-spoke model	RBAC via OIDC (Okta, Dex)	Yes — App of Apps and ApplicationSets
Flux v2	None built-in (CLI only, Weave GitOps UI is separate)	Yes — via multi-tenancy	Kubernetes RBAC native	Yes — Kustomization controller
Jenkins X	Basic web UI	Limited	Limited	Partial
Manual kubectl in CI	N/A	Via scripts	N/A	N/A

Decision Outcome

Chosen option: ArgoCD, because it provides the best combination of a production-grade web UI (critical for developer visibility), native multi-cluster support via the hub-and-spoke model, OIDC-integrated RBAC, and the App of Apps / ApplicationSet patterns for managing many applications at scale. The ArgoCD community is large and the CNCF graduation status demonstrates production maturity.

Positive Consequences

GitHub Actions no longer needs direct cluster credentials — it only pushes image tags to Git; ArgoCD pulls and applies
Continuous reconciliation detects configuration drift within 3 minutes (default sync interval) and auto-corrects it
All deployments are recorded in Git history — rollback is a git revert
Developers can see the diff between desired state (Git) and live state (cluster) in the ArgoCD UI without kubectl access
ApplicationSets allow templating of applications across environments and clusters — one template defines deployment across dev/staging/prod

Accepted Trade-offs: All Kubernetes manifests (or Helm charts / Kustomize overlays) must live in a Git repository accessible to ArgoCD — no more ad-hoc kubectl apply. This introduces process discipline that some engineers will initially resist. An ArgoCD cluster is itself required to be running and healthy for deployments to function — this adds an operational dependency. ArgoCD itself must be bootstrapped manually (or via Terraform) the first time.

Architecture

# App of Apps pattern — root ArgoCD Application manages all other Applications
# File: argocd/root-app.yaml (in the gitops-config repository)

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/gitops-config.git
    targetRevision: main
    path: apps/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

---
# ApplicationSet for multi-environment deployment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: microservices
  namespace: argocd
spec:
  generators:
  - matrix:
      generators:
      - git:
          repoURL: https://github.com/company/gitops-config.git
          revision: main
          directories:
          - path: services/*
      - list:
          elements:
          - cluster: prod-us-east-1
            url: https://prod-us-east-1.example.com
          - cluster: staging-us-east-1
            url: https://staging-us-east-1.example.com
  template:
    metadata:
      name: '{{path.basename}}-{{cluster}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/company/gitops-config.git
        targetRevision: main
        path: '{{path}}/overlays/{{cluster}}'
      destination:
        server: '{{url}}'
        namespace: '{{path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

ADR-0004: Use HashiCorp Vault for Secrets Management

Status: Accepted Date: 2024-01-22 Author: Lê Bình Phương File: ADR-0004-use-hashicorp-vault-for-secrets-management.md

Context and Problem Statement

Application secrets (database passwords, API keys, TLS certificates, service account tokens) are currently stored in Kubernetes ConfigMaps and Secrets. While Kubernetes Secrets are base64-encoded (not encrypted at rest by default in our current setup), they are frequently synced to developer laptops, committed to Git accidentally, and shared in Slack messages. A security audit in Q4 2023 identified 12 instances of secrets committed to repositories, and a pentest found two Kubernetes Secrets containing production database credentials accessible to a compromised pod with default service account permissions.

We need a dedicated secrets management platform that provides: encryption at rest and in transit, audit logging of all secret access, dynamic secret generation (short-lived credentials), automatic rotation, and integration with our Kubernetes workloads without requiring developers to handle secret values directly.

Decision Drivers

Cloud-agnostic — must work identically on AWS and GCP without vendor lock-in
Dynamic secrets — ability to generate short-lived, auto-expiring database credentials to eliminate static long-lived passwords
Full audit log of every secret read — required for PCI-DSS and SOC 2 compliance
Kubernetes-native integration — secrets injected into pods without developer code changes
PKI and certificate authority (CA) management — replaces manual certificate issuance
Must not require application code changes to consume secrets

Considered Options

Option	Cloud-Agnostic	Dynamic Secrets	K8s Integration	Self-Hosted
HashiCorp Vault (self-hosted on K8s) (chosen)	Yes — works everywhere	Yes — DB, AWS, GCP, PKI, SSH	Vault Agent Injector (sidecar)	Yes — full control
AWS Secrets Manager	No — AWS-only	Limited — RDS rotation only	External Secrets Operator	No — managed service
GCP Secret Manager	No — GCP-only	None built-in	External Secrets Operator	No — managed service
SOPS + Age/KMS	Yes	No	Manual / Helm secrets plugin	Yes — file-based

Decision Outcome

Chosen option: HashiCorp Vault, self-hosted on Kubernetes using the official Helm chart, because it is the only option that satisfies all decision drivers simultaneously: full cloud-agnosticism, dynamic database secrets (eliminating static credentials), a complete PKI engine, a comprehensive audit log, and native Kubernetes integration via the Vault Agent Injector sidecar pattern — all without requiring application code changes.

Positive Consequences

Dynamic database credentials: each pod gets a unique PostgreSQL username and password with a 1-hour TTL — even if credentials are captured, they expire within the hour
Full audit log: every vault read secret/myapp/db-password is logged with the Kubernetes service account, pod name, and timestamp — meets PCI-DSS Requirement 10
Vault Agent Injector: secrets are written to the pod's filesystem as files via a sidecar init container — no changes to application code, no secrets in environment variables
PKI engine replaces manual certificate management: Vault issues short-lived certificates signed by an internal CA, eliminating the risk of forgotten, non-rotated certificates
Single secrets backend for all cloud providers and on-premises systems

Accepted Trade-offs: Running Vault adds operational overhead — the Vault cluster (3-node HA with Raft storage) must be maintained, upgraded, and monitored. The Vault seal/unseal lifecycle requires careful key management (we use AWS KMS auto-unseal). Vault itself must be excluded from the secrets it manages. Engineers must be trained on Vault policies, AppRole authentication, and the Kubernetes auth method. On-call engineers must be prepared to handle Vault availability incidents as a Tier 1 dependency.

Deployment Architecture

# Deploy Vault on Kubernetes using official Helm chart
helm repo add hashicorp https://helm.releases.hashicorp.com

# values.yaml for production HA deployment
cat > vault-values.yaml <<'EOF'
server:
  ha:
    enabled: true
    replicas: 3
    raft:
      enabled: true
      setNodeId: true
      config: |
        cluster_name = "production"
        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "http://vault-0.vault-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-1.vault-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-2.vault-internal:8200"
          }
        }
        seal "awskms" {
          region     = "us-east-1"
          kms_key_id = "arn:aws:kms:us-east-1:123456789:key/xxxxxxxx"
        }
        listener "tcp" {
          tls_disable = false
          tls_cert_file = "/vault/userconfig/vault-tls/tls.crt"
          tls_key_file  = "/vault/userconfig/vault-tls/tls.key"
        }
        api_addr = "https://vault.vault.svc.cluster.local:8200"
        cluster_addr = "https://POD_IP:8201"
  auditStorage:
    enabled: true
    size: 10Gi
injector:
  enabled: true
  replicas: 2
EOF

helm install vault hashicorp/vault \
  -n vault --create-namespace \
  -f vault-values.yaml

# Enable Kubernetes auth method
vault auth enable kubernetes
vault write auth/kubernetes/config \
  kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443"

# Example Vault Agent Injector annotations on a Pod
# The sidecar reads the Vault path and writes the secret to /vault/secrets/
metadata:
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "myapp-prod"
    vault.hashicorp.com/agent-inject-secret-db-creds: "database/creds/myapp-prod"
    vault.hashicorp.com/agent-inject-template-db-creds: |
      {{- with secret "database/creds/myapp-prod" -}}
      DB_USERNAME={{ .Data.username }}
      DB_PASSWORD={{ .Data.password }}
      {{- end -}}

Compliance Outcome: With Vault in place, the PCI-DSS Requirement 8 (unique credentials per system component) and Requirement 10 (full audit log of credential access) are satisfied by design. The security audit follow-up confirmed zero static database credentials remaining in Kubernetes Secrets after the migration.

Architecture Decisions

ADR-0001: Use PostgreSQL as Primary Relational Database

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Positive Consequences

Implementation Notes

ADR-0002: Use Terraform for All Cloud Infrastructure Provisioning

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Positive Consequences

State Management Architecture

ADR-0003: Adopt GitOps with ArgoCD for Kubernetes Deployments

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Positive Consequences

Architecture

ADR-0004: Use HashiCorp Vault for Secrets Management

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Positive Consequences

Deployment Architecture