Database & Storage Overview

Foundation of every system — choosing the right database engine and storage strategy is one of the most consequential architectural decisions you will make. This section covers database types, trade-offs, HA design, and cloud-managed options.

Database Type Taxonomy

Relational (RDBMS)

Relational databases organise data into tables with predefined schemas, enforcing integrity through ACID transactions. They excel at complex queries, joins across normalised data, and strong consistency requirements. PostgreSQL, MySQL, and Oracle are the canonical examples. SQL is the universal query language.

NoSQL — Document Stores

Documents (typically JSON/BSON) are the unit of storage. Each document is self-contained and can have a flexible schema. MongoDB and CouchDB are representative. Document stores shine when entities naturally map to nested objects — product catalogs, user profiles, CMS content.

NoSQL — Key-Value Stores

The simplest data model: every value is addressed by an opaque key. Redis, DynamoDB (in key-value mode), and Memcached fall here. Extremely fast for single-key lookups, session storage, and caching layers. Limited querying capability beyond key access.

NoSQL — Wide-Column Stores

Data is stored in column families rather than rows, optimised for high-write workloads and time-series data. Apache Cassandra and HBase are the primary examples. Rows within a partition are sorted by a clustering key, enabling efficient range scans within a partition.

NoSQL — Graph Databases

Entities (nodes) and their relationships (edges) are first-class citizens. Neo4j and Amazon Neptune allow efficient traversal of deeply connected data — social networks, fraud detection, knowledge graphs, and recommendation engines where relationship depth matters.

NewSQL

NewSQL systems attempt to combine the horizontal scalability of NoSQL with the ACID guarantees of traditional RDBMS. CockroachDB, Google Spanner, TiDB, and YugabyteDB represent this category. They use distributed consensus algorithms (Raft, Paxos) to achieve globally consistent transactions at scale.

Database Comparison Matrix

Database Type Consistency Scale Model Query Power Best For
PostgreSQL RDBMS Strong (ACID) Vertical + Read replicas Full SQL, JSON, FTS Transactional apps, analytics, complex queries
MySQL RDBMS Strong (ACID) Vertical + Read replicas Full SQL Web apps, CMS, e-commerce
MongoDB Document Tunable (eventual default) Horizontal sharding Rich queries, aggregation pipeline Flexible schemas, catalogs, content
Redis Key-Value / In-Memory Eventual (async repl) Cluster sharding Key access, Lua scripts Cache, sessions, pub/sub, leaderboards
Cassandra Wide-Column Tunable (quorum) Multi-master horizontal CQL (partition-scoped) IoT time-series, write-heavy, geo-distributed
DynamoDB Key-Value / Document Eventual / Strong (opt-in) Fully managed horizontal PK + GSI/LSI queries Serverless, unpredictable scale, AWS-native

CAP Theorem

Proposed by Eric Brewer in 2000 and formally proved by Gilbert and Lynch in 2002, the CAP theorem states that a distributed data store can guarantee at most two of the following three properties simultaneously:

C — Consistency

Every read receives the most recent write or an error. All nodes see the same data at the same time.

A — Availability

Every request receives a (non-error) response, though it may not contain the most recent write.

P — Partition Tolerance

The system continues to operate despite an arbitrary number of messages being dropped between nodes.

In practice, network partitions are unavoidable in any distributed system, so P is non-negotiable. The real choice is between CP (consistency over availability) and AP (availability over consistency):

Classification Databases Behaviour During Partition
CP PostgreSQL, HBase, Zookeeper, etcd, CockroachDB Returns an error or timeout rather than serving stale data
AP Cassandra, DynamoDB (default), CouchDB, Riak Serves potentially stale data to remain available; reconciles later
CA (single node) Traditional RDBMS (single instance) Not distributed — partition tolerance is not a concern
PACELC Extension: PACELC extends CAP to address the latency/consistency trade-off even when the system is running without partitions. It asks: even Else (when running normally), do you choose Latency or Consistency? DynamoDB is PA/EL; CockroachDB is PC/EC.

When to Use Which Database

Strong transactional consistency required?

Use PostgreSQL or MySQL. If global distribution is needed, consider CockroachDB or Cloud Spanner.

Flexible or rapidly evolving schema?

Use MongoDB for document-oriented storage, or PostgreSQL JSONB if you still want SQL power alongside flexibility.

Sub-millisecond latency, caching, or session storage?

Use Redis. Consider Memcached for pure ephemeral caching where persistence is not needed.

Massive write throughput, time-series, or IoT telemetry?

Use Cassandra for general wide-column workloads. Consider InfluxDB or TimescaleDB for purpose-built time-series queries.

Serverless / unpredictable traffic spikes on AWS?

Use DynamoDB with on-demand capacity mode. Pair with DAX for microsecond read latency.

Highly connected data, recommendations, fraud graphs?

Use Neo4j (self-managed) or Amazon Neptune / Google Spanner Graph (managed).

OLAP / analytical queries over large datasets?

Use BigQuery, Redshift, Snowflake, or ClickHouse — purpose-built columnar engines. Avoid running heavy analytics on OLTP databases.

RTO & RPO — Driving HA Strategy

RTO — Recovery Time Objective

The maximum tolerable length of time that a system can be offline after a failure event before business impact becomes unacceptable. An RTO of 5 minutes means the service must be restored within 5 minutes of an outage.

Drives: Failover automation, standby replica topology, load balancer health checks.

RPO — Recovery Point Objective

The maximum acceptable amount of data loss measured in time. An RPO of 1 minute means you can tolerate losing at most 1 minute of committed transactions during a disaster.

Drives: Replication lag targets, WAL archival frequency, backup intervals.

Tier RTO Target RPO Target Typical Architecture
Tier 0 (Mission Critical) < 1 min 0 (zero data loss) Synchronous replication, active-active multi-region, automatic failover
Tier 1 (Business Critical) 1–15 min < 1 min Synchronous standby, Patroni/Aurora, continuous WAL archival
Tier 2 (Important) 15–60 min < 15 min Asynchronous streaming replication, promoted manually, hourly snapshots
Tier 3 (Non-Critical) > 1 hour < 24 hours Daily logical backups, restore from backup, acceptable data loss

Backup Strategies

Logical vs Physical Backups

Logical Backup

Exports data as SQL statements or structured files (CSV, JSON). Tools: pg_dump, mysqldump, mongodump.

  • Portable across versions and platforms
  • Can restore individual tables or objects
  • Slower for large databases
  • Consistent point-in-time snapshot of data state

Physical Backup

Copies raw data files from disk. Tools: pg_basebackup, filesystem snapshots, LVM, EBS snapshots.

  • Much faster for large databases
  • Requires same OS/architecture for restore
  • Basis for Point-in-Time Recovery (PITR)
  • Must be combined with WAL archival for PITR

Point-in-Time Recovery (PITR)

PITR allows you to restore a database to any specific moment in time, not just the last backup. The mechanism is a base backup combined with a continuous archive of Write-Ahead Log (WAL) segments. During recovery, the base backup is restored first, then WAL segments are replayed forward to the target timestamp. This achieves near-zero RPO for databases with continuous WAL archiving.

Best practice: Combine physical base backups (daily) with continuous WAL archiving (to S3/GCS) to achieve PITR capability. Test restores regularly — a backup you have never tested is not a backup.

3-2-1 Backup Rule

The 3-2-1 Rule

Maintain 3 copies of data, on 2 different media types, with 1 copy stored off-site. For cloud databases: primary instance + read replica (different AZ) + automated backup exported to object storage in a different region.

Cloud Managed Database Options

Service Cloud Engine Key Differentiator
Amazon RDS AWS PostgreSQL, MySQL, MariaDB, Oracle, SQL Server Broad engine support, automated patching, Multi-AZ standby
Amazon Aurora AWS MySQL-compatible, PostgreSQL-compatible Distributed storage, up to 15 read replicas, serverless v2 auto-scaling, Global Database
Cloud SQL GCP PostgreSQL, MySQL, SQL Server Fully managed, integrated IAM auth, automatic storage increases
AlloyDB for PostgreSQL GCP PostgreSQL-compatible 4× faster than standard PostgreSQL, columnar engine for HTAP, ML-driven autovacuum
Cloud Spanner GCP Proprietary (NewSQL) Unlimited horizontal scale with external consistency (TrueTime), global ACID transactions
Azure Database for PostgreSQL Azure PostgreSQL Flexible Server with zone-redundant HA, Citus extension for distributed tables
Azure Cosmos DB Azure Multi-model (API for MongoDB, Cassandra, SQL, Gremlin, Table) Turnkey global distribution, five consistency levels, serverless
Cloud DB cost watch: Managed databases typically cost 2–5× more than self-managed equivalents for equivalent capacity. Factor in storage IOPS costs, data transfer fees, multi-AZ standby charges, and read replica instance costs. Use Reserved Instances / Committed Use Discounts for production workloads to reduce costs by 30–60%.