Database & Storage Overview

Foundation of every system — choosing the right database engine and storage strategy is one of the most consequential architectural decisions you will make. This section covers database types, trade-offs, HA design, and cloud-managed options.

Database Type Taxonomy

Relational (RDBMS)

Relational databases organise data into tables with predefined schemas, enforcing integrity through ACID transactions. They excel at complex queries, joins across normalised data, and strong consistency requirements. PostgreSQL, MySQL, and Oracle are the canonical examples. SQL is the universal query language.

NoSQL — Document Stores

Documents (typically JSON/BSON) are the unit of storage. Each document is self-contained and can have a flexible schema. MongoDB and CouchDB are representative. Document stores shine when entities naturally map to nested objects — product catalogs, user profiles, CMS content.

NoSQL — Key-Value Stores

The simplest data model: every value is addressed by an opaque key. Redis, DynamoDB (in key-value mode), and Memcached fall here. Extremely fast for single-key lookups, session storage, and caching layers. Limited querying capability beyond key access.

NoSQL — Wide-Column Stores

Data is stored in column families rather than rows, optimised for high-write workloads and time-series data. Apache Cassandra and HBase are the primary examples. Rows within a partition are sorted by a clustering key, enabling efficient range scans within a partition.

NoSQL — Graph Databases

Entities (nodes) and their relationships (edges) are first-class citizens. Neo4j and Amazon Neptune allow efficient traversal of deeply connected data — social networks, fraud detection, knowledge graphs, and recommendation engines where relationship depth matters.

NewSQL

NewSQL systems attempt to combine the horizontal scalability of NoSQL with the ACID guarantees of traditional RDBMS. CockroachDB, Google Spanner, TiDB, and YugabyteDB represent this category. They use distributed consensus algorithms (Raft, Paxos) to achieve globally consistent transactions at scale.

Database Comparison Matrix

Database	Type	Consistency	Scale Model	Query Power	Best For
PostgreSQL	RDBMS	Strong (ACID)	Vertical + Read replicas	Full SQL, JSON, FTS	Transactional apps, analytics, complex queries
MySQL	RDBMS	Strong (ACID)	Vertical + Read replicas	Full SQL	Web apps, CMS, e-commerce
MongoDB	Document	Tunable (eventual default)	Horizontal sharding	Rich queries, aggregation pipeline	Flexible schemas, catalogs, content
Redis	Key-Value / In-Memory	Eventual (async repl)	Cluster sharding	Key access, Lua scripts	Cache, sessions, pub/sub, leaderboards
Cassandra	Wide-Column	Tunable (quorum)	Multi-master horizontal	CQL (partition-scoped)	IoT time-series, write-heavy, geo-distributed
DynamoDB	Key-Value / Document	Eventual / Strong (opt-in)	Fully managed horizontal	PK + GSI/LSI queries	Serverless, unpredictable scale, AWS-native

CAP Theorem

Proposed by Eric Brewer in 2000 and formally proved by Gilbert and Lynch in 2002, the CAP theorem states that a distributed data store can guarantee at most two of the following three properties simultaneously:

C — Consistency

Every read receives the most recent write or an error. All nodes see the same data at the same time.

A — Availability

Every request receives a (non-error) response, though it may not contain the most recent write.

P — Partition Tolerance

The system continues to operate despite an arbitrary number of messages being dropped between nodes.

In practice, network partitions are unavoidable in any distributed system, so P is non-negotiable. The real choice is between CP (consistency over availability) and AP (availability over consistency):

Classification	Databases	Behaviour During Partition
CP	PostgreSQL, HBase, Zookeeper, etcd, CockroachDB	Returns an error or timeout rather than serving stale data
AP	Cassandra, DynamoDB (default), CouchDB, Riak	Serves potentially stale data to remain available; reconciles later
CA (single node)	Traditional RDBMS (single instance)	Not distributed — partition tolerance is not a concern

PACELC Extension: PACELC extends CAP to address the latency/consistency trade-off even when the system is running without partitions. It asks: even Else (when running normally), do you choose Latency or Consistency? DynamoDB is PA/EL; CockroachDB is PC/EC.

When to Use Which Database

Strong transactional consistency required?

Use PostgreSQL or MySQL. If global distribution is needed, consider CockroachDB or Cloud Spanner.

Flexible or rapidly evolving schema?

Use MongoDB for document-oriented storage, or PostgreSQL JSONB if you still want SQL power alongside flexibility.

Sub-millisecond latency, caching, or session storage?

Use Redis. Consider Memcached for pure ephemeral caching where persistence is not needed.

Massive write throughput, time-series, or IoT telemetry?

Use Cassandra for general wide-column workloads. Consider InfluxDB or TimescaleDB for purpose-built time-series queries.

Serverless / unpredictable traffic spikes on AWS?

Use DynamoDB with on-demand capacity mode. Pair with DAX for microsecond read latency.

Highly connected data, recommendations, fraud graphs?

Use Neo4j (self-managed) or Amazon Neptune / Google Spanner Graph (managed).

OLAP / analytical queries over large datasets?

Use BigQuery, Redshift, Snowflake, or ClickHouse — purpose-built columnar engines. Avoid running heavy analytics on OLTP databases.

RTO & RPO — Driving HA Strategy

RTO — Recovery Time Objective

The maximum tolerable length of time that a system can be offline after a failure event before business impact becomes unacceptable. An RTO of 5 minutes means the service must be restored within 5 minutes of an outage.

Drives: Failover automation, standby replica topology, load balancer health checks.

RPO — Recovery Point Objective

The maximum acceptable amount of data loss measured in time. An RPO of 1 minute means you can tolerate losing at most 1 minute of committed transactions during a disaster.

Drives: Replication lag targets, WAL archival frequency, backup intervals.

Tier	RTO Target	RPO Target	Typical Architecture
Tier 0 (Mission Critical)	< 1 min	0 (zero data loss)	Synchronous replication, active-active multi-region, automatic failover
Tier 1 (Business Critical)	1–15 min	< 1 min	Synchronous standby, Patroni/Aurora, continuous WAL archival
Tier 2 (Important)	15–60 min	< 15 min	Asynchronous streaming replication, promoted manually, hourly snapshots
Tier 3 (Non-Critical)	> 1 hour	< 24 hours	Daily logical backups, restore from backup, acceptable data loss

Backup Strategies

Logical vs Physical Backups

Logical Backup

Exports data as SQL statements or structured files (CSV, JSON). Tools: pg_dump, mysqldump, mongodump.

Portable across versions and platforms
Can restore individual tables or objects
Slower for large databases
Consistent point-in-time snapshot of data state

Physical Backup

Copies raw data files from disk. Tools: pg_basebackup, filesystem snapshots, LVM, EBS snapshots.

Much faster for large databases
Requires same OS/architecture for restore
Basis for Point-in-Time Recovery (PITR)
Must be combined with WAL archival for PITR

Point-in-Time Recovery (PITR)

PITR allows you to restore a database to any specific moment in time, not just the last backup. The mechanism is a base backup combined with a continuous archive of Write-Ahead Log (WAL) segments. During recovery, the base backup is restored first, then WAL segments are replayed forward to the target timestamp. This achieves near-zero RPO for databases with continuous WAL archiving.

Best practice: Combine physical base backups (daily) with continuous WAL archiving (to S3/GCS) to achieve PITR capability. Test restores regularly — a backup you have never tested is not a backup.

3-2-1 Backup Rule

The 3-2-1 Rule

Maintain 3 copies of data, on 2 different media types, with 1 copy stored off-site. For cloud databases: primary instance + read replica (different AZ) + automated backup exported to object storage in a different region.

Cloud Managed Database Options

Service	Cloud	Engine	Key Differentiator
Amazon RDS	AWS	PostgreSQL, MySQL, MariaDB, Oracle, SQL Server	Broad engine support, automated patching, Multi-AZ standby
Amazon Aurora	AWS	MySQL-compatible, PostgreSQL-compatible	Distributed storage, up to 15 read replicas, serverless v2 auto-scaling, Global Database
Cloud SQL	GCP	PostgreSQL, MySQL, SQL Server	Fully managed, integrated IAM auth, automatic storage increases
AlloyDB for PostgreSQL	GCP	PostgreSQL-compatible	4× faster than standard PostgreSQL, columnar engine for HTAP, ML-driven autovacuum
Cloud Spanner	GCP	Proprietary (NewSQL)	Unlimited horizontal scale with external consistency (TrueTime), global ACID transactions
Azure Database for PostgreSQL	Azure	PostgreSQL	Flexible Server with zone-redundant HA, Citus extension for distributed tables
Azure Cosmos DB	Azure	Multi-model (API for MongoDB, Cassandra, SQL, Gremlin, Table)	Turnkey global distribution, five consistency levels, serverless

Cloud DB cost watch: Managed databases typically cost 2–5× more than self-managed equivalents for equivalent capacity. Factor in storage IOPS costs, data transfer fees, multi-AZ standby charges, and read replica instance costs. Use Reserved Instances / Committed Use Discounts for production workloads to reduce costs by 30–60%.