Redis & Caching

Redis is an in-memory data structure store used as a cache, message broker, and primary database. This guide covers data structures, persistence, cluster topologies, cache design patterns, eviction policies, Kubernetes deployment, and common production pitfalls.

Redis Data Structures

Type Description Key Commands Use Cases
String Binary-safe bytes up to 512 MB. Can hold integers, floats, or serialised objects. SET, GET, INCR, EXPIRE, SETNX Object cache, counters, distributed locks, rate limiting tokens, feature flags
Hash Map of field-value pairs. Efficient for representing objects with multiple attributes. HSET, HGET, HMGET, HGETALL, HINCRBY User profile cache, session storage, entity attribute maps, shopping carts
List Ordered linked list of strings. Efficient push/pop from both ends (O(1)). LPUSH, RPUSH, LPOP, BRPOP, LRANGE Message queues, task queues, activity feeds, recent items list
Set Unordered collection of unique strings. O(1) membership testing. SADD, SISMEMBER, SUNION, SINTER, SDIFF Unique visitors, tags, friend lists, deduplication, online users
Sorted Set (ZSet) Set where each member has a floating-point score. Members ordered by score. ZADD, ZRANGE, ZRANGEBYSCORE, ZRANK, ZINCRBY Leaderboards, priority queues, rate limiting (sliding window), scheduled jobs
Stream Append-only log of entries with auto-generated IDs. Supports consumer groups. XADD, XREAD, XREADGROUP, XACK, XLEN Event sourcing, audit logs, IoT telemetry, lightweight message broker (Kafka-lite)
Bitmap String type exposed with bit-level operations. Extremely space-efficient. SETBIT, GETBIT, BITCOUNT, BITOP Daily active users (1 bit per user per day), feature toggles, bloom filter approximation
HyperLogLog Probabilistic cardinality estimation using ~12 KB regardless of set size. PFADD, PFCOUNT, PFMERGE Approximate unique visitor counting, distinct value estimation at scale
# --- String: distributed rate limiter ---
# Allow 100 requests per minute per user
INCR rate:user:42
EXPIRE rate:user:42 60
# Check: if value > 100, reject request

# --- Sorted Set: real-time leaderboard ---
ZADD leaderboard:season1 1500 "player:alice"
ZADD leaderboard:season1 2200 "player:bob"
ZADD leaderboard:season1 1800 "player:carol"
ZREVRANGE leaderboard:season1 0 9 WITHSCORES   # Top 10 players
ZRANK leaderboard:season1 "player:alice"        # Alice's rank

# --- Stream: event ingestion ---
XADD events:orders * order_id 12345 customer_id 99 total 150.00 status created
XREAD COUNT 10 BLOCK 0 STREAMS events:orders $

# --- Consumer group for parallel processing ---
XGROUP CREATE events:orders processor-group $ MKSTREAM
XREADGROUP GROUP processor-group worker-1 COUNT 5 BLOCK 0 STREAMS events:orders >
XACK events:orders processor-group 1711612800000-0

Persistence Modes

Feature RDB (Redis Database Snapshot) AOF (Append-Only File)
Mechanism Fork process periodically writes a full binary snapshot to disk Every write command is appended to a log file; replayed on restart
Data Loss Risk Up to the interval since last snapshot (minutes of data) At most 1 second with appendfsync everysec
Recovery Speed Fast — load binary snapshot directly Slower — replay all commands (mitigated by AOF rewrite)
File Size Compact binary format Larger; grows over time until AOF rewrite compacts it
Performance Impact Fork COW pause (can be seconds on large datasets with dirty pages) Minimal with everysec; always mode has write latency impact
Best For Disaster recovery, cold standby, analytics snapshots Production caches where data loss is unacceptable
# redis.conf — Recommended hybrid persistence for production

# --- RDB Snapshots ---
save 3600 1      # Save if at least 1 key changed in 1 hour
save 300 100     # Save if at least 100 keys changed in 5 minutes
save 60 10000    # Save if at least 10000 keys changed in 1 minute
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis

# --- AOF ---
appendonly yes
appendfilename "appendonly.aof"
# appendfsync always      # Every write — safest, slowest
appendfsync everysec      # Every second — good balance (default recommendation)
# appendfsync no          # Let OS decide — fastest, most data loss risk

no-appendfsync-on-rewrite yes   # Don't fsync during AOF rewrite (reduces fork latency)
auto-aof-rewrite-percentage 100 # Trigger rewrite when AOF grows 100% vs last rewrite
auto-aof-rewrite-min-size 128mb
aof-use-rdb-preamble yes        # Hybrid: start AOF with RDB snapshot for faster reload

Redis Cluster vs Sentinel

Redis Sentinel

Provides high availability for a single-shard Redis deployment. Sentinel processes monitor the master and replicas; if the master fails, Sentinels elect a new master via quorum vote and update clients via service discovery.

  • Min 3 Sentinel nodes for quorum
  • Data fits on a single instance
  • Clients must support Sentinel protocol
  • Automatic failover in 30–60 seconds

Use when: dataset < single node RAM, simplicity preferred, vertical scaling is sufficient.

Redis Cluster

Provides both horizontal scaling (data sharding) and HA. Data is split into 16384 hash slots across multiple primary shards. Each shard has its own replica(s). Clients connect directly to any node; the cluster redirects with MOVED/ASK.

  • Min 6 nodes (3 primary + 3 replica)
  • Horizontal scale-out
  • Multi-key operations limited to same slot
  • No cross-slot transactions

Use when: dataset exceeds single node RAM, write throughput must scale horizontally.

Cache Design Patterns

Cache-Aside (Lazy Loading)

The application checks the cache first. On a miss, it reads from the database, populates the cache, and returns the result. The application owns cache population logic.

Pros: Simple; cache only contains requested data; tolerates cache failures (falls back to DB).
Cons: Cold start penalty; possible stale reads between TTL expiry and next request.
Best for: Read-heavy workloads where not all data needs to be cached.

Write-Through

Every write goes to the cache AND the database synchronously. Cache is always consistent with the database at the cost of write latency.

Pros: No stale cache; strong consistency.
Cons: Higher write latency; caches data that may never be read (cold data pollution).
Best for: Workloads where read-after-write consistency is critical (user account balance).

Write-Behind (Write-Back)

Writes go to cache immediately and are asynchronously flushed to the database in the background. The application gets low-latency writes.

Pros: Very low write latency; batch DB writes reduce load.
Cons: Risk of data loss if cache fails before flush; complexity in flush logic.
Best for: High-frequency writes where eventual durability is acceptable (counters, analytics).

Read-Through

The cache sits in front of the database. On a cache miss, the cache itself fetches the data from the database, stores it, and returns it to the application. The application only talks to the cache.

Pros: Transparent to application; clean separation of concerns.
Cons: Requires cache provider that supports read-through (e.g., Redis with a custom plugin or DAX).
Best for: Systems using managed caches like AWS DAX (DynamoDB Accelerator).

Cache Eviction Policies

Policy Algorithm Behaviour Best For
noeviction None Return error when memory full Persistent data stores where data loss is unacceptable
allkeys-lru Approximate LRU Evict least recently used keys from all keys General-purpose cache; you want Redis to evict the least popular items
volatile-lru Approximate LRU Evict LRU keys only among those with TTL set Mixed cache + persistent data; protect keys without TTL
allkeys-lfu LFU (frequency) Evict least frequently used keys from all keys Cache with skewed access patterns; hot keys stay, cold keys evicted
volatile-lfu LFU (frequency) Evict LFU keys only among those with TTL set Mixed workload with TTL on cacheable data
allkeys-random Random Evict random keys from all keys Uniform access distribution (rare in practice)
volatile-ttl TTL-based Evict keys with shortest remaining TTL first Time-boxed data where expiry-imminent items are least valuable
# redis.conf — memory and eviction configuration
maxmemory 12gb                    # Cap Redis at 12 GB (leave headroom for OS and fork)
maxmemory-policy allkeys-lfu      # Evict least-frequently-used keys when memory is full
maxmemory-samples 10              # Sample 10 keys when applying LRU/LFU (higher = more accurate, more CPU)

# LFU tuning (Redis 4+)
lfu-decay-time 1                  # Halve LFU counter every 1 minute of inactivity
lfu-log-factor 10                 # Higher = more granular frequency counting

Redis Cluster on Kubernetes (Bitnami Helm)

# values-redis-cluster.yaml — Bitnami Redis Cluster Helm values
# helm install redis-cluster bitnami/redis-cluster -f values-redis-cluster.yaml -n cache

cluster:
  enabled: true
  nodes: 6            # 3 primary shards + 3 replicas (1 replica per shard)
  replicas: 1

# Resource allocation per node
redis:
  resources:
    requests:
      memory: 4Gi
      cpu: 500m
    limits:
      memory: 6Gi
      cpu: 2000m

  # Redis configuration
  extraEnvVars: []
  configmap: |
    maxmemory 4gb
    maxmemory-policy allkeys-lfu
    appendonly yes
    appendfsync everysec
    no-appendfsync-on-rewrite yes
    tcp-keepalive 300
    timeout 0
    hz 20
    aof-use-rdb-preamble yes

# Persistence
persistence:
  enabled: true
  storageClass: "fast-ssd"
  size: 20Gi

# Password authentication
usePassword: true
existingSecret: redis-cluster-secret
existingSecretPasswordKey: redis-password

# Pod anti-affinity: spread nodes across availability zones
podAntiAffinityPreset: hard

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: redis-cluster

# Metrics exporter for Prometheus
metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: monitoring
    interval: 30s
    scrapeTimeout: 10s

# Network policy
networkPolicy:
  enabled: true
  allowExternal: false    # Only allow access from same namespace

# Pod Disruption Budget
pdb:
  create: true
  minAvailable: 4         # At least 4 of 6 nodes must be available

Redis Sentinel Configuration

# sentinel.conf — 3-node Sentinel configuration

# Monitor the master named "mymaster" at 10.0.2.10:6379
# Quorum = 2 (at least 2 Sentinels must agree to initiate failover)
sentinel monitor mymaster 10.0.2.10 6379 2

# Authentication
sentinel auth-pass mymaster 

# Failover timing
sentinel down-after-milliseconds mymaster 5000   # Mark master as subjectively down after 5s no response
sentinel failover-timeout mymaster 60000          # Max time for failover process: 60s
sentinel parallel-syncs mymaster 1               # Reconfigure replicas one at a time during failover

# Notification script (optional — trigger alerting on state changes)
# sentinel notification-script mymaster /opt/redis/notify.sh

# Sentinel bind address
bind 0.0.0.0
sentinel announce-ip 10.0.2.11     # Adjust per Sentinel node
sentinel announce-port 26379

Redis Sentinel on Kubernetes

# values-redis-sentinel.yaml — Bitnami Redis (Sentinel mode)
# helm install redis bitnami/redis -f values-redis-sentinel.yaml -n cache

architecture: replication

sentinel:
  enabled: true
  masterSet: mymaster
  quorum: 2
  downAfterMilliseconds: 5000
  failoverTimeout: 60000
  parallelSyncs: 1

replica:
  replicaCount: 3
  resources:
    requests:
      memory: 2Gi
      cpu: 250m
    limits:
      memory: 4Gi
      cpu: 1000m
  persistence:
    enabled: true
    storageClass: "fast-ssd"
    size: 10Gi

auth:
  enabled: true
  existingSecret: redis-sentinel-secret
  existingSecretPasswordKey: redis-password

# Expose a single ClusterIP Service that routes to current master
service:
  type: ClusterIP
  port: 6379
  sentinelPort: 26379

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: monitoring

podAntiAffinityPreset: hard

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: redis

Monitoring: Key Metrics

INFO Command

# Connect and inspect server stats
redis-cli -h redis-master -a $REDIS_PASSWORD INFO all

# Key sections to monitor:

# -- Memory --
# used_memory_human        — actual memory used by Redis data
# used_memory_rss_human    — memory as seen by OS (includes fragmentation)
# mem_fragmentation_ratio  — >1.5 indicates high fragmentation; consider MEMORY PURGE or restart
# maxmemory_human          — configured limit

# -- Stats --
# keyspace_hits            — cache hits
# keyspace_misses          — cache misses
# evicted_keys             — keys removed due to maxmemory policy
# expired_keys             — keys removed due to TTL expiry
# total_commands_processed — commands per second (use with delta)
# instantaneous_ops_per_sec — current ops/sec

# -- Replication --
# role                     — master or slave
# connected_slaves         — number of connected replicas
# master_repl_offset       — WAL-equivalent offset on master
# slave_repl_offset        — replica's current offset (delta = replication lag)
# master_last_io_seconds_ago — seconds since last replica I/O

# -- Clients --
# connected_clients        — current client count
# blocked_clients          — clients blocked on BRPOP/BLPOP/SUBSCRIBE
# maxclients               — configured limit

Critical Metrics to Alert On

# Prometheus alert rules (example — adjust thresholds for your workload)

# Cache hit rate < 80% (indicates cache is too small or keys are expiring too aggressively)
- alert: RedisCacheHitRateLow
  expr: |
    rate(redis_keyspace_hits_total[5m]) /
    (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) < 0.80
  for: 10m

# Memory usage > 90% of maxmemory
- alert: RedisMemoryHigh
  expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.90
  for: 5m

# Evictions occurring (should be 0 for persistent data)
- alert: RedisEvictingKeys
  expr: rate(redis_evicted_keys_total[5m]) > 0
  for: 2m

# Replication lag > 30 seconds
- alert: RedisReplicationLag
  expr: redis_connected_slaves < 1 OR redis_master_repl_offset - redis_slave_repl_offset > 1048576
  for: 1m

# Too many blocked clients
- alert: RedisBlockedClients
  expr: redis_blocked_clients > 50
  for: 5m

Connection Patterns

Connection Pooling

Redis connections are cheap compared to PostgreSQL, but each idle connection still holds a file descriptor and server-side memory (~20 KB). For high-concurrency applications, use a connection pool in your client library rather than creating a new connection per request.

# Python — redis-py with connection pool
import redis

pool = redis.ConnectionPool(
    host='redis-master',
    port=6379,
    password=os.environ['REDIS_PASSWORD'],
    max_connections=50,          # Max connections in pool
    socket_connect_timeout=2,    # Timeout for new connections
    socket_timeout=1,            # Timeout for read/write operations
    health_check_interval=30,    # Periodic PING to detect stale connections
    decode_responses=True
)

client = redis.Redis(connection_pool=pool)

# For cluster mode:
from redis.cluster import RedisCluster, ClusterConnectionPool

cluster_pool = ClusterConnectionPool(
    startup_nodes=[
        {"host": "redis-cluster-0", "port": 6379},
        {"host": "redis-cluster-1", "port": 6379},
        {"host": "redis-cluster-2", "port": 6379},
    ],
    password=os.environ['REDIS_PASSWORD'],
    max_connections_per_node=20,
    decode_responses=True
)
cluster_client = RedisCluster(connection_pool=cluster_pool)

Circuit Breaker Pattern

When Redis is unavailable, requests should fail fast rather than queuing indefinitely. Implement a circuit breaker that opens after N consecutive failures and automatically retries after a cooldown period.

# Using resilience4j (Java) or pybreaker (Python) — concept shown in pseudo-code

# Circuit Breaker states:
# CLOSED   — normal operation; requests pass through to Redis
# OPEN     — Redis unreachable; requests fail fast without hitting Redis
# HALF-OPEN — after cooldown, allow 1 probe request; if success → CLOSED, if fail → OPEN again

# Configuration
circuit_breaker_config:
  failure_rate_threshold: 50      # Open circuit if >50% of last N calls fail
  slow_call_rate_threshold: 80    # Open if >80% of calls are slower than slow_call_duration_threshold
  slow_call_duration_threshold: 500ms
  minimum_number_of_calls: 10     # Min calls to evaluate failure rate
  wait_duration_in_open_state: 30s
  permitted_number_of_calls_in_half_open_state: 3

# Fallback: serve from database or return cached result from local in-process LRU
def get_user_profile(user_id):
    try:
        return circuit_breaker.call(redis_client.get, f"user:{user_id}")
    except (RedisError, CircuitBreakerOpenError):
        return database.get_user_profile(user_id)   # Fallback to primary DB

Common Production Pitfalls

Cache Stampede (Thundering Herd)

When a popular cached key expires, many concurrent requests simultaneously miss the cache and hit the database at the same time, causing a spike that can overload the database.

Solutions:

  • Mutex / distributed lock: Use SET key value NX PX 5000 (SETNX with TTL) so only one request rebuilds the cache; others wait and retry.
  • Probabilistic early expiration (PER): Before the key expires, proactively refresh it with a probability that increases as expiry approaches. Avoids synchronised expiry.
  • Jitter on TTL: Add random jitter to TTLs (e.g., TTL = base_ttl + random(0, base_ttl * 0.1)) to prevent multiple keys expiring simultaneously.
  • Background refresh: Return the stale value immediately while asynchronously refreshing in the background (stale-while-revalidate pattern).

Hot Keys

A single Redis key receiving disproportionately high traffic (e.g., a globally shared counter or a viral product page) can saturate a single shard's CPU or network bandwidth.

Solutions:

  • Key sharding: Split a hot key into N sub-keys (counter:product:123:shard:{0..N}) and aggregate reads. Route writes to a random shard.
  • Local in-process cache: Cache hot values in application memory (e.g., Caffeine, Guava Cache) with a short TTL (1–5 seconds). Dramatically reduces Redis traffic for truly hot keys.
  • Read replicas: Route read traffic to replicas for hot read-only keys (Redis Cluster supports this with READONLY mode on replica nodes).
  • Redis 7+ hot key detection: Enable hotkeys via redis-cli --hotkeys to identify problematic keys.

Large Keys / Big Values

Keys with very large values (e.g., a serialised 10 MB object) cause latency spikes because Redis is single-threaded for command execution. DEL or EXPIRE on a large key blocks the event loop.

Solutions:

  • Use UNLINK instead of DEL — UNLINK is asynchronous (non-blocking).
  • Scan for large keys: redis-cli --bigkeys or redis-cli --memkeys.
  • Split large objects into smaller chunks or use a CDN/object storage for large blobs.
  • Compress values before storing (snappy, zstd): trade CPU for memory.

Memory Fragmentation

Over time, after many allocations and deallocations of different sizes, Redis RSS memory (as seen by the OS) grows beyond used_memory. A fragmentation ratio above 1.5 wastes significant RAM.

Solutions:

  • Enable activedefrag yes in redis.conf (Redis 4+) for online defragmentation.
  • Schedule a rolling restart of Redis nodes during low-traffic windows.
  • Run MEMORY PURGE to return freed memory to the allocator (jemalloc).
Redis is single-threaded for commands. One slow Lua script, a KEYS command on a large keyspace, or a SORT on a large list will block all other commands. Never run KEYS * in production — use SCAN with a cursor instead. Avoid long-running Lua scripts; set lua-time-limit 5000 and monitor for script timeouts.