WAL + Quorum
How distributed databases stay durable and tune consistency with write/read quorums.
Distributed databases need to answer two hard questions for every write: "Will this survive a crash?" and "Which replicas must see it before we call it successful?" The Write-Ahead Log answers the durability question by recording changes before applying them. Quorum replication answers the consistency question by requiring enough replicas to acknowledge reads and writes.
The problem: crashes and replica disagreement
A single database can crash halfway through a write. A distributed database adds another failure mode: some replicas receive a write while others are slow, partitioned, or offline. If the system acknowledges too early, a crash can lose data. If reads ask the wrong replica, users can see stale data. WAL and quorum rules are the mechanical tools that make those trade-offs explicit.
Crash durability problem:
server receives write
server updates memory
server crashes before disk flush
after restart, the write is gone unless it was logged first
Replica consistency problem:
N = 3 replicas: A, B, C
write reaches A only
read asks B only
read returns old value unless quorum settings force overlapWrite-Ahead Log: durability before mutation
A write-ahead log is an append-only file of changes. The database writes the intended mutation to the log and flushes it to stable storage before updating its main data structures. If the process crashes after the log append but before the in-memory table or on-disk page is updated, restart can replay the log and finish the change.
write(key="user:42", value="active"):
1. append record to WAL:
{lsn: 981, op: "put", key: "user:42", value: "active"}
2. fsync WAL so the record survives power loss
3. apply change to memtable / buffer pool
4. acknowledge success to the client
crash recovery:
read last checkpoint
replay WAL records after that checkpoint
rebuild the latest committed state- Append is fast: sequential disk writes are much cheaper than random page updates.
- Fsync is the price of durability: waiting for the log to reach stable storage adds latency, but it defines what survives a crash.
- Log sequence number: each record gets an ordered LSN so recovery and replication know exactly how far they have progressed.
- Checkpointing: the database periodically writes a compact snapshot so recovery does not replay the entire log from the beginning of time.
Replication to N nodes
A distributed database stores each piece of data on N replicas. In systems inspired by Dynamo and Cassandra, clients or coordinators can choose how many replicas must acknowledge a write (W) and how many replicas are queried for a read (R). Those knobs create tunable consistency.
N = 3 replicas for key "cart:7": A, B, C
W = 2 write acknowledgements required
coordinator receives PUT cart:7 = "paid"
→ append to WAL on A
→ append to WAL on B
→ append to WAL on C
A ack arrives
B ack arrives
coordinator returns success to client because W=2
C may finish later, or repair may update it laterEach replica still uses its WAL locally. A write acknowledgement should mean the replica can recover that write after a crash, not merely that it placed the value in memory.
| Parameter | Meaning | Example impact |
|---|---|---|
| N | How many replicas store the data | N=3 survives one or more replica failures depending on settings |
| W | How many replicas must acknowledge a write | Higher W improves durability/consistency but adds latency |
| R | How many replicas are consulted for a read | Higher R improves freshness but adds latency |
| Coordinator | Node that contacts replicas for one request | Merges responses and resolves versions |
Quorum reads and writes: why W + R > N matters
The key quorum rule is simple but powerful: if W + R > N, then the set of replicas that acknowledged the latest write and the set of replicas consulted by a read must overlap on at least one node. That overlapping node has the latest acknowledged write, so the read can discover it.
N = 3 replicas
Strong read-after-write choice:
W = 2, R = 2
W + R = 4 > 3
any read set of 2 overlaps any successful write set of 2
Fast eventual choice:
W = 1, R = 1
W + R = 2 <= 3
write may land on A, read may ask B, so stale reads are possible
Read-heavy strong-ish choice:
W = 3, R = 1
W + R = 4 > 3
writes are slower, reads are fast and fresh after acknowledged writes| N | W | R | Behavior |
|---|---|---|---|
| 3 | 2 | 2 | Majority read and write; common balanced quorum |
| 3 | 3 | 1 | Slow writes, fast reads, overlap guaranteed |
| 3 | 1 | 3 | Fast writes, slow reads, overlap guaranteed |
| 3 | 1 | 1 | Lowest latency, eventual consistency |
| 5 | 3 | 3 | Majority quorum across more replicas |
Sloppy quorum and hinted handoff
Real clusters have node failures and network partitions. Strict quorum says the write must reach W of the correct N replicas for that key.Sloppy quorum relaxes this by allowing nearby healthy nodes to temporarily accept writes on behalf of unavailable replicas. The system stores a hint saying where the write really belongs and forwards it later. This improves availability, but it weakens the clean quorum-overlap story while the cluster is healing.
Key K should live on replicas A, B, C
Configured W = 2
B is down.
Coordinator writes to:
A (real replica)
D (temporary handoff node with hint: "deliver to B")
Client receives success because two nodes stored the write.
Later:
B recovers
D sends the hinted write to B
B catches up and D can delete the hint| Mode | Availability | Consistency caveat |
|---|---|---|
| Strict quorum | Lower during replica outages | Overlap math is easier to reason about |
| Sloppy quorum | Higher during outages | A later read of original replicas may miss handoff data until repair |
| Hinted handoff | Helps failed replicas catch up | Hints can be delayed, lost after retention, or conflict with newer writes |
Read repair, anti-entropy, and convergence
Replicas drift. Some writes arrive late, some nodes are down, and some reads intentionally use low R for speed. Distributed databases therefore need background mechanisms that make replicas converge over time.
- Read repair: when a read contacts multiple replicas and sees stale values, the coordinator returns the newest value to the client and writes that value back to stale replicas.
- Anti-entropy repair: background jobs compare replica data ranges, often using Merkle trees or checksums, and stream missing or outdated rows to peers.
- Hinted handoff replay: temporary nodes deliver stored hints to recovered replicas.
- Conflict resolution: concurrent writes may need last-write-wins, vector clocks, application merge logic, or lightweight transactions depending on the database.
read key K with R=2 from replicas A and B
A returns value="paid", version=12
B returns value="pending", version=9
coordinator:
chooses version 12 for the client
asynchronously sends value="paid", version=12 to B
next read from B is no longer staleTunable consistency in Dynamo and Cassandra-style systems
Dynamo-style systems and Cassandra expose consistency as a dial. For a shopping cart, low-latency writes with occasional merge conflict may be acceptable. For an account balance, you might choose stronger quorum settings or a different database model. The system lets you trade latency, availability, and freshness per operation.
| Workload | Possible setting | Reason |
|---|---|---|
| User presence | W=1, R=1 | Freshness is nice, but low latency matters more |
| Product catalog | W=2, R=1 or cached reads | Writes are rare; stale reads are tolerable briefly |
| Shopping cart | Quorum or application merge | Availability matters, conflicts can be resolved |
| Financial ledger | Strong quorum or single-leader consensus | Stale or conflicting reads are unacceptable |
Gotchas
- Latency tails: higher W or R waits for more replicas, so the slowest contacted replica affects request latency.
- Clock-based conflicts: last-write-wins can lose updates if clocks skew or writes are concurrent.
- Durability assumptions: an acknowledgement should mean the replica logged the write durably; memory-only acknowledgements are weaker.
- Operational repair: low consistency settings demand strong monitoring for hinted handoff backlog, repair progress, and replica lag.
- A write-ahead log appends and fsyncs changes before applying them, so a replica can recover committed writes after a crash.
- Replication stores each key on N nodes; W controls write acknowledgements and R controls how many replicas a read consults.
- When W + R > N, every successful read set overlaps every successful write set, enabling strong read-after-write behavior when versions are compared correctly.
- Sloppy quorum and hinted handoff improve availability during failures but temporarily weaken simple quorum guarantees until repair completes.
- Read repair, anti-entropy, hinted handoff, and conflict resolution are the maintenance loops that make Dynamo/Cassandra-style tunable consistency converge.
2 + 2 > 3. At least one replica read has the latest acknowledged write. With W=1 and R=1, the write may land on A while the read asks B, so stale data is possible.Mark it complete to track your progress through the workbook.