Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.
Loading diagram…
# Distributed Key-Value Store Design
## Requirements
The system should support:
* PUT(key, value)
* GET(key)
* DELETE(key)
### Non-Functional Requirements
* High availability
* Horizontal scalability
* Configurable replication factor (RF)
* Strong consistency or eventual consistency depending on configuration
* Durable writes
* Automatic failover
* Fault tolerance against node failures
* Support for hotspot detection and mitigation
---
# High-Level Architecture
The system consists of two major components:
### 1. Coordinator Cluster (Control Plane)
Responsible for:
* Cluster membership management
* Partition ownership metadata
* Consistent hashing ring management
* Replica placement decisions
* Failure detection
* Leader promotion
* Rebalancing
The coordinator cluster does not store user data.
The coordinator cluster maintains metadata such as:
```text
Partition -> Primary Node
Partition -> Replica Nodes
Node Health
Replication Factor
Hash Ranges
```
To avoid split-brain scenarios, coordinators form a Raft cluster.
Only the Raft leader can modify metadata.
All coordinator state is replicated through the Raft log.
---
### 2. Data Nodes (Data Plane)
Data nodes are responsible for:
* Persisting key-value data
* Maintaining replication logs
* Serving reads and writes
* Replicating updates to replicas
Each partition has:
```text
1 Primary
RF-1 Replicas
```
The replication factor is configurable.
Examples:
```text
RF = 2
Primary + 1 Replica
RF = 3
Primary + 2 Replicas
RF = 5
Primary + 4 Replicas
```
The architecture does not assume a fixed replication factor.
---
# Data Distribution
Keys are distributed using Consistent Hashing.
Assume:
```text
5 Data Nodes
4 Virtual Nodes Per Physical Node
```
This creates:
```text
20 Virtual Nodes
```
on the hash ring.
For every key:
```text
hash(key)
```
is computed.
The first virtual node clockwise on the ring becomes the owner of the partition.
The coordinator stores the mapping:
```text
Hash Range -> Primary Node
Hash Range -> Replica Set
```
which is used for request routing.
---
# Request Routing
Initially, clients fetch topology metadata from the coordinator cluster.
The metadata contains:
```text
Hash Ranges
Primary Ownership
Replica Ownership
```
The client caches this information locally.
After metadata is obtained:
```text
Client -> Data Node
```
direct communication is used.
This removes the coordinator from the critical data path.
The coordinator remains responsible only for:
* topology updates
* failover
* rebalancing
This cleanly separates:
```text
Control Plane
```
from
```text
Data Plane
```
and allows the system to continue serving traffic even if the coordinator cluster is temporarily unavailable.
---
# Durable Writes
When a write arrives:
```text
PUT(key, value)
```
the request is routed to the partition primary.
The primary first appends the operation to its Write-Ahead Log (WAL).
The WAL is flushed to disk before proceeding.
Only after local durability is guaranteed does replication begin.
The write is replicated to the configured replica set.
Each replica:
1. Receives the replication entry.
2. Appends it to its own WAL.
3. Flushes the WAL to disk.
4. Sends an acknowledgement.
The primary waits according to the configured consistency level.
Example:
```text
RF = 3
Write Quorum = 2
```
The write is considered committed only after two replicas have durably persisted the operation.
Only then does the primary acknowledge success back to the client.
This guarantees durability.
If the primary crashes before quorum acknowledgement:
```text
No Commit Response
```
is returned.
The client can safely retry.
Because PUT operations are idempotent, repeated retries produce the same final state.
---
# Consistency Models
## Eventual Consistency
For low latency:
```text
W = 1
R = 1
```
Writes may be acknowledged after a single durable copy.
Reads can be served from any replica.
Stale reads are acceptable.
---
## Strong Consistency
Strong consistency requires both write and read guarantees.
A write is committed only after quorum acknowledgement.
Example:
```text
RF = 3
W = 2
```
For reads, arbitrary replica reads are not allowed.
The system supports one of the following strategies:
### Leader Reads
```text
Read -> Primary
```
The primary always serves the latest committed value.
### Quorum Reads
```text
RF = 3
W = 2
R = 2
```
Since:
```text
R + W > N
```
every read quorum overlaps with the latest successful write quorum.
This guarantees the latest committed version is observed.
### Lease / Version Validation
Replicas maintain commit indexes.
A replica can only serve a strong read if it can prove it is caught up to the latest committed version.
Otherwise the request is redirected to the primary.
Therefore strong consistency is guaranteed through an explicitly enforced read strategy rather than arbitrary replica reads.
---
# Failure Handling
## Replica Failure
The coordinator continuously monitors:
* Primary nodes
* Replica nodes
including:
```text
Node Liveness
Replication Lag
Last Commit Index
```
If a replica fails:
1. Coordinator marks replica unhealthy.
2. New replica placement is selected.
3. A replacement replica is provisioned.
4. Data is copied from an up-to-date replica.
5. The replica rejoins the replication group.
During this period, the system continues operating with reduced redundancy.
---
## Primary Failure
If a primary becomes unavailable:
1. Coordinator detects failure.
2. Replication status is inspected.
3. The most up-to-date replica is selected.
4. Replica is promoted to primary.
5. Metadata is updated through Raft.
6. Clients refresh topology information.
Temporary write unavailability during failover is acceptable.
Read traffic may continue from healthy replicas depending on the configured consistency level.
Because replica lag is tracked continuously, stale replicas are never promoted.
This prevents acknowledged writes from being lost.
---
# Coordinator Failure
Coordinators are replicated using Raft.
If a coordinator crashes:
* In-flight requests handled by that coordinator may fail.
* Other coordinators continue serving metadata.
* Raft elects a new leader if necessary.
Since clients cache topology information and communicate directly with data nodes, normal reads and writes continue even while coordinator failover occurs.
---
# Capacity Planning
Assume:
```text
Logical Data Size = 1 TB
Replication Factor = 3
```
Total storage required:
```text
1 TB × 3 = 3 TB
```
Assume:
```text
20K Operations / Second
5 Data Nodes
```
Average load:
```text
20K / 5 = 4K OPS per node
```
The system should additionally be sized for:
```text
2× peak traffic
```
and for temporary failover scenarios.
---
# Hot Key Handling
A single logical key cannot be transparently split into:
```text
ABC_1
ABC_2
ABC_3
```
because that breaks normal key-value semantics.
Instead, hot-key mitigation depends on workload type.
For read-heavy hotspots:
* Increase replica count
* Load balance reads across replicas
* Cache popular keys
For write-heavy hotspots such as counters:
* Use counter striping
* Use logical sharding
* Aggregate results asynchronously
This preserves normal KV-store behavior while improving scalability.
---
# Summary
The final design provides:
* Consistent hashing partitioning
* Configurable replication factor
* Durable WAL-based writes
* Strong consistency through leader or quorum reads
* Eventual consistency through replica reads
* Raft-backed coordinator metadata
* Automatic failover
* Replica lag aware promotions
* Client-side metadata caching
* Control plane / data plane separation
* Hotspot mitigation strategies
* Horizontal scalability
Want this kind of feedback on your own design?
Draw your architecture for Key-Value Store and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.