Challenge Drills Library Drawing Guide Learn AI Setup Guide Support About

Key-Value Store — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

# Distributed Key-Value Store Design ## Requirements The system should support: * PUT(key, value) * GET(key) * DELETE(key) ### Non-Functional Requirements * High availability * Horizontal scalability * Configurable replication factor (RF) * Strong consistency or eventual consistency depending on configuration * Durable writes * Automatic failover * Fault tolerance against node failures * Support for hotspot detection and mitigation --- # High-Level Architecture The system consists of two major components: ### 1. Coordinator Cluster (Control Plane) Responsible for: * Cluster membership management * Partition ownership metadata * Consistent hashing ring management * Replica placement decisions * Failure detection * Leader promotion * Rebalancing The coordinator cluster does not store user data. The coordinator cluster maintains metadata such as: ```text Partition -> Primary Node Partition -> Replica Nodes Node Health Replication Factor Hash Ranges ``` To avoid split-brain scenarios, coordinators form a Raft cluster. Only the Raft leader can modify metadata. All coordinator state is replicated through the Raft log. --- ### 2. Data Nodes (Data Plane) Data nodes are responsible for: * Persisting key-value data * Maintaining replication logs * Serving reads and writes * Replicating updates to replicas Each partition has: ```text 1 Primary RF-1 Replicas ``` The replication factor is configurable. Examples: ```text RF = 2 Primary + 1 Replica RF = 3 Primary + 2 Replicas RF = 5 Primary + 4 Replicas ``` The architecture does not assume a fixed replication factor. --- # Data Distribution Keys are distributed using Consistent Hashing. Assume: ```text 5 Data Nodes 4 Virtual Nodes Per Physical Node ``` This creates: ```text 20 Virtual Nodes ``` on the hash ring. For every key: ```text hash(key) ``` is computed. The first virtual node clockwise on the ring becomes the owner of the partition. The coordinator stores the mapping: ```text Hash Range -> Primary Node Hash Range -> Replica Set ``` which is used for request routing. --- # Request Routing Initially, clients fetch topology metadata from the coordinator cluster. The metadata contains: ```text Hash Ranges Primary Ownership Replica Ownership ``` The client caches this information locally. After metadata is obtained: ```text Client -> Data Node ``` direct communication is used. This removes the coordinator from the critical data path. The coordinator remains responsible only for: * topology updates * failover * rebalancing This cleanly separates: ```text Control Plane ``` from ```text Data Plane ``` and allows the system to continue serving traffic even if the coordinator cluster is temporarily unavailable. --- # Durable Writes When a write arrives: ```text PUT(key, value) ``` the request is routed to the partition primary. The primary first appends the operation to its Write-Ahead Log (WAL). The WAL is flushed to disk before proceeding. Only after local durability is guaranteed does replication begin. The write is replicated to the configured replica set. Each replica: 1. Receives the replication entry. 2. Appends it to its own WAL. 3. Flushes the WAL to disk. 4. Sends an acknowledgement. The primary waits according to the configured consistency level. Example: ```text RF = 3 Write Quorum = 2 ``` The write is considered committed only after two replicas have durably persisted the operation. Only then does the primary acknowledge success back to the client. This guarantees durability. If the primary crashes before quorum acknowledgement: ```text No Commit Response ``` is returned. The client can safely retry. Because PUT operations are idempotent, repeated retries produce the same final state. --- # Consistency Models ## Eventual Consistency For low latency: ```text W = 1 R = 1 ``` Writes may be acknowledged after a single durable copy. Reads can be served from any replica. Stale reads are acceptable. --- ## Strong Consistency Strong consistency requires both write and read guarantees. A write is committed only after quorum acknowledgement. Example: ```text RF = 3 W = 2 ``` For reads, arbitrary replica reads are not allowed. The system supports one of the following strategies: ### Leader Reads ```text Read -> Primary ``` The primary always serves the latest committed value. ### Quorum Reads ```text RF = 3 W = 2 R = 2 ``` Since: ```text R + W > N ``` every read quorum overlaps with the latest successful write quorum. This guarantees the latest committed version is observed. ### Lease / Version Validation Replicas maintain commit indexes. A replica can only serve a strong read if it can prove it is caught up to the latest committed version. Otherwise the request is redirected to the primary. Therefore strong consistency is guaranteed through an explicitly enforced read strategy rather than arbitrary replica reads. --- # Failure Handling ## Replica Failure The coordinator continuously monitors: * Primary nodes * Replica nodes including: ```text Node Liveness Replication Lag Last Commit Index ``` If a replica fails: 1. Coordinator marks replica unhealthy. 2. New replica placement is selected. 3. A replacement replica is provisioned. 4. Data is copied from an up-to-date replica. 5. The replica rejoins the replication group. During this period, the system continues operating with reduced redundancy. --- ## Primary Failure If a primary becomes unavailable: 1. Coordinator detects failure. 2. Replication status is inspected. 3. The most up-to-date replica is selected. 4. Replica is promoted to primary. 5. Metadata is updated through Raft. 6. Clients refresh topology information. Temporary write unavailability during failover is acceptable. Read traffic may continue from healthy replicas depending on the configured consistency level. Because replica lag is tracked continuously, stale replicas are never promoted. This prevents acknowledged writes from being lost. --- # Coordinator Failure Coordinators are replicated using Raft. If a coordinator crashes: * In-flight requests handled by that coordinator may fail. * Other coordinators continue serving metadata. * Raft elects a new leader if necessary. Since clients cache topology information and communicate directly with data nodes, normal reads and writes continue even while coordinator failover occurs. --- # Capacity Planning Assume: ```text Logical Data Size = 1 TB Replication Factor = 3 ``` Total storage required: ```text 1 TB × 3 = 3 TB ``` Assume: ```text 20K Operations / Second 5 Data Nodes ``` Average load: ```text 20K / 5 = 4K OPS per node ``` The system should additionally be sized for: ```text 2× peak traffic ``` and for temporary failover scenarios. --- # Hot Key Handling A single logical key cannot be transparently split into: ```text ABC_1 ABC_2 ABC_3 ``` because that breaks normal key-value semantics. Instead, hot-key mitigation depends on workload type. For read-heavy hotspots: * Increase replica count * Load balance reads across replicas * Cache popular keys For write-heavy hotspots such as counters: * Use counter striping * Use logical sharding * Aggregate results asynchronously This preserves normal KV-store behavior while improving scalability. --- # Summary The final design provides: * Consistent hashing partitioning * Configurable replication factor * Durable WAL-based writes * Strong consistency through leader or quorum reads * Eventual consistency through replica reads * Raft-backed coordinator metadata * Automatic failover * Replica lag aware promotions * Client-side metadata caching * Control plane / data plane separation * Hotspot mitigation strategies * Horizontal scalability

Hire SignalHire

Across NFR, API, Entities, and HLD, the candidate demonstrates strong distributed-systems fundamentals and a coherent scalable design that meets the stated requirements. The weaknesses are mostly around explicitness and production hardening of client/data-model contracts rather than core architectural misunderstanding.

⭐ Excellent

Consistency modes are explicitly tied to read/write behavior

The candidate does more than say 'strong or eventual consistency' — they explain the concrete mechanisms for each mode (W=1/R=1 for eventual, quorum or leader reads for strong, and why R+W>N matters). That shows they understand what would break if the wrong read path were allowed under strong consistency.

✅ Good

Availability trade-off is acknowledged during failover

The design explicitly states that temporary write unavailability during primary failover is acceptable and that read behavior depends on the configured consistency level. This is a good NFR discussion because it recognizes availability is not absolute and varies by consistency choice.

✅ Good

Capacity numbers are connected to stated assumptions

The candidate ties the NFRs back to the given assumptions by translating 1 TB logical data into 3 TB physical storage at RF=3 and distributing 20K ops/sec across 5 nodes, while also noting headroom for peak traffic and failover.

warning

Latency target is stated but not defended against the chosen durability path

Have you considered what happens to the 'ideally under 20ms' target when every write requires WAL flushes and, in strong mode, quorum replication plus disk flush on replicas? The design names a latency goal, but it does not separate read vs write latency targets or explain whether 20ms is expected for eventual mode only, strong mode as well, or under what network/failure conditions.

warning

Availability target is qualitative rather than measurable

What happens when you need to reason about whether the system is meeting its availability goal during coordinator failover, primary promotion, or replica rebuilds? The design says 'high availability' and discusses mechanisms, but without a concrete target or allowed write unavailability window, it is hard to judge whether the failover behavior is acceptable under the stated NFRs.

info

Consistency choice could be tied more explicitly to client-visible guarantees

You could improve this by stating what guarantee each mode gives the client in plain terms — for example, whether read-your-writes, monotonic reads, or latest committed read is expected in each configuration. That would make the availability/consistency trade-off easier to evaluate against the requirement that the model is selected before runtime.

info

Throughput sizing would be stronger with mode-specific assumptions

You could improve this by connecting the 20K ops/sec target to the consistency modes. In eventual mode, one durable copy per write has a very different cost than quorum writes with multiple fsyncs. Right now the throughput number is connected to node count, but not to the selected consistency level that materially changes achievable capacity.

✅ Good

Core KV nouns are identified

For a distributed key-value store, the primary domain objects are indeed the user-facing key and value, with user/client as the actor initiating operations. The candidate kept the model aligned to the stated PUT/GET/DELETE flow instead of introducing unrelated product entities.

✅ Good

Storage ownership relationship is implied through partition mapping

The explanation makes it clear that a key maps to a hash range/partition, and that each partition has one primary plus replicas. Even though this is described in system terms rather than as a formal ER model, it shows the candidate understands that data ownership and replication are relationships that matter for the core flow.

warning

Value is modeled too loosely for updates and reads

Have you considered what happens when the same key is overwritten multiple times or read during failover? With only a bare 'Value' entity, the design never makes the stored record/version explicit. In practice the core entity is closer to a key-record or key-version, because replication, quorum reads, commit index checks, and delete semantics all operate on versions of a value, not just a single timeless value.

warning

Delete path is missing a first-class domain concept

What happens when DELETE(key) races with replication or a stale replica later serves the key? The happy path includes deletes, but the entity model does not capture tombstones or delete markers as part of the stored data model. For a distributed KV store, deletion is usually not 'absence' immediately; it is replicated state that must participate in reads, anti-entropy, and failover.

warning

Relationships between client, key, partition, and replicas are not stated cleanly

Have you considered making the core relationships explicit? Right now they are scattered across the explanation: a client issues operations on a key, a key belongs to one partition/hash range, and a partition is owned by one primary and N replicas. Those relationships are present implicitly, but at Senior level I would expect them to be called out directly so the happy path can be traced without inferring the data model from infrastructure text.

✅ Good

Capacity includes replication overhead

The candidate converts the stated 1 TB logical dataset into 3 TB physical storage using RF=3. That shows the right capacity-planning instinct: infrastructure sizing should account for redundancy, not just logical data size.

✅ Good

Traffic is translated into per-node load

They take the 20K ops/sec assumption and map it onto 5 data nodes to get roughly 4K ops/sec per node. This is the right methodology for checking whether the architecture can scale horizontally as nodes are added.

✅ Good

Peak and failover headroom is acknowledged

They explicitly call out sizing for 2x peak traffic and temporary failover scenarios instead of planning only for steady-state average load. That is an important senior-level capacity consideration.

warning

No throughput breakdown by read/write mix or replication traffic

Have you considered what happens if a large fraction of the 20K ops/sec are writes? With RF=3, each logical write fans out into multiple durable WAL appends and replica network transfers, so backend disk and network load can be much higher than the simple 20K/5 split suggests. You could strengthen this by estimating read vs write mix and then translating writes into replication IOPS and bandwidth.

warning

Storage sizing stops at raw replicated data

Have you considered what happens once WALs, compaction/repair copies, metadata, and rebalancing overhead are included? A system sized only for 3 TB usable replicated data can run out of space during recovery or node replacement. You could improve this by adding operational headroom and temporary duplication during repair/rebalance.

warning

Failover capacity is mentioned but not validated

What happens when one data node fails and the remaining nodes absorb its traffic while also rebuilding replicas? The design says to size for peak and failover, but it never checks whether 4 remaining nodes can handle the shifted request rate plus recovery traffic. You could improve this by doing a simple N-1 capacity check.

info

Component choices are not tied back to the stated scale

You could improve this by explaining why 5 data nodes, WAL-based local disks, and the chosen replication approach are sufficient for 20K ops/sec and 1 TB. The current numbers are directionally fine, but the justification from workload to hardware footprint is still fairly thin.

✅ Good

Core KV operations are covered cleanly

The API exposes the three required operations directly as GET, PUT, and DELETE on /keys/:id, so the basic key-value workflow is usable without extra indirection.

✅ Good

Protocol choice matches the use case

Using simple HTTP-style request/response APIs is appropriate for a basic distributed key-value store. The verbs map naturally to read, upsert, and delete semantics, which keeps the client contract easy to understand.

⭐ Excellent

Explanation addresses retry safety for writes

The candidate explicitly calls out that a client may see no commit response if the primary fails before quorum acknowledgement and that PUT is safe to retry because it is idempotent. That shows awareness of a real distributed-systems API edge case rather than only listing happy-path routes.

warning

Read consistency is configurable but not exposed in the API

Have you considered how a client asks for strong versus eventual reads and writes? The explanation discusses leader reads, quorum reads, and different W/R settings, but the route contract does not show whether this is per-request, per-client, or cluster-wide. Without an explicit API mechanism such as headers or query parameters, clients cannot reliably choose the consistency behavior they need.

warning

Error contract is underspecified for common KV outcomes

What does the client see when a key is missing, a delete targets a non-existent key, the request hits a stale node after failover, or quorum cannot be reached? Without clear status codes and a stable error shape, clients will struggle to distinguish not-found from retryable routing or availability failures.

warning

Routing and topology-refresh behavior is missing from the client-facing API

The design relies on clients caching topology and talking directly to data nodes, but what happens when the client sends a request using stale metadata after rebalancing or primary promotion? The API should define whether the node returns a redirect, an ownership error with the new node/range, or a generic retry response; otherwise failover behavior is ambiguous from the client's perspective.

info

Versioning or conditional writes would strengthen overwrite semantics

You could improve this by exposing an optional version/ETag or compare-and-set style precondition. Right now PUT appears to be blind overwrite, so concurrent writers can silently clobber each other even though the storage layer tracks commit order.

⭐ Excellent

Clear control-plane / data-plane separation

The design keeps the coordinator cluster out of the steady-state read/write path by having clients cache topology and talk directly to data nodes. That is a strong architectural choice because it removes a central bottleneck, lets data traffic continue during coordinator failover, and shows the candidate thought about scaling the hot path separately from metadata management.

⭐ Excellent

Failure handling is tied to replication state

Primary failover is not described as a naive 'pick any replica'; the candidate explicitly uses replication lag / commit index to choose the most up-to-date replica and backs coordinator metadata with Raft. That is a strong design decision because it addresses the real failure mode of promoting a stale replica and losing acknowledged writes.

✅ Good

End-to-end request flow is coherent

The design traces a full path: client fetches topology, routes by consistent hashing to the primary, primary writes to WAL, replicates to replicas, waits based on configured consistency, and then responds. The components are logically connected and the flows cover the required GET/PUT/DELETE operations.

✅ Good

Coordinator is not a single point of failure

Using a coordinator cluster with Raft for membership and partition metadata is a solid availability choice for the control plane. It shows awareness that metadata management itself needs consensus and failover, not just the storage nodes.

warning

Client routing during topology changes is underspecified

Have you considered what happens when a client uses cached partition metadata right after a primary failover or rebalance? The request may go to the old owner and fail or hit a stale replica. You could strengthen the design by defining how nodes reject misrouted requests, return updated ownership/epoch information, and how clients refresh and retry safely.

warning

Rebalancing path could become the first scalability bottleneck

Have you considered what happens when nodes are added, removed, or replaced while the system is serving 20K ops/sec? The coordinator can update ownership metadata, but the design does not spell out how partition data is streamed, throttled, and cut over without saturating disks/network or causing long periods of degraded redundancy. A concrete background rebalancing flow with rate limits and ownership epochs would make this more production-ready.

warning

Health-check cluster appears redundant with coordinator failure detection

What happens when the health-check cluster and coordinator disagree about node liveness? In the diagram both seem involved in node health, but the explanation assigns failure detection to the coordinator/Raft control plane. That split can create conflicting sources of truth and accidental failovers unless one system is clearly authoritative. You could improve this by collapsing health checking into the coordinator path or explicitly defining the contract between them.

info

Caching is mentioned but not placed in the concrete architecture

You call out caching for hot keys, which is sensible, but it is not clear whether this is client-side cache, node-local cache, or a distributed cache tier. Since low latency is a stated goal, you could improve the HLD by showing where hot-read caching lives and how invalidation/versioning works under strong vs eventual consistency.

info

Write path may be latency-bound by synchronous disk flush on every replica

What happens to tail latency when every write requires WAL fsync on the primary and quorum replicas before ack? The design is correct for durability, but at the stated sub-20ms target this can become the limiting factor depending on storage media. You could strengthen the trade-off discussion by calling out batching/group commit or SSD assumptions so the durability model and latency target are reconciled architecturally.

Want this kind of feedback on your own design?

Draw your architecture for Key-Value Store and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.

Get your free review See more Key-Value Store designs