Challenge Drills Library Drawing Guide Learn AI Setup Guide Support About

Key-Value Store — system design by AgileViper46

Strong Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

# Distributed Key-Value Store Design ## Requirements The system should support: * PUT(key, value) * GET(key) * DELETE(key) ### Non-Functional Requirements * High availability * Horizontal scalability * Configurable replication factor (RF) * Strong consistency or eventual consistency depending on configuration * Durable writes * Automatic failover * Fault tolerance against node failures * Support for hotspot detection and mitigation --- # High-Level Architecture The system consists of two major components: ### 1. Coordinator Cluster (Control Plane) Responsible for: * Cluster membership management * Partition ownership metadata * Consistent hashing ring management * Replica placement decisions * Failure detection * Leader promotion * Rebalancing The coordinator cluster does not store user data. The coordinator cluster maintains metadata such as: ```text Partition -> Primary Node Partition -> Replica Nodes Node Health Replication Factor Hash Ranges ``` To avoid split-brain scenarios, coordinators form a Raft cluster. Only the Raft leader can modify metadata. All coordinator state is replicated through the Raft log. --- ### 2. Data Nodes (Data Plane) Data nodes are responsible for: * Persisting key-value data * Maintaining replication logs * Serving reads and writes * Replicating updates to replicas Each partition has: ```text 1 Primary RF-1 Replicas ``` The replication factor is configurable. Examples: ```text RF = 2 Primary + 1 Replica RF = 3 Primary + 2 Replicas RF = 5 Primary + 4 Replicas ``` The architecture does not assume a fixed replication factor. --- # Data Distribution Keys are distributed using Consistent Hashing. Assume: ```text 5 Data Nodes 4 Virtual Nodes Per Physical Node ``` This creates: ```text 20 Virtual Nodes ``` on the hash ring. For every key: ```text hash(key) ``` is computed. The first virtual node clockwise on the ring becomes the owner of the partition. The coordinator stores the mapping: ```text Hash Range -> Primary Node Hash Range -> Replica Set ``` which is used for request routing. --- # Request Routing Initially, clients fetch topology metadata from the coordinator cluster. The metadata contains: ```text Hash Ranges Primary Ownership Replica Ownership ``` The client caches this information locally. After metadata is obtained: ```text Client -> Data Node ``` direct communication is used. This removes the coordinator from the critical data path. The coordinator remains responsible only for: * topology updates * failover * rebalancing This cleanly separates: ```text Control Plane ``` from ```text Data Plane ``` and allows the system to continue serving traffic even if the coordinator cluster is temporarily unavailable. --- # Durable Writes When a write arrives: ```text PUT(key, value) ``` the request is routed to the partition primary. The primary first appends the operation to its Write-Ahead Log (WAL). The WAL is flushed to disk before proceeding. Only after local durability is guaranteed does replication begin. The write is replicated to the configured replica set. Each replica: 1. Receives the replication entry. 2. Appends it to its own WAL. 3. Flushes the WAL to disk. 4. Sends an acknowledgement. The primary waits according to the configured consistency level. Example: ```text RF = 3 Write Quorum = 2 ``` The write is considered committed only after two replicas have durably persisted the operation. Only then does the primary acknowledge success back to the client. This guarantees durability. If the primary crashes before quorum acknowledgement: ```text No Commit Response ``` is returned. The client can safely retry. Because PUT operations are idempotent, repeated retries produce the same final state. --- # Consistency Models ## Eventual Consistency For low latency: ```text W = 1 R = 1 ``` Writes may be acknowledged after a single durable copy. Reads can be served from any replica. Stale reads are acceptable. --- ## Strong Consistency Strong consistency requires both write and read guarantees. A write is committed only after quorum acknowledgement. Example: ```text RF = 3 W = 2 ``` For reads, arbitrary replica reads are not allowed. The system supports one of the following strategies: ### Leader Reads ```text Read -> Primary ``` The primary always serves the latest committed value. ### Quorum Reads ```text RF = 3 W = 2 R = 2 ``` Since: ```text R + W > N ``` every read quorum overlaps with the latest successful write quorum. This guarantees the latest committed version is observed. ### Lease / Version Validation Replicas maintain commit indexes. A replica can only serve a strong read if it can prove it is caught up to the latest committed version. Otherwise the request is redirected to the primary. Therefore strong consistency is guaranteed through an explicitly enforced read strategy rather than arbitrary replica reads. --- # Failure Handling ## Replica Failure The coordinator continuously monitors: * Primary nodes * Replica nodes including: ```text Node Liveness Replication Lag Last Commit Index ``` If a replica fails: 1. Coordinator marks replica unhealthy. 2. New replica placement is selected. 3. A replacement replica is provisioned. 4. Data is copied from an up-to-date replica. 5. The replica rejoins the replication group. During this period, the system continues operating with reduced redundancy. --- ## Primary Failure If a primary becomes unavailable: 1. Coordinator detects failure. 2. Replication status is inspected. 3. The most up-to-date replica is selected. 4. Replica is promoted to primary. 5. Metadata is updated through Raft. 6. Clients refresh topology information. Temporary write unavailability during failover is acceptable. Read traffic may continue from healthy replicas depending on the configured consistency level. Because replica lag is tracked continuously, stale replicas are never promoted. This prevents acknowledged writes from being lost. --- # Coordinator Failure Coordinators are replicated using Raft. If a coordinator crashes: * In-flight requests handled by that coordinator may fail. * Other coordinators continue serving metadata. * Raft elects a new leader if necessary. Since clients cache topology information and communicate directly with data nodes, normal reads and writes continue even while coordinator failover occurs. --- # Capacity Planning Assume: ```text Logical Data Size = 1 TB Replication Factor = 3 ``` Total storage required: ```text 1 TB × 3 = 3 TB ``` Assume: ```text 20K Operations / Second 5 Data Nodes ``` Average load: ```text 20K / 5 = 4K OPS per node ``` The system should additionally be sized for: ```text 2× peak traffic ``` and for temporary failover scenarios. --- # Hot Key Handling A single logical key cannot be transparently split into: ```text ABC_1 ABC_2 ABC_3 ``` because that breaks normal key-value semantics. Instead, hot-key mitigation depends on workload type. For read-heavy hotspots: * Increase replica count * Load balance reads across replicas * Cache popular keys For write-heavy hotspots such as counters: * Use counter striping * Use logical sharding * Aggregate results asynchronously This preserves normal KV-store behavior while improving scalability. --- # Summary The final design provides: * Consistent hashing partitioning * Configurable replication factor * Durable WAL-based writes * Strong consistency through leader or quorum reads * Eventual consistency through replica reads * Raft-backed coordinator metadata * Automatic failover * Replica lag aware promotions * Client-side metadata caching * Control plane / data plane separation * Hotspot mitigation strategies * Horizontal scalability

Hire SignalHire

Across NFR, API, Entities, and HLD, the design demonstrates strong distributed systems fundamentals and a mostly coherent end-to-end architecture that satisfies the stated requirements. The weaknesses are real but are primarily around explicit contracts and operational detail rather than fundamental architectural misunderstanding.

⭐ Excellent

Consistency model is explicitly tied to read/write behavior

The design does more than name 'strong' and 'eventual' consistency; it explains what changes operationally in each mode using W/R quorum choices, leader reads, and replica-read constraints. That shows the candidate understands what would break if arbitrary replica reads were allowed under strong consistency.

✅ Good

Availability trade-off is acknowledged as configuration-dependent

The NFRs and explanation consistently reflect that availability depends on the selected consistency mode. Calling out temporary write unavailability during failover for stronger guarantees is a realistic trade-off rather than claiming both maximum availability and strong consistency at once.

✅ Good

Capacity numbers are connected to the stated assumptions

The candidate ties the design back to the given scale by sizing for 1 TB logical data, multiplying by replication factor for durable storage, and distributing 20K OPS/sec across 5 nodes with additional headroom for peak traffic and failover. The math is simple but grounded in the stated assumptions rather than floating targets.

warning

Latency target is stated but not justified per consistency mode

You mention 'ideally under 100ms', but what happens to that target when strong consistency requires WAL fsync plus quorum replication across replicas? Without separating latency expectations for eventual vs strong mode, the target is hard to defend and may be missed under normal cross-node or failover conditions. You could improve this by defining mode-specific latency goals, such as a tighter target for eventual reads/writes and a looser but explicit target for quorum-based operations.

warning

Availability is described qualitatively, not as a concrete target

The design says availability depends on the chosen consistency model, but what happens during coordinator leader election, primary failover, or replica rebuilds? Without an explicit availability objective or acceptable failover window, it is difficult to judge whether the design meets the NFRs or whether temporary write unavailability is within bounds. You could improve this by stating concrete expectations like tolerated failover duration or service availability by mode.

info

Scalability numbers stop at average load

The capacity section connects to 20K OPS/sec and 1 TB, which is good, but have you considered whether the 100ms latency goal still holds during the scenarios you explicitly mention, such as 2x peak traffic, node loss, or replica catch-up? Average per-node OPS alone does not show that the NFRs hold under degraded conditions. You could strengthen this by tying the throughput assumption to headroom and degraded-mode performance expectations.

✅ Good

Core storage nouns are identified

The design clearly centers on the key domain concepts needed for the API surface: key and value, with user included as the actor invoking operations. For a key-value store, this captures the basic object being stored and manipulated.

✅ Good

Replication group relationships are described in the walkthrough

Even though the entity list is minimal, the explanation does define important relationships around the happy path: a hash range/partition maps to one primary and RF-1 replicas, and keys are routed to an owning partition. That gives enough relationship structure to understand how stored data is organized across nodes.

warning

Key and value are modeled as separate core entities without a clear relationship

Have you considered whether 'Value' is really an independent entity here? In the happy path, the system stores a key-value record, but the design never states whether this is a 1:1 mapping, whether DELETE removes the whole record, or how overwrites relate to prior values. Modeling the stored object as a single KV entry (or explicitly stating Key -> current Value) would make the domain much clearer.

warning

Partition or shard is missing from the core entity set

What happens when you try to trace the main flow from PUT(key, value) to durable replicated storage? The explanation relies heavily on hash ranges, partition ownership, primaries, and replica sets, but 'partition' is not listed as a core entity. At this scale and with configurable replication/consistency, partition is part of the core data model because keys belong to partitions and partitions own replication relationships.

warning

Replication relationship is only implicit

Have you considered making the relationship between a stored record and its replicas explicit? The walkthrough talks about primary and replica nodes per partition, but the entity section does not connect Key/Value to those replication groups. Without that relationship, it is harder to reason about where a record lives and how strong vs eventual consistency applies to the same logical item.

info

User is not central to the storage domain

You could improve this by focusing the entity list on the storage model rather than the caller. For this problem, entities like KV entry, partition/hash range, and replica group are more central than 'User', since the functional requirements are about storing and replicating data rather than user management.

✅ Good

Includes replicated storage overhead in sizing

The candidate at least converts the 1 TB logical dataset into 3 TB physical storage for RF=3, which shows the right instinct that capacity must account for replication rather than just raw user data.

✅ Good

Translates total OPS into per-node load

Breaking 20K OPS/sec across 5 data nodes to get roughly 4K OPS/sec per node is a reasonable first-order calculation. It creates a usable bridge from the stated workload to infrastructure sizing instead of leaving the node count ungrounded.

✅ Good

Mentions peak and failover headroom

Calling out 2x peak traffic and temporary failover scenarios is the right capacity-planning mindset. It shows awareness that average steady-state numbers are not enough when sizing a distributed storage system.

warning

Write amplification from replication is not carried through the OPS math

Have you considered what happens to node load when writes must be forwarded to replicas and durably flushed on each copy? The 20K OPS/sec is treated like evenly distributed client traffic, but with RF=3 the internal replication traffic and disk work can be materially higher, especially for PUT/DELETE-heavy workloads. You could improve this by separating read QPS from write QPS and estimating client ops versus internal replicated ops.

warning

No storage growth beyond raw replicated bytes

Have you considered what happens once WALs, compaction/LSM overhead, tombstones from DELETE, metadata, and rebalancing copies accumulate? A KV store with durable writes usually needs meaningfully more than logical_data × RF. Without that buffer, the cluster can run out of disk during compaction or node replacement even if the raw dataset fits. You could improve this by adding a storage headroom factor for WAL and compaction overhead plus spare capacity for rebuilds.

warning

Failover capacity is mentioned but not validated numerically

What happens if one of the 5 data nodes fails while the system is already near the stated target? The design says to size for failover, but the math stops at 4K OPS/node average and does not show whether the remaining 4 nodes can absorb traffic and replica rebuild load. You could improve this by checking N-1 operation explicitly, for example showing post-failure per-node OPS and whether disk/network headroom still holds.

info

Bandwidth and disk throughput are missing from the chain

You could strengthen the capacity story by carrying the reasoning one step further from OPS into bytes/sec and fsync pressure. For a durable replicated KV store, network replication bandwidth and storage IOPS/throughput are often the real bottlenecks, not just request count.

info

Component count is only loosely justified

You could improve this by explaining why 5 data nodes is the right starting point for 1 TB and 20K OPS/sec rather than just assuming it. Even a rough statement like 'chosen to keep steady-state utilization below X and survive one node loss' would make the infrastructure choice feel grounded in the workload.

✅ Good

Core KV operations are covered cleanly

The API exposes the required GET, PUT, and DELETE operations directly on the key resource, so the basic functional flow is usable without unnecessary ceremony.

✅ Good

Resource shape is simple and appropriate

Using /keys/:id keeps the surface area small and resource-oriented for a key-value store. For this problem, that is a reasonable REST-style mapping of the primary entity.

⭐ Excellent

Explanation addresses retry safety for writes

The walkthrough explicitly calls out that a client may not receive a commit response if the primary crashes before quorum acknowledgement and that PUT can be safely retried because it is idempotent. That shows good thinking about client-visible behavior during partial failures.

warning

Consistency and replication are required but not exposed in the API contract

The requirements say consistency mode and replication factor are configurable, but the route definitions do not show how a client or operator interacts with that configuration. What endpoint or request parameter tells the system whether a read should be strong vs eventual, or what RF applies? If this is cluster-level static configuration, expose that clearly through an admin/config API or document that these data APIs operate against a preconfigured cluster mode.

warning

GET response shape is too underspecified for strong vs eventual reads

What does the client see on a GET when the system cannot satisfy a strong read during failover or replica lag? Returning only 'Value' leaves out important semantics like whether the read was served strongly, whether it was redirected, or whether it failed because consistency could not be met. A small response/error contract or headers indicating consistency mode and version would make client behavior much clearer.

warning

Error handling is missing for common KV scenarios

What happens when a key does not exist, a write times out before quorum, or a delete races with failover? The API section does not define status codes or error shapes, so clients cannot distinguish retryable failures from permanent ones. At this level, I would expect at least basic behavior such as 404 for missing keys, 409/503-style responses for consistency or quorum failures, and a structured error body with retry guidance.

info

PUT semantics could be clarified for create vs overwrite

Right now PUT /keys/:id with a value implies upsert, which is fine, but clients may still need to know whether the key was newly created or overwritten. You could improve this by documenting the behavior and returning distinct status codes or metadata so clients can reason about conditional retries and idempotent updates.

⭐ Excellent

Clear control-plane and data-plane separation

Using the coordinator cluster only for membership, partition ownership, failover, and metadata while letting clients talk directly to data nodes is a strong architectural choice. It removes the coordinator from the hot path, reduces a central bottleneck, and lets reads/writes continue during coordinator failover as long as topology is cached.

⭐ Excellent

Failure-aware replication and promotion logic

The design goes beyond 'replicate to replicas' and explains how promotions happen: tracking replica lag/commit index, promoting the most up-to-date replica, and persisting coordinator metadata via Raft. That shows good thinking about acknowledged-write safety during primary failure.

✅ Good

End-to-end write path is coherent

The request flow is complete: client fetches topology, routes to partition primary, primary writes WAL, replicas durably persist, quorum acknowledgements determine commit, and then the client gets a response. This satisfies the durability and configurable consistency requirements in a logically connected way.

✅ Good

Client-side topology caching reduces routing overhead

Having clients cache hash-range ownership metadata is a practical scalability decision for 20K OPS/sec. It avoids sending every operation through a proxy/router tier and keeps request routing simple once membership is stable.

warning

Fixed primary-to-replica layout conflicts with configurable replication factor

The diagram shows each data node connected to exactly two replicas, which effectively hardcodes RF=3. Have you considered what happens when the configured replication factor is 2, 4, or 5 as required? The current topology drawing does not show how replica placement generalizes, so the architecture looks more static than the explanation claims. A better HLD would model replica groups or partition replica sets abstractly instead of one fixed pair per node.

warning

Client routing during failover is underspecified

What happens when a primary fails but clients still have stale topology cached and continue sending writes to the old owner? Without a clear redirect, epoch/version check, or retry flow tied to coordinator metadata versions, clients can see write failures or bounce between nodes during failover. The design would be stronger if data nodes rejected requests with a newer topology version hint so clients can refresh deterministically.

warning

Rebalancing and replica rebuild can saturate the cluster

Have you considered what happens when a node dies or a replacement replica is provisioned and large partitions must be copied? At 1 TB total data, background streaming for re-replication and rebalancing can compete with foreground GET/PUT/DELETE traffic and become the first bottleneck under failure. The HLD mentions rebalancing, but not throttling, snapshot transfer, or incremental catch-up mechanisms to keep recovery from overwhelming the serving path.

info

Read-path caching is mentioned but not placed in the architecture

You could improve this by explicitly showing where hot-key caching lives and under which consistency mode it is safe. Since GET is part of the core workload and hot reads are called out in the explanation, the HLD would be stronger if it showed whether caching is client-side, replica-side, or an external cache, and how strong-consistency reads bypass or validate cache entries.

info

Health-check cluster appears separate from coordinator responsibilities

The diagram includes both a coordination cluster and a health-check cluster, but the explanation assigns failure detection to the coordinator. You could improve this by clarifying whether the health-check cluster is a real separate subsystem or just part of the coordinator. As drawn, it risks becoming an orphaned or duplicated control-plane function.

Want this kind of feedback on your own design?

Draw your architecture for Key-Value Store and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.

Get your free review See more Key-Value Store designs