Challenge Drills Library Drawing Guide Learn AI Setup Guide Support About

Distributed Cache — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

Hire SignalLean Hire

The candidate demonstrates a good senior-level foundation with a workable distributed cache architecture and sensible core data structures, but the design stops short of being fully production-ready at the stated scale. The biggest concerns are around failure semantics, routing authority during reconfiguration, and incomplete capacity/API rigor, which are important gaps for a senior system design interview.

✅ Good

Concrete latency and availability targets

The NFRs name explicit goals for availability (>99.99%) and cache-side lookup latency (p99 < 1ms), which is much better than vague statements like 'highly available' or 'low latency.' These targets are actionable and can drive design trade-offs.

✅ Good

Consistency model is explicitly chosen and justified

Calling out eventual consistency with a leader-replica model and stating that the cache is not the source of truth shows awareness of the trade-off: lower latency and simpler replication at the cost of stale reads and some durability loss.

✅ Good

Scale target ties back to stated assumptions

The scalability requirement explicitly references the assumed 100k RPS workload instead of introducing unrelated numbers, which keeps the NFRs grounded in the problem statement.

warning

Availability target is not connected to failure scenarios

Have you considered what happens when the primary cache node fails or a replica is lagging? Stating >99.99% availability is good, but without clarifying whether reads can continue from replicas, how failover affects SLA, or whether writes are rejected during leader election, the availability target is not yet defensible.

warning

Eventual consistency impact on TTL and deletes is not called out

Have you considered what happens if a client deletes a key or its TTL expires on the leader, but a replica still serves the old value? For a cache, eventual consistency may be acceptable, but TTL and delete semantics are especially visible to clients, so the design should state whether stale reads after expiry/deletion are acceptable and for how long.

warning

Latency target is isolated from workload shape

You set p99 < 1ms from the cache perspective, but have you considered whether that target applies equally to GET, SET, DELETE, TTL updates, and LRU eviction under load? The number is concrete, but it is not yet tied to the 1 TB dataset, replication overhead, or eviction/expiration work that can affect tail latency.

info

Scalability target could be made more defensible with read/write assumptions

You could improve this by connecting the 100k RPS target to an expected read/write mix and replication behavior. For example, if writes are a small fraction, eventual consistency is easier to justify; if writes are heavy, replica lag and leader pressure become much more important to the NFR story.

warning

TTL is modeled as a standalone noun instead of part of the key entry

Have you considered what happens when a key is updated, deleted, or expires? With only separate nouns 'Key', 'Value', and 'TTL', the core flow is underspecified: TTL is really metadata on a key-value entry, not an independent domain object in this design. A clearer entity model would be a cache entry/item that owns key, value, expiration metadata, and last-access state so set/get/delete/expire all operate on one consistent record.

warning

LRU-related state is missing from the entity model

What happens when the cache reaches capacity and needs to evict the least recently used item? The functional requirements explicitly require LRU eviction, but the entities do not include any concept of recency or an eviction-tracked entry. Without modeling access-order state as part of the stored item or eviction structure, the happy path is incomplete because the system cannot determine which key to evict.

info

Relationships are only partially defined

You could improve this by making the relationships explicit: a cache entry has exactly one key and one value, and optionally one expiration timestamp/TTL; the eviction policy tracks many entries in access order. Right now '1 key: 1 value, TTL? (optional)' is a start, but it does not fully describe how expiration and eviction connect to stored data.

warning

Capacity math stops at RAM and per-node RPS

Have you considered the full chain from the stated assumptions to infrastructure sizing? Right now the calculation jumps from 1 TB to node count and then divides 100k RPS across primaries, but it does not show how key/value size, metadata overhead, TTL bookkeeping, LRU tracking, replication traffic, or headroom affect the usable capacity. For an in-memory store, those overheads can materially change whether 20 primary nodes is enough. You could improve this by showing a simple end-to-end estimate: raw data size -> in-memory overhead -> effective usable memory per node -> primary shard count -> replicas -> network/load headroom.

warning

No read/write mix or replication impact in the throughput estimate

What happens if the 100k RPS includes a meaningful write rate? The current 5k RPS/node assumes an even split across 20 primaries, but writes also generate replica traffic and can be more expensive because TTL and LRU metadata must be updated. Without separating reads from writes, the per-node load may be understated. You could improve this by estimating a plausible read/write mix and then translating that into primary request load plus replication load.

info

Missing bandwidth and rebalance considerations

You could improve this by checking whether the architecture still fits during replication, failover, and node replacement. For example, what happens when a node fails and its replica is promoted while data is re-replicated elsewhere? The steady-state node count may be fine, but recovery traffic and temporary hot spots can dominate capacity planning for a 1 TB in-memory system.

info

Component sizing is not justified beyond a single server template

You could improve this by explaining why 64 GB / 50 GB usable is the right node shape for this scale. The current answer gives one server size but does not connect that choice to operational goals like headroom for eviction behavior, failover tolerance, or shard count. Even a brief rationale such as 'smaller nodes reduce rebalance blast radius' or 'larger nodes reduce shard count at this load' would make the capacity plan feel more deliberate.

warning

Set API does not clearly support the TTL requirement

The functional requirement says TTL is configurable, but the write API only shows an absolute `expiry` timestamp. Have you considered what the client does when it wants a relative TTL like 'expire in 5 minutes'? Requiring clients to compute absolute expiry times introduces clock-skew issues across clients and servers. You could improve this by accepting a TTL field such as `ttlSeconds` and defining whether the server also returns the computed expiry.

warning

Write semantics are ambiguous for create vs update

What happens when `POST /keys/:id` is called for an existing key? The route lists both `201 Created` and `200 OK`, but the behavior is not defined. At 100k RPS, clients need predictable semantics for retries and overwrites. You could improve this by making the operation explicit: either use `PUT /keys/{id}` for upsert, or document whether duplicate creates overwrite, fail with `409`, or refresh TTL.

warning

GET response shape is underspecified

What exactly does the client receive from `GET /keys/:id`? Returning just 'Value' is not enough to make the API reliably usable. Have you considered whether the client also needs metadata such as key, expiry/remaining TTL, or whether the key was already expired? A defined JSON response shape makes versioning and error handling much cleaner.

warning

Error handling is incomplete

What happens when the client sends an invalid body, an invalid datetime, an oversized value, or requests a missing/expired key? The API only lists a few success codes plus `404` for delete. Without consistent status codes and an error body shape, clients cannot distinguish retryable failures from bad requests. You could improve this by defining `400`/`422` for validation errors, `404` for missing or expired keys, `5xx` for server failures, and a standard error payload.

info

HTTP verb choice could be cleaner

Have you considered using `PUT /keys/{id}` instead of `POST /keys/:id`? Since the client supplies the key identifier and the operation is effectively create-or-replace, `PUT` communicates idempotent semantics more clearly and makes retries safer after timeouts.

info

Delete behavior for repeated requests is not fully defined

What happens when a client retries `DELETE /keys/:id` after a timeout and the first delete may already have succeeded? Returning `404` is acceptable, but the contract should explicitly say whether delete is treated as idempotent and how clients should interpret repeated deletes for missing or already-expired keys.

info

No API surface for observing expiration state

TTL is part of the functional requirements, but the read path does not indicate whether clients can inspect expiry metadata. You could improve this by returning expiry or remaining TTL in `GET` responses, or by documenting that TTL is write-only if that is the intended contract.

✅ Good

End-to-end sharded cache flow is defined

The design does work end-to-end for the stated requirements: the client fetches slot metadata, hashes the key, routes directly to the owning primary, and the node handles GET/SET/DELETE with TTL checks and LRU eviction locally. That is a coherent flow for a distributed in-memory KV cache.

✅ Good

Local data structure choice supports low-latency LRU operations

Using a hash map plus doubly linked list is the standard O(1) approach for GET updates and LRU eviction. The candidate also explicitly handles the expired-but-not-yet-cleaned case on reads, which shows awareness of the interaction between TTL and periodic cleanup.

✅ Good

Client-side partition routing reduces an extra hop

Having the client hash keys and talk directly to the owning shard is a sensible choice for the stated low-latency goal. The redirect-on-wrong-node behavior and proactive slot refresh are also practical touches for cluster reconfiguration.

✅ Good

Failure handling is considered at the shard level

The design does not stop at a single-node cache: it includes primary-replica replication, replica promotion, and slot reassignment when a shard is lost. That is the right architectural direction for availability at this scale.

warning

Replication semantics are too vague for failover correctness

What happens when a primary acknowledges a write to the client and crashes before the replica has applied it? The design says the primary 'commits' and then sends on the replication stream, which implies acknowledged writes can be lost on failover. Since eventual consistency is allowed this trade-off may be acceptable, but the HLD should explicitly define whether writes are acked before or after replication and what stale/lost-write behavior clients should expect during promotion.

warning

Cluster membership and slot movement during failures can cause inconsistent routing

Have you considered what happens when gossip has not converged yet and different clients or nodes disagree on slot ownership? During primary loss or complete node failover, some clients may still route to the old owner while others route to the new one, which can lead to temporary write divergence or failed requests. You could improve this by defining a clearer source of truth for slot ownership during failover, or at least a versioned config/epoch so stale primaries cannot continue serving writes.

warning

Single-threaded shard nodes are an obvious throughput bottleneck risk

At the stated scale, what happens if traffic is skewed and one hot shard receives far more than the average 5k RPS? A single-threaded in-memory node may be fine on average, but hot keys or uneven slot distribution can make one shard the first thing to saturate while the rest of the cluster is idle. You could improve this by calling out hot-key mitigation, virtual shards, or a strategy for rebalancing overloaded slots.

info

Cleanup jobs are modeled as separate workers but their ownership is unclear

The design shows multiple cleanup workers, but it is not fully clear whether each worker is colocated with one shard, whether they are external processes, and how they avoid interfering with normal request handling. Since expiration cleanup touches the same in-memory structures as GET/SET/LRU updates, you could strengthen the design by explaining concurrency/serialization with the cache node and whether cleanup is incremental to avoid latency spikes.

info

Replica read/write role is not fully constrained after failover

What prevents an old primary from coming back and serving stale traffic after a replica has already been promoted? The design mentions promotion and rejoin sync, but not how split-brain is avoided. You could improve this by defining leader epochs or fencing so only one primary for a slot can accept writes at a time.

info

Applications service appears mostly bypassed in the main data path

The main request flow is client directly to cache node, while the Applications box is only connected to the client and not clearly part of the KV operations. That makes it feel partially orphaned in the HLD. You could improve this by clarifying whether Applications is just a sample consumer of the cache or whether it has an actual control-plane role.

Want this kind of feedback on your own design?

Draw your architecture for Distributed Cache and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.

Get your free review See more Distributed Cache designs