DrawLintDrawLint.ai

Rate Limiter — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

Hire SignalLean Hire

The candidate demonstrates good architectural instincts and chose sensible core building blocks for the problem, but the design lacks enough depth on failure modes, consistency boundaries, and capacity reasoning to be a confident senior-level hire. This is above average and workable, but not yet fully convincing at the expected level of scalability and operational completeness.

✅ Good

Core quality attributes are explicitly identified

The section clearly calls out availability, latency, scalability, and the fail-open/fail-close behavior. For a rate limiter, these are the right non-functional dimensions to surface because they directly affect whether the limiter protects downstream systems without becoming a bottleneck itself.

✅ Good

Availability versus consistency trade-off is stated

Saying availability matters more than consistency is a sensible starting point for a distributed rate limiter. It shows awareness that occasional limit inaccuracies may be preferable to turning the limiter into a single point of failure.

✅ Good

Targets are tied to the stated scale assumptions

The NFRs reference the given workload of 50K RPS and 100k unique clients rather than using abstract numbers in isolation. That is the right way to frame non-functional goals in an interview.

warning

Consistency model is not defined beyond a preference for availability

Have you considered what happens when rate-limit state is replicated or partitioned across nodes? Saying 'availability >> consistency' is not enough by itself: is the limiter allowed to over-admit briefly, under-admit briefly, or must decisions for a single user be linearizable? Without stating the acceptable inconsistency window, the runtime behavior under node failure or cross-node races is ambiguous.

warning

Latency target is not broken down by decision path

Have you considered what happens when the request needs both a limit check and a runtime config lookup for an admin-updated algorithm? A p99 < 100ms target is reasonable, but it is too coarse unless you define whether this is end-to-end added latency from the limiter, the internal decision time only, and whether the target still holds during config propagation or backend degradation.

warning

Availability target is not connected to failure scenarios

What happens when the backing store for counters or configuration is unavailable? You mention 99.99% availability and configurable fail-open/fail-close, but the NFRs do not spell out which mode applies in which scenario or what error budget is acceptable for each. For example, fail-open protects availability but can violate enforcement; fail-close preserves enforcement but can reject healthy traffic.

info

Scalability target would be stronger with per-key skew assumptions

You could improve this by stating whether the 50K RPS is evenly spread across 100k clients or whether hot keys are expected. For a rate limiter, skew matters more than just total RPS because a few abusive clients can create concentrated write contention and change the consistency and latency requirements.

info

Runtime configurability needs an NFR around propagation delay

You could improve this by defining how quickly admin changes must take effect system-wide. Since changing the algorithm and logic at runtime is a functional requirement, the non-functional side should say whether updates must be visible immediately, within seconds, or eventually, because that choice directly drives the acceptable consistency model for configuration.

✅ Good

Core nouns for rate limiting are identified

The design names the main domain concepts the system revolves around: Client, Requests, Rules, and Limits. For this problem, that covers the basic rate-limiting flow of identifying a caller, matching applicable policy, and tracking usage against a limit.

✅ Good

Client abstraction supports multiple identity types

Modeling Client with a type such as API key, IP, userId, or session token is a solid choice because it keeps the domain flexible as different rate-limit dimensions are introduced without redefining the core entity.

warning

Relationships between rules, clients, and counters are underspecified

Have you considered how a request resolves from Client to the applicable Rule and then to the specific Limit being consumed? Right now it is unclear whether Limits are per client, per client+API, or per client+rule. Without that relationship, the system can apply the wrong quota or merge unrelated traffic into the same bucket.

warning

Requests is too vague as a core entity for the happy path

What happens when the same client calls multiple APIs with different policies? A generic Requests entity does not show the domain key used for enforcement, such as an API/resource identifier tied to the request. Without explicitly connecting request context to Rules, per-API rate limiting from the requirements is not fully represented.

info

Separate policy definition from runtime usage state

You could improve this by making the distinction between Rule as admin-configured policy and Limit or Counter as runtime state explicit. That makes the runtime update path clearer when admins change algorithms or thresholds at run-time, and avoids conflating configuration with the mutable usage bucket.

warning

Capacity math stops at a single Redis memory estimate

Have you considered the full chain from the stated assumptions to infrastructure sizing? You estimated Redis memory for counters, but there is no back-of-envelope path from 50K RPS and 100K unique clients to expected reads/writes per request, network throughput, peak concurrency, or how many application instances are needed. At senior level, I would expect at least a rough DAU/client -> request rate -> Redis ops/sec -> memory/bandwidth chain.

warning

Redis sizing is not justified against peak load

What happens when traffic spikes above the average 50K RPS or when rate limiting requires multiple Redis operations per request? Saying '2 Redis servers at 25K RPS each' is not enough to show the system is comfortable at this load, because there is no reasoning about per-node throughput, headroom, replication overhead, or whether the chosen algorithm needs 1, 2, or more commands per request. Without that, the node count feels arbitrary.

warning

No capacity impact for runtime config changes

Have you considered what happens when admins change rate-limit logic or algorithm at runtime? That requirement can introduce config fanout, cache invalidation, and potentially a surge of misses or recomputation across the fleet. The capacity section does not estimate how often config is read, whether it is cached, or what load hits the backing store when rules change.

info

Memory estimate needs connection to algorithm choice

You could improve this by tying the 10 counters per user assumption to the actual rate-limiting algorithm and API shape. Different algorithms have very different storage footprints: fixed window may need one counter, sliding window log may need many timestamps, token bucket may need token state plus refill metadata. Right now the memory estimate is plausible, but it is not justified by the chosen approach.

info

Add headroom and failure-scenario sizing

You could strengthen this by asking: what happens if one Redis node is unavailable or traffic becomes uneven? With only '2 Redis servers at 25K RPS each,' there is no explanation of failover capacity. A stronger answer would show that the remaining capacity can absorb a node loss or that there is enough buffer to handle bursts without immediately saturating Redis.

✅ Good

Admin rule management covers runtime updates

The admin APIs include create, read, update, and delete for rate-limit rules, which directly supports the requirement that admins can change rate-limit logic at runtime.

✅ Good

Standard rate-limit headers are explicitly surfaced

The rateLimit response calls out X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After, which makes the client contract around throttling behavior clear and aligns with the functional requirement.

warning

gRPC response mixes HTTP semantics without a clear contract

Have you considered what the client actually receives on the gRPC rateLimit call? The design says the RPC returns '200 ok or 429' plus headers, but in gRPC those are not modeled the same way as REST responses. Without defining whether throttling is represented as a normal response payload, gRPC status, or response metadata/trailers, different clients may handle denials inconsistently. You could improve this by defining an explicit protobuf response shape such as {allowed, limit, remaining, resetAt, retryAfterSec} and, if needed, mapping that to HTTP headers only at an API gateway boundary.

warning

Core decision API is underspecified for rule selection

Have you considered how the rateLimit API determines which rule applies when multiple dimensions matter? The request only says clientId and requestInfo(url, endpoint etc), but the rule object includes method and path. Without a precise request schema for method, normalized path or route key, and possibly API identifier, the service contract is ambiguous and clients may send inconsistent values that lead to incorrect enforcement. You could improve this by defining exact request fields and matching semantics.

warning

No clear error contract for admin APIs

What happens when an admin submits an invalid rule, references a missing rule-id, or tries to switch to an unsupported algorithm? The routes are listed, but there is no status-code or error-body contract. Without this, clients cannot reliably distinguish validation failures from transient server errors or know whether a retry is safe. You could improve this by specifying standard responses like 400 for invalid rule definitions, 404 for unknown rule IDs, 409 for conflicting updates, and a consistent error payload.

info

List/read APIs are incomplete for operational use

How does an admin discover existing rules or audit current configuration at runtime? You have GET /rules/{rule-id}, but no GET /rules collection endpoint. Since runtime rule management is a functional requirement, operators will likely need to list rules, filter by path or algorithm, and page through results if the rule set grows. You could improve this by adding GET /rules with pagination and filtering.

info

Resource naming is slightly inconsistent

Have you considered separating the resource identifier from the business key? The route uses /rules/{rule-id}, while the example object also contains ruleId='search-api' plus method/path fields. If ruleId is really the primary identifier, that is fine, but if rules are naturally keyed by method+path, the API should make that explicit. Tightening this contract would reduce ambiguity around updates and deletes.

info

Delete route appears malformed

The admin API lists 'DELET /rules/{rule-id}', which looks like a typo. If this is just notation, no issue, but in an API review I would push for precise verb definitions because clients and generated SDKs depend on them. You could improve this by explicitly defining DELETE semantics and expected responses such as 204 on success.

⭐ Excellent

Hot path separates config from counters

Using etcd for rule distribution and Redis for request-time counter evaluation is a strong design choice. It keeps the hot path off the config store, matches the availability and latency goals, and shows good understanding that configuration reads and quota mutations have very different access patterns.

✅ Good

Atomic quota evaluation in Redis

Running the rate-limit algorithm through Redis Lua scripts is a good way to keep counter updates and allow/deny decisions atomic. That avoids race conditions at 50K RPS where multiple requests for the same client could otherwise overshoot the limit.

✅ Good

Runtime rule updates propagated by watch

The admin flow through Admin service -> etcd -> watcher updates in the rate limiter is a good fit for the requirement that admins can change logic at runtime. It avoids polling and keeps rule changes reasonably fresh across instances.

warning

Redis appears to be the primary bottleneck and possible SPOF

What happens when one Redis node fails or gets overloaded? The design says two Redis servers serving ~25K RPS each, but it does not explain sharding, replication, or failover behavior. Without a clear partitioning and redundancy model, one hot shard or one node loss could either drop capacity sharply or make rate-limit decisions unavailable for part of the traffic.

warning

Rule changes may become inconsistent across rate limiter instances

Have you considered what happens if one rate limiter instance misses an etcd watch event, restarts, or lags behind others during a rule update? Some instances could enforce old limits while others enforce new ones. For a runtime-configurable limiter, you would want a clear resync/versioning strategy so instances can detect stale local state and reload rules safely.

warning

Client metadata lookup path is under-specified for the hot path

What happens when the rate limiter needs client properties like free/premium and the Client Metadata Cache misses? The diagram implies a read-through/write-through path to Postgres, but if request-time decisions fall back to Postgres, latency and availability could degrade quickly under load. This path needs a clear strategy for cache warmup, TTLs, and behavior on metadata-store failures.

info

In-memory cache is not tied to a concrete request flow

You could improve this by being explicit about what the local in-memory cache stores and how it is invalidated. Right now it seems intended for rules, but the flow does not show whether it is authoritative for rule lookup, whether it caches client metadata too, or how stale entries are handled after admin updates.

warning

Fail-open and fail-close behavior is mentioned but not fully designed

Have you considered what happens when Redis is partially degraded, timing out, or returning intermittent errors? The design says the gateway or rate limiter can take the default call, but without clear timeout budgets and fallback ownership, requests may hang or different instances may make inconsistent decisions. At 99.99% availability, the failure path needs to be as explicit as the success path.

info

Admin write path lacks persistence story for recovery

You could improve this by clarifying the source of truth between etcd and Postgres for rules. The current flow writes rules to etcd, while Postgres is present but not clearly used for rule persistence. If etcd state is lost or rebuilt, the system needs a deterministic way to restore rule definitions and algorithm settings.

Want this kind of feedback on your own design?

Draw your architecture for Rate Limiter and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.