Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.
Loading diagram…
The candidate demonstrates strong system design fundamentals, good architectural decomposition, and appropriate scaling patterns for the stated scope. The design is clearly above baseline senior quality, but the failure-mode story, some sizing rigor, and a few contract/modeling gaps keep it short of a strong-hire.
Covers the core NFR dimensions
The section explicitly addresses availability, latency, and consistency, which are the key non-functional dimensions expected here. This shows clear awareness that matchmaking correctness and user experience depend on more than just functional behavior.
Consistency is differentiated by operation type
Choosing strong consistency for match assignment while allowing eventual consistency for wait-time estimates is a solid tradeoff. It protects the critical invariant that a player should only be assigned once, while avoiding over-engineering for an estimate that can tolerate staleness.
Latency target is measurable
Using a concrete target like p95 < 100ms is much better than a vague 'low latency' statement. It gives a clear SLO that can be validated and tied to the expected enqueue/dequeue path.
NFRs are tied to the stated scale
The latency discussion references the stated assumption of 10K concurrent gamers and translates that into roughly 67 QPS, which is the right way to justify that the target is realistic for this interview scope.
Availability target is stated but not scoped
99.9% availability is a reasonable target, but it is unclear whether this applies to the enqueue API, match assignment path, or wait-time estimate endpoint separately. For senior-level NFRs, define the scope of the SLA more precisely, e.g. '99.9% monthly availability for enqueue/dequeue and estimate read APIs.'
Latency target is incomplete for the user-visible estimate flow
The p95 < 100ms target is only attached to enqueue/dequeue. Since showing estimated wait time is an explicit functional requirement, the design should also define a measurable latency target for estimate retrieval or push updates, such as p95 < 200ms for estimate reads or update delivery.
Freshness target could be expressed more precisely
Saying estimates refresh every 5s is useful, but senior-level NFRs are stronger when framed as an observable bound, such as 'estimate staleness <= 5s under normal operation.' That makes it easier to monitor and verify.
Core matchmaking nouns are identified
The design lists the main domain entities for the stated flow: Player, QueueEntry, Match, GameSession, and EloBucket. These cover player enrollment, grouping for matchmaking, and handoff to a game session.
Relationships are explicitly defined
The submission does not just name entities; it also specifies key cardinalities such as Player to QueueEntry, QueueEntry to EloBucket, Match to Player, and Match to GameSession. That is the right level of modeling for a Senior core-entities section.
Estimated wait time concept is not modeled
One of the functional requirements is showing estimated wait time, but there is no entity or clearly modeled domain concept that owns or derives this information. You do not need full field-level detail, but the model should include a concept such as QueueState, QueueStats, or WaitTimeEstimate tied to queue population or EloBucket so the requirement has a clear place in the domain.
Queue relationship to Match is missing
The core flow is queue entry to matchmaking to match creation, but the relationships stop short of connecting queued players or queue entries to the resulting Match. Without that linkage, the transition from waiting state to matched state is under-modeled. Add a relationship showing how a Match is formed from QueueEntries, directly or indirectly.
Methodical queue and throughput estimates
The calculation chain from 10K concurrent users to 2K active searches, then to enqueue/dequeue QPS and match rate, is clear and internally consistent. This is the right style of reasoning for sizing a matchmaking system.
Covers multiple resource dimensions
The sizing goes beyond QPS and includes in-memory queue size, WebSocket connection fanout, event throughput, database writes, and wait-time update traffic. That breadth shows good capacity thinking rather than stopping at a single request-rate estimate.
Match-rate math is inconsistent for multiplayer matching
The design states ~67 players/sec and then ~33 matches/sec, but for a game like CSGO/Valorant a match typically consumes multiple players, not 2. If this system is matching 10 players per game, 67 players/sec would be only ~6-7 matches/sec. This matters because downstream sizing for match events, DB writes, and queue drain rate depends on matches/sec vs players/sec. Fix by explicitly defining players per match and carrying that through all calculations.
No peak/headroom assumptions
All numbers appear to be average steady-state values. Senior-level capacity planning should include burst assumptions and safety margin, especially for queue spikes when many players finish games around the same time. Add peak QPS estimates (for example 2-5x average), then verify Redis, WebSocket servers, Kafka, and Postgres still have comfortable headroom.
Component sizing is asserted rather than justified
Statements like 'single Postgres instance easily handled' and '4 WS servers' are plausible at this scale, but they are not backed by per-node capacity assumptions. To make this senior-level, state expected limits per server/instance (connections per WS node, writes/sec for Postgres, memory/CPU for Redis) and show why the chosen counts are sufficient.
Storage estimate is too narrow
The Redis sorted set estimate for 2K active searches is useful, but persistent storage sizing is missing. Even if only matchmaking is in scope, the design mentions storing game records, so it should estimate daily record volume and retention to validate Postgres storage growth. A simple DAU/session assumption leading to rows/day and GB/month would complete the capacity picture.
Core matchmaking lifecycle is covered
The REST API cleanly supports the main required actions for this scope: create a matchmaking entry, fetch its current status including estimated wait time, and cancel it. That maps well to the stated functional requirements without unnecessary surface area.
Resource-oriented REST design
Using /matchmaking-entries as the primary resource with POST, GET by id, and DELETE by id is a solid REST pattern. The URLs are noun-based and the operations are intuitive for clients integrating with the service.
WebSocket message types are structured and consistent
The WebSocket protocol uses explicit type fields and payload objects, which makes client/server handling straightforward and extensible. Separate message types for status updates, match_found, and error are appropriate for a real-time matchmaking flow.
Appropriate use of HTTP verbs and status codes
POST for enqueue, GET for status lookup, and DELETE for cancellation are the correct verb choices. Returning 201 for creation and 204 for deletion also follows standard HTTP semantics.
Missing request contract for creating a matchmaking entry
The POST route shows only the response, but not the request body needed to actually match players by elo and related criteria. Since matchmaking is based on elo and wait time, the API should define the create payload clearly, e.g. playerId and rating/queue attributes, so clients know what data is required to place a player into the correct queue.
No REST error status coverage
The REST routes list only success responses. A senior-level API design should also specify common failure cases such as 400 for invalid input, 404 for unknown entryId, and 409 if a player already has an active matchmaking entry. This makes client behavior predictable and avoids ambiguous failures.
WebSocket lifecycle is underspecified
The design includes a subscribe message, but does not define how subscription failures, invalid entry ownership, or terminal states are handled over the socket. Add clear behavior for cases like subscribing to a nonexistent entry, duplicate subscriptions, and whether the server closes the stream or sends a final status after match_found/cancel.
Status representation could be more explicit
GET /matchmaking-entries/{uuid} returns a status field, but the allowed values are not defined. Enumerating states such as queued, matched, cancelled, and expired would make both the REST and WebSocket contracts easier to implement consistently.
Well-structured end-to-end matchmaking flow
The design covers the full lifecycle from queue submission, bucketed matching, match creation, event publication, game server startup, and notifying players over existing WebSocket connections. This is a complete HLD for the stated matchmaking requirements.
Good sharded matcher strategy on Redis buckets
Partitioning players by game type, region, and ELO bucket in Redis sorted sets, then assigning matcher ownership via leases with TTL, is a strong scaling pattern for 10K concurrent gamers. The Lua-scripted atomic selection also shows awareness of race conditions between matcher replicas.
Estimated wait time path is explicitly designed
The design does not treat wait time as an afterthought. It includes a dedicated wait service, precomputed heuristics per bucket, short-lived WS-side caching, and material-change-based updates, which is a practical approach for serving frequent wait-time reads efficiently.
Durable eventing for match creation
Using Postgres plus an outbox/CDC worker before publishing to Kafka is a solid reliability pattern. It reduces the risk of losing match-created events and is appropriate for coordinating downstream game server startup.
Basic redundancy is present across major services
The design includes multiple WS servers, replicated matchmaking services, replicated matcher and wait services, Redis cluster, and managed Postgres with read replicas and failover. That is a reasonable baseline for avoiding obvious single-instance failures.
WebSocket routing and notification path are internally inconsistent
The diagram shows both direct WS-server consumption from Redis Streams and a separate path where WS servers send through the L4 load balancer to reach users. In practice, once the client has an established socket, the WS server should write directly to that connection; the load balancer is only for connection establishment. Clean up the flow and make the clientId->serverId ownership model explicit so notifications are routed deterministically.
Redis is carrying core queue state without a clear failover/rebuild story
The active matchmaking queue, bucket ownership leases, and some client routing metadata appear to live primarily in Redis. If Redis data is lost or partially unavailable, active queued players may be dropped or duplicated. For a senior-level design, add a clear persistence/recovery plan: durable Redis configuration, replayable source of truth, or periodic checkpointing so the queue can be reconstructed after failure.
No clear backpressure or overload handling on hot buckets
The design mentions dynamic bucket splitting and lease rebalancing, which is good, but it does not explain what happens when a region/ELO bucket becomes much hotter than others or when WS/update traffic spikes. Add explicit controls such as rate limits, bounded stream consumer lag, degraded wait-time refresh frequency, and autoscaling triggers on bucket depth or matcher latency.
Caching strategy is good for wait estimates but limited elsewhere
Wait-time caching is well thought out, but the design could be clearer about what reads should hit Postgres replicas versus Redis. Since this system is mostly matchmaking, keeping operational reads off Postgres would improve resilience. Explicitly state that queue depth, bucket heuristics, and reconnect lookups are served from Redis, while Postgres is reserved for durable match/game records.
Some components are weakly integrated in the diagram
A few nodes look more like notes than first-class components, such as the standalone 'Replica' services and the separate 'Primary + 2 read replicas' database node. This makes the architecture harder to reason about and creates apparent orphan/duplicate elements. Consolidate these into the Postgres component and show only meaningful runtime services and their actual traffic paths.
Draw your architecture for Multiplayer Online Game Matchmaking and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.