DrawLintDrawLint.ai

Multiplayer Online Game Matchmaking — system design by AgileViper46

Strong Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

Hire SignalHire

The candidate demonstrates strong system design fundamentals, good architectural decomposition, and appropriate scaling patterns for the stated scope. The design is clearly above baseline senior quality, but the failure-mode story, some sizing rigor, and a few contract/modeling gaps keep it short of a strong-hire.

✅ Good

Covers the core NFR dimensions

The section explicitly addresses availability, latency, and consistency, which are the key non-functional dimensions expected here. This shows clear awareness that matchmaking correctness and user experience depend on more than just functional behavior.

✅ Good

Consistency is differentiated by operation type

Choosing strong consistency for match assignment while allowing eventual consistency for wait-time estimates is a solid tradeoff. It protects the critical invariant that a player should only be assigned once, while avoiding over-engineering for an estimate that can tolerate staleness.

✅ Good

Latency target is measurable

Using a concrete target like p95 < 100ms is much better than a vague 'low latency' statement. It gives a clear SLO that can be validated and tied to the expected enqueue/dequeue path.

✅ Good

NFRs are tied to the stated scale

The latency discussion references the stated assumption of 10K concurrent gamers and translates that into roughly 67 QPS, which is the right way to justify that the target is realistic for this interview scope.

warning

Availability target is stated but not scoped

99.9% availability is a reasonable target, but it is unclear whether this applies to the enqueue API, match assignment path, or wait-time estimate endpoint separately. For senior-level NFRs, define the scope of the SLA more precisely, e.g. '99.9% monthly availability for enqueue/dequeue and estimate read APIs.'

warning

Latency target is incomplete for the user-visible estimate flow

The p95 < 100ms target is only attached to enqueue/dequeue. Since showing estimated wait time is an explicit functional requirement, the design should also define a measurable latency target for estimate retrieval or push updates, such as p95 < 200ms for estimate reads or update delivery.

info

Freshness target could be expressed more precisely

Saying estimates refresh every 5s is useful, but senior-level NFRs are stronger when framed as an observable bound, such as 'estimate staleness <= 5s under normal operation.' That makes it easier to monitor and verify.

✅ Good

Core matchmaking nouns are identified

The design lists the main domain entities for the stated flow: Player, QueueEntry, Match, GameSession, and EloBucket. These cover player enrollment, grouping for matchmaking, and handoff to a game session.

✅ Good

Relationships are explicitly defined

The submission does not just name entities; it also specifies key cardinalities such as Player to QueueEntry, QueueEntry to EloBucket, Match to Player, and Match to GameSession. That is the right level of modeling for a Senior core-entities section.

warning

Estimated wait time concept is not modeled

One of the functional requirements is showing estimated wait time, but there is no entity or clearly modeled domain concept that owns or derives this information. You do not need full field-level detail, but the model should include a concept such as QueueState, QueueStats, or WaitTimeEstimate tied to queue population or EloBucket so the requirement has a clear place in the domain.

warning

Queue relationship to Match is missing

The core flow is queue entry to matchmaking to match creation, but the relationships stop short of connecting queued players or queue entries to the resulting Match. Without that linkage, the transition from waiting state to matched state is under-modeled. Add a relationship showing how a Match is formed from QueueEntries, directly or indirectly.

✅ Good

Methodical queue and throughput estimates

The calculation chain from 10K concurrent users to 2K active searches, then to enqueue/dequeue QPS and match rate, is clear and internally consistent. This is the right style of reasoning for sizing a matchmaking system.

✅ Good

Covers multiple resource dimensions

The sizing goes beyond QPS and includes in-memory queue size, WebSocket connection fanout, event throughput, database writes, and wait-time update traffic. That breadth shows good capacity thinking rather than stopping at a single request-rate estimate.

warning

Match-rate math is inconsistent for multiplayer matching

The design states ~67 players/sec and then ~33 matches/sec, but for a game like CSGO/Valorant a match typically consumes multiple players, not 2. If this system is matching 10 players per game, 67 players/sec would be only ~6-7 matches/sec. This matters because downstream sizing for match events, DB writes, and queue drain rate depends on matches/sec vs players/sec. Fix by explicitly defining players per match and carrying that through all calculations.

warning

No peak/headroom assumptions

All numbers appear to be average steady-state values. Senior-level capacity planning should include burst assumptions and safety margin, especially for queue spikes when many players finish games around the same time. Add peak QPS estimates (for example 2-5x average), then verify Redis, WebSocket servers, Kafka, and Postgres still have comfortable headroom.

warning

Component sizing is asserted rather than justified

Statements like 'single Postgres instance easily handled' and '4 WS servers' are plausible at this scale, but they are not backed by per-node capacity assumptions. To make this senior-level, state expected limits per server/instance (connections per WS node, writes/sec for Postgres, memory/CPU for Redis) and show why the chosen counts are sufficient.

info

Storage estimate is too narrow

The Redis sorted set estimate for 2K active searches is useful, but persistent storage sizing is missing. Even if only matchmaking is in scope, the design mentions storing game records, so it should estimate daily record volume and retention to validate Postgres storage growth. A simple DAU/session assumption leading to rows/day and GB/month would complete the capacity picture.

✅ Good

Core matchmaking lifecycle is covered

The REST API cleanly supports the main required actions for this scope: create a matchmaking entry, fetch its current status including estimated wait time, and cancel it. That maps well to the stated functional requirements without unnecessary surface area.

✅ Good

Resource-oriented REST design

Using /matchmaking-entries as the primary resource with POST, GET by id, and DELETE by id is a solid REST pattern. The URLs are noun-based and the operations are intuitive for clients integrating with the service.

✅ Good

WebSocket message types are structured and consistent

The WebSocket protocol uses explicit type fields and payload objects, which makes client/server handling straightforward and extensible. Separate message types for status updates, match_found, and error are appropriate for a real-time matchmaking flow.

✅ Good

Appropriate use of HTTP verbs and status codes

POST for enqueue, GET for status lookup, and DELETE for cancellation are the correct verb choices. Returning 201 for creation and 204 for deletion also follows standard HTTP semantics.

warning

Missing request contract for creating a matchmaking entry

The POST route shows only the response, but not the request body needed to actually match players by elo and related criteria. Since matchmaking is based on elo and wait time, the API should define the create payload clearly, e.g. playerId and rating/queue attributes, so clients know what data is required to place a player into the correct queue.

warning

No REST error status coverage

The REST routes list only success responses. A senior-level API design should also specify common failure cases such as 400 for invalid input, 404 for unknown entryId, and 409 if a player already has an active matchmaking entry. This makes client behavior predictable and avoids ambiguous failures.

warning

WebSocket lifecycle is underspecified

The design includes a subscribe message, but does not define how subscription failures, invalid entry ownership, or terminal states are handled over the socket. Add clear behavior for cases like subscribing to a nonexistent entry, duplicate subscriptions, and whether the server closes the stream or sends a final status after match_found/cancel.

info

Status representation could be more explicit

GET /matchmaking-entries/{uuid} returns a status field, but the allowed values are not defined. Enumerating states such as queued, matched, cancelled, and expired would make both the REST and WebSocket contracts easier to implement consistently.

⭐ Excellent

Well-structured end-to-end matchmaking flow

The design covers the full lifecycle from queue submission, bucketed matching, match creation, event publication, game server startup, and notifying players over existing WebSocket connections. This is a complete HLD for the stated matchmaking requirements.

⭐ Excellent

Good sharded matcher strategy on Redis buckets

Partitioning players by game type, region, and ELO bucket in Redis sorted sets, then assigning matcher ownership via leases with TTL, is a strong scaling pattern for 10K concurrent gamers. The Lua-scripted atomic selection also shows awareness of race conditions between matcher replicas.

✅ Good

Estimated wait time path is explicitly designed

The design does not treat wait time as an afterthought. It includes a dedicated wait service, precomputed heuristics per bucket, short-lived WS-side caching, and material-change-based updates, which is a practical approach for serving frequent wait-time reads efficiently.

✅ Good

Durable eventing for match creation

Using Postgres plus an outbox/CDC worker before publishing to Kafka is a solid reliability pattern. It reduces the risk of losing match-created events and is appropriate for coordinating downstream game server startup.

✅ Good

Basic redundancy is present across major services

The design includes multiple WS servers, replicated matchmaking services, replicated matcher and wait services, Redis cluster, and managed Postgres with read replicas and failover. That is a reasonable baseline for avoiding obvious single-instance failures.

warning

WebSocket routing and notification path are internally inconsistent

The diagram shows both direct WS-server consumption from Redis Streams and a separate path where WS servers send through the L4 load balancer to reach users. In practice, once the client has an established socket, the WS server should write directly to that connection; the load balancer is only for connection establishment. Clean up the flow and make the clientId->serverId ownership model explicit so notifications are routed deterministically.

warning

Redis is carrying core queue state without a clear failover/rebuild story

The active matchmaking queue, bucket ownership leases, and some client routing metadata appear to live primarily in Redis. If Redis data is lost or partially unavailable, active queued players may be dropped or duplicated. For a senior-level design, add a clear persistence/recovery plan: durable Redis configuration, replayable source of truth, or periodic checkpointing so the queue can be reconstructed after failure.

warning

No clear backpressure or overload handling on hot buckets

The design mentions dynamic bucket splitting and lease rebalancing, which is good, but it does not explain what happens when a region/ELO bucket becomes much hotter than others or when WS/update traffic spikes. Add explicit controls such as rate limits, bounded stream consumer lag, degraded wait-time refresh frequency, and autoscaling triggers on bucket depth or matcher latency.

info

Caching strategy is good for wait estimates but limited elsewhere

Wait-time caching is well thought out, but the design could be clearer about what reads should hit Postgres replicas versus Redis. Since this system is mostly matchmaking, keeping operational reads off Postgres would improve resilience. Explicitly state that queue depth, bucket heuristics, and reconnect lookups are served from Redis, while Postgres is reserved for durable match/game records.

info

Some components are weakly integrated in the diagram

A few nodes look more like notes than first-class components, such as the standalone 'Replica' services and the separate 'Primary + 2 read replicas' database node. This makes the architecture harder to reason about and creates apparent orphan/duplicate elements. Consolidate these into the Postgres component and show only meaningful runtime services and their actual traffic paths.

Want this kind of feedback on your own design?

Draw your architecture for Multiplayer Online Game Matchmaking and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.