Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.
Loading diagram…
There is meaningful system design thinking here and several good architectural instincts, but the missing core delivery/sync flows and major capacity errors are significant correctness issues for a mid-level candidate. This is close to workable, but not yet strong enough because some required flows are not fully designed.
Core NFR dimensions are explicitly identified
The design clearly calls out availability, latency, consistency/ordering, scale, and delivery guarantees. For a chat system, these are the right non-functional dimensions to surface early and they align well with the stated functional requirements.
Consistency target is appropriately weaker than strong consistency
Stating 'availability >> consistency' together with causal ordering shows a reasonable understanding that chat systems usually do not need global strong consistency, but do need a stronger model than plain eventual consistency so conversations feel correct to users.
Scale estimate is quantified
Converting 40B messages/day into an approximate request rate demonstrates useful capacity thinking. Even if the exact assumptions behind the number are not fully expanded, quantifying expected load is a solid NFR practice.
Latency and availability are not measurable
The NFRs mention 'low latency' and prioritize availability, but they do not define concrete targets such as p95/p99 send-to-deliver latency or an uptime/SLA objective. Without measurable numbers, it is hard to evaluate whether the design meets the requirements. Add explicit targets, for example p95 message delivery latency and a service availability goal.
Guaranteed delivery is underspecified
Saying 'guaranteed delivery' is too absolute for a distributed chat system unless the exact semantics are defined. It is unclear whether this means at-least-once delivery, exactly-once user-visible delivery, durable acceptance by the server, or delivery within the 30-day offline window. Define the guarantee precisely and describe how duplicates, retries, and acknowledgments are handled.
Causal ordering scope is unclear
Causal ordering is a reasonable target, but the design does not specify where it applies: per conversation, per sender, across devices, or across all participants in a group chat. For correctness, the scope should be explicit because global causal ordering is much harder than per-chat ordering. Clarify the ordering contract expected by clients.
Storage estimate does not clearly account for retention and media
The data estimate mentions approximately 32 TB, but the functional requirements include offline retention for up to 30 days and media messaging. For NFR completeness, separate text-message storage from media storage and show whether the estimate is per day or total retained footprint over the retention window.
Covers the main domain nouns
The design lists the core entities needed for the stated requirements: User for participants, Message for exchanged content, and Group/Chat for conversation contexts. This is a solid starting set for supporting both 1:1 and group messaging.
Entity boundaries between Chat and Group are unclear
It is not clear how Chat and Group relate to each other. For these requirements, the design should make the relationship explicit—for example, Chat as the conversation container with types like 1:1 or group, or Group as a specialized kind of chat. Without that, the model is ambiguous for representing both direct and group conversations consistently.
Missing media as a first-class domain concept
Media sending/receiving is a stated requirement, but there is no entity representing media/attachment content. Add a Media or Attachment entity, or explicitly state that Message can contain media as a distinct content type, so the domain model clearly covers this requirement.
Includes core traffic and storage estimates
The candidate does provide the key baseline numbers expected in a capacity section for this problem: messages per day, a rough requests-per-second estimate, and daily storage estimates for text and media. That shows the right instinct to translate DAU into throughput and storage.
RPS estimate is too low for the stated message volume
40B messages/day converts to about 463K messages/second on average, not 462K RPS. If each message implies both a send write and one or more delivery/read operations, the effective system load would be higher than the stated number. Recompute from first principles and clearly distinguish average QPS from peak QPS, ideally applying a peak factor (for example 3x-5x) so capacity is sized for realistic bursts.
Media storage math is inconsistent by several orders of magnitude
The estimate says 4B attached messages/day with 5 MB average media, which would be about 20 PB/day of media, not 20 TB/day. This is a major arithmetic error and materially changes storage planning. Fix by multiplying 4B × 5 MB carefully, then include the 30-day offline retention requirement to estimate total hot/warm storage needed.
30-day retention is not reflected in total storage sizing
The requirements explicitly say offline users can receive messages sent while offline for up to 30 days, but the calculation only gives per-day storage. Capacity planning should extend daily text and media volumes into retained storage over the retention window, including replication overhead if applicable. Add 30-day retained text/media totals so the design can be evaluated against the requirement.
Peak assumptions and group fanout impact are missing
At 1B DAU and support for group chats up to 100 users, average message ingress alone is not enough to size the system. Group messages can create much higher delivery fanout than 1:1 chat, and real systems must handle peak traffic above the daily average. Add a simple peak multiplier and a rough split between 1:1 and group traffic to show the numbers are in the right ballpark.
Core chat actions are represented
The API includes routes/messages for creating chats, sending messages, receiving new messages, and modifying chat membership/name, which maps reasonably well to the core functional requirements for 1:1 and group chat.
Push-style delivery is included
Using server-to-client messages like chatUpdate and newMessage is a sensible pattern for chat because clients need near-real-time updates rather than relying only on polling.
Offline message retrieval API is missing
One of the functional requirements is that users can receive messages sent while offline for up to 30 days, but there is no endpoint/message flow for reconnect sync, fetching message history, or resuming from a last-seen cursor/message ID. Add a read API such as getMessages(chatId, cursor) or a reconnect sync message that returns missed messages since the last acknowledged message.
Protocol shape is inconsistent and underspecified
The design mixes request/response arrows and push messages, but does not clearly define whether this is REST, WebSocket, or a hybrid. For a chat system, that ambiguity matters because connection lifecycle, delivery acknowledgements, and retry behavior depend on the protocol. Clarify the transport and structure: for REST use explicit HTTP methods and resource paths; for WebSocket define message types, payload schemas, and ack semantics consistently.
Primary entity read/list operations are incomplete
There are create and modify operations for chats and send for messages, but no way to fetch chat details, list a user's chats, or read message history. CRUD does not need to be exhaustive for every possible object, but the primary entities here are chats and messages, and clients need read operations to function correctly. Add endpoints/messages for listing chats and reading messages for a chat.
modifyChat request is missing target chat identifier
The modifyChat payload includes participants and name but does not specify which chat should be modified. Without a chatId in the request path or body, the operation is ambiguous. Include chatId explicitly, for example modifyChat(chatId, participants, name) or PATCH /chats/{chatId}.
Message delivery state is too vague
The sendMessage response only returns SUCCESS or FAILURE, and inbound messages are acknowledged with RECEIVED, but there is no messageId, timestamp, or delivery token. This makes deduplication, retries, ordering, and offline sync much harder. Return a server-assigned messageId and timestamp on send, and include messageId in newMessage and ack flows.
Media attachment contract needs more structure
Attachments are included, which is good for the media requirement, but the API does not define whether attachments are raw uploads, references, or pre-uploaded media IDs/URLs. A clearer contract would reduce ambiguity: e.g. upload media first and send attachment metadata in sendMessage, or define attachment fields such as type, size, and mediaId.
Separation of realtime, messaging, and media paths
The design splits WebSocket connection handling, message processing, and media processing into separate components. That is a solid high-level pattern for chat systems because it prevents long-lived connection management from being coupled to heavier message persistence or media workflows.
Asynchronous media pipeline
Using object storage plus Kafka and encoding workers is a good architectural choice for media messages. It keeps large file handling off the synchronous chat path and allows uploads, transcoding, and thumbnail generation to scale independently.
Offline delivery considered in the architecture
The design includes persistent storage in DynamoDB and push notifications through APNs/FCM for offline users. That aligns well with the requirement to receive messages while offline and shows awareness that delivery cannot depend only on active WebSocket connections.
End-to-end message delivery path is incomplete
The diagram does not show how a user message actually reaches the messaging servers from the client in the normal chat flow. The user connects to the load balancer and WebSocket servers, but there is no connection from WebSocket servers to messaging servers, so the core send/receive path is broken at the HLD level. Add an explicit path where WebSocket servers forward inbound messages to messaging servers and receive delivery events back for fanout to connected recipients.
Offline retrieval flow for 30-day stored messages is missing
The design stores messages and has an inbox model, but it does not show how a reconnecting client fetches missed messages from storage. Push notifications alone do not satisfy the requirement to receive messages sent while offline. Add a clear sync path on reconnect: client to WebSocket/API layer, then messaging service reads inbox/message history from DynamoDB and delivers missed messages, with acknowledgement or cursor tracking.
Group chat fanout architecture is underspecified for 100-member groups at large scale
The design has chat metadata and messaging servers, but it does not show how participant membership is resolved and how fanout is performed for group messages. At 1B DAU, group delivery needs an explicit pattern such as reading participant lists from chat metadata, writing per-user inbox entries, and routing online recipients via connection lookup. Make the fanout path explicit so the design clearly supports both 1:1 and group chat.
Redis Pub/Sub is a weak backbone for durable cross-service delivery
Redis Pub/Sub is transient and does not provide durable replay, which makes it risky if it is being used for important message or media state propagation between services. For a system at this scale, use Kafka or another durable log for critical inter-service events, and reserve Redis for ephemeral signaling or caching.
Several components are weakly connected or appear orphaned
The management server, service registry, Redis cache, and ETCD-based WebSocket manager are present, but their operational role in the main request path is not clearly connected. For example, the management server is only linked to the load balancer and database, and Redis cache has only one inbound edge. Tighten the diagram by showing exactly what each component does in routing, presence, discovery, or metadata lookup, or remove components that are not part of the end-to-end flow.
Media upload path bypasses the main API layer
The user is shown sending directly to a compressor and then to blob storage, which makes the control flow harder to reason about and may complicate auth and metadata creation. A cleaner HLD is to have the client obtain an upload token or presigned URL from the media service, upload directly to object storage, and then let the media service finalize metadata after processing.
Draw your architecture for WhatsApp / Messaging and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.