Challenge Drills Library Drawing Guide Learn AI Setup Guide Support About

WhatsApp / Messaging — system design by GentleBear93

Lean No Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

Hire SignalLean No Hire

There is meaningful system design thinking here and several good architectural instincts, but the missing core delivery/sync flows and major capacity errors are significant correctness issues for a mid-level candidate. This is close to workable, but not yet strong enough because some required flows are not fully designed.

✅ Good

Core NFR dimensions are explicitly identified

The design clearly calls out availability, latency, consistency/ordering, scale, and delivery guarantees. For a chat system, these are the right non-functional dimensions to surface early and they align well with the stated functional requirements.

✅ Good

Consistency target is appropriately weaker than strong consistency

Stating 'availability >> consistency' together with causal ordering shows a reasonable understanding that chat systems usually do not need global strong consistency, but do need a stronger model than plain eventual consistency so conversations feel correct to users.

✅ Good

Scale estimate is quantified

Converting 40B messages/day into an approximate request rate demonstrates useful capacity thinking. Even if the exact assumptions behind the number are not fully expanded, quantifying expected load is a solid NFR practice.

warning

Latency and availability are not measurable

The NFRs mention 'low latency' and prioritize availability, but they do not define concrete targets such as p95/p99 send-to-deliver latency or an uptime/SLA objective. Without measurable numbers, it is hard to evaluate whether the design meets the requirements. Add explicit targets, for example p95 message delivery latency and a service availability goal.

warning

Guaranteed delivery is underspecified

Saying 'guaranteed delivery' is too absolute for a distributed chat system unless the exact semantics are defined. It is unclear whether this means at-least-once delivery, exactly-once user-visible delivery, durable acceptance by the server, or delivery within the 30-day offline window. Define the guarantee precisely and describe how duplicates, retries, and acknowledgments are handled.

warning

Causal ordering scope is unclear

Causal ordering is a reasonable target, but the design does not specify where it applies: per conversation, per sender, across devices, or across all participants in a group chat. For correctness, the scope should be explicit because global causal ordering is much harder than per-chat ordering. Clarify the ordering contract expected by clients.

info

Storage estimate does not clearly account for retention and media

The data estimate mentions approximately 32 TB, but the functional requirements include offline retention for up to 30 days and media messaging. For NFR completeness, separate text-message storage from media storage and show whether the estimate is per day or total retained footprint over the retention window.

✅ Good

Covers the main domain nouns

The design lists the core entities needed for the stated requirements: User for participants, Message for exchanged content, and Group/Chat for conversation contexts. This is a solid starting set for supporting both 1:1 and group messaging.

warning

Entity boundaries between Chat and Group are unclear

It is not clear how Chat and Group relate to each other. For these requirements, the design should make the relationship explicit—for example, Chat as the conversation container with types like 1:1 or group, or Group as a specialized kind of chat. Without that, the model is ambiguous for representing both direct and group conversations consistently.

warning

Missing media as a first-class domain concept

Media sending/receiving is a stated requirement, but there is no entity representing media/attachment content. Add a Media or Attachment entity, or explicitly state that Message can contain media as a distinct content type, so the domain model clearly covers this requirement.

✅ Good

Includes core traffic and storage estimates

The candidate does provide the key baseline numbers expected in a capacity section for this problem: messages per day, a rough requests-per-second estimate, and daily storage estimates for text and media. That shows the right instinct to translate DAU into throughput and storage.

warning

RPS estimate is too low for the stated message volume

40B messages/day converts to about 463K messages/second on average, not 462K RPS. If each message implies both a send write and one or more delivery/read operations, the effective system load would be higher than the stated number. Recompute from first principles and clearly distinguish average QPS from peak QPS, ideally applying a peak factor (for example 3x-5x) so capacity is sized for realistic bursts.

warning

Media storage math is inconsistent by several orders of magnitude

The estimate says 4B attached messages/day with 5 MB average media, which would be about 20 PB/day of media, not 20 TB/day. This is a major arithmetic error and materially changes storage planning. Fix by multiplying 4B × 5 MB carefully, then include the 30-day offline retention requirement to estimate total hot/warm storage needed.

warning

30-day retention is not reflected in total storage sizing

The requirements explicitly say offline users can receive messages sent while offline for up to 30 days, but the calculation only gives per-day storage. Capacity planning should extend daily text and media volumes into retained storage over the retention window, including replication overhead if applicable. Add 30-day retained text/media totals so the design can be evaluated against the requirement.

info

Peak assumptions and group fanout impact are missing

At 1B DAU and support for group chats up to 100 users, average message ingress alone is not enough to size the system. Group messages can create much higher delivery fanout than 1:1 chat, and real systems must handle peak traffic above the daily average. Add a simple peak multiplier and a rough split between 1:1 and group traffic to show the numbers are in the right ballpark.

✅ Good

Core chat actions are represented

The API includes routes/messages for creating chats, sending messages, receiving new messages, and modifying chat membership/name, which maps reasonably well to the core functional requirements for 1:1 and group chat.

✅ Good

Push-style delivery is included

Using server-to-client messages like chatUpdate and newMessage is a sensible pattern for chat because clients need near-real-time updates rather than relying only on polling.

critical

Offline message retrieval API is missing

One of the functional requirements is that users can receive messages sent while offline for up to 30 days, but there is no endpoint/message flow for reconnect sync, fetching message history, or resuming from a last-seen cursor/message ID. Add a read API such as getMessages(chatId, cursor) or a reconnect sync message that returns missed messages since the last acknowledged message.

warning

Protocol shape is inconsistent and underspecified

The design mixes request/response arrows and push messages, but does not clearly define whether this is REST, WebSocket, or a hybrid. For a chat system, that ambiguity matters because connection lifecycle, delivery acknowledgements, and retry behavior depend on the protocol. Clarify the transport and structure: for REST use explicit HTTP methods and resource paths; for WebSocket define message types, payload schemas, and ack semantics consistently.

warning

Primary entity read/list operations are incomplete

There are create and modify operations for chats and send for messages, but no way to fetch chat details, list a user's chats, or read message history. CRUD does not need to be exhaustive for every possible object, but the primary entities here are chats and messages, and clients need read operations to function correctly. Add endpoints/messages for listing chats and reading messages for a chat.

warning

modifyChat request is missing target chat identifier

The modifyChat payload includes participants and name but does not specify which chat should be modified. Without a chatId in the request path or body, the operation is ambiguous. Include chatId explicitly, for example modifyChat(chatId, participants, name) or PATCH /chats/{chatId}.

warning

Message delivery state is too vague

The sendMessage response only returns SUCCESS or FAILURE, and inbound messages are acknowledged with RECEIVED, but there is no messageId, timestamp, or delivery token. This makes deduplication, retries, ordering, and offline sync much harder. Return a server-assigned messageId and timestamp on send, and include messageId in newMessage and ack flows.

info

Media attachment contract needs more structure

Attachments are included, which is good for the media requirement, but the API does not define whether attachments are raw uploads, references, or pre-uploaded media IDs/URLs. A clearer contract would reduce ambiguity: e.g. upload media first and send attachment metadata in sendMessage, or define attachment fields such as type, size, and mediaId.

✅ Good

Separation of realtime, messaging, and media paths

The design splits WebSocket connection handling, message processing, and media processing into separate components. That is a solid high-level pattern for chat systems because it prevents long-lived connection management from being coupled to heavier message persistence or media workflows.

✅ Good

Asynchronous media pipeline

Using object storage plus Kafka and encoding workers is a good architectural choice for media messages. It keeps large file handling off the synchronous chat path and allows uploads, transcoding, and thumbnail generation to scale independently.

✅ Good

Offline delivery considered in the architecture

The design includes persistent storage in DynamoDB and push notifications through APNs/FCM for offline users. That aligns well with the requirement to receive messages while offline and shows awareness that delivery cannot depend only on active WebSocket connections.

critical

End-to-end message delivery path is incomplete

The diagram does not show how a user message actually reaches the messaging servers from the client in the normal chat flow. The user connects to the load balancer and WebSocket servers, but there is no connection from WebSocket servers to messaging servers, so the core send/receive path is broken at the HLD level. Add an explicit path where WebSocket servers forward inbound messages to messaging servers and receive delivery events back for fanout to connected recipients.

critical

Offline retrieval flow for 30-day stored messages is missing

The design stores messages and has an inbox model, but it does not show how a reconnecting client fetches missed messages from storage. Push notifications alone do not satisfy the requirement to receive messages sent while offline. Add a clear sync path on reconnect: client to WebSocket/API layer, then messaging service reads inbox/message history from DynamoDB and delivers missed messages, with acknowledgement or cursor tracking.

warning

Group chat fanout architecture is underspecified for 100-member groups at large scale

The design has chat metadata and messaging servers, but it does not show how participant membership is resolved and how fanout is performed for group messages. At 1B DAU, group delivery needs an explicit pattern such as reading participant lists from chat metadata, writing per-user inbox entries, and routing online recipients via connection lookup. Make the fanout path explicit so the design clearly supports both 1:1 and group chat.

warning

Redis Pub/Sub is a weak backbone for durable cross-service delivery

Redis Pub/Sub is transient and does not provide durable replay, which makes it risky if it is being used for important message or media state propagation between services. For a system at this scale, use Kafka or another durable log for critical inter-service events, and reserve Redis for ephemeral signaling or caching.

warning

Several components are weakly connected or appear orphaned

The management server, service registry, Redis cache, and ETCD-based WebSocket manager are present, but their operational role in the main request path is not clearly connected. For example, the management server is only linked to the load balancer and database, and Redis cache has only one inbound edge. Tighten the diagram by showing exactly what each component does in routing, presence, discovery, or metadata lookup, or remove components that are not part of the end-to-end flow.

info

Media upload path bypasses the main API layer

The user is shown sending directly to a compressor and then to blob storage, which makes the control flow harder to reason about and may complicate auth and metadata creation. A cleaner HLD is to have the client obtain an upload token or presigned URL from the media service, upload directly to object storage, and then let the media service finalize metadata after processing.

Want this kind of feedback on your own design?

Draw your architecture for WhatsApp / Messaging and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.

Get your free review See more WhatsApp / Messaging designs