DrawLintDrawLint.ai

WhatsApp / Messaging — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

Failures handles: 1. What if websocket connection is broken? -Ideally the client will intiate a reconnection and get connected to same ws server if the ws server crashes in that also the connection will get broken and client will be responsible for new connection. In later case they will get connected to new server. This will be marked in the redis. There might be a small case where the ws server crashes during processing of event. In that case the client will be doing a retry. 2. How to ensure that messages are delivered? - Firstly the incoming message is stored durably in incoming queue. once the ack is done then only the single tick is shown on the client side. This ensures durable delivery. In case of notification for a message, we have two paths one is live notification path if the user is online else we do a push notification. When the user comes back online the user fetches the messages via sync services directly from the DB 3. How to handle the fan out for hot groups? -The message in the kafka is via channelId to maintain the ordering in the system. To tackke hot partion issue we can have a fanout worker which will fan out message to lightweight redis streams and then the notification can pick it up from there, This can be an internal component poart of the notification system.

Hire SignalLean Hire

This is a credible senior-level design with strong architectural instincts, a sensible real-time messaging backbone, and concrete thinking about durability and scale. The main gaps are not in the overall shape of the system but in the missing precision around correctness contracts, APIs, and failure/scaling edge cases that become critical at the stated scale.

✅ Good

Clear prioritization of availability over consistency

The NFRs explicitly state that availability matters more than consistency and that eventual consistency is acceptable for message delivery. That is an appropriate quality trade-off for a large-scale chat system, because temporary lag in delivery/read state is usually preferable to rejecting sends during partial failures.

✅ Good

Durability tied to client acknowledgment

The explanation connects durability to user-visible behavior: the client only shows the sent tick after the message is durably stored in the incoming queue. That is a concrete NFR-driven contract and shows the candidate is thinking about what 'acknowledged' means under failure.

✅ Good

Scale assumption is explicitly reflected in the NFRs

The candidate does not leave scalability abstract; they restate the 1B DAU and roughly 100 messages/day assumption in the NFRs. That is important because it anchors the design discussion to the stated interview scale rather than generic messaging-system claims.

warning

Consistency model is only partially specified

You say eventual consistency is fine for message delivery, but what happens when sent ticks, read receipts, and message ordering disagree across devices? Those user-facing semantics are part of the functional requirements, and the design should explicitly say which operations are eventually consistent versus which need stronger guarantees per conversation or per message ID. You could improve this by defining the consistency contract separately for message persistence, per-channel ordering, sent acknowledgment, and read-receipt propagation.

warning

Latency target is not broken down by path

The p95 < 100ms target is good to state, but what happens when the message goes through durable enqueue, fanout, websocket delivery, or media handling? Without separating the target for text send/ack, online delivery, offline sync, and media upload/download, it is hard to tell whether the number is defensible or where budget is being spent. You could improve this by assigning latency budgets to each critical path and clarifying whether 100ms applies to send acknowledgment, recipient delivery, or both.

warning

Durability target lacks failure-scope definition

You state that once a message is sent and acknowledged it should be durable, but what happens if a broker node fails right after ack, or an AZ goes down? 'Durable' is meaningful only if you define the failure model under which the ack remains safe. You could improve this by stating the durability boundary explicitly, for example durable across node failure, rack/AZ failure, or region failure before the client sees the sent tick.

warning

Scalability numbers do not translate into concrete NFR pressure

The assumptions mention 1B DAU and about 100 messages/day, but what happens at peak hour, for hot groups of 200 users, or during reconnect storms after websocket/server failures? The scale number is present, but the NFR section does not convert it into peak throughput, concurrent connections, or fanout pressure, so the targets still float somewhat in isolation. You could improve this by tying the NFRs to peak QPS, concurrent websocket counts, and worst-case group fanout assumptions.

info

Media optimization is stated but not measurable

Video/photo optimized distribution based on screen/network is a useful quality goal, but as written it is not an evaluable NFR. You could improve this by turning it into measurable objectives such as startup latency, adaptive bitrate behavior, thumbnail load time, or bandwidth reduction targets under poor networks.

✅ Good

Core messaging nouns are mostly identified

The design names the main domain concepts for the happy path: User, Message, Channel, and Group. That is enough to express direct messaging, group messaging, and message ownership at a high level.

✅ Good

Message is linked to a conversation container

Using channelId on Message is a reasonable way to associate each message with its conversation, which gives a clean parent-child relationship for both 1:1 and group flows.

warning

Channel vs Group relationship is unclear

Have you considered whether Group is a special kind of Channel or a completely separate entity? Right now both Channel and Group have participants, which creates overlap and ambiguity. What happens when a group message is sent—does Message.channelId point to Channel, Group, or both? You could improve this by making Group either metadata on top of Channel or by clearly separating direct-conversation and group-conversation entities.

warning

Read receipt state is not modeled as an entity or relationship

Have you considered what happens when you need sent tick and read receipts for both 1:1 and groups? A Message alone is not enough to represent per-recipient delivery/read state, especially for groups up to 200 members. Without a MessageReceipt, MessageStatus, or equivalent relationship between Message and User, the system has no clear domain model for who has received or read a message.

warning

Unread retention requirement is not reflected in the entities

Have you considered how the 30-day retention for unread messages is represented in the data model? The current entities do not capture unread state per user, so it is unclear how the system would know which messages must be retained for 30 days versus which can be cleaned up earlier. A per-user per-message or per-user per-channel cursor/status model would make this relationship explicit.

info

Participant membership is only implied, not explicitly related

You could improve this by making membership a first-class relationship rather than just a participants list on Channel/Group. For senior-level reasoning, it helps to be explicit about how User connects to Channel or Group, since membership drives delivery, read receipts, and group administration.

info

Media attachment modeling could be clearer

You could improve this by clarifying whether blobId represents an attachment entity or just a field on Message. Since messages can contain text, image, or video, an explicit attachment relationship would make the content model cleaner, especially if a message can evolve beyond exactly one blob or one text field.

✅ Good

Covers the main capacity dimensions

The candidate does more than just quote DAU: they translate 1B DAU into messages/day, average and peak ingress, fanout-adjusted egress, concurrent connections, storage/day, network throughput, and queue partition count. That end-to-end sizing chain is the right methodology for a system at this scale.

✅ Good

Peak and fanout are explicitly modeled

Using a peak multiplier on ingress and a separate fanout multiplier for delivery shows awareness that chat capacity is driven by burstiness and recipient expansion, not just average writes. That is an important scaling consideration for messaging systems.

✅ Good

Hot partition risk is acknowledged in the explanation

The explanation recognizes that ordering by channelId can create hot partitions for large groups and proposes an additional fanout stage to spread downstream load. That shows the candidate is thinking about how traffic shape affects throughput, not just raw totals.

warning

Storage sizing stops at raw daily ingest

Have you considered what happens when unread messages accumulate across the full 30-day retention window? The calculation gives 100TB/day and 300TB/day with replication, but without rolling that into retained storage, indexes, and media storage separation, it is hard to tell whether the persistence layer is sized for steady-state rather than just one day of traffic.

warning

Media traffic is not reflected in the capacity numbers

What happens when a meaningful fraction of messages contain images or video, which are explicitly in scope? The model assumes an average message of 1KB, which is fine for text metadata, but it does not show separate capacity treatment for media object storage, upload/download bandwidth, or CDN/offload. At this scale, media usually dominates storage and network even if message metadata does not.

warning

Gateway sizing lacks bandwidth and state assumptions

Have you considered what happens if 100K WebSocket connections per server is achievable for idle sockets but not for active chat traffic with heartbeats, read receipts, and fanout bursts? The 1000-gateway estimate may be directionally fine, but the reasoning should tie concurrent connections to per-node CPU, memory, and network limits so the server count is justified by load rather than just socket count.

info

Queue partition estimate needs a clearer throughput basis

You could improve this by explaining what '10K per partition' means in terms of producer throughput, consumer throughput, replication overhead, and whether it applies to ingress only or fanout traffic as well. The partition count is in the right spirit, but the justification would be stronger if it connected queue throughput to the actual read/write path being sized.

info

Replication and cross-service amplification could be carried further

You could improve this by extending the same methodology used for storage replication to network and backend traffic. For example, Kafka replication, DB replicas, retries, read-receipt writes, and sync reads after reconnect all add meaningful load at this scale, and calling out those multipliers would make the infrastructure sizing more convincing.

✅ Good

Protocol choice matches real-time messaging

Using a WebSocket-first API is appropriate for chat because it supports low-latency bidirectional delivery for new messages, receipts, and presence-style events without forcing the client into constant polling.

✅ Good

Core real-time event flow is sketched end-to-end

The design covers the main interactive path through named client and server events such as send_message, read_receipt, new_message, message_delivered, and message_read, which is enough to express the basic messaging and receipt flow.

✅ Good

Reconnect and retry behavior is at least considered

The explanation explicitly discusses broken WebSocket connections, reconnecting to a new server, and client retries after server failure. That shows awareness that the API contract must survive disconnects rather than assuming a perfect long-lived socket.

warning

Missing API coverage for conversation lifecycle and offline sync

How does a client actually use this system for the full functional flow beyond the live socket? The requirements include 1:1 and group messaging, but the routes do not show how to create a group, list a user's conversations, fetch message history, or sync unread messages after reconnect. The explanation says the user 'fetches the messages via sync services directly from the DB', but there is no client-facing API for that sync path.

warning

WebSocket message contracts are too vague for correctness at scale

What exactly is inside send_message, acknowledgements, or read_receipt? Without a clear event schema including message_id/client_generated_id, conversation_id, sender_id, media metadata, timestamps, and receipt scope, it is hard to reason about deduplication, ordering, retries, or how group read receipts are represented for up to 200 members.

warning

Ack semantics are ambiguous

There are multiple server events listed—message_received, message_delivered, message_read—and the explanation also says the client gets a tick only after durable queueing. What happens when the client retries after a disconnect and the server had already persisted the message? Without a clear distinction between accepted/persisted, delivered-to-recipient, and read-by-recipient, clients can show the wrong state or create duplicates.

warning

No explicit idempotency or retry contract

Have you considered what happens if the client sends send_message, the socket drops before the acknowledgement arrives, and the client retries? The explanation says the client will retry, but the API does not define an idempotency key or client message ID that lets the server safely dedupe retried sends.

warning

Error handling is not defined for client-visible failures

What does the client see when a send fails because the payload is too large, media type is unsupported, the user is not a member of the group, or the conversation does not exist? The design names happy-path events but does not define error event types, error codes, or which failures are retryable versus terminal.

warning

Media upload/download API is missing

Messages can contain image or video, but the API only shows send_message over WebSocket. Are clients expected to stream large binaries through the socket, or upload media separately and send a reference? Without an explicit media upload/download contract, the multimedia requirement is not really usable through the API.

info

Define connection lifecycle events more explicitly

You could improve this by specifying handshake/authentication, reconnect token or session resume behavior, heartbeat/ping-pong, and how the client asks for missed events after reconnect. That would make the WebSocket protocol much more robust under mobile network churn.

✅ Good

End-to-end messaging flow is mostly coherent

The design traces a plausible path from WebSocket ingress to durable queueing, message persistence, fanout, online delivery, offline push, and later sync. For the stated requirements, that shows the candidate is thinking beyond just send-message and includes delivery/read receipt and offline recovery paths.

✅ Good

Ordering-aware partitioning choices

Partitioning incoming traffic by channelId and outgoing traffic by recipientId is a thoughtful trade-off. It preserves per-conversation ordering on write while also making per-user delivery streams easier to reason about on the fanout side.

⭐ Excellent

Durable ack boundary is explicitly defined

The explanation makes a concrete architectural choice that the client only gets the sent tick after the message is durably written to the incoming queue. That is a strong design decision because it ties user-visible acknowledgement to a real durability boundary instead of a best-effort in-memory accept.

✅ Good

Separate sync path for offline users

Using a dedicated sync service to fetch pending messages when a user reconnects is a sensible way to avoid overloading the live delivery path and gives the system a recovery mechanism when push notifications or live fanout are missed.

critical

Hot group partitions will become a bottleneck

What happens when a 200-member group becomes extremely active? Because ingress is partitioned by channelId, all writes for that group land on one Kafka partition and one ordered processing lane. At the stated scale, a few hot groups can create partition hotspots, increased lag, and delayed delivery for that conversation. You mention a possible Redis-stream fanout mitigation in the explanation, but it is only for downstream fanout; the write path for that channel is still serialized. You should explain how you would isolate or split hot channels while preserving per-channel ordering guarantees.

warning

Presence store is on the critical delivery path

Have you considered what happens if Redis presence is unavailable or stale? Notification routing depends on looking up which WebSocket server owns a user connection. If Redis is down or contains stale mappings after a server crash, online users may be treated as offline or notifications may be sent to the wrong server. Since this sits directly in the delivery path, you should describe replication/failover and how stale connection ownership is cleaned up.

warning

WebSocket server failure handling relies heavily on client retry

What happens when a WebSocket server dies after accepting a client event but before the event is durably enqueued? The explanation says the client retries, but without a clear idempotency strategy this can produce duplicate messages or ambiguous sent state. At this scale, server crashes during in-flight sends are normal, so the design should make the retry boundary explicit with client-generated message IDs or dedupe on ingress.

warning

Metadata store may become a scaling choke point

Have you considered the load on Postgres for channel metadata, group membership, and receipt-related metadata at 1B DAU? The design uses Postgres in multiple paths, including message server metadata writes and notification/receipt processing. Even if message bodies are in Cassandra, centralizing high-cardinality metadata and fanout-related state in a relational store can become the first scaling bottleneck unless you clearly scope what lives there and how it is partitioned or cached.

warning

Outbox flow is not fully clear under failure

What happens if the outbox worker crashes after publishing to Kafka but before marking work complete, or if Cassandra/Postgres and Kafka become temporarily inconsistent? You mention at-least-once delivery, which is good, but that means duplicates are expected. The design should explicitly state where deduplication happens for messages and receipts so that retries do not create duplicate notifications or duplicate read/delivery events.

info

Some components appear redundant or weakly integrated

You could improve the HLD by tightening a few ambiguous boxes. For example, both 'Cassandra (message store)' and 'read replicated cassandra with sharding on channelId' are shown, and there are duplicate notification service nodes. That makes it harder to reason about the actual production topology and failure domains. Consolidating these into a clearer primary/replica or logical-service view would make the architecture easier to validate.

info

Media pipeline is disconnected from message send semantics

You could improve this by making the interaction between media upload and message creation explicit. Right now users upload to compressor/blob/encoding/CDN, and media service updates Cassandra, but the HLD does not clearly show when a chat message referencing that media becomes sendable or what happens if encoding is delayed. A clearer two-phase flow would help: upload returns a blob/media ID, then the message references that ID, and delivery semantics for 'media not ready yet' are defined.

Want this kind of feedback on your own design?

Draw your architecture for WhatsApp / Messaging and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.