DrawLintDrawLint.ai

Notification System — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

For the design of notification system We are going to have this following components We have an event source which would be generating the events for notifications all the events would be going from the event source to the api gateway plus the load balancer layer which would be communicating with our notification service The notification service is going to store this notifications in our postgre sql The Postwash SQL will be having user table channel table campaign table notification table etc When a notification is added into the posterior sql the notification service will also make sure that it adds that particular row in the outbox notification channel Where the outbox pattern the CDC worker will durably add this notification into an incoming Kafka queue We will have a fan out workers taking the events from this incoming queue and putting it accordingly into the different SMS queue push notification queue or the email queue the fan out worker are also connected via Redis cache to get the information about the campaign to the user's mapping basically the notification which is coming for a particular campaign it will figure out all the users for that particular campaign and fan out those to the proper queue We would also make sure here that if there is some hot notification or wear a campaign contains more than lets say 30,000 users we will not fan out in that case and we will just send out a single node single event and the type inside the event we can say that it's a hot event type The hot event types would Handled by our outgoing servers in a different manner Now whenever this event comes from the smsq ap and Q Email queue to the outgoing servers it will first store an entry into the notification send table database The notification send database will have different state per notifications like received sent acknowledged etc Whenever the outgoing server pops an event from the particular queue it will add an entry into the notification sent table that the event Was received Then it will communicate with the proper channel to send out the proper notification This allows us to deduplicate the events and make sure that only we are sending one notification per user If the outgoing server crashes in between the next server can take up that message again but it can already see the state from the database where it left and it can resume from there The outgoing server system is also connected with the rate limiting system based on the user preferences and it will make sure that it occurs to the rate limiting suggested by the user In case of transactional type of notification we would bypass the rate limiting system and we would still send the notification For the hot notification The outgoing servers are going to receive only a single event and it would be the responsibility of the outgoing server to fan out During the processing time Thus we are not finding out millions of events for a single notification but we are sending out multiple notifications for a single event this hybrid strategy will make sure that we are not plotting our queues In case of notification is not able to successfully delivered that would be moved to ADLQ which is the dead letter queue We would have a dead letter queue processor which would try to send this notification one more time and if it is still not able to do it then we would store those notification as a failures Apart from this we also have a configuration service which would be handling all the configuration changes for the user and it would be storing that configuration in the process In a red city cluster and Those hcd cluster would be connected to our rate limiting system via the not watch notifications This will make sure that our configurations are get updated in the real time and the rate limiting system can avail the recent configurations One thing to note here is that notification servers configuration services or every other server in the system would be scaled horizontally and it can do parallel processing the Redis cache is a redis cluster with the primary replica standover so that whenever a primary fails the replica can take over during the failover scenarios as well as it is going to To be sharded by the key space Our postgresql Would be mainly the right path for the notifications it is only only going to handle most of the rights but it is also going to be read for the campaign to the user mapping or the hot notification notification datas we would have sufficient read replicas and the read would go via the read replicas in a primary secondary replica standby model Currently the system scale is not that much that up a single process sql cannot write that information but if required we can short the posts sql based on the notification id or the user id channel id depending and then we can have multiple sharded postures sql over a managed cluster Now let's consider this of the failure and resiliency scenarios what happens when one of the service provider is down the outgoing servers are going to try to send out a notification but they will find that let's say the sms notification service is down in that case they would not just directly move this to a dlq but it would drive with an exponential back off before retrying before sending it is notification to the dlq now the dlq processor would also retry this notification out of certain time like let us say 10 minutes or an half an hour or one hour based on the configuration and if still the request is failing then we would mark it as a failure so this would prevent us from the temporary failures of the service Oregon the underlying server going down instead of directly marking the retrieval notifications as failures Second thing to note here is that the CDC workers Are going to be horizontally scaled and whenever and they are going durably going to add the notification into the incoming queue they will only remove Notification from the incoming from the outbox table after they have successfully committed that message in the Kafka this makes sure that the notification is always sent to the Kafka queue The configuration service if it goes down it only depends or it would only affect the rate limiting part and the configuration for the user store maybe the user might receive certain number of new notifications for which they have just unsubscribed but it would not bring down the notification system For the hot notification processing here in the system it is just shown that there would be a 1 message and the one outgoing worker would be using it but it wont be that way the fan out worker would be breaking that vote notification into multiple notification packets or events and bucketing size and based on the bucket multiple outgoing servers would be processing that what notification at the same time it would be very catastrophic or a cpu bound for one single cpu to send out all the notifications for a hot notification it would be bucketed into different events and then multiple parallel systems would be working on it Certain considerations on this particular system If the user opt out for a particular campaign The single source of truth would be the fan out workers If the notification was already sent by the fan out workers then that particular event would be delivered. The rate limiting system can also be used for secondary verification to make sure we are not sending a suppperesed notificaiton The decision lies during the notification outbound by the fan out worker whether that user is subscribed or not and if it sees subscribed then it would send for this system if we are sending one or two more notification even after unsubscribe is not a big deal of an issue What a deduplication path under retries there might be a case where the outgoing worker marks a particular notification as received in the db calls the service provider the service provider sends the notification before updating the notification as sent it crashes Now to solve the ratio the outgoing worker will also create an item potency key and it would be sending that same item potency key to the service provider as well if the service this would be service providers responsibility if it sees the same item potency key for which the outgoing worker is retrying or sending the message then it would simply say that it was been already sent or the status of it thus a new outgoing outgoing worker when it takes place of a crashed outgoing worker it would not deduct the message Intersystem the outgoing workers are soon as a cluster of four but it is horizontally scaled based on the requirements and we can have cpu bound or the memory matrix to perform auto scaling on the outgoing workers outbound workers bas We can also create a separate state management layer And perform ACDC pattern on this state as well where the state management layer first takes the events write that into the database and then the outbound workers uses the cdc method to use the events from that so we can decouple the state management from the outbound system working perspective

Hire SignalLean Hire

The candidate demonstrates strong architectural instincts for durability, async processing, failure handling, and channel isolation, which are important senior-level signals. However, there are meaningful gaps in core requirement handling details—especially first-class modeling and enforcement of per-channel preferences, realistic deduplication guarantees, and API/status clarity—so this is not a clean hire.

⭐ Excellent

Availability is explicitly prioritized over consistency

The candidate clearly states the trade-off and carries it through the explanation: configuration service failure should not bring down notification delivery, retries are used for transient provider failures, and asynchronous queues/outbox decouple producers from downstream senders. That shows they understand which inconsistencies are acceptable in order to keep the system delivering.

✅ Good

Concrete latency target is tied to notification priority

The design acknowledges a specific delivery objective for high-priority notifications and supports it with asynchronous processing, channel-specific queues, and parallel outbound workers. Even though the exact budget breakdown is not quantified, the target is at least concrete and influences the architecture.

✅ Good

Consistency trade-off is discussed with real consequences

The explanation does not treat consistency as abstract. It explicitly calls out cases like recent unsubscribe changes not taking effect immediately and accepts a small amount of stale preference data in exchange for system availability. That is the kind of justification expected at senior level.

warning

Targets are not fully defensible against the stated assumptions

You mention 99.99% availability at normal load and high-priority delivery within 30 seconds, but what happens during the stated 500/sec peak or when a provider is degraded? The design does not translate those NFRs into queue backlog limits, retry budgets, or per-stage latency budgets, so it is hard to tell whether the 30-second target still holds under peak traffic or partial outages.

warning

Deduplication consistency depends on external provider behavior

What happens when a provider does not honor your idempotency key, or acknowledges ambiguously after a timeout? The system could still send duplicates even though 'send only once' is listed as an NFR. For this requirement, you should be explicit about the consistency guarantee you can actually provide per channel—exactly-once internally, but at-least-once or best-effort dedupe externally unless the provider supports idempotency.

info

Rate limiting is stated as an NFR but not quantified

You could improve this by defining what 'prevent spam' means in measurable terms, such as per-user per-channel limits over a time window and whether transactional notifications bypass those limits. Without concrete thresholds, the NFR does not meaningfully drive design or capacity decisions.

info

Numbers only partially connect to the assumptions

You reference that a single PostgreSQL instance is enough at current scale and that services scale horizontally, which is directionally fine for 1M/day and 500/sec peak. But the NFR section would be stronger if it explicitly connected those assumptions to expected queue depth, retry volume, and how much headroom is needed to preserve 99.99% availability and 30-second delivery under peak load.

✅ Good

Core nouns for the main flow are mostly identified

The design names the main business objects involved in sending notifications: User, Channel, Notification, and Campaign. These are relevant to the stated flow of targeting users, choosing a delivery medium, and tracking a notification request through the system.

✅ Good

Per-delivery state is modeled separately

In the explanation, the candidate introduces a notification send table with states like received, sent, and acknowledged. That is a useful domain distinction from the higher-level Notification entity because retries and deduplication usually operate at the individual delivery-attempt or per-recipient delivery level, not just at the campaign or template level.

warning

User preferences are not modeled as a first-class entity

Have you considered what happens when a user opts out of SMS but still allows email and push? The requirements explicitly depend on honoring per-channel preferences, but the entity list only has Users and Channels, not a clear Preference or UserChannelPreference relationship. Without modeling that join explicitly, it is hard to reason about where opt-out state lives and how it is enforced consistently.

warning

Recipient-level delivery entity is missing from the core model

What happens when one notification fans out to many users and each user can succeed, fail, or retry independently across channels? A single Notification entity is not enough to represent that happy path. The explanation hints at a notification send table, but that core entity is missing from the declared model. You would want a clear per-user, per-channel delivery record so retries and final status are attached to the right unit of work.

warning

Relationships between entities are left implicit

Have you considered making the cardinalities explicit? For example: is Campaign -> Notifications one-to-many, Campaign -> Users many-to-many, User -> Channel many-to-many through preferences, and Notification -> DeliveryAttempts one-to-many? The explanation references campaign-to-user mapping and per-channel sends, but the relationships are not clearly defined, which makes the data model harder to validate against the required flow.

info

Separate campaign targeting from notification content

You could improve this by clarifying whether Campaign is required for every notification or only for bulk sends. The functional requirements are about sending notifications generally, so it would help to show whether a Notification can exist independently of a Campaign, while Campaign simply groups or targets many notifications.

✅ Good

Basic scale numbers are present and internally consistent

The candidate anchors the design on the stated load with 1M notifications/day, 50/sec average, and 500/sec peak, then uses those numbers to justify a queue-based, horizontally scaled worker model. That is the right starting point for capacity reasoning at this scale.

✅ Good

Touches multiple capacity dimensions beyond QPS

The write-up does not stop at request rate; it also estimates storage growth for notification records, preference lookup read rate, and idempotency cache size. That shows an attempt to reason from traffic into database and cache footprint rather than only naming scalable components.

warning

Peak-path math stops too early

Have you considered what happens at 500 notifications/sec once each notification fans out into channel-specific work, retries, and provider calls? The current numbers stay at the top-line notification rate, but worker count, queue throughput, and downstream bandwidth depend on expanded message volume, not just the initial ingest rate.

warning

Storage estimate is not tied to the actual data model

Have you considered what happens to the 500 MB/day estimate when you store multiple state transitions per send, retries, DLQ entries, and per-channel delivery records? With notification tables plus send-state history, the annual footprint could grow materially beyond the stated ballpark unless you show what a 'record' includes.

warning

Cache sizing lacks retention and memory assumptions

Have you considered what happens to the Redis footprint for 7 million idempotency keys once you include TTL, replication, and per-key overhead? The idea is reasonable, but without even a rough memory estimate it is hard to tell whether this comfortably fits in one shard or needs a larger cluster.

info

Quantify infrastructure from the load assumptions

You could improve this by carrying the numbers one step further: estimate queue ingress/egress at peak, rough worker throughput per instance, and resulting instance counts for fan-out and outbound senders. Senior-level capacity planning is stronger when it connects user load to concrete infrastructure sizing, even approximately.

info

Include replication and retry amplification in the ballpark

You could improve this by explicitly folding in common multipliers such as database replicas, Kafka replication factor, and retry rate. The current approach is directionally fine, but these multipliers are often what determine whether the chosen storage and queue tiers remain comfortably sized.

✅ Good

Core write path for creating notifications is present

The design does include a concrete notification creation endpoint with the minimum fields needed to trigger delivery, so the primary producer flow is at least represented in the API surface.

✅ Good

Preference management is exposed as user-scoped routes

Using /me/preferences-style endpoints is a reasonable API shape for end-user notification settings because it avoids leaking user IDs into the client contract and keeps preference updates scoped to the authenticated caller.

warning

Notification API does not clearly support all required channels and preference checks

Have you considered how a client actually requests delivery across push, email, and SMS while honoring per-channel opt-out? The POST /notifications body only has a single channelId and messageContent, so it is unclear whether one request can target multiple channels, whether channel-specific payloads are supported, or how the API expresses 'send on allowed channels only'. Without that contract, the core requirement is only partially usable through the API.

warning

Retry behavior is implemented internally but not visible in the API contract

What does the client see after creating a notification if delivery later fails and is retried asynchronously? The explanation describes retries and DLQ processing, but the API routes do not expose a notification status/read endpoint or any delivery state resource. Without a way to fetch status, clients cannot tell whether a request is queued, retrying, delivered, or permanently failed.

warning

Resource design is inconsistent and mixes actions with unclear entities

Have you considered tightening the resource model? POST /notifications creates notifications, but campaign management uses GET /campaings/{id} and POST /campaings/{id}, which is an unusual update shape for REST. If POST on an existing resource means update, clients will have to guess whether the operation is create, mutate, or trigger something. A cleaner split like POST /campaigns and PUT/PATCH /campaigns/{id} would make the contract less ambiguous.

warning

Basic CRUD coverage for primary entities is thin

What happens when a client needs to inspect or modify an existing notification or campaign beyond the single GET and POST shown here? For a senior-level API, I would expect at least the basic lifecycle operations for the entities you expose. Right now campaigns have read and a nonstandard write route, preferences have read/update, but notifications only have create with no read/status path.

warning

Error contract and retry guidance are missing

What does the client receive on invalid campaignId, unsupported channelId, opted-out users, duplicate submissions, or provider-side temporary failures? The routes list no status codes, no error shape, and no indication of which failures are retryable by the caller versus handled asynchronously by the system. Without that, clients cannot build reliable retry behavior and may accidentally duplicate sends.

info

List endpoints would need pagination if added or expanded

There are no obvious list endpoints here, so pagination is not a current gap. But if campaign or notification listing is expected later, you could improve the API by using cursor-based pagination from the start rather than offset-based scans.

⭐ Excellent

Outbox plus CDC for durable ingestion

Using Postgres as the write path with an outbox table and CDC into the incoming queue is a strong reliability choice. It gives a clear end-to-end path from API request to async processing and avoids losing notifications between DB commit and queue publish.

✅ Good

Channel-specific queues isolate downstream failures

Splitting fanout into SMS, push, and email queues is a good architectural decision because one slow or failing provider does not have to block the other channels. That matches the availability-first requirement well.

✅ Good

Candidate considered hot campaign fanout pressure

The explanation shows awareness that very large campaigns can overwhelm queues if every recipient is expanded immediately. The idea of bucketing hot notifications for parallel processing is a thoughtful scalability trade-off for the stated peak load.

✅ Good

Failure handling is modeled explicitly

The design includes retries, DLQs, and a DLQ processor, so failed deliveries have a concrete path instead of being dropped silently. That makes the retry requirement visible in the architecture.

warning

Preference enforcement is split across multiple places

Have you considered what happens when user preferences change while a notification is moving through the pipeline? The explanation says fanout is the source of truth, but outgoing also consults rate limiting/configuration as a secondary check. With decisions split between fanout, Redis, and rate limiting, different workers can make different send/suppress decisions and you can still deliver after opt-out. A cleaner design would define one authoritative suppression check at send time, with a consistent cache invalidation/update path.

warning

Dedup depends on provider idempotency that may not exist

What happens when an outgoing worker sends to SMTP/SMS/APN, the provider accepts it, and the worker crashes before updating state? The current recovery story relies heavily on passing an idempotency key to the provider, but many email/SMS providers do not guarantee exactly-once semantics on your key. In that case retries can produce duplicate user-visible notifications. You should call out that the system is at-least-once externally and use internal dedup keys plus provider-specific reconciliation where available.

warning

Outgoing worker is doing too many responsibilities

Have you considered what happens under a hot campaign when outgoing workers must do provider calls, state transitions, rate-limit checks, cache reads, and possibly additional fanout bucketing? This makes outgoing the most likely bottleneck and couples throughput to external provider latency. Separating expansion/scheduling from provider delivery workers would make scaling and failure isolation cleaner.

warning

Single database appears on multiple critical paths

What happens when Postgres is slow or unavailable? The design uses it for initial notification writes, outbox storage, notification-send state, failure records, and possibly campaign/user mapping reads. Even at moderate scale this creates a central dependency where one DB incident can stall ingestion and recovery together. Read replicas help reads, but they do not remove the write-path bottleneck or SPOF risk unless failover and role separation are explicit.

info

Some components are logically under-connected in the diagram

You could improve the HLD by making the end-to-end flow more explicit for push delivery and preference lookup. For example, Firebase/APN is drawn but not connected from outgoing, and the second Redis cache is referenced in annotations/explanation more than in the flow. Tightening those arrows would make it clearer that the design fully completes each channel path.

info

Retry path would benefit from clearer backoff ownership

You could improve this by showing where exponential backoff lives before DLQ. Right now retries are described in the explanation, but the diagram mostly shows queue to DLQ to processor. Making the retry scheduler or delayed-queue mechanism explicit would better demonstrate that transient provider failures do not immediately flood DLQ or block workers.

Want this kind of feedback on your own design?

Draw your architecture for Notification System and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.