Notification System — system design by AgileViper46
Hire
Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.
Loading diagram…
For the design of notification system We are going to have this following components We have an event source which would be generating the events for notifications all the events would be going from the event source to the api gateway plus the load balancer layer which would be communicating with our notification service The notification service is going to store this notifications in our postgre sql The Postwash SQL will be having user table channel table campaign table notification table etc When a notification is added into the posterior sql the notification service will also make sure that it adds that particular row in the outbox notification channel Where the outbox pattern the CDC worker will durably add this notification into an incoming Kafka queue We will have a fan out workers taking the events from this incoming queue and putting it accordingly into the different SMS queue push notification queue or the email queue the fan out worker are also connected via Redis cache to get the information about the campaign to the user's mapping basically the notification which is coming for a particular campaign it will figure out all the users for that particular campaign and fan out those to the proper queue We would also make sure here that if there is some hot notification or wear a campaign contains more than lets say 30,000 users we will not fan out in that case and we will just send out a single node single event and the type inside the event we can say that it's a hot event type The hot event types would Handled by our outgoing servers in a different manner Now whenever this event comes from the smsq ap and Q Email queue to the outgoing servers it will first store an entry into the notification send table database The notification send database will have different state per notifications like received sent acknowledged etc Whenever the outgoing server pops an event from the particular queue it will add an entry into the notification sent table that the event Was received Then it will communicate with the proper channel to send out the proper notification This allows us to deduplicate the events and make sure that only we are sending one notification per user If the outgoing server crashes in between the next server can take up that message again but it can already see the state from the database where it left and it can resume from there The outgoing server system is also connected with the rate limiting system based on the user preferences and it will make sure that it occurs to the rate limiting suggested by the user In case of transactional type of notification we would bypass the rate limiting system and we would still send the notification For the hot notification The outgoing servers are going to receive only a single event and it would be the responsibility of the outgoing server to fan out During the processing time Thus we are not finding out millions of events for a single notification but we are sending out multiple notifications for a single event this hybrid strategy will make sure that we are not plotting our queues In case of notification is not able to successfully delivered that would be moved to ADLQ which is the dead letter queue We would have a dead letter queue processor which would try to send this notification one more time and if it is still not able to do it then we would store those notification as a failures Apart from this we also have a configuration service which would be handling all the configuration changes for the user and it would be storing that configuration in the process In a red city cluster and Those hcd cluster would be connected to our rate limiting system via the not watch notifications This will make sure that our configurations are get updated in the real time and the rate limiting system can avail the recent configurations One thing to note here is that notification servers configuration services or every other server in the system would be scaled horizontally and it can do parallel processing the Redis cache is a redis cluster with the primary replica standover so that whenever a primary fails the replica can take over during the failover scenarios as well as it is going to To be sharded by the key space Our postgresql Would be mainly the right path for the notifications it is only only going to handle most of the rights but it is also going to be read for the campaign to the user mapping or the hot notification notification datas we would have sufficient read replicas and the read would go via the read replicas in a primary secondary replica standby model Currently the system scale is not that much that up a single process sql cannot write that information but if required we can short the posts sql based on the notification id or the user id channel id depending and then we can have multiple sharded postures sql over a managed cluster Now let's consider this of the failure and resiliency scenarios what happens when one of the service provider is down the outgoing servers are going to try to send out a notification but they will find that let's say the sms notification service is down in that case they would not just directly move this to a dlq but it would drive with an exponential back off before retrying before sending it is notification to the dlq now the dlq processor would also retry this notification out of certain time like let us say 10 minutes or an half an hour or one hour based on the configuration and if still the request is failing then we would mark it as a failure so this would prevent us from the temporary failures of the service Oregon the underlying server going down instead of directly marking the retrieval notifications as failures Second thing to note here is that the CDC workers Are going to be horizontally scaled and whenever and they are going durably going to add the notification into the incoming queue they will only remove Notification from the incoming from the outbox table after they have successfully committed that message in the Kafka this makes sure that the notification is always sent to the Kafka queue The configuration service if it goes down it only depends or it would only affect the rate limiting part and the configuration for the user store maybe the user might receive certain number of new notifications for which they have just unsubscribed but it would not bring down the notification system For the hot notification processing here in the system it is just shown that there would be a 1 message and the one outgoing worker would be using it but it wont be that way the fan out worker would be breaking that vote notification into multiple notification packets or events and bucketing size and based on the bucket multiple outgoing servers would be processing that what notification at the same time it would be very catastrophic or a cpu bound for one single cpu to send out all the notifications for a hot notification it would be bucketed into different events and then multiple parallel systems would be working on it Certain considerations on this particular system If the user opt out for a particular campaign The single source of truth would be the fan out workers If the notification was already sent by the fan out workers then that particular event would be delivered. The rate limiting system can also be used for secondary verification to make sure we are not sending a suppperesed notificaiton The decision lies during the notification outbound by the fan out worker whether that user is subscribed or not and if it sees subscribed then it would send for this system if we are sending one or two more notification even after unsubscribe is not a big deal of an issue What a deduplication path under retries there might be a case where the outgoing worker marks a particular notification as received in the db calls the service provider the service provider sends the notification before updating the notification as sent it crashes Now to solve the ratio the outgoing worker will also create an item potency key and it would be sending that same item potency key to the service provider as well if the service this would be service providers responsibility if it sees the same item potency key for which the outgoing worker is retrying or sending the message then it would simply say that it was been already sent or the status of it thus a new outgoing outgoing worker when it takes place of a crashed outgoing worker it would not deduct the message Intersystem the outgoing workers are soon as a cluster of four but it is horizontally scaled based on the requirements and we can have cpu bound or the memory matrix to perform auto scaling on the outgoing workers outbound workers bas We can also create a separate state management layer And perform ACDC pattern on this state as well where the state management layer first takes the events write that into the database and then the outbound workers uses the cdc method to use the events from that so we can decouple the state management from the outbound system working perspective
Want this kind of feedback on your own design?
Draw your architecture for Notification System and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.