Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.
Loading diagram…
All requests enter via an API gateway + load balancer doing rate limiting, auth, and load balancing, then route by URL to the right service. The Tweet Service owns the tweet lifecycle: on create with media it generates a presigned blob URL and stores only metadata in Postgres while the client uploads bytes directly to blob, keeping binaries out of SQL; a text-only tweet is written straight to Postgres and its id returned. Reads are served from a Redis tweet cache, read-through to Postgres on miss, and batched via multi-get when a feed is materialized. Tweet ids are Snowflake ids — time-ordered by construction, so the id encodes chronological order and no explicit timestamp is stored. The Follow Service handles follow/unfollow and counters; the Like Service handles like events; the Feed Service generates the timeline and is hybrid. Storage: Postgres holds source-of-truth metadata, horizontally sharded with region failover, writes to primary and reads from replicas. Cassandra is the high-volume timeline/event store, partitioned and sharded by key. Redis clusters (primary-replica failover) cache tweets/feeds and store follow relationships plus aggregate counters. Blob (Azure/S3) holds media behind a CDN. Write path uses the transactional outbox pattern: each tweet/follow write commits an outbox row in the same Postgres transaction; a CDC worker tails the outbox and publishes to Kafka, so no event is lost. The Tweet Fan-out Service writes each new tweet into every follower's Cassandra feed channel (fan-out-on-write; feed partitioned by user_id, ordered by tweet_id). The Follow Normalizer writes both directions (A→B, B→A). Counters: like/follow events flow Kafka→Flink, which maintains aggregates stored in the Redis counter cache. Likes table is partitioned by user_id, ordered by tweet_id, with a unique (user_id, tweet_id) constraint. Since delivery is at-least-once, each event carries an eventId idempotency key and Flink drops duplicates before aggregating; dedup state is bounded in keyed RocksDB state over a watermark window. Flink's RocksDB also durably stores counters/snapshots if Redis dies. Fan-out breaks for celebrities (200M followers can't take 200M writes per tweet), so fan-out is skipped for them and the Feed Service is hybrid: read the precomputed feed from Cassandra, find which celebrities the user follows (follower list cached in Redis, one shot), fetch their recent posts, merge, and serve — writing the hot tweet once and pulling at read time. The same merge gives read-your-own-write: the author's own recent tweets are merged in at read time since fan-out is async. The celebrity case is a read hot-key problem, not data volume (~10 tiny tweets read by 200M), so we use hot-key replication — N copies of the same celebrity feed (celeb:123:copy0..199), each reader hashing to a copy — plus local in-process cache and CDN/edge for viral tweets; staleness is fine at ~10 writes/day. Celebrity timelines are cached so the merge doesn't fan out to many Cassandra reads. Deletes use tombstones, not row rewrites across millions of timelines: the tweet is marked deleted and dropped at read time when the feed is materialized (page over-fetches so filtered tombstones don't shrink it). Feed latency is dominated by the merge; the feed is delivered progressively — serve the first 50, keep the next 50 ready, and precompute the next page when the user scrolls to ~75. Scaling/reliability: every service scales horizontally and independently. Cassandra timeline partition explosion is avoided by bucketing timelines by time (e.g. per month) so no single partition grows unbounded. A thundering herd on a hot tweet is handled with request coalescing (collapse concurrent identical reads into one origin fetch) and cache-stampede protection on expiry. ML-relevance ranking is added as a layer, not a replacement: Cassandra can't sort by a per-viewer score, so the precomputed timeline becomes the candidate set and an ML ranking generator re-scores and re-sorts it. Consistency is deliberately relaxed (a feed is a best-effort ranking, not a ledger) — retrieval→ranking→serve. To stay in budget we don't score everything inline: retrieve top-N recent candidates and score only those using cheap features like recency and like velocity (delta likes in the last hour), already computed in Flink and served from the counter store. Hydration: Cassandra returns ordered ids; one batched multi-get to Redis, single bulk Postgres read for misses. 2 hops, not N. Follow authority: Postgres is source of truth; Cassandra/Redis are async-derived. Feed tolerates seconds of staleness — not a correctness bug. CDC: Partitioned Kafka consumer group with offset checkpointing — scales out, resumes on failure, outbox prevents loss. Hot partitions: Heavy users are read-heavy (celebrities are pull-merged, not fanned out); cached with hot-key replication + time-bucketing spreads load.
Draw your architecture for Twitter / Social Feed and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.