DrawLintDrawLint.ai

Twitter / Social Feed — system design by AgileViper46

Strong Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

All requests enter via an API gateway + load balancer doing rate limiting, auth, and load balancing, then route by URL to the right service. The Tweet Service owns the tweet lifecycle: on create with media it generates a presigned blob URL and stores only metadata in Postgres while the client uploads bytes directly to blob, keeping binaries out of SQL; a text-only tweet is written straight to Postgres and its id returned. Reads are served from a Redis tweet cache, read-through to Postgres on miss, and batched via multi-get when a feed is materialized. Tweet ids are Snowflake ids — time-ordered by construction, so the id encodes chronological order and no explicit timestamp is stored. The Follow Service handles follow/unfollow and counters; the Like Service handles like events; the Feed Service generates the timeline and is hybrid. Storage: Postgres holds source-of-truth metadata, horizontally sharded with region failover, writes to primary and reads from replicas. Cassandra is the high-volume timeline/event store, partitioned and sharded by key. Redis clusters (primary-replica failover) cache tweets/feeds and store follow relationships plus aggregate counters. Blob (Azure/S3) holds media behind a CDN. Write path uses the transactional outbox pattern: each tweet/follow write commits an outbox row in the same Postgres transaction; a CDC worker tails the outbox and publishes to Kafka, so no event is lost. The Tweet Fan-out Service writes each new tweet into every follower's Cassandra feed channel (fan-out-on-write; feed partitioned by user_id, ordered by tweet_id). The Follow Normalizer writes both directions (A→B, B→A). Counters: like/follow events flow Kafka→Flink, which maintains aggregates stored in the Redis counter cache. Likes table is partitioned by user_id, ordered by tweet_id, with a unique (user_id, tweet_id) constraint. Since delivery is at-least-once, each event carries an eventId idempotency key and Flink drops duplicates before aggregating; dedup state is bounded in keyed RocksDB state over a watermark window. Flink's RocksDB also durably stores counters/snapshots if Redis dies. Fan-out breaks for celebrities (200M followers can't take 200M writes per tweet), so fan-out is skipped for them and the Feed Service is hybrid: read the precomputed feed from Cassandra, find which celebrities the user follows (follower list cached in Redis, one shot), fetch their recent posts, merge, and serve — writing the hot tweet once and pulling at read time. The same merge gives read-your-own-write: the author's own recent tweets are merged in at read time since fan-out is async. The celebrity case is a read hot-key problem, not data volume (~10 tiny tweets read by 200M), so we use hot-key replication — N copies of the same celebrity feed (celeb:123:copy0..199), each reader hashing to a copy — plus local in-process cache and CDN/edge for viral tweets; staleness is fine at ~10 writes/day. Celebrity timelines are cached so the merge doesn't fan out to many Cassandra reads. Deletes use tombstones, not row rewrites across millions of timelines: the tweet is marked deleted and dropped at read time when the feed is materialized (page over-fetches so filtered tombstones don't shrink it). Feed latency is dominated by the merge; the feed is delivered progressively — serve the first 50, keep the next 50 ready, and precompute the next page when the user scrolls to ~75. Scaling/reliability: every service scales horizontally and independently. Cassandra timeline partition explosion is avoided by bucketing timelines by time (e.g. per month) so no single partition grows unbounded. A thundering herd on a hot tweet is handled with request coalescing (collapse concurrent identical reads into one origin fetch) and cache-stampede protection on expiry. ML-relevance ranking is added as a layer, not a replacement: Cassandra can't sort by a per-viewer score, so the precomputed timeline becomes the candidate set and an ML ranking generator re-scores and re-sorts it. Consistency is deliberately relaxed (a feed is a best-effort ranking, not a ledger) — retrieval→ranking→serve. To stay in budget we don't score everything inline: retrieve top-N recent candidates and score only those using cheap features like recency and like velocity (delta likes in the last hour), already computed in Flink and served from the counter store. Hydration: Cassandra returns ordered ids; one batched multi-get to Redis, single bulk Postgres read for misses. 2 hops, not N. Follow authority: Postgres is source of truth; Cassandra/Redis are async-derived. Feed tolerates seconds of staleness — not a correctness bug. CDC: Partitioned Kafka consumer group with offset checkpointing — scales out, resumes on failure, outbox prevents loss. Hot partitions: Heavy users are read-heavy (celebrities are pull-merged, not fanned out); cached with hot-key replication + time-bucketing spreads load.

Hire SignalHire

The candidate demonstrates strong architectural instincts, good scalability awareness, and concrete mechanisms for the hardest parts of the problem, especially feed generation and async propagation. The concerns are real but are mostly around operational completeness and API polish rather than fundamental design weakness.

⭐ Excellent

Clear availability-over-consistency stance with concrete scope

The candidate explicitly states that availability is prioritized over consistency for most of the system and then ties that to concrete behavior: feeds tolerate seconds of staleness, follow data in Redis/Cassandra is async-derived, and read-your-own-write is selectively repaired at read time. That shows they are not just naming CAP trade-offs, but applying them where correctness requirements are softer.

⭐ Excellent

Consistency model is differentiated by data type

This design does a good job separating source-of-truth writes from derived views. Postgres is treated as authoritative for tweets and follows, while feeds and counters are eventually consistent. That is the right level of nuance for a social timeline system because it preserves correctness for core mutations while allowing scalable, low-latency feed serving.

✅ Good

Latency target is connected to feed-serving strategy

The p95 < 500ms goal is not left floating. The candidate explains how it is supported through precomputed feeds, hybrid fan-out for celebrities, batched hydration, progressive delivery, caching, and request coalescing. Even without exact timing budgets, the latency objective is clearly driving design choices.

✅ Good

Scale assumptions are reflected in the NFR reasoning

The candidate uses the stated scale assumptions to justify hybrid fan-out, celebrity pull-merge, hot-key replication, and partition bucketing. That shows the scalability requirement is grounded in the 1B-user / 100M-DAU context rather than being treated as a generic slogan.

warning

Availability target is stated, but failure-budget implications are not fully spelled out

Have you considered what happens when a dependency like Redis, Kafka/CDC, or a regional Postgres primary is degraded for several minutes? You say 99.99% availability, but the NFR section does not define which user actions must still succeed under those failures and which can degrade gracefully. Without that boundary, the target is hard to defend operationally. You could improve this by stating per-path expectations such as 'tweet create remains available via primary DB, feed may serve stale cache, likes may be accepted and reflected later.'

warning

Consistency guarantees are described qualitatively, but not bounded where UX depends on them

Have you considered what happens if eventual consistency stretches from seconds to minutes during backlog or replay? For example, follow/unfollow and like state affect personalization and user trust. The design says staleness is acceptable, but it does not define an expected propagation window or which actions need stronger guarantees, such as whether a user should immediately stop seeing an unfollowed account. You could strengthen this by attaching bounded freshness targets to derived views.

info

Latency target would be more defensible with a simple budget breakdown

You could improve this by turning p95 < 500ms into a rough hop budget: cache hit path, Cassandra read, Redis multi-get, Postgres miss fill, ranking time, and merge time. Right now the mechanisms are sensible, but the target is still somewhat qualitative because there is no explicit end-to-end budget showing why 500ms is achievable at the stated DAU.

✅ Good

Core nouns for the main product flows are present

The design identifies the main domain concepts needed for the stated requirements: User, Tweet, Follow, Like, and Feed. That covers posting, following, liking, and timeline generation without drifting into out-of-scope features.

✅ Good

Follow relationship is modeled in both directions

The explanation makes the follow graph explicit by storing both A→B and B→A views. That shows awareness that the system needs more than just a raw edge table: one direction supports 'who I follow' and the other supports fan-out and follower-based access patterns.

✅ Good

Feed is treated as a derived entity rather than source of truth

The candidate distinguishes durable source entities like Tweet and Follow from the derived Feed view. That is a sensible domain boundary for this problem because the personalized timeline is computed from follow relationships and tweets rather than authored directly.

warning

Relationship cardinalities are only implied, not clearly stated

Have you considered making the entity relationships explicit? For example, User↔User through Follow is many-to-many, User→Tweet is one-to-many, and User↔Tweet through Like is many-to-many. Without stating those relationships clearly, it is harder to reason about ownership, deduplication, and what the source of truth is for each interaction.

warning

Feed membership is underspecified as a domain relationship

What happens when you need to explain exactly how a tweet appears in a user's timeline? Right now Feed is named as an entity, but the relationship between Feed, User, and Tweet is not clearly defined: is Feed a per-user collection of tweet references, a ranked view, or a materialized inbox? Making that relationship explicit would tighten the happy path from follow graph to timeline.

info

You could separate source entities from derived views more explicitly

You could improve this by calling out that Follow and Like are edge entities, Tweet is authored content, User is the principal, and Feed is a derived projection over Tweet + Follow. That framing would make the model cleaner and show stronger command of which entities require correctness guarantees versus which can be rebuilt.

⭐ Excellent

Capacity reasoning connects product scale to traffic shape

The candidate gives concrete scale anchors for users, posts, storage growth, network, and request rates, then ties those numbers to the architecture choices. In particular, calling out that raw post-write QPS is low while fan-out amplification is the real scaling problem shows good capacity intuition for a social feed system.

⭐ Excellent

Component choices are justified by the actual bottlenecks

The explanation distinguishes source-of-truth metadata in Postgres from high-volume timeline storage in Cassandra, uses blob storage plus CDN for media bandwidth, and Kafka for absorbing asynchronous fan-out. Those choices are motivated by the calculated workload shape rather than picked generically.

✅ Good

Celebrity traffic is treated as a separate capacity class

Recognizing that fan-out-on-write breaks for very high follower counts and switching those accounts to pull-at-read with hot-key replication is a strong scale-aware trade-off. It shows the candidate is thinking about skew and not just average-case throughput.

warning

Peak load and headroom are not sized explicitly

The QPS numbers are daily averages. What happens during diurnal peaks, breaking-news spikes, or a viral celebrity post? Without converting average traffic into peak QPS and peak bandwidth, it is hard to tell whether the Redis, Cassandra, Kafka, and feed-merge path have enough headroom.

warning

Follow graph and timeline storage growth are missing from the capacity model

The storage section covers tweets and media, but the dominant data volume in this design may be the follow graph plus precomputed feed entries in Cassandra. Have you considered how many follow edges exist at 1B users, how large Redis would get if it stores follow relationships, and how much timeline fan-out data accumulates per user over retention?

warning

Replication and multi-copy overhead are not reflected in the numbers

The raw storage estimates are reasonable, but the deployed footprint will be materially larger once you include database replication, Cassandra replication factor, Redis replicas, Kafka retention, and the hot-key replicated celebrity feeds. What happens to your capacity plan when those multipliers are applied?

info

Fan-out throughput could be quantified more concretely

You could improve this by turning 'fan-out feed writes: very high' into a rough envelope, for example average followers per posting user and a separate worst-case path for non-celeb accounts. That would make it easier to sanity-check Kafka throughput, Cassandra write rate, and backlog recovery after failures.

✅ Good

Core user flows are covered by concrete endpoints

The routes let a client perform the required actions end-to-end: create a tweet, fetch a tweet, fetch a personalized feed, follow/unfollow, and like. That is enough to exercise the stated functional requirements through the API without inventing extra features.

✅ Good

Feed API uses cursor-based pagination

Using a cursor on the timeline endpoint is the right shape for a large, append-heavy feed. At this scale, cursor pagination is much safer than offset pagination because it avoids deep scans and unstable page boundaries as new tweets arrive.

✅ Good

Media upload flow acknowledges direct-to-blob upload

The create-tweet response returning a presigned media URL shows awareness that media bytes should not traverse the tweet service. That is a practical API choice for large-scale upload handling.

warning

Tweet creation flow is ambiguous for media uploads

What happens when the client calls POST /tweets with media and receives both a tweet_id and a presigned URL, but the blob upload later fails or is never completed? The API currently makes it unclear whether the tweet is already published, whether it should remain text-only, or whether a second finalize call is required. Without an explicit two-step contract such as create-draft -> upload -> finalize, clients can easily create dangling or partially visible tweets.

warning

Like API mixes actions in a non-idempotent POST

Have you considered what happens if the client retries POST /likes/{tweet_id} after a timeout? Because the body carries type: like|dislike, the semantics are muddy: is dislike an unlike, a downvote, or a separate reaction? For this requirement the system only needs like, so a cleaner contract would be PUT /likes/{tweet_id} to like and DELETE /likes/{tweet_id} to unlike, which gives clearer idempotency and retry behavior.

warning

Error contract and retry guidance are missing

What does the client see when follow already exists, a tweet is deleted, the cursor is invalid, or rate limiting/auth blocks the request at the gateway? The routes do not define status codes or a consistent error body, so clients cannot reliably distinguish retryable failures from permanent ones. At senior level, I would expect at least a clear pattern such as 400/401/403/404/409/429/5xx plus an error code and retry hint.

info

Feed resource shape could be cleaner

You could improve this by making the personalized timeline implicit from auth rather than passing user={user_id} on GET /tweets/feed. If the caller can request arbitrary user ids here, it blurs whether this is a home timeline or a public user timeline. A cleaner split is GET /feed for the authenticated user's home feed and a separate endpoint if user-profile tweets are ever needed.

info

Cursor parameter appears underspecified

You could strengthen the feed API by defining what the cursor encodes and how clients should handle expiration or tampering. At this scale, opaque cursors are fine, but the contract should say whether they are stable snapshots, best-effort continuation tokens, and what error is returned when a cursor is invalid.

info

Basic follow/like readbacks are not exposed

You could improve client usability by returning the resulting state from mutation endpoints, for example whether the follow now exists or the current liked state/count. The core requirements are technically covered, but without stateful responses clients may need extra round trips or guess whether a retry succeeded after a timeout.

⭐ Excellent

Thoughtful hybrid feed architecture

The candidate explicitly avoids pure fan-out-on-write for celebrity accounts and switches to pull-at-read with merge logic. That shows strong design judgment for the stated scale, because the first obvious failure mode in a Twitter-like system is exploding write amplification on high-follower accounts.

⭐ Excellent

Reliable async write propagation via outbox and Kafka

Using a transactional outbox in Postgres, then CDC into Kafka before fan-out, is a strong end-to-end design choice. It addresses the classic failure scenario where the tweet is committed but the feed event is lost, and it cleanly decouples user-facing write latency from heavy downstream fan-out work.

✅ Good

Read path is optimized around feed hydration

The design separates feed index storage from tweet hydration and uses Redis multi-get plus bulk DB reads on misses. That is a good high-level pattern for keeping feed reads fast while avoiding N+1 lookups.

✅ Good

Hot-key mitigation is called out explicitly

The explanation goes beyond generic caching and discusses request coalescing, cache-stampede protection, and replicated celebrity cache keys. That shows awareness of where the real bottlenecks appear in a social feed system.

warning

Follow write path and follow event propagation are inconsistent

Have you considered what happens when a user follows or unfollows someone? In the diagram, Follow Service writes follow info to Postgres and Kafka feeds a Follow normalizer, but there is no clear path showing how follow events get into Kafka other than an outbox annotation. If that propagation is delayed or broken, the personalized timeline can remain wrong for a long time because fan-out targets and celebrity merge inputs depend on derived follow state.

warning

Feed correctness depends on multiple async stores with no clear reconciliation path

What happens when Cassandra feed entries, Redis follow cache, and Postgres source-of-truth disagree after partial failures or lag? The design intentionally accepts staleness, which is fine, but at this scale you still need a clear recovery story for rebuilding a user's feed slice or re-deriving follow state when consumers fall behind or bad data is written.

warning

Single-region failure handling is asserted but not reflected in the flow

Have you considered what happens if the Postgres primary, a Redis primary, or a Cassandra node/region fails during active traffic? The explanation mentions sharding and region failover, but the HLD does not show how services fail over, whether Kafka is multi-AZ/region, or how feed reads continue when one backing store is degraded. For a 99.99% availability target, these failure paths are important architectural decisions, not implementation details.

info

Some components are under-connected in the drawn design

You could improve this by making the end-to-end role of Flink + Kafka, replicas, and the follow normalizer more explicit on the diagram. Right now they are explainable from the walkthrough, but visually they look partially orphaned or loosely attached, which makes it harder to verify the complete request and event flows under failure.

info

Cache strategy is strong for reads but invalidation paths are not visible

You could strengthen the design by showing what happens to tweet/feed cache entries on like count changes, deletes, and unfollows. Without a visible invalidation or TTL strategy, the likely production behavior is stale counters or stale feed composition persisting longer than intended.

Want this kind of feedback on your own design?

Draw your architecture for Twitter / Social Feed and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.