DrawLintDrawLint.ai

News Aggregator β€” system design by AgileViper46

Lean Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

Hire SignalLean Hire

This is a solid mid-level attempt with correct major building blocks and generally sound architectural instincts, but it falls short on some fundamentals that matter for this problem: sizing, precise pagination consistency, and a cleaner end-to-end design. I would lean hire because the core approach is reasonable and fixable, but I would want stronger rigor in validating scale and operational correctness.

βœ… Good

Latency target is explicitly stated

The design includes a concrete latency goal of under 200ms for feed loads, which is important for evaluating whether the system meets user-facing performance expectations.

βœ… Good

Availability prioritized over consistency

Stating that showing slightly stale data is preferable to showing nothing is a sensible consistency tradeoff for a news feed product, especially under large-scale read traffic.

βœ… Good

Scalability expectation matches stated assumptions

Calling out the need to scale to the assumed 500M DAU spike shows awareness that the non-functional design must be evaluated against very large traffic levels.

warning

Availability is not quantified

The requirement says availability is more important than consistency, but it does not define an uptime target such as 99.9% or 99.99%. Without a measurable SLA/SLO, it is hard to judge whether the design meets the expected reliability. Add a concrete availability target for the feed read path.

warning

Consistency model is too vague for pagination

Saying availability is preferred over consistency is directionally correct, but the system also requires infinite scroll with consistent pagination. That needs a more precise consistency statement, such as snapshot/cursor-based pagination over an immutable ordering window or eventual consistency for newly ingested articles while preserving stable page boundaries for a session.

info

Scalability is stated as a goal rather than an NFR metric

High scalability to 500M DAU is relevant, but it would be stronger if translated into measurable expectations such as peak QPS, regional traffic distribution, or ingestion/read amplification. That makes the NFR actionable and easier to validate.

βœ… Good

Includes core user and content nouns

User and Article are both relevant top-level entities for serving a regional news feed, and Publisher is also an important domain noun for representing RSS content sources.

warning

Regional feed entity is missing

The requirements explicitly center on users viewing a regional feed of aggregated articles, but there is no entity capturing Region or Feed. Without a noun representing the regional grouping, the core domain model does not clearly express how articles are organized for retrieval. Add a Region (or RegionalFeed) entity to show how articles are associated to a user's requested geography.

info

Publisher naming is inconsistent

The list uses 'Publishers' while the other entities are singular nouns. Core entities should be listed as singular domain nouns for clarity and consistency. Rename this to 'Publisher'.

critical

No basic capacity numbers provided

The section is empty, so there are no estimates for DAU-to-concurrency, read QPS, write/ingestion QPS, storage growth, or bandwidth. For this problem, the candidate should at minimum translate the stated 100M DAU (with 500M spike) into rough feed-read traffic and RSS ingestion volume so the rest of the design can be sized.

critical

Scale is not validated against the stated assumptions

With 100M DAU and spikes to 500M, ballpark sizing matters a lot. Without even rough calculations, there is no way to tell whether the proposed system can support regional feed reads, infinite scroll pagination, or article aggregation from thousands of publishers. Add back-of-the-envelope estimates for peak QPS, cache hit assumptions, per-region fanout/read distribution, and storage/index growth to show the design is in the right ballpark.

βœ… Good

Core feed retrieval endpoint is present

The design includes a GET /feed endpoint with region, limit, and cursor parameters, which directly maps to the main user-facing requirements: viewing a regional feed and supporting infinite scroll.

βœ… Good

Cursor-based pagination fits infinite scroll

Using a cursor instead of offset is a solid API choice for large, continuously updated feeds because it helps maintain more consistent pagination as new articles arrive.

warning

Response shape is underspecified for pagination

Returning only Articles[] is not enough to support reliable infinite scrolling. The response should include pagination metadata such as nextCursor and optionally hasMore so clients know how to request the next page consistently.

warning

Cursor semantics are not defined

The route includes a cursor parameter, but the API does not specify what the cursor represents or how it preserves ordering. Define it as an opaque token derived from the feed sort key (for example publish time plus tie-breaker article ID) so pagination remains stable across requests.

info

Primary article read API is minimal

For the stated requirements, a feed endpoint is sufficient, but CRUD coverage for the primary entity is only partial because there is no article read-by-id route. Adding something like GET /articles/{articleId} would make the API more complete if clients need to open a specific article from the feed.

βœ… Good

Reasonable separation of ingest and read paths

The design separates article collection/scheduling from the user-facing read service, which is a solid high-level pattern for this problem. It reduces coupling between bursty publisher ingestion and high-QPS feed reads, making the system easier to scale independently.

βœ… Good

Search-oriented store fits regional feed queries

Using Elasticsearch for article documents with fields like timestamp and region is a sensible choice for serving regional feeds sorted by recency. It aligns well with the functional requirement to aggregate articles and retrieve them efficiently by region.

βœ… Good

Caching and CDN improve read scalability

Adding Redis for hot regional content and a CDN for thumbnails is a practical architectural choice for the stated DAU assumptions. This helps offload repeated feed and media access from the core services and storage systems.

critical

Pagination design is missing for infinite scroll consistency

The functional requirement explicitly asks for infinite scroll with consistent pagination, but the HLD only shows GET /feed?region and a cache of the last 5 minutes. There is no architectural mechanism for stable page boundaries, such as cursor-based pagination using a composite sort key like (publishTime, articleId) and search_after or equivalent. Without that, users will see duplicates or skipped articles as new content arrives. Add a cursor-based read flow and ensure both Elasticsearch queries and cache keys are built around that cursor.

warning

Several components and flows are orphaned or unclear

There are multiple unnamed nodes and blank-to-blank connections, plus CDC links where the source or destination is missing. That makes the end-to-end architecture hard to validate and suggests some components are not logically connected. Clean up the diagram so every queue, CDC stage, and service has explicit producers and consumers, and remove placeholder nodes that do not participate in the flow.

warning

Ingestion pipeline is not fully connected end-to-end

The design includes Scheduling Service, Kafka, Collection Service, web crawling, RSS feeds, websites, and a webhook path, but the actual control/data flow between them is incomplete. For example, Scheduling Service is not shown dispatching work to Kafka or collectors, and web crawling is not shown writing normalized articles to storage or Elasticsearch. To make the design correct end-to-end, explicitly connect scheduler -> queue -> collectors/crawlers -> normalization/dedup -> Elasticsearch/S3.

warning

Single metadata database appears to be a bottleneck

Postgres is shown storing publishers and users, and the scheduler scans it for unsynced publishers every 5 minutes. Under thousands of publishers and very large user scale, a single primary database for mixed metadata and scheduling state can become an operational bottleneck or single point of failure. A better HLD would isolate publisher scheduling metadata from user data and show read replicas/partitioning or another scalable metadata strategy.

warning

Read-path redundancy is underspecified for the stated scale

The diagram shows duplicate read services and some Redis replicas, but Elasticsearch redundancy, shard strategy, and multi-node deployment are not shown. For 100M DAU with spikes to 500M, the feed-serving datastore must be explicitly distributed and fault tolerant. Add an Elasticsearch cluster with primary/replica shards and show how read services load balance across it.

info

CDC-to-cache flow needs clearer purpose

Using CDC to refresh Redis can be useful, but here the cache is described as 'articles per region for last 5 minutes,' which may not align with cursor-based feed access beyond the hottest window. Clarify whether CDC is warming only the first page per region or maintaining multiple cursor windows. If not, consider limiting cache scope to hot first-page results and letting deeper pages query Elasticsearch directly.

Want this kind of feedback on your own design?

Draw your architecture for News Aggregator and get an instant hire/no-hire signal from 6 specialized AI reviewers β€” free to start.

News Aggregator β€” System Design by AgileViper46 | DrawLint.ai