Challenge Drills Library Drawing Guide Learn AI Setup Guide Support About

Food Delivery System — system design by AgileViper46

Strong Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

The user starts by searching for a product along with their current latitude and longitude. The Search Service first performs a geospatial lookup in Elasticsearch to identify nearby distribution centers (DCs). Since a simple radius search does not guarantee delivery within one hour, the service then determines the actual travel time from each candidate DC to the user's location. To avoid making expensive Google Maps API calls for every request, travel times are cached in Redis using geohash (or H3 cell) based keys. If a cache entry exists for the user's geohash region, the cached delivery times are used. Otherwise, Google Maps Distance Matrix is called, the results are stored in Redis with a short TTL, and then returned. Any DC whose delivery time exceeds one hour is filtered out. Once the system has identified all DCs capable of delivering within one hour, it performs a product availability lookup. Inventory data is replicated from PostgreSQL into Elasticsearch through a CDC pipeline (for example, Debezium → Kafka → Elasticsearch). This allows low-latency searches without putting load on the transactional database. Elasticsearch returns products that are available in the filtered DCs. We accept that search results may be slightly stale because the final inventory validation happens during order placement. When a user places an order, the Order Service receives the request and calls the Inventory Service to validate stock availability. If sufficient inventory exists for all requested items, the Inventory Service starts a database transaction. Within the transaction, inventory is reserved by creating records in an Inventory Hold table, available inventory is reduced, the order record is created, and an Outbox event is written. All of these operations are committed atomically to guarantee consistency and prevent overselling. As soon as the transaction commits successfully, the Order Service immediately returns a success response to the user. The user does not wait for downstream processing such as payment. The Outbox table is monitored through CDC, which publishes events to Kafka. Downstream services such as Payment Service consume these events asynchronously. Once payment processing completes, the Payment Service publishes a success or failure event, and the Order Service updates the order status accordingly. Users can retrieve the latest state through the order history APIs. If payment fails or times out, a background cleanup job scans for expired inventory holds. For each expired hold, inventory is released back to the available pool, the hold record is removed, and the order is marked as failed. This ensures reserved inventory does not remain locked indefinitely. To handle client retries and network failures safely, order creation is protected using idempotency keys. If the same request is received multiple times, the system returns the previously created order instead of creating duplicate reservations or duplicate orders. Overall, the design uses PostgreSQL as the source of truth for orders and inventory, Elasticsearch for low-latency product and geo searches, Redis for caching travel-time calculations, Kafka for asynchronous event propagation, CDC for keeping search indexes synchronized, and inventory reservations to provide strong consistency during ordering while still achieving fast search performance.

Hire SignalHire

Across NFR, API, and HLD, the design shows strong senior-level judgment on consistency boundaries, transactional correctness, retries, and read/write separation. It is not a top-tier answer because the failure-mode story, hot-row contention handling, and some domain/API modeling details are underdeveloped, but the core architecture is sound and clearly above the bar.

⭐ Excellent

Consistency model is explicitly chosen and enforced on the write path

The candidate does more than say 'strong consistency'—they explain what must be strongly consistent and how: inventory validation, reservation, order creation, and outbox write happen in one database transaction. That directly ties the consistency requirement to the oversell failure mode and shows they understand where eventual consistency is acceptable versus where it is not.

✅ Good

Latency targets are connected to concrete read-path optimizations

The p95 < 100ms search goals are supported by specific mechanisms: Elasticsearch for geo and inventory search, Redis caching for travel-time lookups, and acceptance of slightly stale search results with final validation at order time. That is a reasonable NFR-driven separation of fast reads from transactional writes.

✅ Good

Scalability target is at least anchored to stated traffic assumptions

The candidate references 5K concurrent orders and a 5:1 read/write pattern, then uses asynchronous propagation and cached/indexed read paths to keep the transactional system focused on order placement. The numbers are not left completely floating; they influence the split between serving reads and protecting the write path.

warning

Availability is not called out as an explicit quality target

You mention latency, consistency, and scale, but what happens when Elasticsearch, Redis, Google Maps, or Kafka is degraded? Without an explicit availability objective or degraded-mode expectation, it is hard to judge whether the system should fail closed, serve stale results, or partially operate during dependency outages.

warning

Scalability numbers are incomplete for the hottest path

You state 5K concurrent orders and roughly 25K reads, but have you considered what that means for the transactional inventory reservation path specifically? Strong consistency is fine, but the design should connect the concurrency assumption to expected contention on popular SKUs or distribution centers; otherwise the write path may meet correctness but miss throughput under bursty demand.

info

Search staleness is justified, but the tolerance is not quantified

You clearly chose eventual consistency for search and strong consistency for ordering, which is good. You could improve this by stating the acceptable freshness window for CDC-driven Elasticsearch updates and cache TTLs, so the trade-off is measurable rather than implicit.

info

Latency targets would be stronger with end-to-end scope

The p95 < 100ms targets are useful, but have you considered whether they apply to just internal search service processing or the full user-visible request including geo lookup, cache miss handling, and distance API dependency? Defining the measurement boundary would make the NFRs more defensible.

✅ Good

Core nouns for the main flow are identified

The design names the main domain entities needed for the stated requirements: User, Distribution Center, Product, Inventory, and Order. These cover the primary customer flows of searching deliverable items, placing a multi-item order, and viewing order history.

✅ Good

Inventory hold is introduced as a meaningful domain concept

In the explanation, the candidate adds an Inventory Hold concept during checkout. That is a useful domain entity for this problem because it explains how stock is reserved between search and final order completion, and it makes the ordering lifecycle more coherent than treating inventory as a single mutable number.

warning

Order-to-product relationship is not modeled explicitly

Have you considered what happens when one order contains multiple items? The requirements explicitly require multi-item orders, but the entities stop at Order and Product without introducing an OrderItem or line-item relationship. Without that join entity, the many-to-many relationship between orders and products is left implicit, and it becomes unclear how quantities, per-item fulfillment, and order history are represented.

warning

Inventory relationships are underspecified

Have you considered how Inventory connects to both Product and Distribution Center? Availability by location depends on inventory being scoped to a specific product at a specific DC, but that relationship is not stated explicitly. If Inventory is not clearly modeled as the intersection of Product and Distribution Center, the happy path for 'what items are available from which nearby DCs' becomes ambiguous.

info

Make user-to-order and order lifecycle relationships explicit

You could improve this by stating the key relationships directly: one User has many Orders, one Order has many OrderItems, one Product appears in many OrderItems, and one Distribution Center has many Inventory records. The explanation implies some of this, but writing it down would make the domain model much clearer and easier to validate.

warning

Write QPS is derived from concurrency, not arrival rate

Have you considered what happens if 5K concurrent orders does not mean 5K write QPS? Concurrency and QPS are different dimensions: 5K in-flight orders could correspond to far lower or higher request throughput depending on latency. If the write baseline is off, every downstream estimate for database capacity, Kafka throughput, and Elasticsearch update volume will also be off. You could improve this by converting 100K orders/day into average and peak order creation QPS, then using the 5K concurrent figure to reason about connection pools and in-flight work separately.

warning

Peak traffic and burst assumptions are missing

Have you considered what happens during lunch/dinner spikes or regional bursts? 100K orders/day averages to a low steady-state rate, but this kind of system is usually highly bursty. Without a peak multiplier or busy-hour estimate, it is hard to tell whether PostgreSQL, Redis, Elasticsearch, and Kafka are sized for real production load or just daily averages. You could improve this by stating a peak-hour fraction or a simple burst factor and sizing hot-path components against that.

warning

No end-to-end capacity chain from traffic to storage and bandwidth

Have you considered what happens when order history, inventory CDC, and search indexing grow over time? The section gives only read/write QPS, but for a senior-level capacity discussion I would expect at least rough storage growth for orders, order items, inventory holds, Kafka retention, Elasticsearch index size, and Redis cache footprint. Without that chain, it is difficult to judge whether the chosen infrastructure still fits after months of growth. You could improve this by adding back-of-envelope estimates for data per order, retention period, replication overhead, and CDC/event volume.

warning

Read QPS is not tied to the actual access patterns in the design

Have you considered what happens when search traffic dominates order traffic? The design includes product availability search, geospatial DC lookup, travel-time cache lookups, order placement, and order history reads, but the single 5:1 read/write ratio does not explain how much load lands on Elasticsearch versus Redis versus PostgreSQL. If search is much hotter than ordering, Elasticsearch and Redis may be the real bottlenecks even if order writes are modest. You could improve this by breaking read traffic into major flows and mapping each to the serving system.

info

Component choices are not justified by the stated scale

You could improve this by explaining why Kafka, CDC, Elasticsearch, and Redis are warranted for 100K orders/day and 100K catalog items under these assumptions. The architecture may still be valid, but in a capacity review I want to hear whether these are chosen because of expected search fanout, decoupled indexing throughput, cache hit rate, or future growth rather than as default components.

⭐ Excellent

Idempotent order creation is explicitly handled

The explanation calls out idempotency keys for order creation and explains the retry behavior: if the client resends the same create request after a timeout or network failure, the API returns the previously created order instead of creating duplicate orders or reservations. That is exactly the kind of failure-mode thinking expected at senior level for a POST order API.

✅ Good

Core customer flows are mostly covered by the routes

The API set supports the main functional requirements: searching products by location, checking inventory, creating a multi-item order, listing a user's orders, and fetching a specific order or status. A client can reasonably complete the end-to-end customer flow through these endpoints.

✅ Good

Order status is separated from order creation latency

The explanation makes it clear that order creation returns immediately while downstream processing happens asynchronously, and the client can observe progress through order history/status APIs. That is a sensible protocol contract for a system where fulfillment-related work continues after the initial request.

warning

Search and availability contract is split awkwardly across two APIs

Have you considered what the client experience looks like when it has to call `GET /products?...` and then separately call `GET /inventories/{product_id}?shops=...` for each product? For a search page with many results, this turns into an N+1 API pattern and makes it unclear which endpoint is the source of truth for 'deliverable within 1 hour'. You could improve this by returning deliverable availability in the search response itself, or by offering a batch availability endpoint for multiple product IDs.

warning

List orders endpoint has no pagination or filtering story

What happens when a long-time customer has hundreds or thousands of past orders and calls `GET /me/Orders`? Without pagination, cursors, or at least limit/offset parameters, the response can become large and slow, and clients have no clean way to incrementally load history. You could improve this by adding cursor-based pagination and optional filters such as date range or status.

warning

Error behavior is not defined for key edge cases

What does the client see when some requested items are out of stock, the delivery location is not serviceable within one hour, the idempotency key conflicts with a different payload, or the order is still processing? Right now the routes are listed, but there is no status-code/error-shape contract, so clients cannot reliably distinguish retryable failures from user-correctable ones. You could improve this by defining a consistent error body plus expected codes like 400/404/409/422/429/503 and retry guidance.

info

Resource design could be cleaner and more consistent

You could improve this by making the REST surface more resource-oriented and consistent in naming/casing. Examples: `/orders` instead of `/ordres`, `/me/orders` instead of `/me/Orders`, and possibly exposing order status as a field on `GET /me/orders/{id}` unless there is a strong reason for a separate status resource. Cleaner resource design makes the API easier to learn and less error-prone for clients.

info

Create-order request shape is underspecified for multi-item orders

What happens if the client sends mismatched `products` and `items` arrays, duplicate product IDs, or invalid quantities? The current body sketch suggests separate lists, which is ambiguous for a multi-item order API. You could improve this by using an explicit line-item structure such as `items: [{product_id, quantity}]`, which makes validation and partial-stock error reporting much clearer.

⭐ Excellent

Clear split between search and transactional paths

Using Elasticsearch for low-latency geo/product discovery while keeping PostgreSQL as the source of truth for orders and inventory is a strong architectural choice. It shows the candidate understands the trade-off between fast, slightly stale reads and strongly consistent writes, which fits the stated requirements well.

⭐ Excellent

Order flow accounts for consistency and retries

The explanation adds a concrete transactional reservation flow with inventory holds, order creation, and outbox writes committed atomically. Combined with idempotency keys, this is a thoughtful design for preventing oversell and duplicate orders under retries and concurrent ordering.

✅ Good

Expensive delivery-time computation is cached

Caching Google Maps travel-time lookups in Redis by geohash/H3 region is a good optimization for the hot search path. It directly addresses the latency target and avoids repeatedly calling an external dependency for nearby users.

✅ Good

Async downstream processing keeps order placement responsive

Publishing order events through CDC/Kafka and handling payment/status updates asynchronously is a good use of decoupling. It prevents the user-facing order API from blocking on slower downstream work.

critical

Search path depends on synchronous Google Maps calls on cache miss

What happens when Redis misses spike or Google Maps is slow/rate-limited? The product search path now waits on an external API before it can decide 1-hour deliverability, so p95 latency can easily blow past the 100ms target and search availability becomes coupled to a third-party service. You could improve this by precomputing delivery zones/isochrones per DC, warming the cache for hot regions, and defining a degraded fallback when Maps is unavailable.

warning

Inventory service is described in the flow but missing from the actual HLD

The end-to-end ordering story relies on the Order Service calling an Inventory Service for validation and reservation, but that component is not present in the diagram or connections. Have you considered where the reservation logic actually lives and how it scales under 5K concurrent orders? If it is inside Order Service, show that explicitly; if separate, add it and its DB interaction so the write path is unambiguous.

warning

Postgres is the main bottleneck for concurrent reservation-heavy writes

Have you considered what happens when many users order the same hot items at once? The design pushes all inventory decrements, hold creation, order writes, and status updates through a single PostgreSQL primary. At the stated concurrency, hot rows for popular product/DC combinations can become lock contention points and the primary becomes the first scaling limit. You could improve this by calling out partitioning/sharding strategy for inventory, careful row-level locking semantics, and how hot-item contention is handled.

warning

Single points of failure are not addressed for core stateful components

What happens if the PostgreSQL primary, Redis cluster coordinator, Kafka broker set, or Elasticsearch node handling queries fails? The diagram shows replicas for reads, but failover/write continuity for the primary transactional path is not described. Without a concrete HA story, order placement can stop entirely on a primary failure. You could improve this by specifying leader failover for Postgres, multi-node Kafka/ES deployments, and how services reconnect and recover.

info

Some drawn components are weakly integrated into the visible flow

The CDC workers and Payment Service are explained, but the diagram mixes direct Postgres-to-CDC and replica-to-Elasticsearch update paths in a way that is hard to trace. You could improve this by making one consistent event/indexing flow explicit so there are no ambiguous or partially orphaned components in the HLD.

Want this kind of feedback on your own design?

Draw your architecture for Food Delivery System and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.

Get your free review See more Food Delivery System designs