DrawLintDrawLint.ai

Ticket Booking System — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

# Ticket Booking Flow ## 1. User Initiates Booking The user selects seat A1 and submits a booking request. ```http POST /bookings { "eventId": "E1", "seatIds": ["A1"] } ``` The client includes an idempotency key so that retries caused by network failures do not create duplicate bookings. --- ## 2. Seat Hold Acquisition The Booking Service first attempts to acquire a temporary hold on the seat using Redis. ```redis SET seat:A1 booking123 NX EX 300 ``` Where: * NX ensures only one user can acquire the hold. * EX 300 sets a 5-minute expiration. If another user attempts to reserve the same seat concurrently, the command fails and the booking request is rejected. --- ## 3. Persist Booking and Seat Hold Redis is not treated as the source of truth. After successfully acquiring the Redis hold, the Booking Service updates PostgreSQL in a transaction. ### Booking Table ```text booking_id user_id event_id status created_at ``` Status: ```text PENDING_PAYMENT CONFIRMED EXPIRED FAILED ``` ### Seat Table ```text seat_id event_id status booking_id hold_expires_at ``` Status: ```text AVAILABLE HELD BOOKED ``` The seat is updated as: ```text Seat A1 status = HELD booking_id = booking123 hold_expires_at = now() + 5 minutes ``` The Booking Service creates: ```text Booking(status=PENDING_PAYMENT) ``` and writes a: ```text BookingCreated ``` event to the Outbox table within the same transaction. This guarantees atomicity between database updates and event publication. --- ## 4. CDC Publishes Booking Event A CDC process reads the Outbox table and publishes: ```text BookingCreated ``` to Kafka. ```text Booking Service ↓ Outbox ↓ CDC ↓ Kafka ``` --- ## 5. Payment Service Creates Payment Record The Payment Service consumes the BookingCreated event. Before calling the payment gateway, it creates a row in its own Payments database. ### Payments Table ```text payment_id booking_id status payment_url gateway_reference created_at ``` Status: ```text CREATING_SESSION SESSION_CREATED PAYMENT_SUCCESS PAYMENT_FAILED REFUNDED ``` The booking_id is unique to prevent duplicate payment sessions during retries. --- ## 6. Create Payment Session The Payment Service creates a payment session with the external payment provider using an idempotency key. ```text Payment Service ↓ Payment Gateway ``` The gateway returns: ```text gateway_reference payment_url ``` The Payment Service updates its database: ```text status = SESSION_CREATED payment_url = ... gateway_reference = ... ``` If the Payment Service crashes after creating the session but before saving the response, retries are protected using the gateway idempotency key and the unique booking_id constraint. --- ## 7. Client Polls Booking Status The Booking Service immediately returns: ```json { "bookingId": "booking123", "status": "PENDING_PAYMENT" } ``` The client periodically polls: ```http GET /bookings/{bookingId} ``` The Booking Service checks payment status through a service API or synchronized payment state and returns: ```json { "status": "PAYMENT_READY", "paymentUrl": "https://gateway.com/pay/xyz" } ``` The client is redirected to the payment gateway. --- ## 8. User Completes Payment The user completes payment on the external payment gateway. --- ## 9. Payment Confirmation The payment gateway sends a webhook. ```text Payment Gateway ↓ Load Balancer ↓ Any Payment Service Instance ``` The webhook is not tied to the instance that created the payment session. Since payment state is persisted in the Payments database, any healthy Payment Service instance can process the webhook. The Payment Service updates: ```text status = PAYMENT_SUCCESS ``` and publishes: ```text PaymentSucceeded ``` to Kafka. Webhook processing is idempotent because gateways may retry webhooks multiple times. --- ## 10. Booking Service Confirms Booking The Booking Service consumes: ```text PaymentSucceeded ``` Before marking the booking successful, it validates ownership of the seat. Expected state: ```text Seat A1 status = HELD booking_id = booking123 ``` Atomic transition: ```text Seat: HELD → BOOKED Booking: PENDING_PAYMENT → CONFIRMED ``` Only the booking that owns the seat hold can perform this transition. --- ## 11. Hold Expiration Handling A background job continuously scans: ```sql status = HELD AND hold_expires_at < NOW() ``` Expired holds are released. ```text Seat: HELD → AVAILABLE Booking: PENDING_PAYMENT → EXPIRED ``` Redis keys naturally expire via TTL. PostgreSQL remains the source of truth. --- ## 12. Redis Failure Recovery If Redis crashes: ```text seat:A1 key disappears ``` Seat ownership is not lost because PostgreSQL contains: ```text status = HELD booking_id = booking123 hold_expires_at ``` Redis can be rebuilt from PostgreSQL by loading all active holds: ```sql status='HELD' AND hold_expires_at > NOW() ``` This prevents inventory corruption due to cache loss. --- ## 13. Late Payment Scenario Consider: ```text 12:00 Seat Held 12:05 Hold Expires 12:05 Seat Released 12:05 Another User Books Seat 12:06 Original Payment Completes ``` The Booking Service validates the seat state before confirming. Expected: ```text status = HELD booking_id = booking123 ``` If validation fails: ```text Seat no longer belongs to booking123 ``` the booking is rejected. The system then initiates a compensation workflow: ```text PaymentSucceeded ↓ BookingConfirmationFailed ↓ RefundRequested ↓ Payment Service ↓ Gateway Refund ``` The user is refunded and double booking is prevented. Virtual Waiting Room for Hot Events: For extremely high-demand events (e.g., IPL Finals, World Cup Finals, major concerts), allowing all users to directly access the booking system can overwhelm downstream services such as Redis, PostgreSQL, and the Booking Service. To prevent this, a Virtual Queue Service is placed in front of the booking flow. When ticket sales open, users are assigned a queue position and stored in a Redis-backed waiting room. An admission controller continuously allows only a limited number of users (e.g., 5,000–10,000 concurrent users) into the booking system based on current capacity. Admitted users receive a short-lived booking token that must be presented when accessing seat maps or creating bookings. This ensures that the booking system processes a controlled amount of traffic, prevents thundering herd problems during flash sales, and significantly improves system stability while maintaining a fair ordering mechanism for users. For the search event we are using the elastic search which is great for location based search as well as text based search. The event data is generally wirte once read many and does not change very often, Thus via CDC we can capture the changes and update the data in elastic search but search can be served via the elastic search.

Hire SignalHire

The candidate demonstrates strong distributed systems judgment on the hardest part of the problem—booking correctness under concurrency and failure—and shows senior-level awareness of real production failure modes. The concerns are meaningful, especially around HA and scale translation, but they are more about completeness and operational rigor than fundamental architectural misunderstanding.

⭐ Excellent

Consistency requirement is explicitly tied to booking correctness

The candidate does more than say 'strong consistency'—they explain what breaks if it is wrong: double reservations and duplicate payments. The booking flow enforces ownership checks before HELD → BOOKED transitions, uses idempotency keys, and validates late-payment scenarios with compensation. That shows the consistency requirement is driving concrete correctness behavior rather than being a vague label.

✅ Good

Availability is differentiated by subsystem

The NFRs separate high availability expectations for search and seat availability from the stricter correctness needs of booking. That is a sensible quality-attribute split for a ticketing system, where search can prioritize uptime and latency while booking must prioritize correctness.

✅ Good

Latency target for search is concrete and supported by workload shape

The candidate gives a measurable target for search latency (p95 < 200ms) and aligns it with an Elasticsearch-based read-optimized path for text and location queries. They also note the write-light, read-heavy nature of event data, which is the right rationale for a search index.

warning

Availability target is stated but not translated into failure expectations

You mention 99.99% availability for search and seat availability, but what happens when Redis, PostgreSQL, Kafka, CDC, or Elasticsearch is degraded? The design explains some recovery behavior, especially for Redis, but the NFR section does not define which user journeys must remain available during partial failures and which can degrade. You could improve this by mapping the 99.99% target to concrete failure scenarios such as 'search remains available during indexing lag' or 'booking rejects new holds but never oversells during DB failover.'

warning

Consistency model is clear for booking but not for search and booking history

The design correctly chooses strong consistency for booking, but what happens if a user books a ticket and immediately checks booking history, or if event updates lag into Elasticsearch? The system appears to use asynchronous propagation in multiple places, so the consistency expectations for read paths are important. You could improve this by explicitly stating that booking confirmation and ticket durability are strongly consistent in the transactional store, while search indexing is eventually consistent and acceptable under the stated requirements.

warning

NFR numbers are not connected to the stated scale assumptions

The targets are concrete, but they float somewhat independently from the assumptions of 1B users, 10M DAU, 1M events, and 100K bookings/day. For example, p95 < 200ms for search may be reasonable, but what query volume or concurrency is that target meant to hold under? Likewise, 99.99% availability for seat availability matters most during concentrated on-sale spikes, not average daily booking volume. You could strengthen this by tying each target to expected traffic patterns, especially burst behavior for hot events versus the relatively modest average booking rate.

info

Durability requirement would be stronger with an explicit recovery objective

You call out durability for tickets, which is the right quality attribute, but what happens after a database failure or regional outage? Without an explicit durability expectation, it is hard to judge whether the chosen mechanisms are sufficient. You could improve this by stating a concrete durability goal such as no confirmed ticket loss after acknowledged write, and optionally an RPO/RTO expectation for booking records.

✅ Good

Core booking nouns are identified

The design names the main domain entities needed for the stated flow: User, Event, Booking, Venue, and Seat. That covers the primary concepts behind searching events, selecting inventory, and storing a user's booking history.

✅ Good

Booking-to-seat relationship is made concrete

In the explanation, the candidate goes beyond just listing entities and shows how Booking and Seat connect through booking_id, seat status, and ownership checks during confirmation. That makes the reservation flow understandable at the domain level.

✅ Good

Event and venue are separated

Keeping Venue distinct from Event is a sensible domain model because location-based search and event discovery often depend on venue metadata, while bookings are tied to a specific event occurrence.

warning

Have you considered whether Seat belongs to Venue or to a specific Event inventory?

The explanation uses a Seat table with event_id, which suggests seats are event-scoped, but Venue is also listed as a core entity. What happens when the same physical venue hosts many events over time? Without clearly defining whether Seat is a reusable venue seat and EventSeat is the per-event inventory, the relationship is ambiguous and can lead to confusing ownership and availability semantics.

warning

Booked history implies a user-to-booking-to-event relationship, but it is only partially spelled out

You mention booking history as a requirement and the booking table includes user_id and event_id, which is good, but the entity relationships are not explicitly described. Have you considered making the cardinalities clear: one User has many Bookings, one Event has many Bookings, and a Booking may contain one or many Seats? At Senior level, I would expect those relationships to be stated so the happy path is unambiguous.

info

You could strengthen the model by clarifying multi-seat bookings

The API accepts seatIds as an array, but the entity description does not show whether this is modeled as Booking 1:N Seats or through a join entity such as BookingSeat. Making that relationship explicit would better align the domain model with the booking flow and avoid ambiguity around how multiple seats are attached to one booking.

✅ Good

Basic scale estimates are present

The candidate does provide core back-of-the-envelope numbers for storage and traffic: event corpus size, user data size, booking write volume, search QPS, and peak multipliers. For a capacity discussion, this is the right starting point instead of jumping straight to infrastructure.

✅ Good

Recognizes hotspot traffic is different from average traffic

They explicitly call out that bookings for popular events can spike far above the daily average and mention a virtual waiting room to cap concurrency. That shows awareness that ticketing systems are dominated by bursty, skewed demand rather than uniform load.

warning

Capacity chain stops before infrastructure sizing

You estimated QPS and some storage, but what happens when you translate that into actual system load on Elasticsearch, Redis, PostgreSQL, and Kafka? Without connecting DAU → peak QPS → per-component throughput/storage/replication, it is hard to tell whether the proposed architecture can actually sustain the stated assumptions.

warning

Peak search and booking numbers are internally inconsistent

Have you considered reconciling the different peak assumptions? Search is derived as ~5.5K QPS peak from 10 searches per DAU, but later 'Peak read - 115k read/sec' appears without a clear derivation, and booking writes jump from ~1.15/sec average to 120/sec peak. If these are hotspot assumptions, call them out separately; otherwise the sizing logic becomes hard to trust.

warning

Storage estimates miss growth and replication overhead

What happens once you account for history retention, indexes, replicas, and search copies? The raw event data may be ~1GB, but Elasticsearch indexing overhead and replication can multiply that. Booking history is also presented as 50MB/day, but under the requirement to show booked-event history, you should think about multi-year accumulation rather than only one day of writes.

warning

User data estimate is not reasoned through

The user table is listed as 1B users at '100B' and then calculated as 1TB using 1KB, which suggests the estimate is not internally consistent. Even if exact math is not the goal, what happens when this kind of mismatch propagates into shard counts, index sizes, and backup planning? Tightening the methodology would make the rest of the capacity plan more credible.

info

Justify component choices with the stated scale

You could improve this by tying technology choices to the load profile: for example, why Elasticsearch is needed for 1M events and location/text search, why Redis is appropriate for short-lived seat holds under burst traffic, and whether Kafka volume is high enough to require it versus a simpler queue. The choices may be fine, but the scale-based justification is mostly implicit.

⭐ Excellent

Idempotent booking creation is explicitly considered

The candidate calls out an idempotency key on POST /bookings for client retries. That is an important API-level decision for a booking flow where network retries could otherwise create duplicate reservations or duplicate payment sessions.

✅ Good

Core user flows are mostly covered by the routes

The API includes search, event detail, seat map lookup, booking creation, booking lookup, and user booking history, which is enough to exercise the main functional requirements end to end.

✅ Good

Large search results are handled with a cursor

GET /events uses limit and cursor, which is a sensible choice for potentially large event search result sets and avoids forcing offset pagination on a high-read search path.

warning

Booking creation response contract is underspecified

What does the client see when POST /bookings succeeds versus when a seat is already held, partially unavailable, or the booking is accepted but payment session creation is still pending? The explanation mentions returning bookingId and PENDING_PAYMENT, but the route section does not define the response shape or status behavior. In production, clients need a clear contract for states like accepted, conflict, expired hold, and retryable failure.

warning

Error model and retry guidance are missing

Have you considered what happens when the client gets a timeout on POST /bookings or GET /bookings/{id} while downstream payment/session creation is still in flight? Without defined status codes and error shapes—such as 409 for seat conflict, 404 for unknown booking, 429 for waiting-room throttling, and guidance on when to retry versus stop—the client behavior will be inconsistent and duplicate traffic is likely.

warning

Mixed pagination styles create an inconsistent API

GET /events uses cursor pagination, but GET /users/{id}/bookings uses page=1. At this scale, booking history can grow large for some users, and offset/page pagination becomes less stable under inserts. You should think through whether history should also use a cursor so clients get a consistent and scalable pagination model.

warning

User-scoped booking history API raises ownership questions

What happens if a client calls GET /users/{id}/bookings for another user's id? The route shape implies arbitrary user-id access unless auth rules are enforced elsewhere. For a user history endpoint, a safer contract is often a caller-scoped resource like GET /me/bookings or an explicit statement that the server ignores mismatched user ids and derives identity from auth.

info

Seat hold token is not reflected in the public API

The explanation introduces a short-lived booking token for hot events and a hold lifecycle in the backend, but the API routes do not show how the client obtains or presents that token. You could improve this by making the contract explicit—for example, returning a hold token from seat-map/hold acquisition and requiring it on POST /bookings—so the concurrency control is visible and enforceable at the API boundary.

info

Search filter semantics could be made clearer

GET /events supports location and keyword, which matches the requirements, but the contract does not say whether both filters are optional, how location is represented, or what happens when neither is provided. You could improve this by documenting the expected query combinations and validation behavior so clients know how to form valid searches.

⭐ Excellent

Booking flow has a clear consistency model

The design shows a thoughtful split between fast temporary coordination in Redis and PostgreSQL as the source of truth. The candidate explicitly handles seat holds, transactional booking persistence, outbox publication, and final HELD→BOOKED transition, which is a strong end-to-end approach for preventing double reservation under concurrent booking attempts.

⭐ Excellent

Failure scenarios around payment and retries are well considered

The explanation goes beyond the diagram and addresses realistic production failures: duplicate client retries via idempotency keys, payment-service crash after session creation, webhook retries, and late payment after hold expiry with refund compensation. That shows strong design thinking about distributed workflow failure modes.

✅ Good

Search path is separated from transactional booking path

Using Elasticsearch for text and location-based event search while keeping booking state in PostgreSQL is a good architectural separation. It protects the transactional store from high read traffic and fits the stated low-latency search requirement.

✅ Good

Hot-event admission control is a sensible scalability guardrail

Introducing a waiting room in front of booking for flash-sale scenarios is a good senior-level choice. It acknowledges that the average booking QPS is low but contention for a single popular event can spike sharply, and it protects Redis, PostgreSQL, and Booking Service from a thundering herd.

warning

Search read path does not show caching for hot traffic

Have you considered what happens when many users repeatedly hit the same popular event pages, seat maps, or identical search queries? At the stated scale, Elasticsearch and the Event Service may become the first bottleneck on hot reads because Redis is only shown on the booking/waiting path. You could improve this by caching hot event details, venue data, seat-map snapshots, and common search results with short TTLs.

warning

Seat-hold expiry cleanup may become a bottleneck under bursty events

What happens during a major onsale when a very large number of holds expire around the same time? The explanation mentions a background scan for expired HELD rows, but a naive periodic scan on the seats table can become expensive and delay seat release. You could improve this by partitioning holds by event/time, using an indexed expiry queue, or driving expiration from delayed events rather than broad table scans.

warning

Single-database transactional core is the main scaling risk

Have you considered what happens if one extremely popular event drives concentrated writes on the same seat inventory in PostgreSQL? Even if average booking volume is modest, the first thing that breaks in practice is often the primary database due to lock contention and hot rows for seat state transitions. You could improve this by calling out partitioning/sharding strategy by event or venue, plus careful transaction boundaries and indexing for seat ownership checks.

warning

High availability of critical stateful components is not made explicit

What happens if the PostgreSQL primary, Redis node, Kafka broker, or Elasticsearch node fails? The explanation does cover Redis recovery logically, but the HLD does not show replicas/failover topology for the core stateful systems that sit on the critical path. Without explicit HA, booking or search can stall on a single node failure. You could improve this by showing primary-replica/failover for Postgres, Redis HA, Kafka replication, and Elasticsearch multi-node deployment.

info

Waiting room integration is only partially connected to the booking flow

The waiting service is a good addition, but in the current HLD it is not fully tied into seat-map reads and booking authorization. You could strengthen the design by showing how the booking token is validated by the Booking Service/Event Service so the queue actually gates downstream load rather than existing as a side path.

info

Some components are loosely represented in the end-to-end flow

The 'Replicas' node and the service box containing rate limiting/auth/load balancing are connected, but their role in the request lifecycle is not fully clear from the diagram. You could improve the HLD by making the read/write split and request routing more explicit so there are no ambiguous components in the critical path.

Want this kind of feedback on your own design?

Draw your architecture for Ticket Booking System and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.