Library YouTube / Video Streaming AgileViper46
YouTube / Video Streaming — system design by AgileViper46 Hire Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.
Author's explanation How the author described their design Video Service:
Video service is the video metadata management service. When the user wants to stream a video it will send the request to this service and video service will first try to fetch the video metadata form the redis cache. On cache miss it will populate the cache from the postgres replica.
During the video upload request, it will assign the blob id then create a presigned url for it and create the metadata and return the presigned url for that video id.
Likes service:
This is the service to handle the likes from the user. On every video fetch the likes will be fetched too. The service will user the redis cache to get the likes for the video. For the like update the service will first write the update to the cassandra
Redis cache:
This cache is responsible for serving the video_id to videoMetadata to video service and vidoe_id to likes counter to the likes service, This is a redis cluster instance with a replica failover based of the required QPS.
PostgresSQL:
This is the store for the relational data like the Users and The video Metadata, with a single replica failover strategy
The read QPS + write QPS is < 2000 means this system is enough scaled based on the rquirements
Cassandra:
This is the likes DB where each like/dislike events are stored and most of the read is ideally served via the redis.
The storage will be on (video_id, user_id) so that multiple retry or change from like/dislike will be overwritten as an event.
Flink:
On the like event this will be the store for the aggregate likes for the events coming to the cassandra and keep on pushing the counts to redis every minute window.
Client:
The client will be responsible for the streaming. It will use HLS/DASH for the adpative bit rate streaming. So for a video first it will calculate the average network bandwith + the buffer video duration and then request the respective chunk for the video. It will dynamically increase or decrease the video quality based off this parameters. The request will go to CDN and will be served from the CDN.
CDN:
Since the video data is mostly static we will be serving the data from the CDN. The request will be mostly for the 3-6MB chunks of a video and if not present the CDN will sync from the blob.
Kakfa and Encoding cluster:
Once the video upload is complete the enccoding servers will encode/transform/chunk the video based on different resolutions/bitrates and upload the chunk to the blob. This will mark the video as processed.
Scalability considerations and Failure handling:
- We have cache for fast response with replica to handle the crash and if the redis crashes completely we go to database.
- The postgres also has a replica which can take over in the event of crash.
-The cassandra can be shared based on the videoId if required right now we are not sharding but can be done if the data increases / QPS increases.
- Since the update to redis after the like event can fail we have the like job counter which will sync the likes from the cassandra to redis to keep it consistent.
- The video metadata will have status like created | uploading | processing | completed we have the kafka to make sure the encoding and processing servers will take the updates. These servers will only commit the message after the completion. The encoding servers will be de-dup so if some encoding was done and server crashes then the message picked up by other server will resume form there.
- During the upload of the video. The upload will be multi part so that the client will break the video in chunks and upload to the blob. The blob will stitch the parts once the upload is complete. This will allow us the resumable uploads.
The video metadata contains a status field (UPLOADING, PROCESSING, COMPLETED, FAILED).
When GET /videos/{videoId} is called, the Video Service returns metadata including the status. If the status is not COMPLETED, the response does not include the HLS manifest URL and instead returns a non-playable state to the client.
Even if a client bypasses the UI and directly attempts playback, the Video Service validates the status before generating or returning the manifest URL. Requests for videos in UPLOADING or PROCESSING state return a 409 VideoProcessing or similar error.
Only after the encoding pipeline successfully generates all HLS renditions and uploads the manifest and segments to blob storage does the processing service update the metadata status to COMPLETED. At that point the manifest URL becomes available and playback is enabled.
AI
Design Review Hire Signal Lean Hire
The candidate demonstrates good system decomposition, appropriate technology choices for large-scale video delivery, and thoughtful handling of upload-to-playback correctness. However, for a senior-level bar, the design is weaker on failure-mode handling and resilience of critical control-plane paths, especially around likes ingestion, metadata failover, and cache degradation.
AI Review Senior (L5-L6) Expand All✅ Good Availability prioritized over consistency for likes
The candidate explicitly states availability is more important than strict consistency and aligns the like counter design to that choice with minute-level aggregation and cache refresh. That is a reasonable NFR trade-off for a social counter where temporary staleness is acceptable.
✅ Good Consistency model is called out for like counters
They do not leave consistency implicit: likes are treated as eventually consistent with a concrete freshness target of roughly 1 minute. This is the kind of explicit trade-off a senior candidate should make because it clarifies what user-visible behavior is acceptable.
✅ Good Streaming latency requirement is tied to delivery strategy
Low-latency streaming is connected to concrete mechanisms such as CDN delivery and HLS/DASH adaptive bitrate streaming. That shows the latency goal is not just listed as a buzzword but is influencing the design.
warning Latency target is qualitative, not measurable
You say 'low latency video streaming,' but what happens when startup time or rebuffering gets worse in production? Without concrete targets such as video start time, chunk fetch latency, or acceptable buffering rate, it is hard to know whether the system meets the requirement or where to set capacity and alerting thresholds.
warning Availability target is not translated into service-level expectations
You mention 99.99% availability, but have you considered what that means per user-facing flow? Streaming, upload initiation, upload completion, metadata reads, and likes have very different failure tolerance. Without mapping the availability target to specific APIs or journeys, the number stays aspirational and does not clearly drive operational decisions.
warning Scalability numbers do not connect cleanly to the stated assumptions
The assumptions say 1B users, 10M DAU, and high-frequency streaming, but the explanation justifies Postgres with 'read QPS + write QPS is < 2000' without showing how that follows from those assumptions. What happens if metadata reads, upload state transitions, and like lookups are materially higher? The risk is under-sizing core paths because the throughput estimate is not defended.
warning Consistency choice is only justified for counters, not for user-visible like state
Eventual consistency is fine for aggregate counts, but have you considered what happens when a user likes a video and immediately refreshes? The design mentions overwriting by (video_id, user_id), but the NFR discussion does not separate the consistency needs of the per-user liked state from the aggregate counter. If both are eventually consistent, users may see their own action appear lost even if the backend eventually converges.
info Upload and processing NFRs would be stronger with explicit SLOs
You could improve this by defining measurable targets for upload success and processing completion, for example acceptable failure rate for resumable uploads and expected time from upload completion to playable video. Since upload and processing are core requirements, those numbers would make the scalability and availability discussion much more defensible.
🗃️ Core Entities Review
✅ Good Core nouns for the happy path are present
The design identifies the main entities needed for the stated requirements: User, Video, Like, and VideoMetadata. That is enough to cover upload, stream, and like flows without inventing out-of-scope concepts.
✅ Good Like modeled as its own entity
Separating Likes from Videos shows good domain thinking for a large-scale system. It supports per-user like state and deduplication via (video_id, user_id), instead of treating likes as only a counter on the video.
✅ Good Video lifecycle captured in metadata
The explanation gives VideoMetadata a clear role and includes status transitions such as UPLOADING, PROCESSING, COMPLETED, and FAILED. That is a meaningful domain concept for the upload-to-playback flow, not just a storage detail.
warning Relationship between Video and VideoMetadata is unclear
Have you considered whether VideoMetadata is a separate entity or just attributes of Video? Right now both are listed as core entities, but the relationship is not defined. If this is really a 1:1 extension of Video, say so explicitly; otherwise it is hard to reason about ownership and lifecycle.
warning User-to-Video ownership is not stated
What happens when you need to answer basic domain questions like 'which videos did this user upload' or enforce uploader permissions? The design names Users and Videos, but it never explicitly defines that a User owns or uploads many Videos. That 1:N relationship is part of the core upload flow and should be called out.
warning Like relationships are only partially defined
You imply Likes are keyed by (video_id, user_id), but the domain relationship should be stated directly: Users like Videos through a Like join entity, giving a many-to-many relationship with one Like per user-video pair. Without making that explicit, it is harder to reason about uniqueness, unlike/retoggle behavior, and how counts derive from source-of-truth records.
✅ Good Storage and egress are estimated from first principles
The candidate starts from bitrate, converts that into per-hour storage, then rolls it up to daily and yearly ingest and separately estimates viewing bandwidth from DAU and watch time. That is the right capacity methodology for a video platform because it ties user behavior directly to storage growth and CDN/network load.
✅ Good Peak traffic is considered instead of only averages
They do not stop at average concurrent viewers or average request rates; they apply a peak multiplier to bandwidth and metadata/likes QPS. That shows awareness that infrastructure must be sized for bursts, not just daily averages.
✅ Good Component choices are partially tied to the estimated load
The explanation explicitly connects the relatively low metadata QPS to Postgres plus cache, while pushing high-volume video delivery to the CDN and blob store. For the stated assumptions, that is a reasonable scale-based split rather than overbuilding the control plane.
warning Encoding capacity is not sized from the upload volume
You estimated 10K hours of uploads per day and the resulting storage footprint, but what happens when the transcoding backlog grows? Without converting daily upload hours into required parallel encoding throughput, GPU/CPU fleet size, and processing SLA, the system could accept uploads faster than it can make them playable.
warning CDN origin and cache-hit assumptions are missing
The 2 Tbps peak bandwidth is a good top-line number, but how much of that is expected to be served by the CDN versus pulled from origin? If cache hit rate drops for long-tail content or new uploads, origin bandwidth and blob read capacity can become the real bottleneck. You should size origin egress using an explicit cache-hit assumption.
warning Replication and multi-region overhead are only applied to stored video
You accounted for 3x replication on encoded video storage, which is good, but have you considered the same overhead for likes data, metadata, Redis memory, and cross-region copies if the 1B-user platform is geographically distributed? At this scale, control-plane storage is smaller than media but still worth sizing so failover and regional expansion are grounded in numbers.
info Request-rate estimates could be tied more clearly to playback behavior
You could improve this by separating metadata/likes QPS from segment request volume. Since clients fetch many HLS/DASH chunks per session, clarifying that chunk traffic is absorbed by the CDN while the backend only handles manifest/metadata/like calls would make the capacity chain from viewers to each subsystem more explicit.
info Database sizing is justified qualitatively but not quantitatively
You mention that Postgres is enough because read plus write QPS is under 2000, but you could strengthen this by translating that into rough instance count, storage growth, and headroom under peak. The conclusion may be reasonable, but a quick capacity-to-infra mapping would make the argument more convincing.
✅ Good Core user flows are mostly covered
The routes cover the main functional requirements: fetching video metadata for playback, uploading a video via a presigned URL flow, and liking a video. Even though the route set is lightweight, a client can complete the primary product flows through these APIs.
✅ Good Appropriate protocol choice for streaming
Using HTTP-based HLS/DASH for playback is a solid choice for internet-scale video delivery because it works well with CDNs, adaptive bitrate streaming, and standard clients. That is a better fit here than trying to stream video through a custom API.
⭐ Excellent Playback gating by processing status is explicitly handled
The explanation addresses an important API edge case: a video may exist before it is playable. Returning metadata with status and withholding the manifest URL until processing completes gives the client a clear contract and avoids broken playback attempts.
warning Like API is missing user-facing idempotency and state semantics
Have you considered what happens when the client retries POST /likes/{video-id} because of a timeout or flaky mobile network? Without a clear contract for whether this is 'set my reaction' versus 'append an event', the client cannot know if a retry will double-apply, overwrite, or race with a dislike. You could improve this by making the endpoint explicitly represent the caller's current reaction, for example PUT /videos/{videoId}/like with a body like {"reaction":"like|dislike|none"}, and documenting idempotent retry behavior.
warning Error model is underspecified for upload and playback edge cases
What does the client see when upload initialization fails, a presigned URL expires mid-upload, the video is still processing, or the video ID does not exist? The explanation mentions a 409-style processing error, but the API contract does not consistently define status codes or error shapes across routes. At this scale, clients need predictable responses such as 404 for missing videos, 409 for not-yet-playable videos, 401/403 for unauthorized actions if applicable, and a structured error body with retry guidance.
warning Upload flow is not cleanly modeled as an API contract
Have you considered how the client knows when a multipart upload is complete and when encoding should start? POST /videos returns a presigned URL, but PUT /presigendURL is not really your platform API and does not show the completion step back to your service. Without an explicit finalize/complete upload call or callback contract, the client-facing flow is ambiguous. You could improve this by modeling it as create upload session -> upload parts to blob -> complete upload session on your API.
info Resource design could be more consistent
You could improve this by making the routes more resource-oriented and easier to reason about, for example /videos/{videoId}, /videos/{videoId}/likes, and possibly /videos/{videoId}/upload-session. Right now likes are modeled as a top-level resource while video metadata is nested under videos, which makes the API feel less cohesive.
info Read/write semantics for likes are a bit unclear
What exactly does GET /likes/{video-id} return: aggregate counts, the current user's reaction, or both? A client rendering a like button usually needs both the total count and whether the viewer has already liked or disliked. You could improve this by defining the response shape more clearly so the client does not need extra round trips or guess at semantics.
⭐ Excellent Streaming path is correctly offloaded to CDN/object storage
The design keeps the hot video delivery path away from the application tier: metadata is fetched from Video Service, while actual HLS segments are served from CDN backed by blob storage. At the stated 2 Tbps peak bandwidth, this is the right architectural split and shows good awareness that app servers cannot sit in the media path.
✅ Good Upload and encoding pipeline is asynchronous
The candidate separates upload from heavy post-processing by issuing a presigned URL, storing raw video in blob storage, and triggering encoding through Kafka and worker nodes. That avoids blocking the user on transcoding and matches the requirement for scalable uploads with chunking/segmentation.
✅ Good Read-heavy metadata and likes paths use cache
Using Redis for video metadata and like counters is a sensible optimization for the stated read rates. The explanation also makes the fallback path explicit: cache miss goes to Postgres replica for metadata, and like counters are periodically rebuilt from Cassandra via Flink.
✅ Good Processing state is modeled explicitly
The explanation around UPLOADING, PROCESSING, COMPLETED, and FAILED states closes an important end-to-end gap: clients do not attempt playback until encoding finishes, and the service gates manifest access based on status. That is a solid design detail for correctness.
critical Like ingestion path appears to depend on Cassandra as the first write hop
What happens when Cassandra is slow or partially unavailable during a like spike? The current flow writes likes synchronously into Cassandra and only later aggregates to Redis. That means user-facing like requests would fail or back up behind the database. For an availability-first system, I would expect a more resilient write path, such as durable buffering through Kafka before async aggregation, or at least a clearly defined timeout/degradation strategy.
warning Postgres failover story is too weak for a 99.99% metadata service
Have you considered what happens when the primary Postgres node fails during video upload or metadata updates? The explanation mentions a single replica failover strategy, but the design does not show how failover is detected, how writes are rerouted, or what happens to in-flight metadata mutations. Since every upload and playback bootstrap depends on metadata, this is a key SPOF unless failover is automated and fast.
warning Redis is on the critical read path without a clear degradation plan under cache loss
What happens if the Redis cluster is unavailable or cold after failover? The explanation says the system falls back to the database, but both metadata reads and like reads would suddenly hit Postgres/Cassandra directly. At 10M DAU, even if average QPS looks manageable, cache loss can create a thundering herd on the backing stores. You could improve this by describing cache warmup, request coalescing, and per-key TTL/jitter to avoid stampedes.
warning Encoding completion and metadata update flow is underspecified under worker failure
Have you considered what happens if an encoding worker uploads only some renditions or crashes after writing segments but before the processing server updates Postgres? The explanation says workers commit only after completion and are deduped, which is good, but the HLD still leaves open how partial outputs are detected, retried, and cleaned up so clients never see incomplete manifests or stale PROCESSING states.
info Some components are loosely connected or ambiguously placed in the request flow
You could improve the HLD by making a few flows more explicit. For example, the separate 'Client web/app' and 'User' nodes make the end-to-end path harder to trace, and the HLS adaptation box is shown as a service even though the explanation says adaptation happens on the client. Tightening these connections would make it clearer which logic is client-side versus server-side and reduce ambiguity around orphaned components.
info CDN miss and origin protection strategy is not called out
You could strengthen the design by explaining what happens during a cache miss storm on a newly popular video. Without origin shielding, request collapsing, or prewarming for hot content, the blob store can become the first bottleneck when many viewers request the same manifest and early segments at once.
Want this kind of feedback on your own design? Draw your architecture for YouTube / Video Streaming and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.