Challenge Drills Library Drawing Guide Learn AI Setup Guide Support About

Spotify / Music Streaming — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

This is going to be my high level diagram users are going to be interacting with my load balancer api which is going to perform dot balancing rate limiting and auth for me and then it is those requests are going to be load balanced across my song service and the previous service both of them are interacting with a post trace sql which stores the song metadata and the actual song information or the blog chunks are stored in an azure blob and the sound service can provide a pre send url or sas url to the user and using the url the user can directly download or upload those songs into the blob Deep Dive Now coming to the deep dive part what we are going to do is like For adaptive betrayal streaming what we can do is That whenever a user wants to play a song first that request is going to go To my CDN servers so how this actually going to work is that whenever you want to play a song let's say 1 to 3 the client is going to request the chunk file for it The master file which would have the information about all the segments in that particular song and then try to download each segment on after another and how does the adaptive we thread swimming work is basically The client is going to monitor the network activity from the user's mobile device or the Web app and try to figure out what would be the best Precision of the Beatrice music can be sound like 128K or 256K etc and we can request for higher or lower quality depending on the bandwidth the round trip time latency etc So once the master file is got it will start asking for the next set of chunks and if those chunks are not available in the cdn the cdn is going to pull the information from the azure blog and then serve the user this will ensure that the information is cached once in the cdn during the first request and the information about the song as well as the chunks are kept as close as to the users with multiple cdn servers For the upload of the songs the users are going to communicate with the song service and the song service is just going to create an entry in the sql for this particular song meta data and it would mark its status as created and it would also create a resigned url in the azure blog for it and return it to the user Then the user are going to upload or break basically upload their song into the azure blog part by part i'm not going into detail because this is kind of a blog store design where you just track part by part progress and once every part is completed we would say that the song is completely uploaded this helps in receivable upload download etc and once that is done it would say a final entry point that the song has been completed The song service would mark this particular song is uploaded Now the only thing here After the song has been uploaded the azure blog in the azure blog it would be stitched together as a complete song and then we are going to have a transcoding and encoding layer which is going to Kind of compressed as well as break it into multiple chunks for different bitrates and store it junk into the blob once this transforming and Encoding is complete we can say that the song is ready to be streamed For days we can have a queue which whenever you get an upload complete status you just use the cdc workers to put an event into the Kafka queue and the encoding servers can take the event complete the encoding once the encoding is complete they can again put an event into the Kafka and then Cody was complete and the song service can mark the status as completed. Now coming to the playlist side we would have a playlist service which are going to be you know directly created particular playlist into the postpress sql as a metadata and it would just have the reference to name what all songs it have and information And the playlist service would be responsible for the update create or delete of the playlist Now to scale this particular system for the search side what I'm thinking is that I would introduce an elasticsearch cluster this elasticsearch cluster would be in sync with the R database using cdc workers whenever there is a change in the data about a song or any information about the playlist etc it would be captured and then it would be updated in the elastic search and our search service would be responsible for handling the search routes for the playlist or the songs now the one thing to call out explicitly here that this is going to be eventually consistent manner and this is fine for the search service because we want low latency and higher availability over the consistent search even if a song is uploaded and if it's visible after 3 to 5 seconds it's not an issue for this particular system similarly if a playlist has been created but it's not searchable just after 5 to 10 seconds then it's not that big of a deal for now Now think here is the another bottleneck would be that a single post best sql cannot handle 1 billion songs with multiple playlists and everything . To handle this The Porsche primary needs to be splitted basically sharded So that the rights can be distributed And every bosses primary I'm going to add two particular posters replica for backup and the reads can happen from the replicas This way we can scale our reads as well I'm also going to introduce a radius clash cluster Which would be helpful in surveying the song metadata and playlist metadata for a frequently asked information The couple of things that we need to take care in this particular thing is what happens about the song information or the song data ideally it generally does not change very much once a song is created and uploaded it stays as it is for its lifetime so we can have a long ttl for it But for the metadata information or any hot song which is being streamed the most and if Information expires in cdn and all the requests now tries to pull the same information from our blob it would be a thundering hard problem so what I am going to do is I am going to ensure that we can do a request collarizing and only one request goes to the back end while the other waits And also we can do a jitter based ttl expiry so that the every chunk and every information every song does not expire at the same time it expires at a random interval so whenever you want to fetch the information it would not be a bursty traffic going to the back end system Now coming to the offline download The download is actually going to happen directly from the CDN So it would be same as trimming the music we do it in parts and we download a chunk and then play that chunk It would be same as trimming but what we can do in the download weather like if it's being requested as a download we can just try to get the highest bit rate song for it and make sure that they download it The song surveys are the hot parts is not going to be responsible for the download of the assets everything is going to be just like streaming it would be getting all the chunks and the clients had application is going to make sure that it stores all the chunks and all the information about the header segment file instead of a temporary storage it can store into the phones file system or the clients file system and then it can be played from there For the postgraduate sql we would be sharding the data so let's say we have a songs metadata table as well as a playlist metadata table so we would shard the data based on the song ids we can create a hash of the song id and based on it we can shard the particular data based on a consistent ring formation we can use devisium or other softwares like that to handle this And so when a particular primary goes down the rights might get affected only for those particular songs and playlists ids which exist on that chart and it would not bring the complete system down and we already have two replicas which would be doing a failover during this time and we might see a small amount of interruptions during the right path because the right will be rejected during that point in time and this system is a read heavy system and writes are very less and I am we are allowing those fall at all fault to go during the right part to maintain the consistency and once our secondary gets promoted everything would flow back again

Hire SignalLean Hire

The candidate demonstrates solid system instincts and chooses several correct large-scale patterns, especially around media delivery, async processing, and search. But the design stops short of the level of completeness and rigor expected from a strong senior candidate: core entity relationships, API contracts, control-plane HA, and metadata scaling details are not fully worked through.

⭐ Excellent

Consistency trade-off is explicitly chosen for search

The candidate clearly states that search is eventually consistent and explains why that is acceptable here: low-latency, high-availability reads matter more than immediate visibility of newly uploaded songs or playlists. That is the right kind of NFR reasoning because it ties a consistency choice to user impact and system goals.

✅ Good

Scalability concerns are connected to the stated scale

They recognize that a single relational primary will not handle 1B songs plus playlist metadata and propose sharding with replicas, while also pushing streaming traffic to CDN/blob storage so the core services are not on the hot path for media delivery. This shows awareness that the 1B-user / 100M-DAU assumption must materially shape the design.

✅ Good

Low-latency streaming and search are called out as first-class quality goals

The explanation does not just mention latency abstractly; it connects low-latency streaming to adaptive chunk delivery via CDN and low-latency search to a dedicated search index. That is a reasonable translation of the stated NFRs into concrete quality attributes.

warning

Availability target is named but not decomposed by critical user journey

You mention 99.9% availability for the system, but what happens when a shard primary fails, Kafka is delayed, or the search cluster is degraded? The explanation discusses some failover behavior, but it never defines which paths must meet 99.9% separately: streaming, search, playlist reads, uploads, and metadata writes have very different failure tolerance. You could improve this by stating per-path availability expectations and what graceful degradation is acceptable for each one.

warning

Latency goals are qualitative, not measurable

You say 'low latency adaptive streaming' and 'low latency search,' but what happens if search takes 2 seconds or stream startup takes 5 seconds? Without concrete targets such as search p95/p99 latency or playback start latency, it is hard to judge whether the design actually meets the NFRs or where to spend complexity. You could improve this by defining measurable SLOs for startup time, chunk fetch latency, and search response time.

warning

Scalability numbers do not fully connect to throughput assumptions

The design references 1B users, 100M DAU, and 10 songs/day, but what happens at peak traffic? Those assumptions imply very large daily and peak read throughput for streaming, metadata lookups, and search, yet the explanation does not translate them into request rates or peak concurrency to justify shard count, replica strategy, cache hit expectations, or search cluster sizing. You could improve this by converting the stated assumptions into peak QPS and using those numbers to defend the NFR choices.

info

Consistency is justified for search, but not for playlists and metadata reads

You explicitly justify eventual consistency for search, which is good, but have you considered what consistency users expect when editing a playlist and immediately reopening or sharing it? Some paths may require read-after-write semantics even if search does not. You could strengthen the design by stating which entities are strongly consistent on direct reads/writes versus eventually consistent in derived views like search.

✅ Good

Core nouns for the main product flows are identified

The design names the three central entities the system revolves around: User, Song, and Playlist. Those cover the primary user-facing flows in the requirements: streaming songs, uploading songs, and creating/sharing playlists.

✅ Good

Song lifecycle is modeled implicitly through status transitions

In the explanation, Song is not treated as a flat blob; it moves through created, uploaded, transcoded, and ready states. That shows useful domain thinking because upload and playback depend on different phases of the same entity.

warning

Uploader/ownership relationship is left unclear

Have you considered how Song connects back to User for the upload flow? Users are allowed to upload music, but the entity model never makes explicit whether a song has an owner/uploader or how that relationship is represented. Without that link, it becomes hard to reason about permissions, attribution, and even basic queries like 'show me songs uploaded by this user.'

warning

Playlist-to-song relationship is underspecified

Have you considered what happens when playlists contain many songs or the same song appears in many playlists? The design says a playlist 'has references' to songs, but it does not clearly define the relationship as many-to-many or how membership is modeled. At this scale, that relationship is a core part of the happy path and should be called out explicitly.

warning

Searchable music metadata is not modeled as part of the domain

Have you considered which entity actually carries the searchable attributes required by the product, such as artist name and song name? Search is a functional requirement, but the core entity section only lists Song without clarifying whether artist/album-style metadata is part of Song or represented separately. Even if you keep Artist as embedded metadata rather than a standalone entity, that relationship should be made explicit so the search flow is grounded in the domain model.

✅ Good

Back-of-envelope traffic sizing is present

The candidate does translate the stated DAU and songs-per-day assumption into daily reads, then into QPS and a peak multiplier. That shows the right capacity-planning instinct: start from user behavior and estimate steady-state plus peak load rather than jumping straight to infrastructure.

✅ Good

Storage estimate includes replication overhead

The storage calculation for 1B songs at 5MB each, followed by 3x replication, is directionally sound. Including replication is important at this scale because raw object size alone would materially understate the footprint.

warning

Read QPS is underspecified for chunked streaming

Have you considered what happens when the chunking model is applied consistently to the peak calculation? You estimate 1B song plays/day, multiply by 10 for chunking, and get ~116K average read QPS, but the stated 500K peak appears to apply the 5x multiplier to song plays rather than chunk requests. If chunked delivery is the real unit of load, peak request volume is materially higher, and that affects CDN miss traffic, blob-store egress, and metadata lookups.

warning

Major write-side capacities are missing

Have you considered what happens if uploads, playlist mutations, and search indexing spike? The design includes user uploads, transcoding, CDC, Kafka, playlist creation, and search sync, but there are no write QPS or ingestion-volume estimates to show whether those pipelines are sized appropriately. At senior level, I would expect at least rough sizing for upload throughput and background processing because those can dominate internal capacity even if end-user reads are CDN-heavy.

warning

Storage growth is only calculated for source audio

Have you considered what happens to storage once transcoding creates multiple bitrate renditions and offline-download variants? The 5PB raw estimate appears to count one 5MB object per song, but the explanation explicitly stores multiple encoded outputs and chunks. That can multiply object count and total bytes significantly, so the current storage number likely understates blob capacity and replication needs.

warning

No bandwidth or egress sizing for streaming workload

Have you considered what happens to network capacity at peak playback? For a music system, request QPS alone is not enough; bitrate drives CDN and blob egress. Without a rough estimate of average stream bitrate, cache hit rate, and resulting backend egress, it is hard to tell whether the proposed CDN/blob architecture actually fits the stated 100M DAU scale.

info

Tie infrastructure choices back to estimated load

You could improve this by explicitly connecting the calculated scale to component choices: for example, why PostgreSQL sharding is sufficient for the expected metadata/write volume, why Kafka is needed for the upload/transcoding event rate, and what cache-hit assumptions justify Redis and CDN offload. The components are plausible, but the capacity argument would be stronger if each one were justified by a rough load estimate.

✅ Good

Core user flows are mostly represented

The routes cover the main functional paths in the prompt: searching songs, fetching song metadata, streaming chunks, creating playlists, uploading songs via presigned URL, and downloading songs for offline use. Even though some flows are loosely specified, a client could roughly execute the primary product actions from these APIs.

✅ Good

Cursor pagination is included on search/list endpoints

Using cursor and limit on /songs and /playlists shows awareness that search results can be large at this scale. That is a better fit than offset pagination for high-volume catalogs and public playlists.

✅ Good

Direct blob upload/download via presigned URL is a sensible API boundary

Returning a presigned URL for uploads keeps large media transfer off the application servers and matches the candidate's explanation that clients should interact directly with blob storage/CDN for heavy content transfer.

warning

Playlist APIs do not fully support the stated create/share flow

Have you considered how a client updates playlist contents or metadata after creation? Right now there is POST, GET, and DELETE, but no clear update route such as PATCH /playlists/{id} or song-level add/remove operations. Without that, playlist management is incomplete for a real user flow.

warning

Upload flow is missing the completion/finalization API described in the explanation

What happens after the client uploads the file to the presigned URL? In your walkthrough, the song moves through states like created, uploaded, and encoded, but the route list has no finalize/complete endpoint or status endpoint. Without that, the client does not know how to tell the system the upload finished or how to poll until the song is ready to stream.

warning

Streaming protocol is only partially specified

Have you considered how the client discovers the manifest/master playlist before requesting chunks? The explanation talks about requesting a master file and then adaptive bitrate chunks, but the API list only shows GET /songs/{songId}/{variant}/{chunkId}. Without a manifest endpoint or a clear contract on how variant and chunk IDs are discovered, the streaming flow is underspecified.

warning

Resource design mixes storage details into public URLs

What happens if your chunking scheme or variant naming changes? Exposing /songs/{songId}/{variant}/{chunkId} tightly couples clients to internal media layout. A cleaner contract would return a manifest or signed media URLs so the server can evolve chunk structure without breaking clients.

warning

HTTP method and route semantics are inconsistent

Have you considered the semantics of PUT /preSignedUrl and GET /songs/:id/download? PUT /preSignedUrl is not really updating a presigned-url resource; it appears to upload media bytes to object storage. Similarly, GET /songs/{id}/download likely needs to return a redirect or signed URL rather than the file itself. The current route naming makes it unclear what the client should actually send and receive.

warning

Error handling and retry behavior are not defined

What does the client see when search is rate limited, a presigned URL expires, an upload part fails midway, or a song is not yet encoded? Without status codes, error shapes, and retry guidance, clients cannot reliably recover from common failures. At this level, I would expect at least a consistent error contract and explicit handling for transient vs terminal failures.

info

Download endpoint could be made more explicit

You could improve this by defining whether offline download returns a signed manifest, a highest-bitrate asset, or chunk URLs. Right now the explanation says download reuses the streaming path through CDN, but the route list suggests a separate GET /songs/{id}/download flow without clarifying the response contract.

⭐ Excellent

Streaming path is offloaded to CDN and blob storage

The design keeps the hot media delivery path away from the application tier: clients fetch manifests/chunks from the CDN, which falls back to blob storage on cache miss. That is the right high-level shape for adaptive streaming at this scale because the API tier is not in the data path for every chunk request.

✅ Good

Upload and transcoding are decoupled asynchronously

Using presigned upload URLs plus Kafka-driven encoding avoids blocking the user on heavy media processing. The candidate also models state transitions like created/uploaded/encoded, which makes the end-to-end upload-to-stream flow coherent.

✅ Good

Search is separated from the transactional store

Routing search to Elasticsearch and feeding it via CDC is a sensible trade-off for low-latency search over a large song catalog. The candidate explicitly calls out eventual consistency as acceptable for search, which shows awareness of the trade-off.

✅ Good

Candidate identified cache stampede risk on hot content

Calling out request coalescing and jittered TTLs for hot metadata/chunk expiry shows useful operational thinking beyond just adding a cache box.

warning

Primary database is still a major bottleneck in the current diagram

You mention sharding in the explanation, but the diagram still shows a single Postgres cluster serving song metadata, playlist metadata, CDC, and replication. At 1B songs and large playlist traffic, what happens when that single logical cluster becomes the coordination bottleneck for writes, failover, and CDC fan-out? The design would be stronger if the sharded metadata layer were made explicit, including how Song Service and Playlist Service route to shards.

warning

Playlist and song metadata are mixed into one storage path without clear partitioning

Have you considered what happens when playlist growth and song catalog growth stress the same Postgres cluster differently? Public playlists can become very hot and have very different access patterns from song metadata. Without a clear separation of tables/services/shards for songs versus playlists, one workload can interfere with the other.

warning

Read path for metadata is not fully traced end-to-end

The design shows Song Service and Playlist Service writing to Redis, but it does not show how reads use Redis first and fall back to the database. What happens on a cache miss for GET /songs/{id} or GET /playlists/{id}? The system likely works, but the hot metadata read path is underspecified in the HLD, which matters at this scale.

warning

Offline download and streaming share the same asset path without any control point

You reuse the CDN path for offline download, which is directionally fine, but what happens when large download bursts compete with interactive streaming traffic for the same cached objects and origin bandwidth? Without a separate policy or traffic shaping layer, download-heavy behavior can degrade the low-latency streaming requirement.

info

Replica nodes are orphaned in the current design

The two Replica nodes are drawn from Postgres, but no service is shown reading from them. You could improve this by explicitly showing which read paths use replicas versus cache versus primary, otherwise they look unused in the HLD.

info

Duplicate CDC components make the flow harder to reason about

There are two CDC nodes, one feeding Kafka and one feeding Elasticsearch, but they are not clearly differentiated. You could improve this by labeling them by purpose or showing a single CDC pipeline with multiple sinks so the failure and consistency model is easier to understand.

warning

Failure handling around encoding completion is thin

What happens when encoding succeeds for some variants but the completion event to Song Service is delayed or lost? The song could exist in blob storage but remain unavailable or partially available from the metadata perspective. The design would benefit from a clearer source of truth for asset readiness and idempotent completion handling.

warning

Single points of failure are not fully addressed for critical control-plane services

You discuss DB replica failover, but what happens if Song Service, Playlist Service, Search Service, Redis, or Kafka has a node failure or partition? The HLD implies clustered components in some places, but the HA story is only explicit for Postgres. At 99.9% availability, the control plane should not depend on a single instance of these services.

Want this kind of feedback on your own design?

Draw your architecture for Spotify / Music Streaming and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.

Get your free review See more Spotify / Music Streaming designs