Challenge Drills Library Drawing Guide Learn AI Setup Guide Support About

Dropbox / File Storage — system design by AgileViper46

Hire

Reviewed by 6 specialized AI reviewers. Explore the diagram and the full per-section feedback below.

Loading diagram…

The client will chunk the file into 5-10MB pieces and calculate a fingerprint for each chunk. It will also calculate a fingerprint for the entire file, which is used to check for duplicates and resumability. The client will send a request to check if a file with the same fingerprint already exists for this user. If it does and has a status of "uploading", the client can resume the upload by fetching the existing chunk statuses. If the file does not exist, the client will POST a request to initiate a multipart upload. The backend will call S3's CreateMultipartUpload API to get an uploadId, generate presigned URLs for each part, save the file metadata in the FileMetadata table with a status of "uploading", and return the uploadId along with presigned URLs for each chunk. The client will then upload each chunk to S3 using its corresponding presigned URL (each part requires its own presigned URL with the uploadId and partNumber). After each chunk is uploaded, the client sends a PATCH request to our backend with the chunk status and ETag. Our backend can then verify the chunk uploads with S3's ListParts API before updating the chunks field in the FileMetadata table to mark the chunk as "uploaded". Once all chunks in our chunks array are marked as "uploaded", the backend calls S3's CompleteMultipartUpload API with the list of part numbers and ETags. This tells S3 to assemble all the parts into a single object. Only after S3 confirms successful assembly does the backend update the FileMetadata table to mark the file as "uploaded". 2) How can we make uploads, downloads, and syncing as fast as possible? We've already touched on a few ways to speed up both download and upload respectively, but there is still more we can do to make the system as fast as possible. To recap, for download we used a CDN to cache the file closer to the user. This made it so that the file doesn't have to travel as far to get to the user, reducing latency and speeding up download times. For upload, chunking, beyond being useful for resumable uploads, also plays a significant role in speeding up the upload process. While bandwidth is fixed (put another way, the pipe is only so big), we can use chunking to make the most of the bandwidth we have. By sending multiple chunks in parallel, and utilizing adaptive chunk sizes based on network conditions, we can maximize the use of available bandwidth. The same chunking approach can be used for syncing files. When a file changes, we only need to sync the chunks that actually changed rather than the entire file, making syncing much faster. Beyond that which we've already discussed, we can also utilize compression to speed up both uploads and downloads. Compression reduces the size of the file, which means fewer bytes need to be transferred. Since we're uploading directly to S3, compression happens entirely on the client side: the client compresses the file before uploading, and the compressed data is stored in S3 as-is. When downloading, the client decompresses the file after retrieving it. This keeps our backend out of the data path while still benefiting from reduced transfer sizes. We'll need to be smart about when we compress though. Compression is only useful if the speed gained from transferring fewer bytes outweighs the time it takes to compress and decompress the file. For some file types, particularly media files like images and videos, the compression ratio is so low that it's not worth the time it takes to compress and decompress the file. If you take a .png off your computer right now and compress it, you'll be lucky to have decreased the file size by more than a few percent -- so it's not worth it. For text files, on the other hand, the compression ratio is much higher and, depending on network conditions, it may very well be worth it. A 5GB text file could compress down to 1GB or even less depending on the content. In the end, you'll want to implement logic on the client that decides whether or not to compress the file before uploading it based on the file type, size, and network conditions. 3) How can you ensure file security? Security is a critical aspect of any file storage system. We need to ensure that files are secure and only accessible to authorized users. Encryption in Transit: Sure, to most candidates, this is a no-brainer. We should use HTTPS to encrypt the data as it's transferred between the client and the server. This is a standard practice and is supported by all modern web browsers. Encryption at Rest: We should also encrypt the files when they are stored in S3. This is a feature of S3 and is easy to enable. When a file is uploaded to S3, we can specify that it should be encrypted. S3 will then encrypt the file using a unique key and store the key separately from the file. This way, even if someone gains access to the file, they won't be able to decrypt it without the key. You can learn more about S3 encryption here. Access Control: Our shareList or separate share table/cache is our basic ACL. As discussed earlier, we make sure that we share download links only with authorized users. But what happens if an authorized user shares a download link with an unauthorized user? For example, an authorized user may, intentionally or unintentionally, post a download link to a public forum or social media and we need to make sure that unauthorized users cannot download the file. This is where those signed URLs we talked about early come back into play. When a user requests a download link, we generate a signed URL that is only valid for a short period of time (e.g. 5 minutes). This signed URL is then sent to the user, who can use it to download the file. It's worth noting that signed URLs are bearer tokens - anyone with a valid, unexpired URL can download the file. The short expiration window limits the exposure, but doesn't fully prevent sharing. For higher security scenarios, you could add additional restrictions like IP binding or require the signed URL to be used in conjunction with authentication cookies. They also work with modern CDNs like CloudFront and are a feature of S3. Here is how: Generation: A signed URL is generated on the server, including a signature that typically incorporates the URL path, an expiration timestamp, and possibly other restrictions (like IP address). For CloudFront, this signature is created using the content provider's private key. Distribution: The signed URL is distributed to an authorized user, who can use it to access the specified resource directly from the CDN. Validation: When the CDN receives a request with a signed URL, it verifies the signature using the corresponding public key (which was registered with CloudFront), checks the expiration timestamp and any other restrictions. If the signature is valid and the URL has not expired, the CDN serves the requested content. If not, it denies access. The Boseway Sql will be in a cluster format where we would be having multiple write primaries each write primary would be backed up by at least two replicas so the read can go via the replica parts but the writes and go to the primary the data can be partitioned and sharded Via the file id or the user id based on the table This will ensure that the postress sql can handle the load of a one billion users in total so the data is more the qps is not that much and we can also introduce Redis between the file service and the sql for getting the file metadata for the get path The sink service can also be leveraged via that particular gate path

Hire SignalLean Hire

The candidate demonstrates good senior-level instincts on the core architecture, transfer path, and security model, and the design would likely work for the main file storage flows. However, there are meaningful gaps in control-plane completeness and consistency handling around sync, metadata authority, and API contracts that keep it from a clear hire at this level.

✅ Good

Availability is explicitly prioritized over consistency

The candidate does identify a core quality trade-off up front and aligns parts of the design to it, such as direct-to-object-store transfers, CDN-backed downloads, and resumable multipart upload flows that reduce dependence on a single backend hop.

✅ Good

Low-latency and reliability goals are tied to concrete mechanisms

The NFRs around fast and reliable transfer are not left abstract: chunking, parallel part upload, resumability, adaptive chunk sizing, and CDN usage are all concrete choices that directly support large-file transfer performance and recovery from interrupted uploads/downloads.

✅ Good

Security requirements are backed by specific controls

For the 'secure and reliable storage and sharing' goal, the explanation names concrete protections such as TLS in transit, encryption at rest, ACL-based authorization, and short-lived signed URLs, which is stronger than simply stating that the system is secure.

warning

NFRs are mostly qualitative and lack measurable targets

Have you considered what happens when the team needs to decide whether the system is meeting its goals? 'Low latency', 'reliable', and 'availability > consistency' are directionally useful, but without concrete targets like upload initiation latency, download p95/p99, resumability success rate, or availability SLOs, it is hard to validate the design or make trade-offs under load.

warning

Consistency model is stated but not justified for metadata and sharing flows

Have you considered what happens if metadata or ACL updates are stale while availability is prioritized? For example, if a file is deleted or unshared on one device, can another device still see or download it briefly? The design says availability is more important than consistency, but it does not spell out which operations are eventual versus strong, or what user-visible anomalies are acceptable.

warning

Numbers do not connect back to the stated scale assumptions

Have you considered whether the NFR choices still hold for 1B total users, 10M DAU, and up to 50GB files? The explanation gives useful mechanisms, but it does not translate the assumptions into defensible targets such as expected concurrent multipart uploads, metadata QPS, storage growth, or bandwidth pressure. Without that linkage, the NFRs feel floating rather than derived from the stated scale.

info

Define reliability more precisely for large-file transfers

You could improve this by stating what 'reliable storage and sharing' means operationally: for example, durability expectations, retry behavior for failed parts, cleanup policy for abandoned multipart uploads, and acceptable recovery behavior after client/network interruption. That would make the large-file and resumability NFRs more testable.

✅ Good

Core nouns for the main file flow are present

User, File/FileMetadata, and SharedFiles cover the basic happy path for upload, ownership, and sharing. For this problem scope, those are the main domain concepts the system revolves around.

✅ Good

Candidate distinguishes file content from metadata

Separating File from FileMetadata is a sensible modeling choice for a large file storage system because ownership, upload state, chunk state, and sharing data evolve independently from the blob itself.

warning

Relationship between File and FileMetadata is unclear

Have you considered whether File and FileMetadata are truly separate entities and how they connect? Right now it's hard to tell if this is a 1:1 relationship, whether File is the logical user-visible file and FileMetadata is its state record, or whether they are overlapping names for the same concept. That ambiguity makes ownership, deletion, and sharing semantics harder to reason about.

warning

Sharing model is underspecified

What happens when one file is shared with many users, or a user has many files shared with them? SharedFiles suggests a join entity, but the relationships are not spelled out. For this flow, you want it to be explicit that sharing is effectively User-to-File many-to-many, typically represented by a share/ACL record with owner and recipient relationships.

warning

Sync-specific entity or ownership/version relationship is missing

Have you considered what domain record drives automatic sync across devices? The explanation talks about resumable uploads and changed chunks, but the entity list does not show a version/revision concept or a device-file association. Without some explicit way to represent the latest file state per user and detect changes over time, the sync flow is left implicit rather than modeled.

✅ Good

Core storage sizing is grounded in stated assumptions

The candidate starts from the given user/storage assumptions and derives total blob storage (1B users × 10GB). That is the right first-order methodology for a storage-heavy system like file sync, and it shows awareness that persistent object storage is the dominant capacity dimension.

✅ Good

Peak traffic factor is considered

They do not stop at average daily request counts; they apply a peak multiplier to derive higher read/write QPS. Even if the exact numbers are rough, accounting for burstiness is an important capacity habit at this level.

warning

Request QPS is disconnected from actual byte throughput

Have you considered what happens when average QPS looks modest but each request moves large files or many chunks? For a file storage system, 10K read QPS and 1.25K write QPS are not enough by themselves to size the system. The real bottleneck is network and object-store throughput: 1PB/day of uploads is roughly 11.6GB/s sustained before replication/CDN effects, and peak bandwidth could be several times higher. Without carrying the calculation through to ingress/egress bandwidth, CDN offload, and S3 request rates, the infrastructure could be badly underprovisioned.

warning

Chunked upload design likely multiplies control-plane load far beyond stated QPS

What happens when each uploaded file is split into 5–10MB parts and every part generates presigned URL handling, status updates, and completion bookkeeping? The capacity section counts uploads as 250 QPS average, but the explanation implies many backend interactions per file. At 100MB/day average upload per DAU, that is already multiple chunks per file; for larger files it explodes further. You should translate file-level uploads into part-level API calls and metadata writes, otherwise the control plane and metadata store may be undersized by a large factor.

warning

Storage growth ignores replication and metadata overhead

Have you considered what happens to the 10EB estimate once durability and indexing are included? Raw user data is only the floor. Object storage replication/erasure coding overhead, versions during sync conflicts, multipart staging, deleted-file retention, and metadata tables all add material capacity. At this scale, even a small percentage overhead becomes enormous, so the storage plan should at least acknowledge effective stored bytes versus logical user bytes.

warning

Download assumptions understate capacity risk for a sync product

What happens during device re-installs, new device onboarding, or hot shared-file fanout? The estimate uses 5 downloads per DAU per day, but sync systems often have asymmetric spikes where one upload fans out to many device downloads. Since the functional requirements include automatic sync across devices and sharing, it would strengthen the model to account for fanout-driven reads rather than treating downloads as independent user actions.

info

Tie infrastructure choices back to the calculated scale

You could improve this by explicitly connecting the numbers to component sizing: for example, expected metadata DB write rate, Redis working set, object-store request rate, CDN cache hit assumptions, and shard count. The explanation says SQL can handle the load because QPS is low, but that claim would be much stronger if supported by estimated metadata rows, write amplification from chunk tracking, and expected hot-key/read patterns.

✅ Good

Sync API covers the core delta flow

The GET /files/changes?since={timestamp} endpoint gives clients a concrete way to discover created/modified/deleted files and drive cross-device sync without re-listing all files. That is a sensible API for the stated sync requirement.

✅ Good

Direct upload/download via signed URLs is an appropriate protocol choice

Using a control-plane API to return presigned URLs keeps the backend out of the large file data path, which is the right API pattern for 50GB files. The explanation also shows awareness that upload is multipart and download links should be short-lived.

warning

Upload flow is not fully usable through the listed routes

What happens after POST /files returns a presigned URL for a large multipart upload? In the explanation, the client needs additional API steps to resume an existing upload, report per-part completion, and finalize the multipart upload, but those routes are not actually listed. Without explicit initiate/resume/complete endpoints, the core upload flow is underspecified and hard for a client to implement correctly.

warning

Download API mixes metadata and file bytes ambiguously

What does GET /files/{fileId} return for a 50GB file: raw bytes, metadata, or a redirect/signed URL? Returning 'File & FileMetadata' is not realistic at this size and creates protocol ambiguity for clients. A cleaner contract would separate metadata retrieval from download-link generation so clients know whether to expect JSON, a redirect, or a streamed body.

warning

No API for listing a user's files or shared files

How does a new device bootstrap sync or let a user browse their files and files shared with them? The delta endpoint only works once the client already has a checkpoint. Without at least one list/discovery API, the system is missing an obvious way to initialize state for upload/download/share workflows.

warning

Change feed needs pagination and a stable cursor

What happens when a device has been offline for weeks and GET /files/changes?since={timestamp} returns a huge backlog? A timestamp alone is fragile under high write volume because multiple events can share the same time and clients can miss or duplicate events around boundaries. This endpoint should expose pagination or, better, an opaque cursor/token with deterministic ordering.

warning

Share API does not define idempotency or partial-failure behavior

What happens if the client retries POST /files/{id}/share after a timeout, or some users in the User[] are invalid or unauthorized? Without a clear response model for duplicate shares, per-user failures, and retry-safe semantics, clients can easily create inconsistent UX and may not know which recipients actually received access.

warning

Error contract and retry guidance are missing

What does the client see when a presigned URL expires, a multipart upload is incomplete, a file is not found, or access is denied to a shared file? The routes do not define status codes, error shape, or which operations are safe to retry. For a sync client, this matters a lot because it needs to distinguish retryable failures from permanent ones.

info

Resource design could be cleaner by separating actions from resources

You could improve this by making the API more explicit around resources such as uploads, shares, and download links instead of overloading /files. For example, separate endpoints for file metadata, upload sessions, and share records would make lifecycle and client behavior clearer, especially for resumable multipart uploads.

⭐ Excellent

Direct-to-object-store upload/download keeps backend off the data path

Routing large file transfer through presigned URLs to S3/blob storage is a strong architectural choice for the stated 50GB file cap and 1B-user footprint. It avoids turning the file service into a bandwidth bottleneck and makes resumable multipart upload practical.

✅ Good

Upload completion is decoupled from sync propagation

Using object-store upload completion events through Kafka into the sync service is a good separation of concerns. It keeps the user-facing upload path focused on metadata and transfer orchestration while pushing downstream change propagation asynchronously.

✅ Good

Realtime plus pull-based sync model

Combining WebSocket servers for push notifications with a changes-since API gives clients both low-latency updates and a recovery path when they reconnect after being offline. That is a sensible end-to-end sync design.

✅ Good

Metadata and sharing are separated from blob storage

Keeping file metadata and sharing relationships in SQL while storing file bytes in blob storage is a good logical decomposition. It matches the access patterns: transactional metadata updates in the database and large immutable payloads in object storage.

warning

Sync path ownership is split and may produce inconsistent state

Have you considered what happens when upload finalization is handled by the file service in one flow, but the sync service also writes to Postgres from Kafka events? Without a clear source of truth and idempotent state transitions, the system can race: metadata may say 'uploaded' before or after the event processor updates it, duplicate events may create duplicate change records, and clients may receive out-of-order sync notifications. You could improve this by making one service authoritative for metadata state transitions and treating Kafka consumers as idempotent projectors.

warning

Delete/share flows are not connected into the sync architecture

What happens when a user deletes a file or shares a file with another user? The diagram shows Kafka fed by blob upload completion, but there is no equivalent event path from file service for delete/share mutations into the sync service. That means other devices may not learn about deletes or newly shared files promptly, even though automatic sync is a core requirement. You could improve this by emitting change events for all metadata mutations, not just uploads.

warning

Redis usage is underspecified for the hottest read paths

Have you considered where the first bottleneck appears on download and sync reads? The design mentions Redis between file service and SQL, but the critical hot path is authorization plus metadata lookup for generating download links and listing changes. If cache keys, invalidation, and what is cached are not defined, Postgres becomes the first scaling pressure point for repeated metadata/share checks. You could improve this by explicitly caching file metadata, ACL/share lookups, and recent change cursors with clear invalidation on write.

warning

WebSocket tier may become a fan-out bottleneck during reconnect storms

What happens when many clients reconnect after a mobile network flap or regional outage? The WS servers appear to depend on the sync service directly, but there is no shown mechanism for connection state distribution, backpressure, or horizontal fan-out. A reconnect storm could overload the WS tier or sync service and delay notifications. You could improve this by making WS servers stateless, storing session/subscription state in Redis or a broker, and using bounded queues/backpressure for push delivery.

info

CDN helps downloads, but private file access needs a clearer control point

You could improve this by making the download flow explicit: client requests metadata/download authorization from file service, file service checks ownership/share ACL, then returns a short-lived signed CDN or object-store URL. Right now CDN is connected to the client and storage, but the control path that enforces per-user sharing permissions is only implied.

info

Single-region assumptions are hidden in several core components

What happens when the Postgres cluster, Redis cluster, or Kafka deployment has a regional failure? The design names clusters, which is good, but the high-level architecture does not show whether these are multi-AZ only or multi-region. Given the stated preference for availability over consistency, you could strengthen the design by explaining which failures are tolerated locally versus across regions and how clients fail over.

Want this kind of feedback on your own design?

Draw your architecture for Dropbox / File Storage and get an instant hire/no-hire signal from 6 specialized AI reviewers — free to start.

Get your free review See more Dropbox / File Storage designs