What Is System Design?
What the discipline actually is, why interviews test it, and how to think about trade-offs.
System design is the discipline of turning a product goal into a working arrangement of services, databases, caches, queues, APIs, background jobs, networks, and operational practices. It asks: what must the system do, how much load must it survive, what can fail, and which trade-offs are acceptable for this product?
The problem: software leaves one machine
Small programs are often easy to reason about because everything happens in one process and one database. Real products outgrow that shape. Users arrive from many regions, traffic spikes without warning, machines die, data gets large, and teams need to change parts of the system independently. System design gives you a vocabulary and a process for making those choices deliberately instead of by accident.
prototype:
browser ──▶ web app ──▶ database
what changes in production:
- 10 users become 10 million users
- one server becomes a fleet
- one database becomes replicas, shards, indexes, caches, backups
- one happy path becomes retries, timeouts, deploys, outages, and abuseThe failure mode is not usually that engineers know too little tech.
- They jump to components:
Redis, Kafka, Kubernetes, DynamoDB, Elasticsearch. - They skip the question those components are supposed to answer: latency, throughput, durability, cost, correctness, or operability.
- They produce a diagram that looks sophisticated but does not solve the actual product problem.
Functional vs non-functional requirements
Every design starts with requirements. Functional requirements describe what users can do. Non-functional requirements describe how well the system must do it. The second category is where most system design decisions come from.
| Requirement type | Question it answers | Examples | Design impact |
|---|---|---|---|
| Functional | What behavior must exist? | Users can create posts, follow accounts, upload images, search messages | Defines APIs, data model, workflows, and product scope |
| Non-functional | How well must it behave? | p95 latency < 200 ms, 99.99% availability, durable uploads, 100K writes/sec | Defines scaling, replication, caching, partitioning, failover, and cost |
Why the distinction matters
Suppose the feature is "users can post photos." That is not enough to design the system. A family photo app, Instagram, and a medical imaging archive all accept photo uploads, but they need very different durability, privacy, moderation, latency, and storage-cost choices.
Feature:
upload and view photos
Consumer social app:
p95 image view < 200 ms
tolerate delayed counters
optimize for CDN cache hit rate and cheap storage
Medical imaging archive:
strong audit trail
strict access control
long retention
correctness and compliance outrank feed latencyYou will often use capacity estimationto turn vague words like "large scale" into numbers, and ideas like the CAP theorem to explain what happens when replicas and networks disagree.
The trade-off mindset: there is no perfect design
System design is not a hunt for the perfect diagram. It is a sequence of explicit trade-offs. Every mechanism buys one property by spending another. Caches reduce read latency but introduce staleness and invalidation. Replication improves read capacity and availability but creates consistency questions. Sharding raises write capacity but makes queries and resharding harder.
| Choice | What it buys | What it costs |
|---|---|---|
| Cache hot data | Lower read latency and database load | Stale reads, invalidation bugs, extra memory |
| Replicate data | Higher availability and read scale | Replication lag, failover complexity, split-brain risk |
| Shard a database | Higher write/storage capacity | Cross-shard queries, hotspots, migration complexity |
| Use a queue | Absorb bursts and decouple services | Eventual processing, retries, duplicate handling |
| Use strong consistency | Simpler correctness model | More coordination, higher latency, lower partition availability |
Real systems show this clearly. Amazon DynamoDB lets teams tune capacity, indexes, and consistency per access pattern. Cassandra favors write availability and tunable consistency. PostgreSQL favors a strong, relational model and can scale surprisingly far before you need more exotic machinery. The right answer depends on what you promised users.
A practical design framework
A design conversation is easier when you follow a repeatable path. You do not have to be robotic, but you should avoid wandering. Start broad, choose assumptions, sketch the system, then spend depth where risk is highest.
1. Clarify requirements
- users, core features, out-of-scope features
- latency, availability, consistency, durability, security
2. Back-of-the-envelope estimate
- daily active users, requests/sec, read/write ratio
- storage growth, bandwidth, peak traffic
3. API design
- request/response shape and idempotency
- pagination, auth, rate limits
4. Data model
- entities, indexes, access patterns, retention
5. High-level architecture
- clients, load balancers, services, stores, caches, queues
6. Deep dive
- scale the bottleneck: feed fanout, search, upload path, hot keys
7. Identify bottlenecks and mitigations
- failure modes, observability, backpressure, retries, capacity limitsHow this sounds in an interview
- Clarify: Are we designing only posting and reading tweets, or also search, ads, DMs, and media?
- Estimate: If we have 100 million daily active users and each reads 100 posts per day, reads dominate writes.
- API: Define
POST /postsandGET /timelinebefore choosing storage. - Data model: Store posts by author and timeline entries by viewer because those are different access patterns.
- Deep dive: Discuss fanout-on-write vs fanout-on-read because the home timeline is likely the bottleneck.
Edge cases and gotchas
Good system design includes the messy edges, not just the happy-path boxes. These are the places production systems usually fail.
- Scope creep:a prompt like "design YouTube" is too large. Pick the core user journey and explicitly defer the rest.
- Single points of failure: one database, one region, one queue, or one deployment pipeline can take down the whole system.
- Hotspots: one viral post, one celebrity account, or one partition key can overload an otherwise scalable design.
- Retries without idempotency: retrying payments, orders, or message sends can duplicate side effects unless the API is designed for safe repetition.
- Ignoring operations: backups, dashboards, alerts, deploy safety, and on-call runbooks are part of the system, not an afterthought.
- System design arranges many components to meet product goals under load, failure, growth, and operational constraints.
- Functional requirements say what the system does; non-functional requirements say how well it must do it.
- There is no perfect design: every cache, replica, queue, index, and shard buys one property while costing another.
- A strong design flow is clarify requirements → estimate capacity → define APIs → model data → draw architecture → deep dive → find bottlenecks.
- The biggest beginner mistake is jumping to components before clarifying scope, scale, correctness, and failure expectations.
Mark it complete to track your progress through the workbook.