🧱Fundamentals·6 min read

What Is System Design?

What the discipline actually is, why interviews test it, and how to think about trade-offs.

System design is the discipline of turning a product goal into a working arrangement of services, databases, caches, queues, APIs, background jobs, networks, and operational practices. It asks: what must the system do, how much load must it survive, what can fail, and which trade-offs are acceptable for this product?

🔭Think of it like…

Writing application code is like designing one room in a building. System design is planning the whole airport: where passengers enter, how baggage moves, what happens when a security lane closes, how emergency exits work, and how the airport expands without shutting down. A beautiful room does not matter if the whole building cannot move people safely.

The problem: software leaves one machine

Small programs are often easy to reason about because everything happens in one process and one database. Real products outgrow that shape. Users arrive from many regions, traffic spikes without warning, machines die, data gets large, and teams need to change parts of the system independently. System design gives you a vocabulary and a process for making those choices deliberately instead of by accident.

the simple shape breaks at scale

prototype:
browser ──▶ web app ──▶ database

what changes in production:
- 10 users become 10 million users
- one server becomes a fleet
- one database becomes replicas, shards, indexes, caches, backups
- one happy path becomes retries, timeouts, deploys, outages, and abuse

The failure mode is not usually that engineers know too little tech.

They jump to components: Redis, Kafka, Kubernetes, DynamoDB, Elasticsearch.
They skip the question those components are supposed to answer: latency, throughput, durability, cost, correctness, or operability.
They produce a diagram that looks sophisticated but does not solve the actual product problem.

The naive failure mode

A beginner hears "design Twitter" and immediately draws a load balancer, app servers, a cache, a queue, and a database. A staff-level engineer first asks what version of Twitter we are designing: posting tweets, home timeline, search, DMs, media upload, ads, or all of them? The right architecture depends on the scope.

Functional vs non-functional requirements

Every design starts with requirements. Functional requirements describe what users can do. Non-functional requirements describe how well the system must do it. The second category is where most system design decisions come from.

Requirement type	Question it answers	Examples	Design impact
Functional	What behavior must exist?	Users can create posts, follow accounts, upload images, search messages	Defines APIs, data model, workflows, and product scope
Non-functional	How well must it behave?	p95 latency < 200 ms, 99.99% availability, durable uploads, 100K writes/sec	Defines scaling, replication, caching, partitioning, failover, and cost

Why the distinction matters

Suppose the feature is "users can post photos." That is not enough to design the system. A family photo app, Instagram, and a medical imaging archive all accept photo uploads, but they need very different durability, privacy, moderation, latency, and storage-cost choices.

same feature, different non-functional targets

Feature:
  upload and view photos

Consumer social app:
  p95 image view < 200 ms
  tolerate delayed counters
  optimize for CDN cache hit rate and cheap storage

Medical imaging archive:
  strong audit trail
  strict access control
  long retention
  correctness and compliance outrank feed latency

You will often use capacity estimationto turn vague words like "large scale" into numbers, and ideas like the CAP theorem to explain what happens when replicas and networks disagree.

The trade-off mindset: there is no perfect design

System design is not a hunt for the perfect diagram. It is a sequence of explicit trade-offs. Every mechanism buys one property by spending another. Caches reduce read latency but introduce staleness and invalidation. Replication improves read capacity and availability but creates consistency questions. Sharding raises write capacity but makes queries and resharding harder.

Choice	What it buys	What it costs
Cache hot data	Lower read latency and database load	Stale reads, invalidation bugs, extra memory
Replicate data	Higher availability and read scale	Replication lag, failover complexity, split-brain risk
Shard a database	Higher write/storage capacity	Cross-shard queries, hotspots, migration complexity
Use a queue	Absorb bursts and decouple services	Eventual processing, retries, duplicate handling
Use strong consistency	Simpler correctness model	More coordination, higher latency, lower partition availability

Strong answers name the trade-off

"I will add Redis" is a component choice. "I will cache celebrity profile pages for 60 seconds to reduce database reads, accepting brief staleness because profile edits are rare" is system design reasoning.

Real systems show this clearly. Amazon DynamoDB lets teams tune capacity, indexes, and consistency per access pattern. Cassandra favors write availability and tunable consistency. PostgreSQL favors a strong, relational model and can scale surprisingly far before you need more exotic machinery. The right answer depends on what you promised users.

A practical design framework

A design conversation is easier when you follow a repeatable path. You do not have to be robotic, but you should avoid wandering. Start broad, choose assumptions, sketch the system, then spend depth where risk is highest.

system design interview flow

1. Clarify requirements
   - users, core features, out-of-scope features
   - latency, availability, consistency, durability, security

2. Back-of-the-envelope estimate
   - daily active users, requests/sec, read/write ratio
   - storage growth, bandwidth, peak traffic

3. API design
   - request/response shape and idempotency
   - pagination, auth, rate limits

4. Data model
   - entities, indexes, access patterns, retention

5. High-level architecture
   - clients, load balancers, services, stores, caches, queues

6. Deep dive
   - scale the bottleneck: feed fanout, search, upload path, hot keys

7. Identify bottlenecks and mitigations
   - failure modes, observability, backpressure, retries, capacity limits

How this sounds in an interview

Clarify: Are we designing only posting and reading tweets, or also search, ads, DMs, and media?
Estimate: If we have 100 million daily active users and each reads 100 posts per day, reads dominate writes.
API: Define POST /posts and GET /timeline before choosing storage.
Data model: Store posts by author and timeline entries by viewer because those are different access patterns.
Deep dive: Discuss fanout-on-write vs fanout-on-read because the home timeline is likely the bottleneck.

Do breadth before depth

If you spend 25 minutes perfecting the API before estimating traffic, you may never discover that the real challenge is 10 million timeline reads per second. First map the terrain, then dig where the system is most likely to break.

Edge cases and gotchas

Good system design includes the messy edges, not just the happy-path boxes. These are the places production systems usually fail.

Scope creep:a prompt like "design YouTube" is too large. Pick the core user journey and explicitly defer the rest.
Single points of failure: one database, one region, one queue, or one deployment pipeline can take down the whole system.
Hotspots: one viral post, one celebrity account, or one partition key can overload an otherwise scalable design.
Retries without idempotency: retrying payments, orders, or message sends can duplicate side effects unless the API is designed for safe repetition.
Ignoring operations: backups, dashboards, alerts, deploy safety, and on-call runbooks are part of the system, not an afterthought.

Key takeaways

System design arranges many components to meet product goals under load, failure, growth, and operational constraints.
Functional requirements say what the system does; non-functional requirements say how well it must do it.
There is no perfect design: every cache, replica, queue, index, and shard buys one property while costing another.
A strong design flow is clarify requirements → estimate capacity → define APIs → model data → draw architecture → deep dive → find bottlenecks.
The biggest beginner mistake is jumping to components before clarifying scope, scale, correctness, and failure expectations.

Because components are answers to requirements. Without scope and non-functional targets, you do not know whether you need a cache, a queue, a strongly consistent database, or a globally replicated store. The diagram may look impressive while solving the wrong problem.

Functional requirements describe user-visible behavior, such as posting a message or uploading an image. Non-functional requirements describe quality targets, such as latency, availability, durability, scale, privacy, and cost.

It proves the choice is intentional. Saying the trade-off explains what the design gains, what it gives up, and why that is acceptable for this product instead of being copied from another architecture.

Finished this lesson?

Mark it complete to track your progress through the workbook.