The Capacity Chain Template
A repeatable pipeline from DAU to QPS to storage to node count for every design.
The capacity chain is a repeatable pipeline for turning a product prompt into engineering numbers. Start with users, convert behavior into QPS, convert payloads into storage and bandwidth, then divide by realistic per-node capacity. The output is not a final procurement plan; it is a scale-aware map for the architecture you are about to draw.
The problem: diagrams without numbers lie
A whiteboard architecture can look reasonable while being off by orders of magnitude. One Postgres primary may be perfect for 500 QPS and wrong for 500K QPS. A single object store bucket may hold the data, but the CDN and encoding pipeline may dominate bandwidth. The capacity chain forces each major design choice to connect back to load.
The template: DAU → QPS → storage → bandwidth → nodes
Use the same path every time. Keep reads, writes, and large media transfers separate if they stress different components. Use the memorized anchors from Numbers to Memorize, then refine with real benchmarks when available.
1. Daily actions
daily_actions = DAU * actions_per_user_per_day
2. Average QPS
avg_qps = daily_actions / 86,400
3. Peak QPS
peak_qps = avg_qps * peak_factor
4. Storage
logical_storage = items * average_item_size
raw_storage = logical_storage * replication_factor * retention_window
5. Bandwidth
bytes_per_second = peak_qps * response_or_payload_size
6. Node count
nodes = ceil(peak_qps / safe_qps_per_node)
storage_nodes = ceil(raw_storage / safe_storage_per_node)- DAU: daily active users, not total registered users. Use monthly users only after converting to daily activity.
- Actions/day: separate write actions from read actions. A chat app may send 40 messages but read hundreds of timeline entries.
- Peak factor: 3× for steady enterprise, 5× for consumer/social, and more for launches, live events, or flash sales.
- Safe per-node throughput: use conservative numbers that leave CPU, memory, network, and failure headroom.
One worked example: photo sharing feed
Suppose you are designing a photo sharing feed for 50 million daily active users. Each user uploads 2 photos per day, reads the feed 30 times per day, each feed read returns about 20 KB of JSON metadata, and each compressed photo averages 2 MB. Keep photos for 1 year, replicate storage 3 ways, and size consumer peaks at 5× average.
Inputs
DAU = 50,000,000 users
photo uploads = 2 per user per day
feed reads = 30 per user per day
feed response metadata = 20 KB
photo size = 2 MB
peak factor = 5
replication factor = 3
retention = 365 days
Writes: photo uploads
daily_uploads = 50M * 2 = 100M uploads/day
avg_write_qps = 100M / 86,400 ≈ 1,160 writes/s
peak_write_qps = 1,160 * 5 ≈ 5,800 writes/s
Reads: feed requests
daily_feed_reads = 50M * 30 = 1.5B reads/day
avg_read_qps = 1.5B / 86,400 ≈ 17,400 reads/s
peak_read_qps = 17,400 * 5 ≈ 87,000 reads/s
Photo storage
logical_photo_storage_per_day = 100M * 2 MB = 200 TB/day
logical_photo_storage_1_year = 200 TB/day * 365 ≈ 73 PB
raw_replicated_storage = 73 PB * 3 ≈ 219 PB
Feed metadata bandwidth at peak
peak_feed_bandwidth = 87,000 reads/s * 20 KB ≈ 1.7 GB/s
in bits = 1.7 GB/s * 8 ≈ 13.6 Gbps before protocol overhead
Rough node counts
app servers at safe 5K QPS/node:
ceil(87,000 / 5,000) ≈ 18 nodes for feed reads
add redundancy and multi-AZ headroom → ~30+ nodes
upload metadata DB at safe 2K writes/s/primary shard:
ceil(5,800 / 2,000) ≈ 3 write shards before headroom
object storage:
219 PB raw is not a single-disk problem; use managed object storage
and put a CDN in front for popular photo downloadsThe architecture now follows the numbers. Feed reads need caching and horizontal app servers. Photo bytes should go directly to object storage, not through app servers. Storage volume is large enough that lifecycle policies, compression, and CDN hit rate are first-class design concerns.
Turning numbers into architecture choices
After the chain, ask which number dominates. Sometimes QPS dominates and you need caching, partitioning, and load balancing. Sometimes storage dominates and you need object storage, compaction, retention, or tiering. Sometimes bandwidth dominates and you need CDN, compression, batching, or a different product contract.
- High read QPS: cache hot objects, precompute read models, and avoid expensive joins on the critical path.
- High write QPS: partition by key, queue bursty work, and make writes idempotent for retries.
- High storage: define retention early, separate hot and cold data, and model index overhead.
- High bandwidth: move bytes through CDN or object storage, compress responses, and avoid fanout copies when references work.
Gotchas and calibration
The capacity chain is simple, but inputs are often fuzzy. Make your assumptions visible and keep them easy to change. A design doc should say "assuming 5× peak and 20 KB feed responses", not hide those values inside a final node count.
- Do not mix units: bits vs. bytes and KB vs. KiB mistakes can create 8× or 1024× errors.
- Separate reads and writes: one blended QPS number can hide the fact that databases and caches face very different loads.
- Include retries: timeouts, at-least-once consumers, and client retries can amplify peak load during incidents.
- Leave headroom: node counts from division are minimums. Add capacity for deploys, failures, uneven partitions, and growth.
- The capacity chain is DAU → actions/day → average QPS → peak QPS → storage → bandwidth → node count.
- Use peak QPS, not average QPS, for synchronous serving paths; queue only work that the product can tolerate delaying.
- Storage estimates must include item size, replication factor, indexes, backups, and retention window.
- Bandwidth is peak QPS times payload size; large bytes should usually move through CDN or object storage rather than app servers.
- Node counts come from dividing by conservative per-node capacity and then adding headroom for failures, deploys, and growth.
Mark it complete to track your progress through the workbook.