Latency vs Throughput
Two numbers people constantly mix up — and the difference matters in every design.
Latency and throughput are the two most common performance words in system design, but they measure different things. Latency is how long one operation takes. Throughput is how many operations the system completes per unit of time. Confusing them leads to designs that scale the wrong bottleneck.
The problem: performance has two axes
A system can be slow for one user, overloaded for all users, or both. Latency tells you about the experience of an individual request. Throughput tells you about the system's total work rate. If you only measure one, you can make the wrong fix.
low throughput problem:
incoming requests: 20,000 RPS
service capacity: 5,000 RPS
result: queue grows, timeouts rise, latency gets worse because work waits
high latency problem:
incoming requests: 100 RPS
service capacity: 5,000 RPS
one request path: browser → API → DB → remote API → API → browser
result: each request waits on slow dependencies even though the fleet is idle| Metric | Measures | Common units | User question |
|---|---|---|---|
| Latency | Duration of one operation | ms, seconds, microseconds | How long did my request take? |
| Throughput | Completed work per time | RPS, QPS, MB/s, messages/sec | How many requests can the system handle? |
Why latency and throughput are independent
They are related in overloaded systems, but they are not the same knob. You can raise throughput by doing more work in parallel while the latency of each unit stays unchanged. You can lower latency by removing hops while maximum throughput stays unchanged.
one worker:
each request takes 100 ms
max throughput ≈ 10 requests/sec
ten identical workers:
each request still takes 100 ms
max throughput ≈ 100 requests/sec
The trip did not get shorter. More trips happen at once.Real system examples
- Video transcoding: a batch pipeline may process millions of videos per day while one video still takes minutes. High throughput, high latency.
- High-frequency trading: a service may optimize a single decision path to microseconds even if total request volume is modest. Low latency is the product.
- CDNs such as Cloudflare or Akamai: caching content near users lowers latency. Their global fleet also raises aggregate throughput, but those are separate benefits.
Percentiles: p50, p95, p99, and the tail
Average latency hides pain. Users do not experience averages; each user experiences one request at a time. Percentiles tell you how latency is distributed across requests.
| Percentile | Meaning | Why it matters |
|---|---|---|
| p50 | 50% of requests are faster than this | Typical experience; useful for baseline health |
| p95 | 95% are faster; 5% are slower | Good product SLO for interactive APIs |
| p99 | 99% are faster; 1% are slower | Captures tail pain at scale |
| p99.9 | Only 0.1% are slower | Important for huge systems where rare events happen constantly |
latencies for 10 requests:
10 ms, 10 ms, 11 ms, 12 ms, 12 ms, 13 ms, 14 ms, 15 ms, 16 ms, 2000 ms
average ≈ 210 ms
p50 ≈ 12 ms
p90 ≈ 2000 ms in this tiny sample
The average is not the typical user, and it does not describe the worst pain well.Tail latency dominates user experience because modern pages and mobile screens often fan out to many backend calls. If a page needs 50 calls, one slow dependency can make the whole page slow.
If each backend call has a 1% chance of being slow:
single call slow chance: 1%
50 independent calls:
chance at least one is slow = 1 - 0.99^50
≈ 39.5%
At fanout, rare slow calls become a common user-visible event.Little's Law: connecting concurrency, throughput, and latency
Little's Law is a small formula with enormous design value. In a stable system, the average number of in-flight items equals arrival rate multiplied by average time in the system.
L = λ × W
L = average number of requests in the system
λ = arrival/completion rate (requests per second) in a stable system
W = average time each request spends in the system (seconds)
Example:
λ = 2,000 RPS
W = 0.150 seconds
L = 2,000 × 0.150 = 300 in-flight requestsThis tells you how much concurrency you need. If your service handles 2,000 RPS and each request takes 150 ms, you should expect about 300 concurrent requests in flight even before spikes. If p99 latency jumps to 2 seconds, in-flight work can explode.
Batching: trading latency for throughput
Batching groups many small operations into one larger operation. It often increases throughput because fixed overhead is paid once per batch instead of once per item. The cost is latency: the first item in a batch waits for more items to arrive or for a timer to fire.
without batching:
1 message → 1 network call
overhead paid 1,000 times for 1,000 messages
low waiting latency, lower throughput
with batching:
collect up to 100 messages or wait up to 50 ms
100 messages → 1 network call
overhead paid 10 times for 1,000 messages
higher throughput, but first message may wait 50 ms| Technique | Latency effect | Throughput effect | Used by |
|---|---|---|---|
| Small batches | Adds a bounded wait | Better CPU/network efficiency | Kafka producers, database bulk inserts |
| Large batches | Can add visible delay | Very high throughput | Analytics ETL, log compaction |
| No batching | Lowest wait per item | More overhead per item | Interactive payments, login requests |
Kafka, Kinesis, SQS consumers, database bulk loaders, GPU inference servers, and analytics systems all use batching. It is a great fit when throughput matters more than the latency of any single item.
Real latency numbers and gotchas
You do not need to memorize every number, but you should know orders of magnitude. They help you spot impossible designs.
- Coordinated omission: a benchmark that sends the next request only after the previous response can hide queueing latency.
- Warm vs cold paths: cache hits, JIT warmup, open database connections, and TLS session reuse can make demos faster than real cold requests.
- Retries inflate tails: retries improve success rate but can make p99 latency much worse unless bounded by timeouts and deadlines.
- Bandwidth is not latency: a 10 Gbps link can move many bytes per second, but it cannot make light cross an ocean instantly.
- Latency is the time for one operation; throughput is completed work per unit time.
- They are independent axes until overload creates queueing, which makes latency spike.
- Use p50, p95, and p99 instead of averages because tail latency is what users feel at scale.
- Little's Law, L = λ × W, connects in-flight concurrency, throughput, and response time.
- Batching often raises throughput by amortizing overhead, but it adds waiting latency.
Mark it complete to track your progress through the workbook.