🧱Fundamentals·6 min read

Latency vs Throughput

Two numbers people constantly mix up — and the difference matters in every design.

Latency and throughput are the two most common performance words in system design, but they measure different things. Latency is how long one operation takes. Throughput is how many operations the system completes per unit of time. Confusing them leads to designs that scale the wrong bottleneck.

🔭Think of it like…

Think of a highway. Latency is the time one car takes to drive from one city to another. Throughput is how many cars pass a checkpoint each minute. Adding lanes can let more cars through without making the road shorter. Clearing an accident can reduce one car's trip time without adding any new lanes.

The problem: performance has two axes

A system can be slow for one user, overloaded for all users, or both. Latency tells you about the experience of an individual request. Throughput tells you about the system's total work rate. If you only measure one, you can make the wrong fix.

two different failure modes

low throughput problem:
incoming requests: 20,000 RPS
service capacity:   5,000 RPS
result: queue grows, timeouts rise, latency gets worse because work waits

high latency problem:
incoming requests: 100 RPS
service capacity:  5,000 RPS
one request path:  browser → API → DB → remote API → API → browser
result: each request waits on slow dependencies even though the fleet is idle

Metric	Measures	Common units	User question
Latency	Duration of one operation	ms, seconds, microseconds	How long did my request take?
Throughput	Completed work per time	RPS, QPS, MB/s, messages/sec	How many requests can the system handle?

Precise definitions

Latency is a time interval: request start to response end, or write start to durable commit. Throughput is a rate: completed operations divided by elapsed time. One is measured in time; the other is measured in work per time.

Why latency and throughput are independent

They are related in overloaded systems, but they are not the same knob. You can raise throughput by doing more work in parallel while the latency of each unit stays unchanged. You can lower latency by removing hops while maximum throughput stays unchanged.

same latency, higher throughput

one worker:
  each request takes 100 ms
  max throughput ≈ 10 requests/sec

ten identical workers:
  each request still takes 100 ms
  max throughput ≈ 100 requests/sec

The trip did not get shorter. More trips happen at once.

Real system examples

Video transcoding: a batch pipeline may process millions of videos per day while one video still takes minutes. High throughput, high latency.
High-frequency trading: a service may optimize a single decision path to microseconds even if total request volume is modest. Low latency is the product.
CDNs such as Cloudflare or Akamai: caching content near users lowers latency. Their global fleet also raises aggregate throughput, but those are separate benefits.

Queueing links the two under overload

When arrival rate approaches service capacity, requests wait in a queue. The actual work may still take 20 ms, but waiting 800 ms before work begins makes observed latency 820 ms. This is why an overloaded system often shows both low throughput headroom and terrible latency.

Percentiles: p50, p95, p99, and the tail

Average latency hides pain. Users do not experience averages; each user experiences one request at a time. Percentiles tell you how latency is distributed across requests.

Percentile	Meaning	Why it matters
p50	50% of requests are faster than this	Typical experience; useful for baseline health
p95	95% are faster; 5% are slower	Good product SLO for interactive APIs
p99	99% are faster; 1% are slower	Captures tail pain at scale
p99.9	Only 0.1% are slower	Important for huge systems where rare events happen constantly

why averages lie

latencies for 10 requests:
10 ms, 10 ms, 11 ms, 12 ms, 12 ms, 13 ms, 14 ms, 15 ms, 16 ms, 2000 ms

average ≈ 210 ms
p50     ≈ 12 ms
p90     ≈ 2000 ms in this tiny sample

The average is not the typical user, and it does not describe the worst pain well.

Tail latency dominates user experience because modern pages and mobile screens often fan out to many backend calls. If a page needs 50 calls, one slow dependency can make the whole page slow.

tail latency compounds across fanout

If each backend call has a 1% chance of being slow:

single call slow chance:       1%
50 independent calls:
  chance at least one is slow = 1 - 0.99^50
                              ≈ 39.5%

At fanout, rare slow calls become a common user-visible event.

Little's Law: connecting concurrency, throughput, and latency

Little's Law is a small formula with enormous design value. In a stable system, the average number of in-flight items equals arrival rate multiplied by average time in the system.

Little's Law

L = λ × W

L = average number of requests in the system
λ = arrival/completion rate (requests per second) in a stable system
W = average time each request spends in the system (seconds)

Example:
λ = 2,000 RPS
W = 0.150 seconds
L = 2,000 × 0.150 = 300 in-flight requests

This tells you how much concurrency you need. If your service handles 2,000 RPS and each request takes 150 ms, you should expect about 300 concurrent requests in flight even before spikes. If p99 latency jumps to 2 seconds, in-flight work can explode.

Use it for quick sanity checks

If someone claims a single-threaded service can handle 10,000 RPS while each request takes 50 ms of blocking work, Little's Law should make you suspicious. 10,000 × 0.050 means 500 requests need to be in progress.

Batching: trading latency for throughput

Batching groups many small operations into one larger operation. It often increases throughput because fixed overhead is paid once per batch instead of once per item. The cost is latency: the first item in a batch waits for more items to arrive or for a timer to fire.

batching trade-off

without batching:
  1 message → 1 network call
  overhead paid 1,000 times for 1,000 messages
  low waiting latency, lower throughput

with batching:
  collect up to 100 messages or wait up to 50 ms
  100 messages → 1 network call
  overhead paid 10 times for 1,000 messages
  higher throughput, but first message may wait 50 ms

Technique	Latency effect	Throughput effect	Used by
Small batches	Adds a bounded wait	Better CPU/network efficiency	Kafka producers, database bulk inserts
Large batches	Can add visible delay	Very high throughput	Analytics ETL, log compaction
No batching	Lowest wait per item	More overhead per item	Interactive payments, login requests

Kafka, Kinesis, SQS consumers, database bulk loaders, GPU inference servers, and analytics systems all use batching. It is a great fit when throughput matters more than the latency of any single item.

Real latency numbers and gotchas

You do not need to memorize every number, but you should know orders of magnitude. They help you spot impossible designs.

~0.1 ms

read 1 MB sequentially from memory

~0.5-1 ms

same-zone network round trip

~1-5 ms

SSD random read, depending on device and queueing

~20-80 ms

cross-country network round trip

~100-200 ms

intercontinental network round trip

seconds

cold starts, overloaded queues, retries, or slow third-party APIs

Coordinated omission: a benchmark that sends the next request only after the previous response can hide queueing latency.
Warm vs cold paths: cache hits, JIT warmup, open database connections, and TLS session reuse can make demos faster than real cold requests.
Retries inflate tails: retries improve success rate but can make p99 latency much worse unless bounded by timeouts and deadlines.
Bandwidth is not latency: a 10 Gbps link can move many bytes per second, but it cannot make light cross an ocean instantly.

Key takeaways

Latency is the time for one operation; throughput is completed work per unit time.
They are independent axes until overload creates queueing, which makes latency spike.
Use p50, p95, and p99 instead of averages because tail latency is what users feel at scale.
Little's Law, L = λ × W, connects in-flight concurrency, throughput, and response time.
Batching often raises throughput by amortizing overhead, but it adds waiting latency.

No. More servers usually raise throughput by handling more requests in parallel. A single request still performs the same work unless the added servers remove queueing or allow the request itself to be parallelized.

At scale, rare slow requests happen constantly, and user journeys often require many backend calls. One slow call can make an entire page or transaction feel slow, even when the average looks healthy.

Batching trades latency for throughput. Items wait for a batch to fill or a timer to fire, but the system pays fixed overhead fewer times and can process more total work per second.

Finished this lesson?

Mark it complete to track your progress through the workbook.