Rate Limiting
Protect your system from abuse and overload by capping how fast clients can call you.
Rate limiting controls how many requests a caller can make in a period of time. It protects shared resources from abuse, bugs, scrapers, brute-force attacks, and sudden traffic spikes while preserving fairness for well-behaved users.
The problem: one caller can consume the shared system
The failure mode is not only malicious denial-of-service. A mobile app bug can retry in a tight loop, a partner integration can batch the wrong way, a crawler can walk every URL, or a login attacker can try millions of passwords. Without limits, one actor can exhaust CPU, database connections, queue capacity, third-party quotas, or your cloud budget.
- Availability: protect backends so one noisy caller does not starve everyone else.
- Security: slow brute-force logins, credential stuffing, scraping, and token guessing.
- Fairness: enforce per-plan or per-tenant quotas so paid capacity is shared predictably.
- Cost control: cap expensive operations such as AI calls, exports, SMS sends, and search aggregations.
What key are you limiting?
Rate limiting is only as good as the identity key behind it. Different endpoints need different dimensions, and production systems often stack several limits on the same request.
| Dimension | Good for | Caution |
|---|---|---|
| Per IP | Anonymous traffic, login attempts, public pages | NATs and mobile carriers put many users behind one IP |
| Per user | Authenticated product limits and fairness | Attackers may create many accounts |
| Per API key | Developer platforms and partner quotas | Keys can be leaked or shared |
| Per tenant / org | B2B plan enforcement | One tenant can contain many users with different roles |
| Per route / action | Protect expensive endpoints | Policies become complex if every route is unique |
A login endpoint might limit per IP, per username, and per device fingerprint. An API platform might limit per API key globally, per route for expensive exports, and per organization for monthly plan quotas.
Algorithms: fixed, sliding, token, and leaky buckets
The algorithm decides how strict the limiter is about bursts and how much state it must store. There is no universal best choice; pick the simplest algorithm that matches the product and abuse pattern.
| Algorithm | How it works | Strength | Gotcha |
|---|---|---|---|
| Fixed window | Count requests in wall-clock buckets such as 10:00:00-10:00:59 | Very simple and cheap | Allows double bursts at boundaries |
| Sliding window log | Store timestamps for each request and count those within the last N seconds | Accurate rolling limit | Potentially high memory for hot keys |
| Sliding window counter | Blend previous and current fixed-window counts by elapsed time | Good approximation with low state | Approximate, not exact |
| Token bucket | Tokens refill steadily up to a capacity; each request spends tokens | Allows controlled bursts with a long-run rate | Needs careful atomic math across servers |
| Leaky bucket | Requests drain at a constant rate like a queue | Smooths downstream traffic | Can add latency or drop when queue is full |
Token bucket mechanics
Token bucket is popular because it separates burst size from average rate. A caller can save up tokens during idle time and spend them in a short burst, but over time tokens refill only at the configured rate.
# capacity = maximum burst, refill_rate = tokens per second
now = current_time_seconds()
bucket = load(key) # { tokens, last_refill_at }
elapsed = now - bucket.last_refill_at
bucket.tokens = min(capacity, bucket.tokens + elapsed * refill_rate)
bucket.last_refill_at = now
if bucket.tokens >= cost:
bucket.tokens -= cost
save(key, bucket)
allow_request()
else:
retry_after = (cost - bucket.tokens) / refill_rate
save(key, bucket)
reject_429(retry_after)GET /profile might cost one token, while POST /exports or an AI inference endpoint costs 50 tokens because it consumes more CPU, queue time, or money.Distributed rate limiting with Redis
A limiter on one server is easy; a limiter across a fleet is harder. If you run ten API servers and each keeps its own in-memory counter, a user can receive ten times the intended quota. The decision must be made in a shared, atomic place or partitioned carefully by key.
key = f"rl:{api_key}:{epoch_minute}"
count = redis.incr(key)
if count == 1:
redis.expire(key, 60)
if count > LIMIT:
return 429
return allowRedis is common because operations such as increment, expire, sorted-set updates, and Lua scripts are fast and atomic on a single key. For token bucket or sliding-window logic, use a Lua script or Redis function so read-modify-write happens as one operation.
# unsafe if two servers do this concurrently:
tokens = redis.get(key)
if tokens > 0:
redis.set(key, tokens - 1)
# safe pattern:
EVAL token_bucket_script key now capacity refill_rate cost
# Redis executes the script atomically for that key- Hot keys: a global limit can concentrate all traffic on one Redis key. Prefer per-caller keys or sharded counters when needed.
- Redis outage policy: decide whether to fail open, fail closed, or use a small local fallback for each route.
- Clock behavior: token calculations depend on time. Use server-side Redis time or consistent gateway clocks when possible.
Response semantics: 429, Retry-After, and RateLimit headers
A good limiter does not just say no; it teaches clients how to behave. The standard rejection status is 429 Too Many Requests. Include retry information so SDKs, browsers, and partner integrations can back off instead of hammering the endpoint.
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 17
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 17
{
"error": "rate_limited",
"message": "Too many requests. Retry after 17 seconds."
}- Retry-After: seconds or an HTTP date indicating when the client should retry.
- RateLimit-Limit: the quota being applied, such as 100 requests per minute.
- RateLimit-Remaining: how many requests remain in the current policy view.
- RateLimit-Reset: when capacity is expected to become available again.
Where to enforce rate limits
Rate limits can be enforced at several layers. The right answer is often layered: coarse protection at the edge, product quotas at the gateway, and domain-specific limits near the expensive resource.
| Layer | Best at | Example |
|---|---|---|
| CDN / WAF edge | Absorbing obvious abuse before it reaches your network | Block IPs sending thousands of requests per second |
| API gateway | Consistent API-key, user, tenant, and route quotas | 1000 requests/min per partner key |
| Service | Domain-specific costs and permissions | Only 3 password reset emails per account per hour |
| Queue / worker | Smoothing expensive asynchronous work | Limit report generation concurrency per tenant |
| Database / third-party client | Protecting scarce downstream capacity | Cap SMS sends or payment-provider calls |
Edge enforcement is fast and cheap but may not know the authenticated user. Service enforcement knows the domain but happens after traffic has already consumed gateway and network capacity. Use both when the endpoint is important.
Edge cases and production gotchas
- Window boundary bursts: fixed windows can allow nearly double the intended rate around reset time.
- Identity evasion: attackers rotate IPs, accounts, or API keys. Combine dimensions and anomaly detection for sensitive routes.
- Legitimate shared IPs: schools, offices, and mobile carriers can make many real users appear as one IP.
- Retries amplify load: clients that ignore
Retry-Aftercan turn a small overload into a retry storm. - Multi-region limits: globally strict limits require cross-region coordination, which adds latency; many systems accept approximate regional limits for availability.
- Observability: track allowed, rejected, shadow-rejected, and near-limit counts by route and caller type before tightening policies.
- Rate limiting protects availability, security, fairness, and cost by controlling how much work each caller can create.
- Choose the right identity key: per IP, user, API key, tenant, route, or a layered combination depending on the endpoint.
- Fixed windows are simple, sliding logs are accurate, sliding counters approximate rolling limits, token buckets allow controlled bursts, and leaky buckets smooth output.
- Distributed limiters need shared atomic state, commonly Redis with INCR/EXPIRE, sorted sets, or Lua scripts for token bucket logic.
- Reject with HTTP 429 plus Retry-After and RateLimit headers, and enforce limits at the edge, gateway, service, or queue depending on what you are protecting.
429 with a clear error body and retry metadata, especially Retry-After. RateLimit headers can also tell the client the policy, remaining capacity, and reset time.Mark it complete to track your progress through the workbook.