🧩Core Building Blocks·6 min read

Rate Limiting

Protect your system from abuse and overload by capping how fast clients can call you.

Rate limiting controls how many requests a caller can make in a period of time. It protects shared resources from abuse, bugs, scrapers, brute-force attacks, and sudden traffic spikes while preserving fairness for well-behaved users.

🔭Think of it like…

A rate limiter is the bouncer and ticket dispenser at a busy venue. People can enter quickly when there is spare capacity, but the door never lets an unlimited crowd stampede inside. Some guests have VIP tickets, some have general admission, and everyone gets a clear answer about when to try again.

The problem: one caller can consume the shared system

The failure mode is not only malicious denial-of-service. A mobile app bug can retry in a tight loop, a partner integration can batch the wrong way, a crawler can walk every URL, or a login attacker can try millions of passwords. Without limits, one actor can exhaust CPU, database connections, queue capacity, third-party quotas, or your cloud budget.

Availability: protect backends so one noisy caller does not starve everyone else.
Security: slow brute-force logins, credential stuffing, scraping, and token guessing.
Fairness: enforce per-plan or per-tenant quotas so paid capacity is shared predictably.
Cost control: cap expensive operations such as AI calls, exports, SMS sends, and search aggregations.

The core idea

Identify a caller, choose a quota policy, atomically record consumption, and either allow the request or reject it with a clear retry contract.

What key are you limiting?

Rate limiting is only as good as the identity key behind it. Different endpoints need different dimensions, and production systems often stack several limits on the same request.

Dimension	Good for	Caution
Per IP	Anonymous traffic, login attempts, public pages	NATs and mobile carriers put many users behind one IP
Per user	Authenticated product limits and fairness	Attackers may create many accounts
Per API key	Developer platforms and partner quotas	Keys can be leaked or shared
Per tenant / org	B2B plan enforcement	One tenant can contain many users with different roles
Per route / action	Protect expensive endpoints	Policies become complex if every route is unique

A login endpoint might limit per IP, per username, and per device fingerprint. An API platform might limit per API key globally, per route for expensive exports, and per organization for monthly plan quotas.

Algorithms: fixed, sliding, token, and leaky buckets

The algorithm decides how strict the limiter is about bursts and how much state it must store. There is no universal best choice; pick the simplest algorithm that matches the product and abuse pattern.

Algorithm	How it works	Strength	Gotcha
Fixed window	Count requests in wall-clock buckets such as 10:00:00-10:00:59	Very simple and cheap	Allows double bursts at boundaries
Sliding window log	Store timestamps for each request and count those within the last N seconds	Accurate rolling limit	Potentially high memory for hot keys
Sliding window counter	Blend previous and current fixed-window counts by elapsed time	Good approximation with low state	Approximate, not exact
Token bucket	Tokens refill steadily up to a capacity; each request spends tokens	Allows controlled bursts with a long-run rate	Needs careful atomic math across servers
Leaky bucket	Requests drain at a constant rate like a queue	Smooths downstream traffic	Can add latency or drop when queue is full

Token bucket mechanics

Token bucket is popular because it separates burst size from average rate. A caller can save up tokens during idle time and spend them in a short burst, but over time tokens refill only at the configured rate.

token bucket pseudocode

# capacity = maximum burst, refill_rate = tokens per second
now = current_time_seconds()
bucket = load(key)  # { tokens, last_refill_at }

elapsed = now - bucket.last_refill_at
bucket.tokens = min(capacity, bucket.tokens + elapsed * refill_rate)
bucket.last_refill_at = now

if bucket.tokens >= cost:
    bucket.tokens -= cost
    save(key, bucket)
    allow_request()
else:
    retry_after = (cost - bucket.tokens) / refill_rate
    save(key, bucket)
    reject_429(retry_after)

Requests can have different costs

Not all requests are equal. A cheap GET /profile might cost one token, while POST /exports or an AI inference endpoint costs 50 tokens because it consumes more CPU, queue time, or money.

Distributed rate limiting with Redis

A limiter on one server is easy; a limiter across a fleet is harder. If you run ten API servers and each keeps its own in-memory counter, a user can receive ten times the intended quota. The decision must be made in a shared, atomic place or partitioned carefully by key.

fixed-window Redis limiter with atomic increment

key = f"rl:{api_key}:{epoch_minute}"
count = redis.incr(key)
if count == 1:
    redis.expire(key, 60)

if count > LIMIT:
    return 429
return allow

Redis is common because operations such as increment, expire, sorted-set updates, and Lua scripts are fast and atomic on a single key. For token bucket or sliding-window logic, use a Lua script or Redis function so read-modify-write happens as one operation.

why Lua/atomic scripts matter

# unsafe if two servers do this concurrently:
tokens = redis.get(key)
if tokens > 0:
    redis.set(key, tokens - 1)

# safe pattern:
EVAL token_bucket_script key now capacity refill_rate cost
# Redis executes the script atomically for that key

Hot keys: a global limit can concentrate all traffic on one Redis key. Prefer per-caller keys or sharded counters when needed.
Redis outage policy: decide whether to fail open, fail closed, or use a small local fallback for each route.
Clock behavior: token calculations depend on time. Use server-side Redis time or consistent gateway clocks when possible.

Related building block

Redis is the usual shared counter store. Review the Redis pattern for cache, counter, and atomic-operation design details.

Response semantics: 429, Retry-After, and RateLimit headers

A good limiter does not just say no; it teaches clients how to behave. The standard rejection status is 429 Too Many Requests. Include retry information so SDKs, browsers, and partner integrations can back off instead of hammering the endpoint.

rate limit response example

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 17
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 17

{
  "error": "rate_limited",
  "message": "Too many requests. Retry after 17 seconds."
}

Retry-After: seconds or an HTTP date indicating when the client should retry.
RateLimit-Limit: the quota being applied, such as 100 requests per minute.
RateLimit-Remaining: how many requests remain in the current policy view.
RateLimit-Reset: when capacity is expected to become available again.

Do not leak sensitive policy details

Headers are useful for legitimate clients, but auth and abuse endpoints may intentionally return coarser information. For example, login limits should avoid revealing whether a username exists.

Where to enforce rate limits

Rate limits can be enforced at several layers. The right answer is often layered: coarse protection at the edge, product quotas at the gateway, and domain-specific limits near the expensive resource.

Layer	Best at	Example
CDN / WAF edge	Absorbing obvious abuse before it reaches your network	Block IPs sending thousands of requests per second
API gateway	Consistent API-key, user, tenant, and route quotas	1000 requests/min per partner key
Service	Domain-specific costs and permissions	Only 3 password reset emails per account per hour
Queue / worker	Smoothing expensive asynchronous work	Limit report generation concurrency per tenant
Database / third-party client	Protecting scarce downstream capacity	Cap SMS sends or payment-provider calls

Edge enforcement is fast and cheap but may not know the authenticated user. Service enforcement knows the domain but happens after traffic has already consumed gateway and network capacity. Use both when the endpoint is important.

Edge cases and production gotchas

Window boundary bursts: fixed windows can allow nearly double the intended rate around reset time.
Identity evasion: attackers rotate IPs, accounts, or API keys. Combine dimensions and anomaly detection for sensitive routes.
Legitimate shared IPs: schools, offices, and mobile carriers can make many real users appear as one IP.
Retries amplify load: clients that ignoreRetry-After can turn a small overload into a retry storm.
Multi-region limits: globally strict limits require cross-region coordination, which adds latency; many systems accept approximate regional limits for availability.
Observability: track allowed, rejected, shadow-rejected, and near-limit counts by route and caller type before tightening policies.

Roll out in shadow mode first

Before rejecting real traffic, compute the limit and log what would have been blocked. Shadow mode reveals accidental customer impact and helps tune thresholds.

Key takeaways

Rate limiting protects availability, security, fairness, and cost by controlling how much work each caller can create.
Choose the right identity key: per IP, user, API key, tenant, route, or a layered combination depending on the endpoint.
Fixed windows are simple, sliding logs are accurate, sliding counters approximate rolling limits, token buckets allow controlled bursts, and leaky buckets smooth output.
Distributed limiters need shared atomic state, commonly Redis with INCR/EXPIRE, sorted sets, or Lua scripts for token bucket logic.
Reject with HTTP 429 plus Retry-After and RateLimit headers, and enforce limits at the edge, gateway, service, or queue depending on what you are protecting.

The counter resets at a sharp wall-clock boundary. A caller can send the full quota at the end of one minute and the full quota again at the start of the next, creating a short burst near twice the intended rate.

If every API server keeps only local counters, each server can allow the full quota independently. A shared atomic store lets the fleet make one consistent decision for a caller and prevents race conditions during read-modify-write updates.

Return HTTP 429 with a clear error body and retry metadata, especially Retry-After. RateLimit headers can also tell the client the policy, remaining capacity, and reset time.

Finished this lesson?

Mark it complete to track your progress through the workbook.