DrawLintDrawLint.ai
🧩Core Building Blocks·6 min read

Rate Limiting

Protect your system from abuse and overload by capping how fast clients can call you.

Rate limiting controls how many requests a caller can make in a period of time. It protects shared resources from abuse, bugs, scrapers, brute-force attacks, and sudden traffic spikes while preserving fairness for well-behaved users.

🔭Think of it like…
A rate limiter is the bouncer and ticket dispenser at a busy venue. People can enter quickly when there is spare capacity, but the door never lets an unlimited crowd stampede inside. Some guests have VIP tickets, some have general admission, and everyone gets a clear answer about when to try again.

The problem: one caller can consume the shared system

The failure mode is not only malicious denial-of-service. A mobile app bug can retry in a tight loop, a partner integration can batch the wrong way, a crawler can walk every URL, or a login attacker can try millions of passwords. Without limits, one actor can exhaust CPU, database connections, queue capacity, third-party quotas, or your cloud budget.

  • Availability: protect backends so one noisy caller does not starve everyone else.
  • Security: slow brute-force logins, credential stuffing, scraping, and token guessing.
  • Fairness: enforce per-plan or per-tenant quotas so paid capacity is shared predictably.
  • Cost control: cap expensive operations such as AI calls, exports, SMS sends, and search aggregations.
The core idea
Identify a caller, choose a quota policy, atomically record consumption, and either allow the request or reject it with a clear retry contract.

What key are you limiting?

Rate limiting is only as good as the identity key behind it. Different endpoints need different dimensions, and production systems often stack several limits on the same request.

DimensionGood forCaution
Per IPAnonymous traffic, login attempts, public pagesNATs and mobile carriers put many users behind one IP
Per userAuthenticated product limits and fairnessAttackers may create many accounts
Per API keyDeveloper platforms and partner quotasKeys can be leaked or shared
Per tenant / orgB2B plan enforcementOne tenant can contain many users with different roles
Per route / actionProtect expensive endpointsPolicies become complex if every route is unique

A login endpoint might limit per IP, per username, and per device fingerprint. An API platform might limit per API key globally, per route for expensive exports, and per organization for monthly plan quotas.

Algorithms: fixed, sliding, token, and leaky buckets

The algorithm decides how strict the limiter is about bursts and how much state it must store. There is no universal best choice; pick the simplest algorithm that matches the product and abuse pattern.

AlgorithmHow it worksStrengthGotcha
Fixed windowCount requests in wall-clock buckets such as 10:00:00-10:00:59Very simple and cheapAllows double bursts at boundaries
Sliding window logStore timestamps for each request and count those within the last N secondsAccurate rolling limitPotentially high memory for hot keys
Sliding window counterBlend previous and current fixed-window counts by elapsed timeGood approximation with low stateApproximate, not exact
Token bucketTokens refill steadily up to a capacity; each request spends tokensAllows controlled bursts with a long-run rateNeeds careful atomic math across servers
Leaky bucketRequests drain at a constant rate like a queueSmooths downstream trafficCan add latency or drop when queue is full

Token bucket mechanics

Token bucket is popular because it separates burst size from average rate. A caller can save up tokens during idle time and spend them in a short burst, but over time tokens refill only at the configured rate.

token bucket pseudocode
# capacity = maximum burst, refill_rate = tokens per second
now = current_time_seconds()
bucket = load(key)  # { tokens, last_refill_at }

elapsed = now - bucket.last_refill_at
bucket.tokens = min(capacity, bucket.tokens + elapsed * refill_rate)
bucket.last_refill_at = now

if bucket.tokens >= cost:
    bucket.tokens -= cost
    save(key, bucket)
    allow_request()
else:
    retry_after = (cost - bucket.tokens) / refill_rate
    save(key, bucket)
    reject_429(retry_after)
Requests can have different costs
Not all requests are equal. A cheap GET /profile might cost one token, while POST /exports or an AI inference endpoint costs 50 tokens because it consumes more CPU, queue time, or money.

Distributed rate limiting with Redis

A limiter on one server is easy; a limiter across a fleet is harder. If you run ten API servers and each keeps its own in-memory counter, a user can receive ten times the intended quota. The decision must be made in a shared, atomic place or partitioned carefully by key.

fixed-window Redis limiter with atomic increment
key = f"rl:{api_key}:{epoch_minute}"
count = redis.incr(key)
if count == 1:
    redis.expire(key, 60)

if count > LIMIT:
    return 429
return allow

Redis is common because operations such as increment, expire, sorted-set updates, and Lua scripts are fast and atomic on a single key. For token bucket or sliding-window logic, use a Lua script or Redis function so read-modify-write happens as one operation.

why Lua/atomic scripts matter
# unsafe if two servers do this concurrently:
tokens = redis.get(key)
if tokens > 0:
    redis.set(key, tokens - 1)

# safe pattern:
EVAL token_bucket_script key now capacity refill_rate cost
# Redis executes the script atomically for that key
  • Hot keys: a global limit can concentrate all traffic on one Redis key. Prefer per-caller keys or sharded counters when needed.
  • Redis outage policy: decide whether to fail open, fail closed, or use a small local fallback for each route.
  • Clock behavior: token calculations depend on time. Use server-side Redis time or consistent gateway clocks when possible.
Related building block
Redis is the usual shared counter store. Review the Redis pattern for cache, counter, and atomic-operation design details.

Response semantics: 429, Retry-After, and RateLimit headers

A good limiter does not just say no; it teaches clients how to behave. The standard rejection status is 429 Too Many Requests. Include retry information so SDKs, browsers, and partner integrations can back off instead of hammering the endpoint.

rate limit response example
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 17
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 17

{
  "error": "rate_limited",
  "message": "Too many requests. Retry after 17 seconds."
}
  • Retry-After: seconds or an HTTP date indicating when the client should retry.
  • RateLimit-Limit: the quota being applied, such as 100 requests per minute.
  • RateLimit-Remaining: how many requests remain in the current policy view.
  • RateLimit-Reset: when capacity is expected to become available again.
Do not leak sensitive policy details
Headers are useful for legitimate clients, but auth and abuse endpoints may intentionally return coarser information. For example, login limits should avoid revealing whether a username exists.

Where to enforce rate limits

Rate limits can be enforced at several layers. The right answer is often layered: coarse protection at the edge, product quotas at the gateway, and domain-specific limits near the expensive resource.

LayerBest atExample
CDN / WAF edgeAbsorbing obvious abuse before it reaches your networkBlock IPs sending thousands of requests per second
API gatewayConsistent API-key, user, tenant, and route quotas1000 requests/min per partner key
ServiceDomain-specific costs and permissionsOnly 3 password reset emails per account per hour
Queue / workerSmoothing expensive asynchronous workLimit report generation concurrency per tenant
Database / third-party clientProtecting scarce downstream capacityCap SMS sends or payment-provider calls

Edge enforcement is fast and cheap but may not know the authenticated user. Service enforcement knows the domain but happens after traffic has already consumed gateway and network capacity. Use both when the endpoint is important.

Edge cases and production gotchas

  • Window boundary bursts: fixed windows can allow nearly double the intended rate around reset time.
  • Identity evasion: attackers rotate IPs, accounts, or API keys. Combine dimensions and anomaly detection for sensitive routes.
  • Legitimate shared IPs: schools, offices, and mobile carriers can make many real users appear as one IP.
  • Retries amplify load: clients that ignoreRetry-After can turn a small overload into a retry storm.
  • Multi-region limits: globally strict limits require cross-region coordination, which adds latency; many systems accept approximate regional limits for availability.
  • Observability: track allowed, rejected, shadow-rejected, and near-limit counts by route and caller type before tightening policies.
Roll out in shadow mode first
Before rejecting real traffic, compute the limit and log what would have been blocked. Shadow mode reveals accidental customer impact and helps tune thresholds.
Key takeaways
  • Rate limiting protects availability, security, fairness, and cost by controlling how much work each caller can create.
  • Choose the right identity key: per IP, user, API key, tenant, route, or a layered combination depending on the endpoint.
  • Fixed windows are simple, sliding logs are accurate, sliding counters approximate rolling limits, token buckets allow controlled bursts, and leaky buckets smooth output.
  • Distributed limiters need shared atomic state, commonly Redis with INCR/EXPIRE, sorted sets, or Lua scripts for token bucket logic.
  • Reject with HTTP 429 plus Retry-After and RateLimit headers, and enforce limits at the edge, gateway, service, or queue depending on what you are protecting.
The counter resets at a sharp wall-clock boundary. A caller can send the full quota at the end of one minute and the full quota again at the start of the next, creating a short burst near twice the intended rate.
If every API server keeps only local counters, each server can allow the full quota independently. A shared atomic store lets the fleet make one consistent decision for a caller and prevents race conditions during read-modify-write updates.
Return HTTP 429 with a clear error body and retry metadata, especially Retry-After. RateLimit headers can also tell the client the policy, remaining capacity, and reset time.
Finished this lesson?

Mark it complete to track your progress through the workbook.