Rate Limiting & Quotas
Implement per-user rate limits and token budgets to control API usage.
If my AI API costs money per call, how do I prevent one user from burning through my entire budget?
Rate limiting. You track how many requests each user makes in a time window and reject requests that exceed the limit:
from datetime import datetime, timedelta
class RateLimiter:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window = timedelta(seconds=window_seconds)
self.requests = {} # user -> list of timestamps
def is_allowed(self, user: str) -> bool:
now = datetime.utcnow()
cutoff = now - self.window
# Clean old requests
if user in self.requests:
self.requests[user] = [
t for t in self.requests[user] if t > cutoff
]
else:
self.requests[user] = []
# Check limit
if len(self.requests[user]) >= self.max_requests:
return False
self.requests[user].append(now)
return True
This is a sliding window rate limiter. It keeps timestamps of recent requests and counts how many fall within the window.
How does this integrate with FastAPI?
As a dependency, naturally:
limiter = RateLimiter(max_requests=10, window_seconds=60)
def check_rate_limit(user: str = Depends(require_auth)):
if not limiter.is_allowed(user):
raise HTTPException(
status_code=429,
detail="Rate limit exceeded. Try again later."
)
return user
@app.post("/chat")
def chat(user: str = Depends(check_rate_limit)):
return {"message": "Response here"}
HTTP 429 means "Too Many Requests" — the standard status for rate limiting.
What about token budgets? LLM costs depend on how many tokens you process, not just how many requests.
Track tokens per user in addition to request counts:
class TokenBudget:
def __init__(self, daily_limit: int):
self.daily_limit = daily_limit
self.usage = {} # user -> {"date": str, "tokens": int}
def check_and_deduct(self, user: str, tokens: int) -> bool:
today = datetime.utcnow().strftime("%Y-%m-%d")
if user not in self.usage or self.usage[user]["date"] != today:
self.usage[user] = {"date": today, "tokens": 0}
if self.usage[user]["tokens"] + tokens > self.daily_limit:
return False
self.usage[user]["tokens"] += tokens
return True
Should I rate limit by IP address or by user identity?
Both. IP-based limits protect against unauthenticated abuse — someone hammering your login endpoint. User-based limits protect against authenticated abuse — one user consuming all your API budget.
Layer them:
- IP rate limit on all endpoints (100 requests/minute)
- User rate limit on authenticated endpoints (20 requests/minute)
- Token budget on AI endpoints (10,000 tokens/day)
Each layer catches different kinds of abuse. The tighter limits go on the most expensive operations.
Practice your skills
Sign up to write and run code in this lesson.
Rate Limiting & Quotas
Implement per-user rate limits and token budgets to control API usage.
If my AI API costs money per call, how do I prevent one user from burning through my entire budget?
Rate limiting. You track how many requests each user makes in a time window and reject requests that exceed the limit:
from datetime import datetime, timedelta
class RateLimiter:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window = timedelta(seconds=window_seconds)
self.requests = {} # user -> list of timestamps
def is_allowed(self, user: str) -> bool:
now = datetime.utcnow()
cutoff = now - self.window
# Clean old requests
if user in self.requests:
self.requests[user] = [
t for t in self.requests[user] if t > cutoff
]
else:
self.requests[user] = []
# Check limit
if len(self.requests[user]) >= self.max_requests:
return False
self.requests[user].append(now)
return True
This is a sliding window rate limiter. It keeps timestamps of recent requests and counts how many fall within the window.
How does this integrate with FastAPI?
As a dependency, naturally:
limiter = RateLimiter(max_requests=10, window_seconds=60)
def check_rate_limit(user: str = Depends(require_auth)):
if not limiter.is_allowed(user):
raise HTTPException(
status_code=429,
detail="Rate limit exceeded. Try again later."
)
return user
@app.post("/chat")
def chat(user: str = Depends(check_rate_limit)):
return {"message": "Response here"}
HTTP 429 means "Too Many Requests" — the standard status for rate limiting.
What about token budgets? LLM costs depend on how many tokens you process, not just how many requests.
Track tokens per user in addition to request counts:
class TokenBudget:
def __init__(self, daily_limit: int):
self.daily_limit = daily_limit
self.usage = {} # user -> {"date": str, "tokens": int}
def check_and_deduct(self, user: str, tokens: int) -> bool:
today = datetime.utcnow().strftime("%Y-%m-%d")
if user not in self.usage or self.usage[user]["date"] != today:
self.usage[user] = {"date": today, "tokens": 0}
if self.usage[user]["tokens"] + tokens > self.daily_limit:
return False
self.usage[user]["tokens"] += tokens
return True
Should I rate limit by IP address or by user identity?
Both. IP-based limits protect against unauthenticated abuse — someone hammering your login endpoint. User-based limits protect against authenticated abuse — one user consuming all your API budget.
Layer them:
- IP rate limit on all endpoints (100 requests/minute)
- User rate limit on authenticated endpoints (20 requests/minute)
- Token budget on AI endpoints (10,000 tokens/day)
Each layer catches different kinds of abuse. The tighter limits go on the most expensive operations.
Practice your skills
Sign up to write and run code in this lesson.