Caching & Performance
Implement response caching, TTL expiration, and async patterns for faster APIs.
LLM calls are slow and expensive. If two users ask the same question, do I really need to call the model twice?
No. That's what caching solves. Store the result of expensive operations and serve the cached version for identical requests:
from datetime import datetime, timedelta
cache = {}
def get_cached(key: str, ttl_seconds: int = 300):
if key in cache:
entry = cache[key]
if datetime.utcnow() < entry["expires_at"]:
return entry["value"]
del cache[key] # Expired
return None
def set_cached(key: str, value, ttl_seconds: int = 300):
cache[key] = {
"value": value,
"expires_at": datetime.utcnow() + timedelta(seconds=ttl_seconds)
}
TTL (Time To Live) means cached values expire after a set time. A 5-minute TTL means identical requests within 5 minutes hit the cache instead of the LLM.
How do I generate the cache key?
Hash the inputs that determine the output. For an AI endpoint, that's usually the prompt and model parameters:
import hashlib
import json
def make_cache_key(prompt: str, model: str, temperature: float) -> str:
data = json.dumps({"prompt": prompt, "model": model, "temp": temperature})
return hashlib.sha256(data.encode()).hexdigest()[:16]
Same inputs always produce the same hash. Different inputs produce different hashes. The key should capture everything that affects the output.
What about caching at the HTTP level?
FastAPI supports cache headers that tell clients and CDNs to cache responses:
from fastapi import Response
@app.get("/data")
def get_data(response: Response):
response.headers["Cache-Control"] = "public, max-age=60"
return {"data": "cacheable"}
But for AI responses, server-side caching gives you more control — you can cache by semantic similarity, invalidate on updates, and share the cache across all users.
What about async? Does that help performance?
Async lets your server handle multiple requests while waiting for slow I/O. Without async, one LLM call blocks the entire server. With async, other requests proceed while you wait:
import asyncio
async def call_llm(prompt: str) -> str:
# Simulate slow LLM call
await asyncio.sleep(2)
return f"Response to: {prompt}"
@app.get("/ask")
async def ask(q: str):
# Check cache first
cached = get_cached(q)
if cached:
return {"answer": cached, "source": "cache"}
# Cache miss — call LLM
answer = await call_llm(q)
set_cached(q, answer)
return {"answer": answer, "source": "llm"}
The combination of caching + async is powerful: cached responses return instantly, and uncached requests don't block each other.
Practice your skills
Sign up to write and run code in this lesson.
Caching & Performance
Implement response caching, TTL expiration, and async patterns for faster APIs.
LLM calls are slow and expensive. If two users ask the same question, do I really need to call the model twice?
No. That's what caching solves. Store the result of expensive operations and serve the cached version for identical requests:
from datetime import datetime, timedelta
cache = {}
def get_cached(key: str, ttl_seconds: int = 300):
if key in cache:
entry = cache[key]
if datetime.utcnow() < entry["expires_at"]:
return entry["value"]
del cache[key] # Expired
return None
def set_cached(key: str, value, ttl_seconds: int = 300):
cache[key] = {
"value": value,
"expires_at": datetime.utcnow() + timedelta(seconds=ttl_seconds)
}
TTL (Time To Live) means cached values expire after a set time. A 5-minute TTL means identical requests within 5 minutes hit the cache instead of the LLM.
How do I generate the cache key?
Hash the inputs that determine the output. For an AI endpoint, that's usually the prompt and model parameters:
import hashlib
import json
def make_cache_key(prompt: str, model: str, temperature: float) -> str:
data = json.dumps({"prompt": prompt, "model": model, "temp": temperature})
return hashlib.sha256(data.encode()).hexdigest()[:16]
Same inputs always produce the same hash. Different inputs produce different hashes. The key should capture everything that affects the output.
What about caching at the HTTP level?
FastAPI supports cache headers that tell clients and CDNs to cache responses:
from fastapi import Response
@app.get("/data")
def get_data(response: Response):
response.headers["Cache-Control"] = "public, max-age=60"
return {"data": "cacheable"}
But for AI responses, server-side caching gives you more control — you can cache by semantic similarity, invalidate on updates, and share the cache across all users.
What about async? Does that help performance?
Async lets your server handle multiple requests while waiting for slow I/O. Without async, one LLM call blocks the entire server. With async, other requests proceed while you wait:
import asyncio
async def call_llm(prompt: str) -> str:
# Simulate slow LLM call
await asyncio.sleep(2)
return f"Response to: {prompt}"
@app.get("/ask")
async def ask(q: str):
# Check cache first
cached = get_cached(q)
if cached:
return {"answer": cached, "source": "cache"}
# Cache miss — call LLM
answer = await call_llm(q)
set_cached(q, answer)
return {"answer": answer, "source": "llm"}
The combination of caching + async is powerful: cached responses return instantly, and uncached requests don't block each other.
Practice your skills
Sign up to write and run code in this lesson.