Day 26 · ~12m

Caching & Performance

Implement response caching, TTL expiration, and async patterns for faster APIs.

🧑‍💻

LLM calls are slow and expensive. If two users ask the same question, do I really need to call the model twice?

👩‍🏫

No. That's what caching solves. Store the result of expensive operations and serve the cached version for identical requests:

from datetime import datetime, timedelta

cache = {}

def get_cached(key: str, ttl_seconds: int = 300):
    if key in cache:
        entry = cache[key]
        if datetime.utcnow() < entry["expires_at"]:
            return entry["value"]
        del cache[key]  # Expired
    return None

def set_cached(key: str, value, ttl_seconds: int = 300):
    cache[key] = {
        "value": value,
        "expires_at": datetime.utcnow() + timedelta(seconds=ttl_seconds)
    }

TTL (Time To Live) means cached values expire after a set time. A 5-minute TTL means identical requests within 5 minutes hit the cache instead of the LLM.

🧑‍💻

How do I generate the cache key?

👩‍🏫

Hash the inputs that determine the output. For an AI endpoint, that's usually the prompt and model parameters:

import hashlib
import json

def make_cache_key(prompt: str, model: str, temperature: float) -> str:
    data = json.dumps({"prompt": prompt, "model": model, "temp": temperature})
    return hashlib.sha256(data.encode()).hexdigest()[:16]

Same inputs always produce the same hash. Different inputs produce different hashes. The key should capture everything that affects the output.

🧑‍💻

What about caching at the HTTP level?

👩‍🏫

FastAPI supports cache headers that tell clients and CDNs to cache responses:

from fastapi import Response

@app.get("/data")
def get_data(response: Response):
    response.headers["Cache-Control"] = "public, max-age=60"
    return {"data": "cacheable"}

But for AI responses, server-side caching gives you more control — you can cache by semantic similarity, invalidate on updates, and share the cache across all users.

🧑‍💻

What about async? Does that help performance?

👩‍🏫

Async lets your server handle multiple requests while waiting for slow I/O. Without async, one LLM call blocks the entire server. With async, other requests proceed while you wait:

import asyncio

async def call_llm(prompt: str) -> str:
    # Simulate slow LLM call
    await asyncio.sleep(2)
    return f"Response to: {prompt}"

@app.get("/ask")
async def ask(q: str):
    # Check cache first
    cached = get_cached(q)
    if cached:
        return {"answer": cached, "source": "cache"}
    # Cache miss — call LLM
    answer = await call_llm(q)
    set_cached(q, answer)
    return {"answer": answer, "source": "llm"}

The combination of caching + async is powerful: cached responses return instantly, and uncached requests don't block each other.

Practice your skills

Sign up to write and run code in this lesson.

Already have an account? Sign in