Day 19 · ~12m●

Streaming Responses

Implement async generators and Server-Sent Events for real-time AI output.

🧑‍💻

When I call an LLM, it takes seconds to generate the full response. Can I show output as it arrives instead of waiting?

👩‍🏫

That's streaming. Instead of waiting for the complete response, you send chunks as they're generated. The user sees text appearing word by word — much better UX.

The pattern uses async generators in Python:

async def generate_words(text: str):
    words = text.split()
    for word in words:
        yield word + " "

yield sends one piece at a time. The caller gets each word as it's produced, without waiting for all words.

🧑‍💻

How does that connect to a FastAPI endpoint?

👩‍🏫

FastAPI's StreamingResponse wraps your generator:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

async def token_generator(prompt: str):
    # Simulate LLM tokens
    words = ["The", " answer", " is", " 42", "."]
    for word in words:
        yield f"data: {word}\n\n"

@app.get("/stream")
async def stream_response():
    return StreamingResponse(
        token_generator("test"),
        media_type="text/event-stream"
    )

The text/event-stream media type tells the browser this is a Server-Sent Events (SSE) stream. Each chunk follows the format data: <content>\n\n.

🧑‍💻

What's the data: prefix for?

👩‍🏫

It's the SSE protocol. The browser's EventSource API and most HTTP clients understand this format:

async def sse_generator(prompt: str):
    chunks = ["Hello", ", ", "world", "!"]
    for chunk in chunks:
        yield f"data: {chunk}\n\n"
    yield "data: [DONE]\n\n"  # Signal end of stream

The client reads events one at a time. When it sees [DONE], it knows the stream is complete. This is the same pattern that OpenAI's API uses.

🧑‍💻

How do I accumulate the full response while streaming?

👩‍🏫

Collect chunks on both sides. The server can accumulate too:

async def stream_and_collect(prompt: str):
    full_response = []
    words = ["Streaming", " is", " powerful"]
    for word in words:
        full_response.append(word)
        yield f"data: {word}\n\n"
    # After streaming, full_response has everything
    complete = "".join(full_response)
    yield f"data: [DONE]\n\n"

The client concatenates chunks as they arrive. By the end of the stream, both sides have the complete response. This lets you log the full output, count tokens, or store the result — all while the user watches it appear in real time.

Practice your skills

Already have an account? Sign in