Three independent classifications run sequentially — q1 waits for the network, q2 waits, q3 waits. Total time = sum of latencies. In parallel — all three fire at once. Total time = max of latencies.
from concurrent.futures import ThreadPoolExecutor
def call(prompt):
return Agent(model).run_sync(prompt).output
prompts = ["Reply: alpha", "Reply: beta", "Reply: gamma"]
with ThreadPoolExecutor(max_workers=3) as pool:
answers = list(pool.map(call, prompts))3 calls, all in flight at once. The wall-clock time is roughly the slowest single call, not the sum.
Why threads, not asyncio?
Both work. ThreadPoolExecutor is simpler — you write blocking code (.run_sync) and it parallelizes for you. asyncio requires async def and await everywhere. For LLM clients (network-bound, I/O-bound), threads are fine and read like normal code.
When NOT to parallelise?
When the calls depend on each other. If turn 2 needs turn 1's answer, parallel makes no sense. Multi-turn conversations are sequential. Independent classifications, eval batches, embedding batches — those parallelise cleanly.
from concurrent.futures import ThreadPoolExecutor
import time
def call(prompt):
return Agent(model).run_sync(prompt).output
prompts = ["q1", "q2", "q3"]
start = time.time()
with ThreadPoolExecutor(max_workers=3) as pool:
answers = list(pool.map(call, prompts))
elapsed = time.time() - start
print(f"3 calls in {elapsed:.1f}s") # roughly the slowest single callLLM calls are I/O-bound — most time is spent waiting for the network. Threads release the GIL during I/O, so multiple threads make progress concurrently. CPU-bound code wouldn't benefit (threads share one core), but LLM calls aren't CPU-bound.
ThreadPoolExecutor(max_workers=N)max_workers=N caps concurrent in-flight calls. Set to:
N parallel calls = N quota slots, same as sequential. Parallelism saves latency, not cost. (For lessons we keep N=2-3 to avoid burning quota on demonstration.)
pool.map preserves input ordering — answers[i] corresponds to prompts[i]. If you use pool.submit + as_completed, results come back in completion order — fast calls first. Pick based on whether you want ordered results or first-result-first.
from concurrent.futures import as_completed
with ThreadPoolExecutor(max_workers=3) as pool:
futures = [pool.submit(call, p) for p in prompts]
for future in as_completed(futures):
try:
answer = future.result()
except Exception as e:
print(f"call failed: {e}")One failed call doesn't crash the others. Each future carries its own exception state.
Three independent classifications run sequentially — q1 waits for the network, q2 waits, q3 waits. Total time = sum of latencies. In parallel — all three fire at once. Total time = max of latencies.
from concurrent.futures import ThreadPoolExecutor
def call(prompt):
return Agent(model).run_sync(prompt).output
prompts = ["Reply: alpha", "Reply: beta", "Reply: gamma"]
with ThreadPoolExecutor(max_workers=3) as pool:
answers = list(pool.map(call, prompts))3 calls, all in flight at once. The wall-clock time is roughly the slowest single call, not the sum.
Why threads, not asyncio?
Both work. ThreadPoolExecutor is simpler — you write blocking code (.run_sync) and it parallelizes for you. asyncio requires async def and await everywhere. For LLM clients (network-bound, I/O-bound), threads are fine and read like normal code.
When NOT to parallelise?
When the calls depend on each other. If turn 2 needs turn 1's answer, parallel makes no sense. Multi-turn conversations are sequential. Independent classifications, eval batches, embedding batches — those parallelise cleanly.
from concurrent.futures import ThreadPoolExecutor
import time
def call(prompt):
return Agent(model).run_sync(prompt).output
prompts = ["q1", "q2", "q3"]
start = time.time()
with ThreadPoolExecutor(max_workers=3) as pool:
answers = list(pool.map(call, prompts))
elapsed = time.time() - start
print(f"3 calls in {elapsed:.1f}s") # roughly the slowest single callLLM calls are I/O-bound — most time is spent waiting for the network. Threads release the GIL during I/O, so multiple threads make progress concurrently. CPU-bound code wouldn't benefit (threads share one core), but LLM calls aren't CPU-bound.
ThreadPoolExecutor(max_workers=N)max_workers=N caps concurrent in-flight calls. Set to:
N parallel calls = N quota slots, same as sequential. Parallelism saves latency, not cost. (For lessons we keep N=2-3 to avoid burning quota on demonstration.)
pool.map preserves input ordering — answers[i] corresponds to prompts[i]. If you use pool.submit + as_completed, results come back in completion order — fast calls first. Pick based on whether you want ordered results or first-result-first.
from concurrent.futures import as_completed
with ThreadPoolExecutor(max_workers=3) as pool:
futures = [pool.submit(call, p) for p in prompts]
for future in as_completed(futures):
try:
answer = future.result()
except Exception as e:
print(f"call failed: {e}")One failed call doesn't crash the others. Each future carries its own exception state.
Create a free account to get started. Paid plans unlock all tracks.