You can't optimize what you can't see. Cost observability is the simplest production discipline: log every LLM call with version, latency, token count, and cost — then aggregate.
import time
def tracked_call(version, prompt):
start = time.time()
result = Agent(model).run_sync(prompt)
elapsed = time.time() - start
record = {
"version": version,
"prompt_chars": len(prompt),
"answer_chars": len(result.output),
"elapsed_s": round(elapsed, 2),
}
USAGE_LOG.append(record)
return result.outputEvery call appends a row. At the end of the run, dump the log to a Sheet for eyeballing — or aggregate in code (sum, mean) if you just want the totals.
Token count?
pydantic_ai's result.usage() exposes input/output token counts when the provider returns them. For a portable observability pattern, character count is a 95%-correct proxy that works on any model. (Token = ~4 chars for English; rough but useful.)
And the Sheet?
One row per call. Columns: timestamp, version, prompt-len, answer-len, elapsed. After a batch of calls, the Sheet is your cost dashboard — no separate observability service needed. We use Tasks today (auto-provisioned) for the same effect.
import time
USAGE_LOG = []
def tracked_call(version, prompt):
start = time.time()
result = Agent(model).run_sync(prompt)
elapsed = time.time() - start
USAGE_LOG.append({
"version": version,
"prompt_chars": len(prompt),
"answer_chars": len(result.output),
"elapsed_s": round(elapsed, 2),
"timestamp": time.time(),
})
return result.output
def summary():
n = len(USAGE_LOG)
if n == 0:
return {"calls": 0}
total_chars = sum(r["prompt_chars"] + r["answer_chars"] for r in USAGE_LOG)
avg_latency = sum(r["elapsed_s"] for r in USAGE_LOG) / n
return {
"calls": n,
"total_chars": total_chars,
"avg_latency_s": round(avg_latency, 2),
}| Field | Why |
|---|---|
| version | Group by prompt version — A/B comparison |
| prompt_chars / answer_chars | Cost proxy (~ tokens) |
| elapsed_s | Latency — slow calls have user-visible impact |
| timestamp | Trend over time |
| user_id (if multi-user) | Per-user spend |
| tool_used (if applicable) | Which subroutines burn budget |
# Total cost (proxy)
total_chars = sum(r["prompt_chars"] + r["answer_chars"] for r in log)
# Per-version
from collections import defaultdict
by_version = defaultdict(list)
for r in log:
by_version[r["version"]].append(r)
for v, rows in by_version.items():
avg = sum(r["elapsed_s"] for r in rows) / len(rows)
print(f"{v}: {len(rows)} calls, avg {avg:.2f}s"){prompt_chars: 412}, not {prompt: "customer says ..."} — never put PII in your dashboard.print is enough. Don't build dashboards for one-off scripts.You can't optimize what you can't see. Cost observability is the simplest production discipline: log every LLM call with version, latency, token count, and cost — then aggregate.
import time
def tracked_call(version, prompt):
start = time.time()
result = Agent(model).run_sync(prompt)
elapsed = time.time() - start
record = {
"version": version,
"prompt_chars": len(prompt),
"answer_chars": len(result.output),
"elapsed_s": round(elapsed, 2),
}
USAGE_LOG.append(record)
return result.outputEvery call appends a row. At the end of the run, dump the log to a Sheet for eyeballing — or aggregate in code (sum, mean) if you just want the totals.
Token count?
pydantic_ai's result.usage() exposes input/output token counts when the provider returns them. For a portable observability pattern, character count is a 95%-correct proxy that works on any model. (Token = ~4 chars for English; rough but useful.)
And the Sheet?
One row per call. Columns: timestamp, version, prompt-len, answer-len, elapsed. After a batch of calls, the Sheet is your cost dashboard — no separate observability service needed. We use Tasks today (auto-provisioned) for the same effect.
import time
USAGE_LOG = []
def tracked_call(version, prompt):
start = time.time()
result = Agent(model).run_sync(prompt)
elapsed = time.time() - start
USAGE_LOG.append({
"version": version,
"prompt_chars": len(prompt),
"answer_chars": len(result.output),
"elapsed_s": round(elapsed, 2),
"timestamp": time.time(),
})
return result.output
def summary():
n = len(USAGE_LOG)
if n == 0:
return {"calls": 0}
total_chars = sum(r["prompt_chars"] + r["answer_chars"] for r in USAGE_LOG)
avg_latency = sum(r["elapsed_s"] for r in USAGE_LOG) / n
return {
"calls": n,
"total_chars": total_chars,
"avg_latency_s": round(avg_latency, 2),
}| Field | Why |
|---|---|
| version | Group by prompt version — A/B comparison |
| prompt_chars / answer_chars | Cost proxy (~ tokens) |
| elapsed_s | Latency — slow calls have user-visible impact |
| timestamp | Trend over time |
| user_id (if multi-user) | Per-user spend |
| tool_used (if applicable) | Which subroutines burn budget |
# Total cost (proxy)
total_chars = sum(r["prompt_chars"] + r["answer_chars"] for r in log)
# Per-version
from collections import defaultdict
by_version = defaultdict(list)
for r in log:
by_version[r["version"]].append(r)
for v, rows in by_version.items():
avg = sum(r["elapsed_s"] for r in rows) / len(rows)
print(f"{v}: {len(rows)} calls, avg {avg:.2f}s"){prompt_chars: 412}, not {prompt: "customer says ..."} — never put PII in your dashboard.print is enough. Don't build dashboards for one-off scripts.Create a free account to get started. Paid plans unlock all tracks.