Yesterday's cost log captured per-call metrics. Agent observability is one level up: per run, capture the sequence of tool calls, args, and outcomes. When something goes wrong, the trace is the audit trail.
class Trace:
def __init__(self, run_id):
self.run_id = run_id
self.events = []
def record(self, kind, name, **fields):
self.events.append({"kind": kind, "name": name, **fields})
# Inside the agent
trace = Trace("run-abc")
trace.record("tool_call", "retrieve", query=q, k=3)
result = retrieve(q, k=3)
trace.record("tool_result", "retrieve", n_chunks=len(result))
trace.record("llm_call", "answer", prompt_chars=len(prompt))
answer = generate(prompt)
trace.record("llm_result", "answer", answer_chars=len(answer))At run end: dump trace to a Sheet, JSON, or your observability platform.
This is a lot of code for one run.
Three lines per event. With a decorator, it's one line:
@traced(trace)
def retrieve(q, k=3): ...Decorator records tool_call before, tool_result after, with the args and return value. Same shape, less boilerplate.
What does this catch?
Bugs that hide in the sequence: tool fired with wrong args, fallback fired when it shouldn't have, retry kept retrying past the cap. Without trace, you see "the agent failed". With trace, you see exactly which step failed.
import uuid, time, json
class Trace:
def __init__(self):
self.run_id = uuid.uuid4().hex[:8]
self.started_at = time.time()
self.events = []
def record(self, kind, name, **fields):
self.events.append({
"t": round(time.time() - self.started_at, 3),
"kind": kind,
"name": name,
**fields,
})
def dump_json(self):
return json.dumps({"run_id": self.run_id, "events": self.events}, indent=2)| Kind | When |
|---|---|
tool_call | Before invoking a tool / function |
tool_result | After the tool returned (with summary, not raw data) |
tool_error | After the tool raised |
llm_call | Before an LLM call (with version) |
llm_result | After (with output summary) |
decision | When the agent picked between options (route, recover, etc.) |
Three-five kinds is plenty. Don't over-engineer the schema.
**fieldsSummaries, not raw data:
# Bad — pollutes trace, leaks PII
trace.record("llm_call", "answer", prompt=full_prompt)
# Good — summary, no leakage
trace.record("llm_call", "answer", prompt_chars=len(full_prompt), version="v3")A full trace lets you replay a run (with mocks for tools that already side-effected). Reproduce a bug deterministically by feeding the trace's tool-results into the same agent code. This is how production debugging on agent failures actually works.
Yesterday's cost log captured per-call metrics. Agent observability is one level up: per run, capture the sequence of tool calls, args, and outcomes. When something goes wrong, the trace is the audit trail.
class Trace:
def __init__(self, run_id):
self.run_id = run_id
self.events = []
def record(self, kind, name, **fields):
self.events.append({"kind": kind, "name": name, **fields})
# Inside the agent
trace = Trace("run-abc")
trace.record("tool_call", "retrieve", query=q, k=3)
result = retrieve(q, k=3)
trace.record("tool_result", "retrieve", n_chunks=len(result))
trace.record("llm_call", "answer", prompt_chars=len(prompt))
answer = generate(prompt)
trace.record("llm_result", "answer", answer_chars=len(answer))At run end: dump trace to a Sheet, JSON, or your observability platform.
This is a lot of code for one run.
Three lines per event. With a decorator, it's one line:
@traced(trace)
def retrieve(q, k=3): ...Decorator records tool_call before, tool_result after, with the args and return value. Same shape, less boilerplate.
What does this catch?
Bugs that hide in the sequence: tool fired with wrong args, fallback fired when it shouldn't have, retry kept retrying past the cap. Without trace, you see "the agent failed". With trace, you see exactly which step failed.
import uuid, time, json
class Trace:
def __init__(self):
self.run_id = uuid.uuid4().hex[:8]
self.started_at = time.time()
self.events = []
def record(self, kind, name, **fields):
self.events.append({
"t": round(time.time() - self.started_at, 3),
"kind": kind,
"name": name,
**fields,
})
def dump_json(self):
return json.dumps({"run_id": self.run_id, "events": self.events}, indent=2)| Kind | When |
|---|---|
tool_call | Before invoking a tool / function |
tool_result | After the tool returned (with summary, not raw data) |
tool_error | After the tool raised |
llm_call | Before an LLM call (with version) |
llm_result | After (with output summary) |
decision | When the agent picked between options (route, recover, etc.) |
Three-five kinds is plenty. Don't over-engineer the schema.
**fieldsSummaries, not raw data:
# Bad — pollutes trace, leaks PII
trace.record("llm_call", "answer", prompt=full_prompt)
# Good — summary, no leakage
trace.record("llm_call", "answer", prompt_chars=len(full_prompt), version="v3")A full trace lets you replay a run (with mocks for tools that already side-effected). Reproduce a bug deterministically by feeding the trace's tool-results into the same agent code. This is how production debugging on agent failures actually works.
Create a free account to get started. Paid plans unlock all tracks.