Three weeks of primitives. Today they all show up at once. Agent with 2 tools, eval suite with 3 cases, pass-rate threshold. The synthesis pattern.
from pydantic_ai import Agent
import re
VALUES = {"x": 10, "y": 20, "z": 30}
agent = Agent(model)
@agent.tool_plain
def add(a: int, b: int) -> int:
"""Return the sum of two integers."""
return a + b
@agent.tool_plain
def lookup(key: str) -> int:
"""Look up the integer value associated with a single-letter key."""
return VALUES[key]
cases = [
("What is x + y?", 30),
("What is x + 5?", 15),
("What is z + y?", 50),
]
def parse_int(s: str) -> int | None:
digits = re.findall(r"-?\d+", s)
return int(digits[-1]) if digits else None
pass_count = 0
for prompt, expected in cases:
out = agent.run_sync(prompt).output
got = parse_int(out)
ok = got == expected
pass_count += int(ok)
print(f" {'PASS' if ok else 'FAIL'} — {prompt} → got={got} expected={expected}")
print(f"\n{pass_count} / {len(cases)} passed")
assert pass_count >= 2, f"expected at least 2/3, got {pass_count}"Tool calling (add + lookup), agent loop (2 tools, multi-step), output validation (regex parse the int), eval suite (3 cases), pass-rate threshold (≥ 2/3). Five primitives in 30 lines.
Right. Each primitive sits where it fits. The agent decides which tool to call (lookup or add). The eval cases force composition — case 1 needs two lookups + add; case 2 needs one lookup + add; case 3 needs two lookups + add. The verification is a Python threshold over the suite.
What's the deterministic post-processing for?
The agent's result.output is text — "The answer is 30." or "30" or "x + y equals 30." — varies by run. We strip non-digits to get the integer. Same pattern as week-1 of AI Foundations: validate the shape of the response, not the exact text.
Five primitives composed:
| Primitive | From | What it does here |
|---|---|---|
| Tool calling (single) | L4 | add registered with the agent |
| Multi-step tools | L5 | lookup then add for cases needing both |
| Multi-tool agent | L18 | Agent picks lookup vs add per task |
| Output validation | L11 | re.findall(r"-?\d+", out) parses the integer |
| Eval suite + threshold | L19 | 3 cases, pass-rate ≥ 2/3 |
No new concepts. The exercise is putting them together on a small generic problem.
("What is x + y?", 30) → lookup(x)=10, lookup(y)=20, add(10, 20)=30
("What is x + 5?", 15) → lookup(x)=10, add(10, 5)=15
("What is z + y?", 50) → lookup(z)=30, lookup(y)=20, add(30, 20)=50
Each case requires a different sequence of tool calls. Same agent, same toolset — the loop adapts per prompt. That's the agent's value: it composes tools without you writing the dispatch logic.
LLM sampling. Even with sharp prompts, occasional misses happen. 2/3 = 66% — passes when the agent works most of the time. Tightening to 3/3 would make the lesson flake on noise. Production thresholds depend on cost and severity.
The agent might say "30" or "The answer is 30." or "It equals 30 in base 10." — all should pass. Parsing the last integer in the output is robust to wording variants. re.findall(r"-?\d+", s)[-1] does that.
Each case = 1 agent run = 2-4 LLM calls (one to plan, one or more to call tools, one to finalise). Three cases = ~9-12 quota slots. Substantial — but synthesis lessons are once-per-week.
This synthesis is deliberately compact. No moderation, no self-critique, no chained prompts — those are different patterns. AI Patterns' final-week synthesis (L28) brings in real Composio and a wider compose.
Three weeks of primitives. Today they all show up at once. Agent with 2 tools, eval suite with 3 cases, pass-rate threshold. The synthesis pattern.
from pydantic_ai import Agent
import re
VALUES = {"x": 10, "y": 20, "z": 30}
agent = Agent(model)
@agent.tool_plain
def add(a: int, b: int) -> int:
"""Return the sum of two integers."""
return a + b
@agent.tool_plain
def lookup(key: str) -> int:
"""Look up the integer value associated with a single-letter key."""
return VALUES[key]
cases = [
("What is x + y?", 30),
("What is x + 5?", 15),
("What is z + y?", 50),
]
def parse_int(s: str) -> int | None:
digits = re.findall(r"-?\d+", s)
return int(digits[-1]) if digits else None
pass_count = 0
for prompt, expected in cases:
out = agent.run_sync(prompt).output
got = parse_int(out)
ok = got == expected
pass_count += int(ok)
print(f" {'PASS' if ok else 'FAIL'} — {prompt} → got={got} expected={expected}")
print(f"\n{pass_count} / {len(cases)} passed")
assert pass_count >= 2, f"expected at least 2/3, got {pass_count}"Tool calling (add + lookup), agent loop (2 tools, multi-step), output validation (regex parse the int), eval suite (3 cases), pass-rate threshold (≥ 2/3). Five primitives in 30 lines.
Right. Each primitive sits where it fits. The agent decides which tool to call (lookup or add). The eval cases force composition — case 1 needs two lookups + add; case 2 needs one lookup + add; case 3 needs two lookups + add. The verification is a Python threshold over the suite.
What's the deterministic post-processing for?
The agent's result.output is text — "The answer is 30." or "30" or "x + y equals 30." — varies by run. We strip non-digits to get the integer. Same pattern as week-1 of AI Foundations: validate the shape of the response, not the exact text.
Five primitives composed:
| Primitive | From | What it does here |
|---|---|---|
| Tool calling (single) | L4 | add registered with the agent |
| Multi-step tools | L5 | lookup then add for cases needing both |
| Multi-tool agent | L18 | Agent picks lookup vs add per task |
| Output validation | L11 | re.findall(r"-?\d+", out) parses the integer |
| Eval suite + threshold | L19 | 3 cases, pass-rate ≥ 2/3 |
No new concepts. The exercise is putting them together on a small generic problem.
("What is x + y?", 30) → lookup(x)=10, lookup(y)=20, add(10, 20)=30
("What is x + 5?", 15) → lookup(x)=10, add(10, 5)=15
("What is z + y?", 50) → lookup(z)=30, lookup(y)=20, add(30, 20)=50
Each case requires a different sequence of tool calls. Same agent, same toolset — the loop adapts per prompt. That's the agent's value: it composes tools without you writing the dispatch logic.
LLM sampling. Even with sharp prompts, occasional misses happen. 2/3 = 66% — passes when the agent works most of the time. Tightening to 3/3 would make the lesson flake on noise. Production thresholds depend on cost and severity.
The agent might say "30" or "The answer is 30." or "It equals 30 in base 10." — all should pass. Parsing the last integer in the output is robust to wording variants. re.findall(r"-?\d+", s)[-1] does that.
Each case = 1 agent run = 2-4 LLM calls (one to plan, one or more to call tools, one to finalise). Three cases = ~9-12 quota slots. Substantial — but synthesis lessons are once-per-week.
This synthesis is deliberately compact. No moderation, no self-critique, no chained prompts — those are different patterns. AI Patterns' final-week synthesis (L28) brings in real Composio and a wider compose.
Create a free account to get started. Paid plans unlock all tracks.