Last code lesson. Compose six primitives on a tiny generic problem so the orchestration is what shines, not the domain.
from pydantic_ai import Agent
from pydantic import BaseModel
import re
# === The setup ===
VALUES = {"a": 5, "b": 7, "c": 11}
agent = Agent(model)
@agent.tool_plain
def lookup(key: str) -> int:
"""Look up the integer value associated with a single-letter key."""
return VALUES[key]
@agent.tool_plain
def add(x: int, y: int) -> int:
"""Return the sum of two integers."""
return x + y
# === Eval suite ===
cases = [
("What is a + b?", 12),
("What is c + a?", 16),
("What is b + c?", 18),
]
def parse_int(s):
digits = re.findall(r"-?\d+", s)
return int(digits[-1]) if digits else None
# === Run + score ===
rubric_per_case = []
for prompt, expected in cases:
out = agent.run_sync(prompt).output
got = parse_int(out)
rubric_per_case.append({
"correct": got == expected,
"is_integer": got is not None,
"non_empty": bool(out.strip()),
})
# Weighted rubric: correctness weighs more
WEIGHTS = {"correct": 0.7, "is_integer": 0.2, "non_empty": 0.1}
scores = []
for item in rubric_per_case:
s = sum(WEIGHTS[k] * int(v) for k, v in item.items())
scores.append(s)
final = sum(scores) / len(scores)
for i, (item, s) in enumerate(zip(rubric_per_case, scores), 1):
print(f" case {i}: {item}, score={s:.2f}")
print(f"\nFinal weighted score: {final:.2f}")Tools (lookup, add), agent loop (3-case eval), output validation (regex-parse), rubric (weighted scoring), threshold (final score check). Each piece is doing one job.
Right. Six primitives, three tiny generic cases. The orchestration — calling the agent on each case, scoring each output across a rubric, averaging — is the lesson. The math itself is incidental.
Why not test on something real?
Because the pattern is the point. Once you've internalised the orchestration on toy math, you can plug in any tool (Composio actions from L27), any rubric (domain-specific predicates), any cases (your real eval set) and the structure is the same. Synthesis lessons stay tiny on purpose.
| Primitive | From | Role here |
|---|---|---|
| Multi-tool agent | L18 | lookup + add registered, model picks per case |
| Tool calling | L4 | Each tool is a typed Python function with docstring |
| Output validation | L11/L24 | parse_int regex-extracts the integer from agent text |
| Eval suite | L19 | 3 (input, expected) cases |
| Scoring rubric | L25 | 3-criterion weighted score per case |
| Threshold check | L19/L21 | Final mean score >= 0.7 |
Six primitives. Twenty-five lines (excluding the toy data). Generic math. No domain smuggling.
lookup/add with two Composio tools — the rest of the script stays the same.)Synthesis isn't "use everything". It's "use what fits". A real production agent would pick a different subset for its problem.
You have the AI Patterns kit. To apply it:
The exercise of building a real eval suite for your real problem is the move from "finished an LLM track" to "can ship LLM features". AI Advanced (deferred) adds embeddings, RAG, model routing, caching — refinements on top of this kit, not replacements.
The code above. Run it. Verification asserts the final mean weighted score >= 0.7 — meaning at least 70% of the criteria, weighted by importance, passed across the 3 cases.
Last code lesson. Compose six primitives on a tiny generic problem so the orchestration is what shines, not the domain.
from pydantic_ai import Agent
from pydantic import BaseModel
import re
# === The setup ===
VALUES = {"a": 5, "b": 7, "c": 11}
agent = Agent(model)
@agent.tool_plain
def lookup(key: str) -> int:
"""Look up the integer value associated with a single-letter key."""
return VALUES[key]
@agent.tool_plain
def add(x: int, y: int) -> int:
"""Return the sum of two integers."""
return x + y
# === Eval suite ===
cases = [
("What is a + b?", 12),
("What is c + a?", 16),
("What is b + c?", 18),
]
def parse_int(s):
digits = re.findall(r"-?\d+", s)
return int(digits[-1]) if digits else None
# === Run + score ===
rubric_per_case = []
for prompt, expected in cases:
out = agent.run_sync(prompt).output
got = parse_int(out)
rubric_per_case.append({
"correct": got == expected,
"is_integer": got is not None,
"non_empty": bool(out.strip()),
})
# Weighted rubric: correctness weighs more
WEIGHTS = {"correct": 0.7, "is_integer": 0.2, "non_empty": 0.1}
scores = []
for item in rubric_per_case:
s = sum(WEIGHTS[k] * int(v) for k, v in item.items())
scores.append(s)
final = sum(scores) / len(scores)
for i, (item, s) in enumerate(zip(rubric_per_case, scores), 1):
print(f" case {i}: {item}, score={s:.2f}")
print(f"\nFinal weighted score: {final:.2f}")Tools (lookup, add), agent loop (3-case eval), output validation (regex-parse), rubric (weighted scoring), threshold (final score check). Each piece is doing one job.
Right. Six primitives, three tiny generic cases. The orchestration — calling the agent on each case, scoring each output across a rubric, averaging — is the lesson. The math itself is incidental.
Why not test on something real?
Because the pattern is the point. Once you've internalised the orchestration on toy math, you can plug in any tool (Composio actions from L27), any rubric (domain-specific predicates), any cases (your real eval set) and the structure is the same. Synthesis lessons stay tiny on purpose.
| Primitive | From | Role here |
|---|---|---|
| Multi-tool agent | L18 | lookup + add registered, model picks per case |
| Tool calling | L4 | Each tool is a typed Python function with docstring |
| Output validation | L11/L24 | parse_int regex-extracts the integer from agent text |
| Eval suite | L19 | 3 (input, expected) cases |
| Scoring rubric | L25 | 3-criterion weighted score per case |
| Threshold check | L19/L21 | Final mean score >= 0.7 |
Six primitives. Twenty-five lines (excluding the toy data). Generic math. No domain smuggling.
lookup/add with two Composio tools — the rest of the script stays the same.)Synthesis isn't "use everything". It's "use what fits". A real production agent would pick a different subset for its problem.
You have the AI Patterns kit. To apply it:
The exercise of building a real eval suite for your real problem is the move from "finished an LLM track" to "can ship LLM features". AI Advanced (deferred) adds embeddings, RAG, model routing, caching — refinements on top of this kit, not replacements.
The code above. Run it. Verification asserts the final mean weighted score >= 0.7 — meaning at least 70% of the criteria, weighted by importance, passed across the 3 cases.
Create a free account to get started. Paid plans unlock all tracks.