You want the agent to return exactly one word — positive, neutral, or negative. How reliable is a single-word system prompt on its own?
Probably mostly reliable? I'd guess the model sometimes adds a period or capitalizes it?
You'd guess correctly. You can write a perfect system prompt and still get "Positive." with a period, "NEGATIVE" in caps, or "The sentiment is positive." once in every twenty runs. The instruction is a suggestion, not a contract — so you normalize the output after the fact:
agent = Agent(model, system_prompt="Classify the sentiment of the text. Reply with exactly one word: positive, neutral, or negative.")
result = agent.run_sync(text)
label = result.output.strip().lower().strip() trims leading and trailing whitespace. .lower() folds case. Together they make "Positive", " positive", and "POSITIVE" all compare equal to "positive".
Why both .strip() and .lower()? Isn't one enough?
Each handles a different failure mode. Whitespace leaks from the model rarely but deterministically — one trailing newline breaks == comparisons. Capitalization varies based on temperature and prompt phrasing. Stripping without lowering fails on "Positive"; lowering without stripping fails on " positive". Applying both gives you one canonical shape to match against:
def classify_sentiment(text: str) -> str:
agent = Agent(model, system_prompt="Classify the sentiment of the text. Reply with exactly one word: positive, neutral, or negative.")
return agent.run_sync(text).output.strip().lower()So .strip().lower() is defensive programming for LLM outputs?
Exactly. Any time you compare an agent's output against a known label — sentiment, urgency, intent — normalize first. It is a three-character change that prevents a whole category of runtime bugs.
And the test cases check for a short lowercase string, not the exact word, because the model might still wobble?
Right. The schema validates shape — a string with at least a few characters — not the exact label. Write classify_sentiment(text) now: a single-word system prompt, run_sync, then .strip().lower() on .output before returning.
TL;DR: system_prompt shapes output; .strip().lower() handles the wobble.
positive | neutral | negative.strip() — removes leading and trailing whitespace.lower() — folds capitalization| Raw output | Normalized |
|---|---|
"Positive" | "positive" |
" negative " | "negative" |
"NEUTRAL" | "neutral" |
"Positive." | "positive." |
The trailing period still leaks through — for stricter guarantees, Day 11 shows result_type=Literal[...].
You want the agent to return exactly one word — positive, neutral, or negative. How reliable is a single-word system prompt on its own?
Probably mostly reliable? I'd guess the model sometimes adds a period or capitalizes it?
You'd guess correctly. You can write a perfect system prompt and still get "Positive." with a period, "NEGATIVE" in caps, or "The sentiment is positive." once in every twenty runs. The instruction is a suggestion, not a contract — so you normalize the output after the fact:
agent = Agent(model, system_prompt="Classify the sentiment of the text. Reply with exactly one word: positive, neutral, or negative.")
result = agent.run_sync(text)
label = result.output.strip().lower().strip() trims leading and trailing whitespace. .lower() folds case. Together they make "Positive", " positive", and "POSITIVE" all compare equal to "positive".
Why both .strip() and .lower()? Isn't one enough?
Each handles a different failure mode. Whitespace leaks from the model rarely but deterministically — one trailing newline breaks == comparisons. Capitalization varies based on temperature and prompt phrasing. Stripping without lowering fails on "Positive"; lowering without stripping fails on " positive". Applying both gives you one canonical shape to match against:
def classify_sentiment(text: str) -> str:
agent = Agent(model, system_prompt="Classify the sentiment of the text. Reply with exactly one word: positive, neutral, or negative.")
return agent.run_sync(text).output.strip().lower()So .strip().lower() is defensive programming for LLM outputs?
Exactly. Any time you compare an agent's output against a known label — sentiment, urgency, intent — normalize first. It is a three-character change that prevents a whole category of runtime bugs.
And the test cases check for a short lowercase string, not the exact word, because the model might still wobble?
Right. The schema validates shape — a string with at least a few characters — not the exact label. Write classify_sentiment(text) now: a single-word system prompt, run_sync, then .strip().lower() on .output before returning.
TL;DR: system_prompt shapes output; .strip().lower() handles the wobble.
positive | neutral | negative.strip() — removes leading and trailing whitespace.lower() — folds capitalization| Raw output | Normalized |
|---|---|
"Positive" | "positive" |
" negative " | "negative" |
"NEUTRAL" | "neutral" |
"Positive." | "positive." |
The trailing period still leaks through — for stricter guarantees, Day 11 shows result_type=Literal[...].
Create a free account to get started. Paid plans unlock all tracks.