Most LLM mistakes on multi-step problems aren't because the model can't do the steps. They're because it tried to jump straight to the answer. Chain-of-thought is one phrase: "Let's think step by step." It coaxes the model to write out the reasoning, and the answer at the end is reliably better.
problem = "A box has 12 apples. Half are green. Of the green, a third are spoiled. How many are spoiled?"
# Direct
direct = Agent(model).run_sync(f"{problem} Reply with just the number.")
print("DIRECT:", direct.output)
# Chain-of-thought
cot = Agent(model).run_sync(f"{problem}\n\nLet's think step by step.")
print("COT:", cot.output)The CoT version writes out the reasoning — half of 12 is 6 green, a third of 6 is 2 — and lands on 2. The direct version sometimes gets it right, sometimes hallucinates.
Yes. The exposed reasoning gives the model more tokens to check itself with. Each step is conditioned on the prior steps, which constrains the next prediction. It's the same mechanism — next-token prediction — but the prompt invites a longer, more structured trajectory.
Trade-off?
More tokens out → more cost. For trivial questions CoT is overkill ("What is 2 + 2?" doesn't need a paragraph of reasoning). For multi-step word problems, classification with multiple criteria, or any task where the model is currently flaky, it's the cheapest accuracy upgrade you can buy.
A prompt-engineering pattern that asks the model to write out its reasoning before producing the final answer. The classic phrase: "Let's think step by step."
LLMs are next-token predictors. Each token they produce is conditioned on what they've already produced. If you prompt them to produce reasoning first, the final-answer tokens are conditioned on that reasoning — making them more accurate on multi-step problems.
Without CoT, the model has to do the whole reasoning implicitly between the prompt and the answer token. Sometimes that works; often it doesn't.
Three common phrasings:
# Most basic
prompt = problem + "\n\nLet's think step by step."
# More explicit structure
prompt = problem + "\n\nReason through this step by step, then give the final answer."
# With output format
prompt = problem + "\n\nFirst write your reasoning, then on the last line write 'ANSWER: <number>'."The last variant is most useful programmatically — the reasoning is human-readable, but you can re.search for "ANSWER: (\d+)" to extract the number.
Print the answers to a small word problem with and without CoT. Verification just confirms both responses ran. Compare them yourself — feel the difference in the reasoning trace.
Most LLM mistakes on multi-step problems aren't because the model can't do the steps. They're because it tried to jump straight to the answer. Chain-of-thought is one phrase: "Let's think step by step." It coaxes the model to write out the reasoning, and the answer at the end is reliably better.
problem = "A box has 12 apples. Half are green. Of the green, a third are spoiled. How many are spoiled?"
# Direct
direct = Agent(model).run_sync(f"{problem} Reply with just the number.")
print("DIRECT:", direct.output)
# Chain-of-thought
cot = Agent(model).run_sync(f"{problem}\n\nLet's think step by step.")
print("COT:", cot.output)The CoT version writes out the reasoning — half of 12 is 6 green, a third of 6 is 2 — and lands on 2. The direct version sometimes gets it right, sometimes hallucinates.
Yes. The exposed reasoning gives the model more tokens to check itself with. Each step is conditioned on the prior steps, which constrains the next prediction. It's the same mechanism — next-token prediction — but the prompt invites a longer, more structured trajectory.
Trade-off?
More tokens out → more cost. For trivial questions CoT is overkill ("What is 2 + 2?" doesn't need a paragraph of reasoning). For multi-step word problems, classification with multiple criteria, or any task where the model is currently flaky, it's the cheapest accuracy upgrade you can buy.
A prompt-engineering pattern that asks the model to write out its reasoning before producing the final answer. The classic phrase: "Let's think step by step."
LLMs are next-token predictors. Each token they produce is conditioned on what they've already produced. If you prompt them to produce reasoning first, the final-answer tokens are conditioned on that reasoning — making them more accurate on multi-step problems.
Without CoT, the model has to do the whole reasoning implicitly between the prompt and the answer token. Sometimes that works; often it doesn't.
Three common phrasings:
# Most basic
prompt = problem + "\n\nLet's think step by step."
# More explicit structure
prompt = problem + "\n\nReason through this step by step, then give the final answer."
# With output format
prompt = problem + "\n\nFirst write your reasoning, then on the last line write 'ANSWER: <number>'."The last variant is most useful programmatically — the reasoning is human-readable, but you can re.search for "ANSWER: (\d+)" to extract the number.
Print the answers to a small word problem with and without CoT. Verification just confirms both responses ran. Compare them yourself — feel the difference in the reasoning trace.
Create a free account to get started. Paid plans unlock all tracks.