Prompts evolve. v1 worked. v2 fixed a bug. v3 added a constraint. When the eval suite (week 2) shows a regression, the question is which version produced the regression. Without versioning, you don't know.
PROMPTS = {
"classify_v1": "Classify: {input}",
"classify_v2": "Classify {input} as positive or negative. Reply with one word.",
"classify_v3": "Classify {input} as exactly one of: positive, negative, neutral. Reply with one word, lowercased.",
}
ACTIVE_VERSION = "classify_v3"
def classify(text):
template = PROMPTS[ACTIVE_VERSION]
prompt = template.format(input=text)
answer = Agent(model).run_sync(prompt).output.strip().lower()
return {"version": ACTIVE_VERSION, "answer": answer}Every answer is tagged with the prompt version that produced it. When eval pass rate drops, you grep the logs by version.
Why a dict instead of separate functions?
A dict is a registry. You can iterate it for A/B testing (tomorrow's lesson), look up by version string at runtime, ship a new version by adding a key. Keeping prompts in code as strings under version keys is the simplest possible prompt-management system.
Production tools for this?
PromptLayer, Helicone, Langsmith, OpenAI's prompt-templates feature. All built on top of this same idea — a prompt has a version, an answer is tagged with the version that produced it, the eval suite groups results by version. You can ship a Promptfoo-grade workflow with 20 lines of Python.
class PromptRegistry:
def __init__(self):
self.prompts = {}
self.active = None
def register(self, name, version, template):
self.prompts[f"{name}_v{version}"] = template
def use(self, name, version):
self.active = f"{name}_v{version}"
def render(self, **kwargs):
return self.prompts[self.active].format(**kwargs), self.active| Capability | Without versioning | With versioning |
|---|---|---|
| Reproduce a regression | "It worked yesterday" | Pin to v2, replay |
| A/B test prompts | Manual swap, hope | Run both, compare per-version pass rate (tomorrow) |
| Roll back | Edit code, ship, hope | registry.use("classify", 2) |
| Audit | What was the prompt 3 weeks ago? | Lookup classify_v2 |
{purpose}_v{N} — classify_v3, summarise_v2. Purpose first (so related versions cluster), version suffix (so sort order is intuitive).
Semantic versioning (v1.2.3) is overkill — prompts rarely have backward-compat concerns. Integer monotonic versions (v1, v2, v3) keep the system simple.
Where the registry lives:
Start in code. Migrate when team size or change frequency demands it.
When prompt v3 changes the output format (added a confidence field), downstream code must know which schema to expect:
result = {"version": "classify_v3", "answer": "positive", "confidence": 0.9}
result["version"] # downstream uses this to dispatch parsing logicThe same version string drives prompt selection AND output handling.
Prompts evolve. v1 worked. v2 fixed a bug. v3 added a constraint. When the eval suite (week 2) shows a regression, the question is which version produced the regression. Without versioning, you don't know.
PROMPTS = {
"classify_v1": "Classify: {input}",
"classify_v2": "Classify {input} as positive or negative. Reply with one word.",
"classify_v3": "Classify {input} as exactly one of: positive, negative, neutral. Reply with one word, lowercased.",
}
ACTIVE_VERSION = "classify_v3"
def classify(text):
template = PROMPTS[ACTIVE_VERSION]
prompt = template.format(input=text)
answer = Agent(model).run_sync(prompt).output.strip().lower()
return {"version": ACTIVE_VERSION, "answer": answer}Every answer is tagged with the prompt version that produced it. When eval pass rate drops, you grep the logs by version.
Why a dict instead of separate functions?
A dict is a registry. You can iterate it for A/B testing (tomorrow's lesson), look up by version string at runtime, ship a new version by adding a key. Keeping prompts in code as strings under version keys is the simplest possible prompt-management system.
Production tools for this?
PromptLayer, Helicone, Langsmith, OpenAI's prompt-templates feature. All built on top of this same idea — a prompt has a version, an answer is tagged with the version that produced it, the eval suite groups results by version. You can ship a Promptfoo-grade workflow with 20 lines of Python.
class PromptRegistry:
def __init__(self):
self.prompts = {}
self.active = None
def register(self, name, version, template):
self.prompts[f"{name}_v{version}"] = template
def use(self, name, version):
self.active = f"{name}_v{version}"
def render(self, **kwargs):
return self.prompts[self.active].format(**kwargs), self.active| Capability | Without versioning | With versioning |
|---|---|---|
| Reproduce a regression | "It worked yesterday" | Pin to v2, replay |
| A/B test prompts | Manual swap, hope | Run both, compare per-version pass rate (tomorrow) |
| Roll back | Edit code, ship, hope | registry.use("classify", 2) |
| Audit | What was the prompt 3 weeks ago? | Lookup classify_v2 |
{purpose}_v{N} — classify_v3, summarise_v2. Purpose first (so related versions cluster), version suffix (so sort order is intuitive).
Semantic versioning (v1.2.3) is overkill — prompts rarely have backward-compat concerns. Integer monotonic versions (v1, v2, v3) keep the system simple.
Where the registry lives:
Start in code. Migrate when team size or change frequency demands it.
When prompt v3 changes the output format (added a confidence field), downstream code must know which schema to expect:
result = {"version": "classify_v3", "answer": "positive", "confidence": 0.9}
result["version"] # downstream uses this to dispatch parsing logicThe same version string drives prompt selection AND output handling.
Create a free account to get started. Paid plans unlock all tracks.