ML Prompt Operations

Prompts are production code. They define task framing, tool policy, response format, safety posture, retrieval usage, and error behavior. Prompt changes should be versioned, reviewed, evaluated, and rolled back like model or application changes.

Prompt templates should be treated as named, versioned artifacts instead of anonymous strings hidden in application code.

Command Examples

find prompts -type f | sort
git diff -- prompts/

Example output and meaning:

Command	Example output	What it does
`find prompts -type f \\\| sort`	`Sorted file paths such as data/train.parquet and data/validation.parquet.`	Shows which files the pipeline or prompt loader will actually consume.
`git diff -- prompts/`	`Concrete IDs, states, counters, versions, rows, or error strings.`	Turns the example from a command list into evidence for the next debugging step.

If prompts live in code or a database, export the exact active prompt bundle before debugging.

Prompt Contract

Part	What It Controls
System message	Global role, safety constraints, output boundaries.
Developer message	Application-specific task rules and tool rules.
User message	User intent and input data.
Retrieved context	External evidence, citations, and grounding material.
Tool schema	Available actions and structured input constraints.
Output schema	JSON, citations, refusal format, or other response contract.

Versioning Checklist

prompt template ID,
prompt text hash,
model ID,
tokenizer/chat template,
retrieval prompt assembly code,
tool schema version,
output schema version,
generation config,
eval report,
rollout status.

Structured Outputs

Use structured outputs when downstream systems need predictable fields. Treat the schema as an API contract.

{
  "answer": "short response",
  "citations": ["source-id"],
  "confidence": "low|medium|high",
  "needs_human_review": false
}

Structured output still needs semantic validation. Valid JSON can contain unsupported claims or unsafe tool arguments.

Prompt Failure Modes

Symptom	Likely Cause	Fix
Ignores retrieved evidence	Context too long, weak instruction hierarchy, conflicting chunks.	Improve context assembly and citation checks.
Verbose or off-format output	Output schema unclear or not enforced.	Use schema validation and examples.
Unsafe tool choice	Tool policy embedded only in natural language.	Enforce tool authorization outside the prompt.
Fragile improvements	Prompt tuned to a tiny eval set.	Expand slices and run regression cases.
Injection success	Untrusted text mixed with instructions.	Delimit data, quote sources, and add injection evals.

Prompt Release Runbook

Render prompts for representative cases.
Run golden, safety, retrieval, and formatting evals.
Compare token count and latency against baseline.
Review failures by slice and severity.
Canary the new prompt with per-version metrics.
Roll back by prompt bundle ID if thresholds fail.

Study Cards

Question

Why are prompts production code?

Answer

They control task behavior, tool policy, output format, safety boundaries, and retrieval usage.

Question

What should be versioned with a prompt?

Answer

Template text, model, chat template, tool schema, output schema, retrieval assembly, generation config, and eval report.

Question

Why is valid JSON not enough for structured outputs?

Answer

The JSON can still contain unsupported claims, unsafe actions, or semantically invalid values.

ML Prompt Operations

Command Examples

Prompt Contract

Versioning Checklist

Structured Outputs

Prompt Failure Modes

Prompt Release Runbook

Study Cards

References