ML Evaluation and CI/CD

ML CI/CD should test behavior, not only build artifacts. A release can change model weights, prompts, retrieval, tools, generation parameters, runtime kernels, quantization, or safety policy. Each change needs a gate that matches the risk.

Regression gates compare a candidate against the current approved baseline and block release when required behavior gets worse beyond an agreed threshold.

Command Examples

git diff --name-only
date -Is

Example output and meaning:

Command	Example output	What it does
`git diff --name-only`	`prompts/support.yaml and evals/golden.jsonl`	Shows which prompt, eval, or policy artifacts changed in the release.
`date -Is`	`2026-06-06T10:24:33-07:00`	Pins command output and logs to an exact incident timestamp.

Before running evals, identify what changed: data, model, prompt, retrieval, tool, serving runtime, or policy.

Evaluation Harness

Component	Purpose
Case schema	Defines input, expected behavior, metadata, risk, and slices.
Runner	Executes the candidate system in a reproducible environment.
Judge	Scores deterministic checks, model-graded checks, human review, or hybrid criteria.
Baseline	Current production system or last approved release.
Report	Shows aggregate, slice, regression, cost, latency, and failure examples.
Gate	Blocks release when thresholds or severity rules fail.

Release Pipeline

flowchart LR
  Change[Change] --> Build[Build artifact]
  Build --> Smoke[Smoke eval]
  Smoke --> Regression[Regression suite]
  Regression --> Safety[Safety and policy eval]
  Safety --> Load[Serving load test]
  Load --> Canary[Canary]
  Canary --> Promote[Promote or rollback]

Gate Matrix

Change	Required Gates
Prompt template	Golden set, prompt-injection cases, latency/token budget.
Model weights	Target eval, regression eval, safety eval, calibration, serving benchmark.
LoRA adapter	Base/adaptor compatibility, merge comparison, task and regression evals.
Retrieval index	Recall@k, citation support, ACL tests, freshness checks.
Tool schema	Authorization tests, idempotency, trajectory evals, audit replay.
vLLM/runtime	Latency, throughput, KV-cache pressure, output compatibility, rollback.
Quantization	Quality, calibration, safety, and hardware-specific performance tests.

Human Review

Human review is most useful for ambiguous usefulness, policy boundary cases, preference labels, and incident samples. It is weakest when reviewers lack instructions or when aggregate scores hide severe failures.

Review packet:

case_id:
input:
retrieved_context:
candidate_output:
baseline_output:
expected_behavior:
risk_slice:
reviewer_label:
notes:

Study Cards

Question

Why does ML CI/CD need behavioral gates?

Answer

Model, prompt, retrieval, runtime, or policy changes can pass builds while changing user-visible behavior.

Question

What is a golden set?

Answer

A set of critical cases that must not regress across releases.

Question

Why compare against a baseline?

Answer

Absolute scores can hide regressions that only appear relative to the current production system.

ML Evaluation and CI/CD

Command Examples

Evaluation Harness

Release Pipeline

Gate Matrix

Human Review

Study Cards

References