Inference Runbooks

Inference incidents need fast symptom splitting. Avoid starting with “the GPU is slow.” First classify the failure as behavior, queueing, prefill, decode, memory, API, routing, or release regression.

Command Examples

curl -s http://localhost:8000/v1/models
curl -s http://localhost:8000/metrics | grep -E 'queue|token|kv|error|request'
nvidia-smi

Example output and meaning:

Command	Example output	What it does
`curl -s http://localhost:8000/v1/models`	`HTTP status, headers, timing, JSON payload, or TLS/proxy error.`	Separates reachability, TLS, proxy, and application behavior.
`curl -s http://localhost:8000/metrics \\| grep -E 'queue\\|token\\|kv\\|error\\|request'`	`HTTP status, headers, timing, JSON payload, or TLS/proxy error.`	Separates reachability, TLS, proxy, and application behavior.
`nvidia-smi`	`GPU utilization, memory use, CUDA visibility, model list, or serving metrics.`	Separates accelerator visibility from model-serving capacity and latency.

Record route, model ID, adapter, tokenizer/template revision, runtime version, generation config, prompt tokens, output tokens, tenant, and release time.

Symptom Split

Symptom	First Question
Wrong output	Does a deterministic single-call repro fail outside the serving engine?
High TTFT	Is the delay queueing or prefill?
Slow streaming	Is ITL high on server metrics or only at the client?
OOM/rejections	Are weights, KV cache, or workspace the pressure source?
Low throughput	Is bottleneck compute, memory bandwidth, scheduler, or routing?
Queue buildup	Are workers healthy and admitting requests?
Cache preemptions	Are prompt/output/concurrency limits too high?
Runtime regression	Did model, tokenizer, adapter, quantization, or engine change?

Wrong Output

Reproduce with deterministic decoding.
Compare model revision, tokenizer, chat template, generation config, and adapter.
Render prompt text and token IDs.
Test same input in a reference runtime.
Check retrieval/tool context if present.
Roll back changed artifact if release-linked.

High TTFT

Evidence	Meaning
High queue time	Admission, capacity, or worker health issue.
Long prompt tokens	Prefill-heavy request shape.
Low prefix hit rate	Prefix caching not helping.
High waiting requests	Scheduler pressure.
Cold model	Load/warmup path.

Levers: reduce prompt tokens, add replicas, route long prompts, enable prefix caching, use chunked prefill, lower concurrency per replica, or scale hardware.

Slow Inter-Token Latency

Check output tokens, decode tokens/sec, KV-cache usage, GPU memory bandwidth, batch occupancy, speculative acceptance rate, and client/network buffering. Levers include smaller model, speculation, quantization, decode-focused pool, lower output caps, or more hardware.

OOM and Cache Pressure

Compare weight memory estimate to available GPU memory.
Estimate KV cache for p95 prompt plus output at active concurrency.
Check kv_cache_usage, preemptions, swaps, and rejected requests.
Lower max_model_len, output cap, or concurrency.
Add KV quantization, more memory, or separate long-context pool.

Low Throughput

Cause	Check
Small batches	Batch occupancy and request arrival pattern.
Sequence skew	Long outputs holding decode slots.
Tensor parallel overhead	Interconnect and NCCL metrics.
CPU bottleneck	Tokenization, routing, logging, JSON serialization.
Kernel mismatch	Runtime version, quantization, unsupported shape.

Release Regression

Use this rollback checklist:

Identify changed model/runtime/tokenizer/template/adapter/quantization/flags.
Compare golden prompts and structured-output evals.
Compare TTFT, ITL, queue, cache, and error metrics.
Canary stop if p95 latency, error rate, or quality gates fail.
Restore previous artifact bundle and flags.

Study Cards

Question

What is the first split in an inference incident?

Answer

Separate behavior, queueing, prefill, decode, memory, API, routing, and release-regression causes.

Question

What usually drives high TTFT?

Answer

Queue time, long prompt prefill, cold starts, low prefix-cache hits, or insufficient capacity.

Question

What usually drives slow ITL?

Answer

Decode bottlenecks, memory bandwidth, long outputs, poor batch occupancy, or ineffective speculation.

Inference Runbooks

Command Examples

Symptom Split

Wrong Output

High TTFT

Slow Inter-Token Latency

OOM and Cache Pressure

Low Throughput

Release Regression

Study Cards

References