Tech Study Guide
Inference Runbooks
Operational runbooks for LLM inference incidents: wrong output, high TTFT, slow inter-token latency, OOM, low throughput, queue buildup, cache pressure, tokenizer mismatch, and runtime regressions.
Inference Runbooks
Inference incidents need fast symptom splitting. Avoid starting with “the GPU is slow.” First classify the failure as behavior, queueing, prefill, decode, memory, API, routing, or release regression.
Command Examples
curl -s http://localhost:8000/v1/models
curl -s http://localhost:8000/metrics | grep -E 'queue|token|kv|error|request'
nvidia-smi
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
curl -s http://localhost:8000/v1/models |
HTTP status, headers, timing, JSON payload, or TLS/proxy error. |
Separates reachability, TLS, proxy, and application behavior. |
curl -s http://localhost:8000/metrics \| grep -E 'queue\|token\|kv\|error\|request' |
HTTP status, headers, timing, JSON payload, or TLS/proxy error. |
Separates reachability, TLS, proxy, and application behavior. |
nvidia-smi |
GPU utilization, memory use, CUDA visibility, model list, or serving metrics. |
Separates accelerator visibility from model-serving capacity and latency. |
Record route, model ID, adapter, tokenizer/template revision, runtime version, generation config, prompt tokens, output tokens, tenant, and release time.
Symptom Split
| Symptom | First Question |
|---|---|
| Wrong output | Does a deterministic single-call repro fail outside the serving engine? |
| High TTFT | Is the delay queueing or prefill? |
| Slow streaming | Is ITL high on server metrics or only at the client? |
| OOM/rejections | Are weights, KV cache, or workspace the pressure source? |
| Low throughput | Is bottleneck compute, memory bandwidth, scheduler, or routing? |
| Queue buildup | Are workers healthy and admitting requests? |
| Cache preemptions | Are prompt/output/concurrency limits too high? |
| Runtime regression | Did model, tokenizer, adapter, quantization, or engine change? |
Wrong Output
- Reproduce with deterministic decoding.
- Compare model revision, tokenizer, chat template, generation config, and adapter.
- Render prompt text and token IDs.
- Test same input in a reference runtime.
- Check retrieval/tool context if present.
- Roll back changed artifact if release-linked.
High TTFT
| Evidence | Meaning |
|---|---|
| High queue time | Admission, capacity, or worker health issue. |
| Long prompt tokens | Prefill-heavy request shape. |
| Low prefix hit rate | Prefix caching not helping. |
| High waiting requests | Scheduler pressure. |
| Cold model | Load/warmup path. |
Levers: reduce prompt tokens, add replicas, route long prompts, enable prefix caching, use chunked prefill, lower concurrency per replica, or scale hardware.
Slow Inter-Token Latency
Check output tokens, decode tokens/sec, KV-cache usage, GPU memory bandwidth, batch occupancy, speculative acceptance rate, and client/network buffering. Levers include smaller model, speculation, quantization, decode-focused pool, lower output caps, or more hardware.
OOM and Cache Pressure
- Compare weight memory estimate to available GPU memory.
- Estimate KV cache for p95 prompt plus output at active concurrency.
- Check
kv_cache_usage, preemptions, swaps, and rejected requests. - Lower
max_model_len, output cap, or concurrency. - Add KV quantization, more memory, or separate long-context pool.
Low Throughput
| Cause | Check |
|---|---|
| Small batches | Batch occupancy and request arrival pattern. |
| Sequence skew | Long outputs holding decode slots. |
| Tensor parallel overhead | Interconnect and NCCL metrics. |
| CPU bottleneck | Tokenization, routing, logging, JSON serialization. |
| Kernel mismatch | Runtime version, quantization, unsupported shape. |
Release Regression
Use this rollback checklist:
- Identify changed model/runtime/tokenizer/template/adapter/quantization/flags.
- Compare golden prompts and structured-output evals.
- Compare TTFT, ITL, queue, cache, and error metrics.
- Canary stop if p95 latency, error rate, or quality gates fail.
- Restore previous artifact bundle and flags.
Study Cards
What is the first split in an inference incident?
Separate behavior, queueing, prefill, decode, memory, API, routing, and release-regression causes.
What usually drives high TTFT?
Queue time, long prompt prefill, cold starts, low prefix-cache hits, or insufficient capacity.
What usually drives slow ITL?
Decode bottlenecks, memory bandwidth, long outputs, poor batch occupancy, or ineffective speculation.