Inference Benchmarking

Inference benchmarks should answer a production question: can this model/runtime/hardware combination meet quality, latency, throughput, and cost goals for the real request mix? A benchmark that only reports average tokens/sec on short prompts is usually not enough.

Command Examples

curl -s http://localhost:8000/v1/models
curl -s http://localhost:8000/metrics | grep -E 'queue|token|kv|request'
nvidia-smi dmon -s pucm

Example output and meaning:

Command	Example output	What it does
`curl -s http://localhost:8000/v1/models`	`HTTP status, headers, timing, JSON payload, or TLS/proxy error.`	Separates reachability, TLS, proxy, and application behavior.
`curl -s http://localhost:8000/metrics \\| grep -E 'queue\\|token\\|kv\\|request'`	`HTTP status, headers, timing, JSON payload, or TLS/proxy error.`	Separates reachability, TLS, proxy, and application behavior.
`nvidia-smi dmon -s pucm`	`GPU utilization, memory use, CUDA visibility, model list, or serving metrics.`	Separates accelerator visibility from model-serving capacity and latency.

Record model revision, tokenizer, runtime version, GPU type, driver, CUDA version, precision, quantization, and serving flags with every result.

Benchmark Design

Axis	Include
Prompt length	p50, p90, p95, p99, and max allowed.
Output length	p50, p90, p95, p99, and max allowed.
Concurrency	steady state, burst, overload, tenant hot spot.
Traffic mix	short chat, long RAG, summarization, agents, batch.
Generation config	deterministic and product defaults.
Runtime variants	baseline, quantized, prefix cache, speculation, new engine.
Quality gates	Golden prompts, structured output, safety slices.

Metrics

Metric	Why It Matters
TTFT p50/p95/p99	User-visible wait before output starts.
ITL p50/p95/p99	Streaming smoothness during decode.
End-to-end latency	Total request duration.
Prompt tokens/sec	Prefill throughput.
Generation tokens/sec	Decode throughput.
Queue time	Admission and capacity pressure.
Active/waiting requests	Scheduler health.
KV-cache utilization	Memory headroom.
Preemptions/swaps/recomputes	Cache pressure symptoms.
Error/retry/reject rate	Reliability under load.
Cost per 1K tokens	Unit economics.

Methodology

Warm up model load, CUDA context, kernels, and cache allocator.
Run a single-request correctness smoke test.
Run fixed-shape tests to isolate prefill and decode.
Run production-shape traffic mixes.
Sweep concurrency until SLO failure.
Repeat with quantization, prefix cache, speculation, and engine changes.
Pair every latency result with quality and safety evals.

Anti-Patterns

Anti-Pattern	Why It Misleads
Average-only latency	Hides p95/p99 tail behavior.
Short prompts only	Misses prefill and KV-cache pressure.
Fixed output length only	Misses decode occupancy variation.
No warmup	Measures cold start instead of steady state.
No quality gate	Faster output may be worse output.
Ignoring rejects	Overload may look fast because bad requests were dropped.
Single tenant only	Misses noisy-neighbor and routing behavior.

Benchmark Report

Field	Value
Model / revision
Tokenizer / template
Runtime / version
Hardware / driver
Precision / quantization
Serving flags
Traffic mix
SLO target
Pass/fail decision
Rollback risk

Study Cards

Question

Why are average tokens/sec benchmarks weak?

Answer

They can hide tail latency, prompt-shape effects, quality regressions, and overload behavior.

Question

What should every inference benchmark record?

Answer

Model, tokenizer, runtime, hardware, precision, serving flags, traffic mix, latency, throughput, quality, and errors.

Question

Why benchmark prompt and generation tokens separately?

Answer

Prefill and decode stress different parts of the serving system.

Inference Benchmarking

Command Examples

Benchmark Design

Metrics

Methodology

Anti-Patterns

Benchmark Report

Study Cards

References