vLLM Operations

vLLM is a production-oriented LLM serving engine. Operating it well means controlling request shapes, KV-cache pressure, scheduler behavior, model compatibility, runtime flags, metrics, and rollout risk.

Command Examples

vllm serve <model> --help | sed -n '1,160p'
curl -s http://localhost:8000/v1/models
curl -s http://localhost:8000/metrics | grep '^vllm:'

Example output and meaning:

Command	Example output	What it does
`vllm serve <model> --help \\| sed -n '1,160p'`	`GPU utilization, memory use, CUDA visibility, model list, or serving metrics.`	Separates accelerator visibility from model-serving capacity and latency.
`curl -s http://localhost:8000/v1/models`	`HTTP status, headers, timing, JSON payload, or TLS/proxy error.`	Separates reachability, TLS, proxy, and application behavior.
`curl -s http://localhost:8000/metrics \\| grep '^vllm:'`	`HTTP status, headers, timing, JSON payload, or TLS/proxy error.`	Separates reachability, TLS, proxy, and application behavior.

These checks prove the server responds. They do not prove capacity or compatibility.

Operational Concepts

Concept	What To Watch
Scheduler	Waiting/running requests, queue time, priority, fairness.
PagedAttention	KV block allocation, fragmentation, cache pressure.
Continuous batching	Batch occupancy under mixed prompt/output lengths.
Prefix caching	Hit rate, tokenized prefix stability, tenant boundaries.
Speculative decoding	Accepted tokens, draft overhead, quality compatibility.
Parallelism	Tensor/pipeline split, NCCL, interconnect, latency.
Adapters	LoRA compatibility, active adapters, adapter-specific latency.

Important Flags

Flag	Use
`--max-model-len`	Bound context length and KV-cache demand.
`--gpu-memory-utilization`	Reserve GPU memory fraction for execution.
`--tensor-parallel-size`	Split tensors across GPUs.
`--pipeline-parallel-size`	Split layers across stages.
`--enable-prefix-caching`	Reuse stable prompt prefixes.
`--generation-config`	Control model repo vs vLLM generation defaults.
`--speculative-config`	Enable supported speculative decoding mode.
`--kv-cache-dtype`	Use supported KV-cache dtype/quantization.

Metrics Checklist

Metric Family	Incident Meaning
Queue time	Admission or scheduler pressure.
TTFT	Queue plus prefill user wait.
ITL	Decode smoothness.
KV-cache usage	Memory pressure.
Preemptions	Cache or scheduler stress.
Running/waiting requests	Load and fairness.
Prefix cache hits	Prefix caching value.
Token throughput	Prefill and decode capacity.

vLLM Runbook

Identify model ID, runtime version, flags, adapter, and request shape.
Split symptom into wrong output, high TTFT, slow ITL, OOM, or errors.
Compare prompt/output token histograms.
Inspect queue time, running/waiting requests, KV usage, and preemptions.
Check tokenizer, chat template, generation config, and adapter compatibility.
Roll back model/runtime/config if release-linked.
Add the request shape to load tests and evals.

Study Cards

Question

What does vLLM PagedAttention operate on?

Answer

The KV cache blocks used by active autoregressive sequences.

Question

Which vLLM metrics matter during OOM-like incidents?

Answer

KV-cache usage, preemptions, running/waiting requests, queue time, and token histograms.

Question

Why version vLLM flags?

Answer

Flags such as max model length, memory utilization, generation config, prefix caching, and speculation change capacity and behavior.

vLLM Operations

Command Examples

Operational Concepts

Important Flags

Metrics Checklist

vLLM Runbook

Study Cards

References