Tech Study Guide
vLLM Operations
Operational guide for vLLM serving with scheduler concepts, PagedAttention, KV cache, prefix caching, speculative decoding, parallelism, metrics, flags, and runbooks.
vLLM Operations
vLLM is a production-oriented LLM serving engine. Operating it well means controlling request shapes, KV-cache pressure, scheduler behavior, model compatibility, runtime flags, metrics, and rollout risk.
Command Examples
vllm serve <model> --help | sed -n '1,160p'
curl -s http://localhost:8000/v1/models
curl -s http://localhost:8000/metrics | grep '^vllm:'
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
vllm serve <model> --help \| sed -n '1,160p' |
GPU utilization, memory use, CUDA visibility, model list, or serving metrics. |
Separates accelerator visibility from model-serving capacity and latency. |
curl -s http://localhost:8000/v1/models |
HTTP status, headers, timing, JSON payload, or TLS/proxy error. |
Separates reachability, TLS, proxy, and application behavior. |
curl -s http://localhost:8000/metrics \| grep '^vllm:' |
HTTP status, headers, timing, JSON payload, or TLS/proxy error. |
Separates reachability, TLS, proxy, and application behavior. |
These checks prove the server responds. They do not prove capacity or compatibility.
Operational Concepts
| Concept | What To Watch |
|---|---|
| Scheduler | Waiting/running requests, queue time, priority, fairness. |
| PagedAttention | KV block allocation, fragmentation, cache pressure. |
| Continuous batching | Batch occupancy under mixed prompt/output lengths. |
| Prefix caching | Hit rate, tokenized prefix stability, tenant boundaries. |
| Speculative decoding | Accepted tokens, draft overhead, quality compatibility. |
| Parallelism | Tensor/pipeline split, NCCL, interconnect, latency. |
| Adapters | LoRA compatibility, active adapters, adapter-specific latency. |
Important Flags
| Flag | Use |
|---|---|
--max-model-len |
Bound context length and KV-cache demand. |
--gpu-memory-utilization |
Reserve GPU memory fraction for execution. |
--tensor-parallel-size |
Split tensors across GPUs. |
--pipeline-parallel-size |
Split layers across stages. |
--enable-prefix-caching |
Reuse stable prompt prefixes. |
--generation-config |
Control model repo vs vLLM generation defaults. |
--speculative-config |
Enable supported speculative decoding mode. |
--kv-cache-dtype |
Use supported KV-cache dtype/quantization. |
Metrics Checklist
| Metric Family | Incident Meaning |
|---|---|
| Queue time | Admission or scheduler pressure. |
| TTFT | Queue plus prefill user wait. |
| ITL | Decode smoothness. |
| KV-cache usage | Memory pressure. |
| Preemptions | Cache or scheduler stress. |
| Running/waiting requests | Load and fairness. |
| Prefix cache hits | Prefix caching value. |
| Token throughput | Prefill and decode capacity. |
vLLM Runbook
- Identify model ID, runtime version, flags, adapter, and request shape.
- Split symptom into wrong output, high TTFT, slow ITL, OOM, or errors.
- Compare prompt/output token histograms.
- Inspect queue time, running/waiting requests, KV usage, and preemptions.
- Check tokenizer, chat template, generation config, and adapter compatibility.
- Roll back model/runtime/config if release-linked.
- Add the request shape to load tests and evals.
Study Cards
What does vLLM PagedAttention operate on?
The KV cache blocks used by active autoregressive sequences.
Which vLLM metrics matter during OOM-like incidents?
KV-cache usage, preemptions, running/waiting requests, queue time, and token histograms.
Why version vLLM flags?
Flags such as max model length, memory utilization, generation config, prefix caching, and speculation change capacity and behavior.