Advanced Inference and vLLM

Advanced inference engineering is about managing tokens, memory, scheduling, and tail latency. vLLM provides a production-oriented serving engine, but operators still need to size KV cache, control prompt/output shapes, choose parallelism, and measure real traffic.

Command Examples

vllm serve <model> --help | sed -n '1,80p'
curl -s http://localhost:8000/metrics | grep '^vllm:'

Example output and meaning:

Command	Example output	What it does
`vllm serve <model> --help \\| sed -n '1,80p'`	`GPU utilization, memory use, CUDA visibility, model list, or serving metrics.`	Separates accelerator visibility from model-serving capacity and latency.
`curl -s http://localhost:8000/metrics \\| grep '^vllm:'`	`HTTP status, headers, timing, JSON payload, or TLS/proxy error.`	Separates reachability, TLS, proxy, and application behavior.

vLLM Architecture Concepts

Concept	Operational Meaning
PagedAttention	KV-cache memory management that avoids large contiguous allocations.
Continuous batching	Adds and removes requests dynamically as sequences progress.
Chunked prefill	Splits large prompt prefill into smaller schedulable chunks.
Disaggregated prefill	Runs prefill and decode in separate instances and transfers KV cache; experimental in current vLLM docs.
Prefix caching	Reuses KV cache for repeated prompt prefixes.
Speculative decoding	Uses draft or prompt/suffix methods to reduce inter-token latency when candidate tokens are accepted.
Quantized KV cache	Reduces KV memory in supported configurations with quality checks.

KV-Cache Memory Math

Approximate KV memory grows with:

layers * 2(K,V) * batch_sequences * context_tokens * hidden_per_layer * bytes_per_value

This is why long context and high concurrency are memory problems even when model weights fit.

Parallelism Choices

Strategy	Helps	Cost
Tensor parallelism	Split large matrix work across GPUs.	Collective communication and interconnect dependence.
Pipeline parallelism	Split layers across stages.	Bubbles and latency for small batches.
Data parallel serving	More replicas for more traffic.	More model copies and routing complexity.
Disaggregated prefill/decode	Tune TTFT and ITL separately.	KV transfer, experimental behavior, more moving parts.

Advanced vLLM Lab: Capacity Worksheet

model:
  parameters:
  dtype:
  max_model_len:
traffic:
  p50_prompt_tokens:
  p95_prompt_tokens:
  p50_output_tokens:
  p95_output_tokens:
targets:
  ttft_p95:
  itl_p95:
  cost_per_1k_tokens:
tuning:
  tensor_parallel_size:
  gpu_memory_utilization:
  enable_prefix_caching:
  speculative_config:

Tuning Decision Tree

Goal	First Lever	Second Lever
Lower TTFT	Reduce prompt tokens or add prefix caching.	More replicas or disaggregated prefill.
Lower ITL	Speculative decoding or smaller model.	Decode-focused routing and hardware sizing.
Fit longer context	More memory or lower concurrency.	Quantized KV cache after quality testing.
Improve throughput	Continuous batching and admission control.	Tensor parallelism or more replicas.
Support many adapters	Adapter routing and compatibility checks.	Separate pools for heavy adapters.

Study Cards

Question

Why is disaggregated prefill useful?

Answer

It lets operators tune time to first token and inter-token latency separately by separating prefill and decode work.

Question

What does KV-cache memory scale with?

Answer

Layers, active sequences, context tokens, hidden dimensions, and bytes per cached value.

Question

When is speculative decoding most useful?

Answer

When workloads are decode-latency sensitive and candidate tokens are accepted often enough to offset overhead.

Advanced Inference and vLLM

Command Examples

vLLM Architecture Concepts

KV-Cache Memory Math

Parallelism Choices

Advanced vLLM Lab: Capacity Worksheet

Tuning Decision Tree

Study Cards

References