LLM Inference Systems

LLM inference is the production system that turns model artifacts, tokenizers, chat templates, prompts, adapters, hardware, runtime kernels, schedulers, KV cache, and API contracts into generated tokens. Treat inference as a systems problem: the same weights can behave differently or cost much more when the tokenizer, generation config, serving engine, cache policy, precision, batching, or request shape changes.

Command Examples

nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"
python -c "from transformers import AutoTokenizer; print('tokenizer import ok')"
vllm serve <model> --help | sed -n '1,120p'
curl -s http://localhost:8000/v1/models
curl -s http://localhost:8000/metrics | grep -E 'vllm:|kv|queue|token'

Example output and meaning:

Command	Example output	What it does
`nvidia-smi`	`GPU utilization, memory use, CUDA visibility, model list, or serving metrics.`	Separates accelerator visibility from model-serving capacity and latency.
`Python snippet`	`A version, tensor shape, score, retrieved IDs, metric delta, or explicit error.`	Turns the example into a measurable model, data, or pipeline signal.
`Python snippet`	`A version, tensor shape, score, retrieved IDs, metric delta, or explicit error.`	Turns the example into a measurable model, data, or pipeline signal.

These checks prove only basic runtime reachability. They do not prove model compatibility, output quality, cache capacity, tenant isolation, or cost.

System Layers

Layer	What to Verify	Common Failure
Model artifact	File format, architecture, dtype, shards, license.	Runtime cannot load or silently falls back to slow behavior.
Tokenizer	Vocabulary, special tokens, BOS/EOS behavior, chat template.	Same prompt produces different token IDs across engines.
Weights	Parameter count, precision, quantization, sharding, adapters.	Model fits at rest but fails under KV-cache load.
Runtime engine	Scheduler, kernels, KV layout, batching, API server.	Good single-call output but poor production tail latency.
Request policy	Max prompt, max output, sampling defaults, tenant quota.	Unbounded requests exhaust cache or create cost spikes.
Observability	TTFT, ITL, queue time, tokens/sec, cache pressure, errors.	Incidents are diagnosed from generic GPU utilization only.
Release gate	Quality evals, load tests, canary, rollback, compatibility checks.	Runtime or model upgrades change behavior without detection.

Model Artifacts and Formats

Format	Where It Shows Up	What to Know
PyTorch checkpoint	Training, fine-tuning, research code.	Flexible, but often not the most efficient production serving artifact.
`safetensors`	Hugging Face model repos and serving stacks.	Safer tensor-only format with fast loading and no pickle execution.
Sharded weights	Large models split across multiple files.	Need all shards, matching index, enough host RAM, and compatible dtype.
GGUF	llama.cpp, local/offline inference, quantized edge use.	Bundles metadata and quantized weights for llama.cpp-family runtimes.
ONNX	Cross-runtime export and optimization.	Useful for some models, but decoder-only LLM serving often needs engine-specific support.
TensorRT engine	TensorRT-LLM deployment artifact.	Built for specific model, precision, shape assumptions, and hardware target.
LoRA adapter	Parameter-efficient serving over a base model.	Must match base model, tokenizer, target modules, rank, dtype, and merge state.

Artifact reviews should record model ID, revision, license, architecture, context window, tokenizer revision, chat template, dtype, quantization scheme, adapter list, generation config, eval baseline, and rollback artifact.

Weights, Activations, and KV Cache

Memory Class	Lifetime	Scales With	Why It Matters
Weights	Loaded while model replica is running.	Parameter count and bytes per parameter.	Determines base GPU memory before traffic.
Activations	Temporary during forward passes.	Batch shape, sequence length, layers, kernels.	Training is activation-heavy; inference still has temporary workspace.
KV cache	Grows per active sequence during prefill/decode.	Layers, tokens, KV heads, head dim, dtype, concurrency.	Often the production serving bottleneck.
Optimizer state	Training only.	Parameters and optimizer type.	Usually absent from inference replicas.
Workspace	Runtime-specific temporary buffers.	Kernel choices, parallelism, compilation.	Can cause OOM even when rough math looks safe.

Rough weight memory:

weight_memory ~= parameters * bytes_per_parameter

Common bytes per parameter:

Precision	Bytes	Typical Use
FP32	4	Training or high-precision baselines.
BF16 / FP16	2	Common GPU inference and training precision.
FP8	1	Newer accelerator paths with calibration and kernel support.
INT8	1	Quantized serving where quality holds.
INT4	0.5	Aggressive weight quantization for memory/cost reduction.

Rough KV-cache memory per active sequence:

layers * active_tokens * 2(K,V) * kv_heads * head_dim * bytes_per_value

Grouped-query attention and multi-query attention reduce KV memory by using fewer KV heads than query heads. This is one reason two models with similar parameter counts can have different serving capacity.

Engine Choice Matrix

Engine	Strengths	Watch Out For	Good Fit
vLLM	PagedAttention, continuous batching, OpenAI-compatible API, prefix caching, speculative decoding, metrics.	Feature compatibility, cache pressure, version-specific flags, operational tuning.	General high-throughput LLM serving.
TensorRT-LLM	NVIDIA-optimized kernels, engine build workflow, in-flight batching, paged KV cache support.	Build complexity, hardware specificity, engine rebuilds for shape/model changes.	NVIDIA fleets needing maximum optimized throughput.
Hugging Face TGI	Hugging Face ecosystem integration, production server, common open-model workflow.	Engine-specific model support and tuning differences.	Teams standardized on HF model repos and APIs.
llama.cpp	GGUF ecosystem, CPU/GPU/offload, local and edge inference.	Different performance profile, model conversion/quant choices, not a drop-in GPU fleet server.	Local apps, edge, laptops, CPU-heavy deployments.
SGLang	Serving runtime with structured generation and high-performance LLM execution features.	Operational maturity and feature fit should be validated for your workload.	Structured generation, agentic, or programmatic generation workloads.
Ollama	Simple local model management and developer UX.	Convenience layer, not usually the final high-scale serving control plane.	Local development, demos, lightweight internal tools.

Engine selection should be workload-driven. Compare prompt length distribution, output length distribution, concurrency, latency targets, hardware, model format, quantization needs, adapter support, API compatibility, observability, and upgrade process.

Inference Mechanics

Mechanic	Practical Meaning	Primary Metric
Prefill	Processes prompt tokens and creates initial KV cache.	Time to first token and prompt tokens/sec.
Decode	Generates tokens one step at a time using KV cache.	Inter-token latency and generation tokens/sec.
Streaming	Sends partial output while decode continues.	Smoothness, cancellation, client disconnects.
Static batching	Runs preselected batches together.	Batch throughput, but can add waiting.
Dynamic batching	Batches requests arriving near each other.	Throughput vs latency tradeoff.
Continuous batching	Adds/removes active sequences while decoding.	GPU occupancy and tail latency under mixed lengths.
Admission control	Limits traffic before the runtime collapses.	Queue time, rejection rate, cache pressure.
Request-shape routing	Splits short, long-context, and long-output traffic.	Per-pool SLOs and utilization.

Prefill and decode should be measured separately. Long prompts hurt TTFT and KV memory early. Long outputs extend decode, grow KV cache, and keep slots occupied. A system can have good average tokens/sec while users experience bad p95 TTFT or uneven streaming.

Sampling and Decoding Controls

Control	Effect	Risk
Greedy decoding	Always picks the highest-probability token.	Deterministic but can be dull or brittle.
Temperature	Changes randomness of token selection.	High values can reduce reliability.
Top-p	Samples from a probability mass cutoff.	Poor defaults can affect safety and format.
Top-k	Samples from a fixed number of candidates.	Too small can overconstrain output.
Repetition penalty	Discourages repeated tokens.	Can damage exact quoting or structured output.
Beam search	Explores multiple candidate sequences.	Expensive and often not ideal for chat serving.
Max tokens	Caps output length.	Too high raises cost/cache use; too low truncates.
Stop sequences	Ends generation at known boundaries.	Bad stops can cut valid answers.

Version generation settings with deployments. A model-repo default, runtime default, SDK default, and product default can differ.

API Contracts, Streaming, and Cancellation

Surface	What to Lock Down	Failure Mode
OpenAI-compatible endpoints	Request schema, response schema, model IDs, tool/function fields, streaming chunks.	Client works in one engine but breaks after runtime swap.
Streaming semantics	Token/chunk ordering, finish reasons, errors, timeouts.	Clients hang, double-render, or lose final usage data.
Cancellation	Client disconnects, timeout cancellation, server-side abort.	Abandoned requests keep decoding and holding KV cache.
Usage accounting	Prompt tokens, completion tokens, cached tokens, rejected requests.	Cost attribution and rate limits drift from reality.
Error taxonomy	Rate limit, context length, safety, overload, model unavailable.	Retry logic overloads a degraded system.
Backward compatibility	Old SDKs, old model aliases, old tool schemas.	Runtime upgrade becomes a client incident.

Treat API compatibility as part of the model contract. A serving engine migration should replay real client requests, including streaming, tool calls, long prompts, stop sequences, cancellation, and overload behavior.

KV Cache Essentials

KV cache stores attention key/value tensors for previous tokens so autoregressive decode does not recompute the whole prompt for every new token. It is internal model state for active inference, not final-answer caching and not durable conversation memory.

Topic	Important Detail
Lifecycle	Allocate during prefill, append during decode, free on finish/cancel.
Memory pressure	Increases with prompt tokens, generated tokens, concurrency, layers, heads, and dtype.
Decode performance	Often memory-bandwidth-sensitive because each new token reads prior KV.
Prefix reuse	Shared prefixes can reuse cached KV only when model, tokenizer, template, and prefix tokens match.
Quantized KV	Can reduce memory/bandwidth but must pass quality and kernel support checks.
Offloaded KV	Moves some cache to CPU memory but adds PCIe/NVLink latency and bandwidth limits.
Sliding window	Some architectures keep only a recent attention window.
Tenant isolation	Cache reuse must not cross tenant or authorization boundaries.

Capacity planning should include a table of request classes:

Class	Prompt Tokens	Output Tokens	Concurrency	Serving Pool
Chat short	p50/p95	p50/p95	target	Low-latency pool.
RAG long prompt	p50/p95	p50/p95	target	Prefix-cache or long-context pool.
Batch summarization	p50/p95	p50/p95	target	Throughput-oriented pool.
Agent/tool loop	p50/p95 per turn	p50/p95 per turn	target	Strict timeout and cost limits.

PagedAttention Essentials

PagedAttention is a KV-cache memory-management technique used by vLLM. It splits KV cache into fixed-size physical blocks and maps logical token positions to those blocks. This avoids requiring one large contiguous allocation per sequence.

Concept	What It Means
Logical block	A token range in a sequence’s cache.
Physical block	A GPU memory block holding actual KV tensors.
Block table	Mapping from logical blocks to physical blocks.
Non-contiguous storage	One request can span scattered physical blocks.
Block sharing	Shared prefixes or parallel samples can reuse blocks.
Copy-on-write	Shared blocks are copied only when sequences diverge.
Fragmentation control	Less memory is wasted by worst-case contiguous reservation.

PagedAttention improves packing, concurrency, continuous batching, and cache reuse. It does not make attention compute free, change model quality, or remove the need for prompt/output limits.

Prefix Caching, Speculation, and Prefill/Decode Split

Technique	Helps	Best Signal
Prefix caching	Reuses KV for stable repeated prefixes.	Prefix cache hit rate and lower TTFT.
Chunked prefill	Slices long prefill into schedulable chunks.	Fairness and p95 TTFT under mixed traffic.
Disaggregated prefill/decode	Separates prefill-heavy and decode-heavy workers.	Independent TTFT and ITL capacity control.
Speculative decoding	Drafts candidate tokens and verifies them.	Accepted-token rate and lower ITL.
Multi-LoRA serving	Serves adapters over one base model.	Adapter-specific latency, cache pressure, quality.
Quantized serving	Reduces memory and sometimes improves throughput.	Quality evals, latency, supported kernels.

Prefix caching is not the same as response caching. It reuses internal KV for identical token prefixes. Any change in system prompt, chat template, retrieved document order, tokenizer, adapter, or tenant boundary can make reuse unsafe or impossible.

Quantization for Serving

Type	What Is Quantized	Main Benefit	Main Risk
Weight quantization	Model parameters.	Lower model memory and cost.	Quality loss, unsupported kernels, calibration gaps.
Activation quantization	Runtime activations.	Faster/lower-memory kernels where supported.	Accuracy and hardware dependence.
KV-cache quantization	Cached keys and values.	More context/concurrency.	Quality loss and decode kernel compatibility.
GGUF quant levels	llama.cpp model weights.	Local/edge memory reduction.	Quant choice materially affects output and speed.
FP8 serving	Weights/activations on supported hardware.	High throughput on modern accelerators.	Calibration and engine support complexity.

Quantization is a release, not a toggle. Run task evals, safety evals, long-context evals, structured-output evals, and latency/load tests on the exact serving engine.

Parallelism and Routing

Strategy	Use	Cost
Data parallel replicas	Scale independent traffic.	More weight copies.
Tensor parallelism	Split matrix work across GPUs.	Collective communication and interconnect sensitivity.
Pipeline parallelism	Split layers across stages.	Pipeline bubbles and higher latency for small batches.
Expert parallelism	MoE serving.	Routing/load-balance complexity.
Model routing	Choose model by task, tenant, cost, latency.	Routing becomes part of product behavior.
Fallback routing	Fail over to smaller/older/remote models.	Compatibility and user-visible quality changes.
Adapter routing	Pick LoRA adapter per tenant/task.	Cache pressure and adapter compatibility.

Routing policy should record the model, adapter, tokenizer, generation config, prompt template, safety policy, and rollback route for every served request.

Security and Tenant Isolation

Risk	Why It Matters	Control
Cross-tenant cache reuse	Prefix or response caches can expose another tenant’s context.	Scope caches by tenant, auth boundary, model, tokenizer, template, adapter, and policy.
Prompt logging leakage	Prompts and outputs may contain secrets, PII, documents, or tool results.	Redaction, retention limits, access controls, sampling, and audit logs.
Adapter mix-up	Wrong LoRA adapter can leak tenant behavior or data.	Route-level adapter IDs, compatibility checks, and per-adapter metrics.
Tool-call drift	Engine or prompt changes can alter tool arguments.	Schema validation, deterministic authorization, replay tests.
Model supply chain	Artifacts may be malicious, unlicensed, or unreviewed.	Use trusted formats, pinned revisions, license review, checksum/signature policy.
Overload abuse	Long contexts and outputs can exhaust KV cache.	Quotas, max prompt/output tokens, admission control, and tenant rate limits.

Security review should include cache boundaries, logs, model artifacts, adapters, generation config, routing rules, and client-visible API behavior.

Performance Test Matrix

Test Axis	Values to Include
Prompt length	p50, p90, p95, p99, max allowed.
Output length	p50, p90, p95, p99, max allowed.
Concurrency	expected, burst, overload, tenant hot spot.
Traffic mix	short chat, long RAG, batch, agent loops, streaming.
Runtime variant	baseline engine, new engine, quantized, speculative, prefix cache.
Model variant	old/new weights, adapter, tokenizer, chat template.
Hardware	GPU type, count, interconnect, driver, CUDA/runtime version.

Track these metrics:

Metric	Why
TTFT p50/p95/p99	User waits before seeing output.
ITL p50/p95/p99	Streaming smoothness.
End-to-end latency	Total user-visible time.
Prompt tokens/sec	Prefill throughput.
Generation tokens/sec	Decode throughput.
Queue time	Admission and capacity pressure.
Active/waiting requests	Scheduler load.
KV-cache utilization	Memory pressure.
Preemptions/swaps/recomputes	Cache exhaustion or scheduler contention.
Prefix cache hit rate	Prefix caching value.
Speculative acceptance rate	Speculation value.
Error/retry/reject rate	User-visible reliability.
Cost per 1K input/output tokens	Unit economics.

Debugging Runbooks

Symptom	First Split	Likely Checks
Wrong output	Model behavior vs serving mismatch.	Tokenizer, chat template, generation config, adapter, model revision.
High TTFT	Queue vs prefill.	Queue time, prompt tokens, prefix cache hits, chunked prefill, replicas.
Slow streaming	Decode vs network/client.	ITL, output tokens, KV pressure, memory bandwidth, speculation.
OOM under burst	Weights vs KV vs workspace.	Max model len, concurrency, cache usage, quantization, request limits.
GPU busy but throughput low	Kernel/shape/scheduler issue.	Batch occupancy, sequence length mix, tensor parallel overhead.
Queue high but GPU low	Admission/routing bottleneck.	Router, rate limits, worker health, blocked scheduler.
Prefix cache no gain	Prefix mismatch.	Tokenized prefix equality, template drift, tenant boundaries.
Speculation no gain	Low acceptance or overhead.	Draft quality, sampling config, accepted-token counters.
Adapter latency spike	Adapter-specific behavior.	Adapter rank, active adapters, cache pressure, routing.
Runtime upgrade regression	Compatibility drift.	Golden prompts, structured outputs, latency diff, metrics names.

Release Gates

Before changing model, runtime, quantization, prompt template, tokenizer, adapter, or generation config:

Record old and new artifact revisions.
Run golden quality evals and structured-output checks.
Run safety, refusal, and jailbreak slices where relevant.
Compare tokenization for representative prompts.
Load test short, long prompt, long output, and burst shapes.
Compare TTFT, ITL, throughput, queue time, cache usage, and error rate.
Canary by tenant or route with automatic rollback conditions.
Preserve old engine/model artifacts until rollback is proven.

Practical Labs

Lab	Goal
KV-cache capacity worksheet	Estimate memory by model, context, dtype, and concurrency.
Engine shootout	Compare vLLM, TGI, TensorRT-LLM, llama.cpp, SGLang, or Ollama for one model.
Prefix caching benchmark	Measure TTFT with repeated and non-repeated prefixes.
Long-context OOM lab	Find the prompt/output/concurrency point that exhausts KV memory.
Streaming cancellation lab	Confirm disconnected clients release decode work and KV cache.
Continuous batching demo	Compare static-style load to mixed-length continuous batching.
Speculative decoding A/B	Compare accepted-token rate, ITL, quality, and cost.
Quantization comparison	Compare weight quantization and KV quantization separately.
Multi-LoRA routing	Measure adapter-specific latency and correctness.
Canary drill	Practice rollback from a runtime or generation-config regression.
Metrics dashboard	Build panels for TTFT, ITL, queue, cache, preemptions, errors, cost.

Study Cards

Question

Why is LLM inference a systems problem?

Answer

Outputs and cost depend on weights, tokenizer, prompt format, generation config, runtime engine, KV cache, batching, hardware, and observability.

Question

How are weights different from KV cache?

Answer

Weights are loaded model parameters shared by requests; KV cache is per-active-sequence attention state that grows with prompt and generated tokens.

Question

When should you compare inference engines?

Answer

When model format, latency, throughput, quantization, adapter support, hardware, observability, or operating model changes.

Question

Why separate TTFT from ITL?

Answer

TTFT captures queue and prefill delay, while ITL captures decode and streaming smoothness.

Question

What makes prefix caching safe?

Answer

The reused prefix must match at the token level and respect model, tokenizer, template, adapter, and tenant boundaries.

Question

Why test quantization as a release?

Answer

It can change quality, safety, latency, memory use, and kernel compatibility.

LLM Inference Systems

Command Examples

System Layers

Model Artifacts and Formats

Weights, Activations, and KV Cache

Engine Choice Matrix

Inference Mechanics

Sampling and Decoding Controls

API Contracts, Streaming, and Cancellation

KV Cache Essentials

PagedAttention Essentials

Prefix Caching, Speculation, and Prefill/Decode Split

Quantization for Serving

Parallelism and Routing

Security and Tenant Isolation

Performance Test Matrix

Debugging Runbooks

Release Gates

Practical Labs

Study Cards

References