Inference Engine Comparison

Inference engines are not interchangeable wrappers. They choose kernels, schedulers, KV-cache layouts, API behavior, quantization support, adapter handling, batching policy, and metrics. Compare engines against the workload, not only a headline benchmark.

Command Examples

vllm serve <model> --help | sed -n '1,80p'
text-generation-launcher --help 2>/dev/null | head
python -c "import torch; print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"

Example output and meaning:

Command	Example output	What it does
`vllm serve <model> --help \\| sed -n '1,80p'`	`GPU utilization, memory use, CUDA visibility, model list, or serving metrics.`	Separates accelerator visibility from model-serving capacity and latency.
`text-generation-launcher --help 2>/dev/null \\| head`	`Concrete IDs, states, counters, versions, rows, or error strings.`	Turns the example from a command list into evidence for the next debugging step.
`Python snippet`	`A version, tensor shape, score, retrieved IDs, metric delta, or explicit error.`	Turns the example into a measurable model, data, or pipeline signal.

Record runtime versions and flags. Small version changes can alter supported features and metric names.

Feature Matrix

Engine	Strength	Watch
vLLM	High-throughput serving, PagedAttention, continuous batching, OpenAI-compatible API.	Tuning flags, feature compatibility, cache pressure.
TensorRT-LLM	NVIDIA-optimized engine build and kernels.	Build workflow, hardware specificity, shape assumptions.
Hugging Face TGI	HF-native model serving and common production deployment path.	Supported model/quantization combinations.
llama.cpp	GGUF, local CPU/GPU/offload, quantized edge inference.	Different ops model than GPU fleet serving.
SGLang	Structured/programmatic generation and high-performance runtime.	Workload fit and operational maturity validation.
Ollama	Simple local model management and developer UX.	Convenience layer, not usually a high-scale control plane.

Comparison Checklist

Area	Questions
Model support	Architecture, tokenizer, context length, adapters, MoE, multimodal.
Artifact format	`safetensors`, GGUF, TensorRT engine, quantized artifact.
API behavior	OpenAI-compatible requests, streaming, tool calls, errors, usage.
Performance	TTFT, ITL, throughput, queue time, cache pressure, cost.
Quantization	AWQ, GPTQ, FP8, INT8, INT4, GGUF, KV-cache quantization.
Cache features	Paged KV, prefix caching, offload, sliding window, preemption.
Observability	Metrics, traces, logs, per-model/per-tenant labels.
Operations	Upgrade path, rolling deploys, canaries, rollback, debugging.

Migration Plan

Pick representative prompts, long contexts, streaming requests, tool calls, and overload cases.
Render and compare token IDs.
Match generation config and stop behavior.
Run deterministic golden prompts.
Run traffic-shape load tests.
Compare metrics and logs.
Canary with rollback.

Study Cards

Question

Why can engine migration change model behavior?

Answer

Tokenization, chat templates, generation defaults, stop behavior, quantization, and adapter support can differ.

Question

When is TensorRT-LLM attractive?

Answer

When an NVIDIA fleet needs optimized engine builds and high throughput for supported model shapes.

Question

Why keep llama.cpp separate in an engine comparison?

Answer

It targets GGUF/local/offload workflows with different tradeoffs than GPU fleet servers.

Inference Engine Comparison

Command Examples

Feature Matrix

Comparison Checklist

Migration Plan

Study Cards

References