Inference Engine Comparison

Inference engines are not interchangeable wrappers. They choose kernels, schedulers, KV-cache layouts, API behavior, quantization support, adapter handling, batching policy, and metrics. Compare engines against the workload, not only a headline benchmark.

Command Examples

vllm serve <model> --help | sed -n '1,80p'
text-generation-launcher --help 2>/dev/null | head
python -c "import torch; print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"

Example output and meaning:

Command Example output What it does
vllm serve <model> --help \| sed -n '1,80p' GPU utilization, memory use, CUDA visibility, model list, or serving metrics. Separates accelerator visibility from model-serving capacity and latency.
text-generation-launcher --help 2>/dev/null \| head Concrete IDs, states, counters, versions, rows, or error strings. Turns the example from a command list into evidence for the next debugging step.
Python snippet A version, tensor shape, score, retrieved IDs, metric delta, or explicit error. Turns the example into a measurable model, data, or pipeline signal.

Record runtime versions and flags. Small version changes can alter supported features and metric names.

Feature Matrix

Engine Strength Watch
vLLM High-throughput serving, PagedAttention, continuous batching, OpenAI-compatible API. Tuning flags, feature compatibility, cache pressure.
TensorRT-LLM NVIDIA-optimized engine build and kernels. Build workflow, hardware specificity, shape assumptions.
Hugging Face TGI HF-native model serving and common production deployment path. Supported model/quantization combinations.
llama.cpp GGUF, local CPU/GPU/offload, quantized edge inference. Different ops model than GPU fleet serving.
SGLang Structured/programmatic generation and high-performance runtime. Workload fit and operational maturity validation.
Ollama Simple local model management and developer UX. Convenience layer, not usually a high-scale control plane.

Comparison Checklist

Area Questions
Model support Architecture, tokenizer, context length, adapters, MoE, multimodal.
Artifact format safetensors, GGUF, TensorRT engine, quantized artifact.
API behavior OpenAI-compatible requests, streaming, tool calls, errors, usage.
Performance TTFT, ITL, throughput, queue time, cache pressure, cost.
Quantization AWQ, GPTQ, FP8, INT8, INT4, GGUF, KV-cache quantization.
Cache features Paged KV, prefix caching, offload, sliding window, preemption.
Observability Metrics, traces, logs, per-model/per-tenant labels.
Operations Upgrade path, rolling deploys, canaries, rollback, debugging.

Migration Plan

  1. Pick representative prompts, long contexts, streaming requests, tool calls, and overload cases.
  2. Render and compare token IDs.
  3. Match generation config and stop behavior.
  4. Run deterministic golden prompts.
  5. Run traffic-shape load tests.
  6. Compare metrics and logs.
  7. Canary with rollback.

Study Cards

Question

Why can engine migration change model behavior?

Answer

Tokenization, chat templates, generation defaults, stop behavior, quantization, and adapter support can differ.

Question

When is TensorRT-LLM attractive?

Answer

When an NVIDIA fleet needs optimized engine builds and high throughput for supported model shapes.

Question

Why keep llama.cpp separate in an engine comparison?

Answer

It targets GGUF/local/offload workflows with different tradeoffs than GPU fleet servers.

References