Tech Study Guide
Inference Engine Comparison
Comparison guide for vLLM, TensorRT-LLM, Hugging Face TGI, llama.cpp, SGLang, Ollama, OpenAI-compatible APIs, quantization, adapters, and migration checks.
Inference Engine Comparison
Inference engines are not interchangeable wrappers. They choose kernels, schedulers, KV-cache layouts, API behavior, quantization support, adapter handling, batching policy, and metrics. Compare engines against the workload, not only a headline benchmark.
Command Examples
vllm serve <model> --help | sed -n '1,80p'
text-generation-launcher --help 2>/dev/null | head
python -c "import torch; print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
vllm serve <model> --help \| sed -n '1,80p' |
GPU utilization, memory use, CUDA visibility, model list, or serving metrics. |
Separates accelerator visibility from model-serving capacity and latency. |
text-generation-launcher --help 2>/dev/null \| head |
Concrete IDs, states, counters, versions, rows, or error strings. |
Turns the example from a command list into evidence for the next debugging step. |
Python snippet |
A version, tensor shape, score, retrieved IDs, metric delta, or explicit error. |
Turns the example into a measurable model, data, or pipeline signal. |
Record runtime versions and flags. Small version changes can alter supported features and metric names.
Feature Matrix
| Engine | Strength | Watch |
|---|---|---|
| vLLM | High-throughput serving, PagedAttention, continuous batching, OpenAI-compatible API. | Tuning flags, feature compatibility, cache pressure. |
| TensorRT-LLM | NVIDIA-optimized engine build and kernels. | Build workflow, hardware specificity, shape assumptions. |
| Hugging Face TGI | HF-native model serving and common production deployment path. | Supported model/quantization combinations. |
| llama.cpp | GGUF, local CPU/GPU/offload, quantized edge inference. | Different ops model than GPU fleet serving. |
| SGLang | Structured/programmatic generation and high-performance runtime. | Workload fit and operational maturity validation. |
| Ollama | Simple local model management and developer UX. | Convenience layer, not usually a high-scale control plane. |
Comparison Checklist
| Area | Questions |
|---|---|
| Model support | Architecture, tokenizer, context length, adapters, MoE, multimodal. |
| Artifact format | safetensors, GGUF, TensorRT engine, quantized artifact. |
| API behavior | OpenAI-compatible requests, streaming, tool calls, errors, usage. |
| Performance | TTFT, ITL, throughput, queue time, cache pressure, cost. |
| Quantization | AWQ, GPTQ, FP8, INT8, INT4, GGUF, KV-cache quantization. |
| Cache features | Paged KV, prefix caching, offload, sliding window, preemption. |
| Observability | Metrics, traces, logs, per-model/per-tenant labels. |
| Operations | Upgrade path, rolling deploys, canaries, rollback, debugging. |
Migration Plan
- Pick representative prompts, long contexts, streaming requests, tool calls, and overload cases.
- Render and compare token IDs.
- Match generation config and stop behavior.
- Run deterministic golden prompts.
- Run traffic-shape load tests.
- Compare metrics and logs.
- Canary with rollback.
Study Cards
Why can engine migration change model behavior?
Tokenization, chat templates, generation defaults, stop behavior, quantization, and adapter support can differ.
When is TensorRT-LLM attractive?
When an NVIDIA fleet needs optimized engine builds and high throughput for supported model shapes.
Why keep llama.cpp separate in an engine comparison?
It targets GGUF/local/offload workflows with different tradeoffs than GPU fleet servers.