Quantized Serving

Quantization represents model data with fewer bits to reduce memory, bandwidth, and cost. In serving, quantization is a release decision because it can change quality, latency, safety behavior, kernel compatibility, and rollback behavior.

Command Examples

python - <<'PY'
import torch
for dtype in [torch.float32, torch.bfloat16, torch.float16, torch.int8]:
    print(dtype)
PY
vllm serve <model> --help | grep -i quant

Example output and meaning:

Command	Example output	What it does
`Python snippet`	`torch.float32`, `torch.bfloat16`, `torch.float16`, and `torch.int8`.	Confirms dtype names before comparing weight and KV-cache precision choices.
`vllm serve <model> --help \| grep -i quant`	Quantization flags and supported option names.	Verifies the runtime exposes the quantization mode you plan to use.

Do not assume a runtime supports every quantized artifact or that a supported artifact is quality-equivalent.

Quantization Types

Type	What Changes	Main Benefit	Main Risk
Weight-only INT8/INT4	Model weights.	Lower model memory.	Quality loss and kernel support.
Activation quantization	Intermediate tensors.	Faster kernels on supported hardware.	Calibration sensitivity.
KV-cache quantization	Cached K/V tensors.	More context or concurrency.	Decode quality and kernel compatibility.
FP8	Low-precision floating point.	High throughput on modern GPUs.	Hardware and calibration requirements.
GGUF quant levels	llama.cpp-family weights.	Local and edge memory reduction.	Quant choice affects speed and quality.
QLoRA	Quantized base for adapter training.	Lower fine-tuning memory.	Merge and serving compatibility.

Common Schemes

Scheme	Where Seen	Notes
AWQ	LLM weight quantization.	Often used for activation-aware weight-only serving.
GPTQ	Post-training weight quantization.	Artifact/runtime compatibility matters.
bitsandbytes	Training and inference workflows.	Convenient but not always fastest production path.
GGUF Q-types	llama.cpp.	Choose by model, hardware, and quality tolerance.
FP8	H100/H200/B200-class paths and optimized engines.	Requires exact hardware/runtime validation.

Quality Gates

Gate	Why
Golden prompts	Catch obvious behavior drift.
Structured output	Quantization can break exact schemas.
Long-context eval	Small numeric drift can compound.
Safety/refusal eval	Safety behavior can shift.
RAG faithfulness	Answers may become less grounded.
Tool-call eval	Arguments and formatting must remain valid.
Latency/load test	Smaller memory does not always mean faster serving.

Rollout Flow

Baseline full-precision model on exact evals and traffic shapes.
Quantize or select quantized artifact.
Verify tokenizer, chat template, adapter, and generation config compatibility.
Run quality, safety, long-context, and structured-output evals.
Run load tests for TTFT, ITL, throughput, cache usage, and errors.
Canary by route or tenant.
Preserve rollback artifact and runtime flags.

Study Cards

Question

Why is quantized serving not just a memory toggle?

Answer

It can change quality, safety, latency, kernel support, and rollback compatibility.

Question

How is KV-cache quantization different from weight quantization?

Answer

Weight quantization compresses parameters; KV-cache quantization compresses per-request attention state.

Question

Why test structured output after quantization?

Answer

Small numeric changes can break formatting, schemas, tool calls, or stop behavior.

Quantized Serving

Command Examples

Quantization Types

Common Schemes

Quality Gates

Rollout Flow

Study Cards

References