Model Memory Math

Model memory math turns model size, precision, architecture, context length, and traffic shape into capacity estimates. Exact numbers depend on kernels and runtime layout, but rough math catches impossible deployments before they become GPU incidents.

Command Examples

nvidia-smi --query-gpu=name,memory.total,memory.used --format=csv
python - <<'PY'
params_b = 70
bytes_per_param = 2
print(f"weights ~= {params_b * bytes_per_param:.0f} GB decimal")
PY

Example output and meaning:

Command	Example output	What it does
`nvidia-smi --query-gpu=name,memory.total,memory.used --format=csv`	`GPU utilization, memory use, CUDA visibility, model list, or serving metrics.`	Separates accelerator visibility from model-serving capacity and latency.
`Python snippet`	`A version, tensor shape, score, retrieved IDs, metric delta, or explicit error.`	Turns the example into a measurable model, data, or pipeline signal.
`params_b = 70`	`Concrete IDs, states, counters, versions, rows, or error strings.`	Turns the example from a command list into evidence for the next debugging step.

These checks prove only the GPU memory budget and rough weight size. They do not include KV cache, temporary workspace, allocator overhead, fragmentation, replicas, or host RAM needed during load.

Memory Classes

flowchart LR
  Artifact[Model artifact] --> Load[Load replica]
  Load --> Weights[Weights on GPU]
  Request[Active requests] --> Prefill[Prefill tokens]
  Prefill --> KV[KV cache blocks]
  Prefill --> Workspace[Runtime workspace]
  KV --> Decode[Decode tokens]
  Decode --> KV
  Decode --> Complete[Free per-request KV]

Memory	When It Exists	Capacity Driver
Weights	While a replica is loaded.	Parameter count and bytes per parameter.
KV cache	During active autoregressive requests.	Layers, tokens, KV heads, head dim, dtype, active sequences.
Activations	During forward passes.	Batch shape, sequence length, kernels, runtime.
Workspace	Runtime-specific temporary buffers.	Engine, kernels, parallelism, compilation.
Optimizer state	Training or optimizer-backed fine-tuning.	Optimizer type and parameter count.
Host staging memory	During load and sharding.	Shard format, loader, CPU RAM, mmap behavior.

Weights vs KV Cache vs Activations

Question	Weights	KV Cache	Activations
Shared across requests?	Yes, one loaded replica shares weights.	No, cache is per active sequence, except explicit safe prefix reuse.	Mostly temporary per forward pass.
Biggest input variable	Parameter count and precision.	Context length, output length, KV heads, dtype, and concurrency.	Batch shape, sequence length, kernels, and model architecture.
Serving failure signal	Model will not load, or replica count is too high.	Requests queue, preempt, reject, or OOM under long contexts.	OOM or slow prefill from large batch/prompt shapes.
Quantization effect	Weight quantization can materially shrink base memory.	Only KV-specific quantization shrinks cache.	Activation quantization depends on kernels and hardware.
Capacity planning mistake	Assuming the model fits because weights fit.	Ignoring p95/p99 prompt and output tokens.	Forgetting workspace and temporary buffers during warmup.

Weight Memory

weight_memory ~= parameters * bytes_per_parameter

Model Size	FP16/BF16	FP8/INT8	INT4
7B	~14 GB	~7 GB	~3.5 GB
13B	~26 GB	~13 GB	~6.5 GB
34B	~68 GB	~34 GB	~17 GB
70B	~140 GB	~70 GB	~35 GB

Leave headroom. Runtime metadata, embeddings, temporary buffers, CUDA context, fragmentation, and KV cache all consume additional memory.

KV-Cache Examples

kv_cache ~= layers * tokens * 2(K,V) * kv_heads * head_dim * bytes_per_value * active_sequences

Architecture Shape	Why It Changes Memory
Multi-head attention	KV heads usually match query heads; highest KV memory.
Grouped-query attention	Fewer KV heads than query heads; lower KV memory.
Multi-query attention	One or very few KV heads; lowest KV memory.
Sliding-window attention	Keeps only a recent attention window for some layers.
MoE	Expert layers change compute and weights, but attention KV still scales with active tokens.

Example for 32 layers, 8 KV heads, head dim 128, BF16, 8K tokens:

32 * 8192 * 2 * 8 * 128 * 2 ~= 1 GiB per active long sequence

The same context with 32 KV heads is roughly 4 GiB. This is why GQA/MQA can materially improve serving capacity.

Capacity Worksheet

Input	Value
GPU memory per device
Number of GPUs
Model parameters
Weight precision
Tensor parallel size
Runtime headroom target	10-20 percent minimum
Layers
KV heads
Head dim
KV dtype
p95 prompt tokens
p95 output tokens
Target concurrent sequences

Compute in this order:

Estimate weight memory per replica.
Divide or shard weights according to tensor/pipeline parallelism.
Reserve runtime workspace and allocator headroom.
Estimate KV cache per active sequence by request class.
Multiply by concurrency and burst target.
Check whether long-context requests need a separate serving pool.

Loading and Cold Start

Issue	Practical Meaning
Shard loading	All shard files and index metadata must match the model revision.
Host RAM spike	Some loaders stage weights in CPU memory before GPU transfer.
mmap behavior	`safetensors` can reduce unsafe loading and improve load behavior, but the runtime still needs enough memory.
Warmup	First requests may compile kernels, allocate cache blocks, or populate CUDA context.
Tensor parallel load	Every rank needs the right shard, NCCL health, and matching config.

Common Mistakes

Mistake	Better Check
Counting only weight memory.	Include KV cache, workspace, fragmentation, and CUDA context.
Using max context for all traffic.	Capacity plan by request class and p95/p99 token lengths.
Assuming INT4 weights mean small KV cache.	KV cache may still be BF16/FP16 unless separately quantized.
Ignoring host RAM.	Test cold start from empty cache and record peak host memory.
Using average tokens.	Plan around p95/p99 prompt and output tokens.

Study Cards

Question

What memory is shared across requests?

Answer

Model weights are shared by a loaded replica; KV cache is per active sequence.

Question

Why can INT4 weights still OOM?

Answer

KV cache, activations, workspace, fragmentation, and long-context concurrency can exceed memory even when weights fit.

Question

Why does GQA reduce KV-cache memory?

Answer

Grouped-query attention uses fewer key/value heads than query heads, reducing cached K/V tensors.

Model Memory Math

Command Examples

Memory Classes

Weights vs KV Cache vs Activations

Weight Memory

KV-Cache Examples

Capacity Worksheet

Loading and Cold Start

Common Mistakes

Study Cards

References