ML Performance Engineering

ML performance work starts with measurement. Guessing often optimizes the wrong layer: Python overhead, input pipeline, GPU kernels, memory bandwidth, collective communication, queueing, or model quality constraints.

Command Examples

nvidia-smi dmon
python -m torch.utils.bottleneck train.py

Example output and meaning:

Command Example output What it does
nvidia-smi dmon GPU utilization, memory use, CUDA visibility, model list, or serving metrics. Separates accelerator visibility from model-serving capacity and latency.
python -m torch.utils.bottleneck train.py GPU utilization, memory use, CUDA visibility, model list, or serving metrics. Separates accelerator visibility from model-serving capacity and latency.

Bottleneck Map

Bottleneck Signal Lever
CPU input pipeline GPU idle, CPU busy. More workers, caching, prefetch, vectorized transforms.
Host-device copies Copy time high. Pinned memory, async copies, move transforms to GPU.
Kernel overhead Many tiny kernels. Fuse ops, compile, larger batches.
Memory bandwidth FLOPS low, bandwidth high. Better layout, fused kernels, lower precision.
Activation memory OOM in training. Activation checkpointing, sequence/batch reduction.
Collectives Distributed stalls. Topology, bucket sizes, faster interconnect.
Inference queue High TTFT. Autoscale, admission control, batching policy.

Advanced Techniques

Technique Helps Risk
FlashAttention Efficient attention memory and speed. Kernel/version compatibility.
Fused kernels Less launch overhead and memory traffic. Debugging and portability.
torch.compile Graph optimization. Dynamic shape and op support issues.
Activation checkpointing Lower training memory. Extra compute.
Quantization Lower inference memory/cost. Quality and calibration.
Speculative decoding Lower decode latency. Acceptance rate and overhead.

Practical Lab: Performance Report

baseline:
  throughput:
  p95_latency:
  gpu_utilization:
  memory_used:
candidate:
  change:
  throughput:
  p95_latency:
  quality_delta:
decision:

Never accept a performance gain without a quality and regression check.

Study Cards

Question

Why profile before optimizing ML code?

Answer

The bottleneck may be input, CPU, copies, kernels, memory, communication, queueing, or decoding.

Question

What does activation checkpointing trade?

Answer

It saves memory by recomputing activations, increasing compute time.

Question

Why pair performance tests with quality evals?

Answer

Optimization can change precision, decoding, batching, or context behavior in ways that affect outputs.

References