PyTorch Fundamentals

PyTorch is a tensor library plus an automatic differentiation system and neural-network module ecosystem. Most training code is a loop over batches: run the model, compute loss, backpropagate gradients, update weights, and evaluate.

Command Examples

python - <<'PY'
import torch
print(torch.__version__)
print(torch.cuda.is_available())
x = torch.randn(4, 8)
print(x.mean().item())
PY

Example output and meaning:

Command	Example output	What it does
`Python snippet`	PyTorch version, `True` or `False` for CUDA, and a numeric tensor mean.	Proves the framework imports, reports accelerator visibility, and executes tensor operations.

Core Objects

Object	Role
Tensor	N-dimensional numeric array with device and dtype.
`nn.Module`	Stateful model component containing parameters and submodules.
Parameter	Tensor registered as trainable model state.
Autograd graph	Dynamic graph PyTorch builds to compute gradients.
Loss	Scalar objective to minimize or maximize.
Optimizer	Updates parameters using gradients and optimizer state.
Dataset/DataLoader	Input examples and batching pipeline.
Checkpoint	Saved model, optimizer, scheduler, and training metadata.

Minimal Training Loop

import torch
from torch import nn

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

for features, target in dataloader:
    prediction = model(features)
    loss = loss_fn(prediction, target)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

This pattern hides important details: device movement, dtype, gradient accumulation, clipping, evaluation mode, checkpointing, and reproducibility.

Training vs Evaluation

model.train()
# gradients and training-time module behavior

model.eval()
with torch.no_grad():
    prediction = model(batch)

train() and eval() change modules such as dropout and batch norm. torch.no_grad() disables gradient tracking for inference/evaluation.

Device and Dtype

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
features = features.to(device)

All tensors participating in an operation must generally be on compatible devices. Mixed CPU/GPU tensors cause runtime errors or implicit slow paths.

Checkpointing

torch.save({
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "epoch": epoch,
}, "checkpoint.pt")

Save enough state to resume or audit the run: architecture config, weights, optimizer, scheduler, tokenizer/preprocessor, random seeds, dataset version, and metrics.

Training Loop Correctness Gaps

PyTorch gives direct control over the training loop, which means correctness depends on explicit handling of state, randomness, gradient flow, and evaluation behavior.

Gap	What To Fill	Operational Check
ML-GAP-024 Dataset and DataLoader Boundaries	Separate dataset indexing, transforms, collation, shuffling, worker state, and batch shape assumptions.	Test one batch end-to-end before starting a long run.
ML-GAP-025 Reproducibility Seeds	Seed Python, NumPy, PyTorch, workers, and distributed ranks while recording nondeterministic kernel settings.	Store seed, library versions, device type, and determinism flags with each run.
ML-GAP-026 Autograd Graph Lifetime	Know when graphs are freed, retained, detached, or accidentally expanded across iterations.	Watch memory growth and use `detach()` intentionally for recurrent or cached tensors.
ML-GAP-027 Gradient Accumulation	Accumulate gradients deliberately when effective batch size exceeds memory.	Scale loss or learning rate consistently and step the optimizer only at accumulation boundaries.
ML-GAP-028 Gradient Clipping	Clip exploding gradients for unstable sequence, RL, or fine-tuning workloads.	Log gradient norms before and after clipping.
ML-GAP-029 Optimizer State	Adam-like optimizers keep state that can exceed model weight memory.	Include optimizer memory in GPU budget and checkpoint size estimates.
ML-GAP-030 Learning Rate Schedules	Warmup, decay, cosine, step, or constant schedules change convergence and stability.	Plot learning rate against loss and validation metrics.
ML-GAP-031 AMP GradScaler	Automatic mixed precision needs scaling on FP16 training to avoid underflow and overflow.	Check skipped steps, scaler value, `nan` gradients, and dtype coverage.
ML-GAP-032 DistributedDataParallel	DDP needs correct rank setup, sampler sharding, gradient synchronization, and identical model graphs.	Confirm each rank sees a distinct data shard and reaches barriers.
ML-GAP-033 Checkpoint Resume State	Resume requires model, optimizer, scheduler, scaler, epoch/step, RNG, and data cursor state.	Resume from a checkpoint in CI or a smoke run, not only after an outage.
ML-GAP-034 Torch Compile and Export	`torch.compile`, TorchScript, ONNX, and export paths can alter supported ops, shapes, and debugging behavior.	Compare compiled/exported output against eager output on representative inputs.
ML-GAP-035 Evaluation Mode Hazards	Forgetting `eval()` or `no_grad()` changes dropout, batch norm, gradient memory, and reported metrics.	Wrap validation in a helper that sets mode and restores the previous mode.

Common Failures

Symptom	Likely Cause
Loss is `nan`	Learning rate too high, bad labels, unstable precision, exploding gradients.
GPU unused	Tensors or model still on CPU, dataloader bottleneck, tiny batch.
Eval differs from training	Forgot `model.eval()`, data leakage, preprocessing mismatch.
Cannot resume	Saved only weights, not optimizer/scheduler/config.

Study Cards

Question

What does autograd compute?

Answer

Gradients of tensor operations so optimizers can update model parameters.

Question

Why call optimizer.zero_grad before backward?

Answer

PyTorch accumulates gradients by default, so old gradients must be cleared or intentionally accumulated.

Question

What does model.eval change?

Answer

It switches modules such as dropout and batch norm to evaluation behavior.

Question

Why save optimizer state in checkpoints?

Answer

Optimizers such as Adam keep momentum-like state needed to resume training faithfully.

PyTorch Fundamentals

Command Examples

Core Objects

Minimal Training Loop

Training vs Evaluation

Device and Dtype

Checkpointing

Training Loop Correctness Gaps

Common Failures

Study Cards

References