Tech Study Guide
Deep Learning Fundamentals
Deep learning fundamentals: neural networks, layers, activations, embeddings, CNNs, RNNs, transformers, optimizers, normalization, regularization, initialization, and training stability.
Deep Learning Fundamentals
Deep learning stacks differentiable layers so a model can learn representations from raw or lightly processed data. The same core mechanics appear in vision, speech, recommendation, language, and multimodal systems.
Command Examples
import torch
from torch import nn
model = nn.Sequential(nn.Linear(8, 16), nn.ReLU(), nn.Linear(16, 2))
x = torch.randn(4, 8)
print(model(x).shape)
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
Python example |
A numeric score, tensor shape, token IDs, retrieved IDs, or explicit error. |
Shows the example produces measurable output instead of silent success. |
Core Building Blocks
| Block | Role |
|---|---|
| Linear layer | Weighted projection from inputs to outputs. |
| Activation | Adds nonlinearity so networks can model complex functions. |
| Embedding | Learned vector lookup for tokens, categories, users, or items. |
| Convolution | Local pattern detector for grids and signals. |
| Recurrent layer | Processes sequence state over time. |
| Attention | Mixes information across positions by learned relevance. |
| Normalization | Stabilizes activations or residual streams. |
| Dropout | Regularizes by randomly masking activations during training. |
Training Stability
| Issue | Signal | Control |
|---|---|---|
| Exploding gradients | nan loss, huge gradient norm. |
Gradient clipping, lower LR, normalization. |
| Vanishing gradients | No learning in early layers. | Residual connections, better initialization, normalization. |
| Overfit | Train improves while validation worsens. | Data, regularization, early stopping. |
| Poor initialization | Slow or unstable start. | Established initializers and architecture defaults. |
| Bad schedule | Loss spikes or plateaus. | Warmup, cosine decay, step decay, LR sweep. |
Architecture Families
| Family | Typical Use |
|---|---|
| MLP | Tabular, embeddings, simple features. |
| CNN | Images, audio spectrograms, local signal patterns. |
| RNN/LSTM/GRU | Legacy sequence and time-series workloads. |
| Transformer | Language, code, multimodal, long-context modeling. |
| Diffusion network | Generative media through denoising. |
Practical Lab: Training Stability Checklist
Before a long run:
one batch overfits: yes/no
gradient norms logged: yes/no
validation split clean: yes/no
checkpoint resume tested: yes/no
learning-rate schedule plotted: yes/no
eval mode used for validation: yes/no
Study Cards
Why do neural networks need activations?
Without nonlinear activations, stacked linear layers collapse into another linear function.
What does normalization help with?
It stabilizes training by controlling activation or residual-stream scale.
Why use residual connections?
They improve gradient flow and make deep networks easier to optimize.