Tech Study Guide
LLM Training Lifecycle
LLM lifecycle from pretraining and continued pretraining to SFT, RLHF, DPO, RLAIF, preference data, synthetic data, data filtering, and stage-specific evaluation.
LLM Training Lifecycle
LLM behavior is built in stages. Each stage changes data, objective, evaluation, and risk. A production team needs to know which stage created the behavior it is trying to preserve or change.
Lifecycle Stages
| Stage | Objective | Main Risk |
|---|---|---|
| Pretraining | Learn broad language/statistical structure from large corpora. | Data quality, memorization, contamination, compute cost. |
| Continued pretraining | Adapt base model to domain distribution. | Forgetting, domain overfit, data licensing. |
| Supervised fine-tuning | Teach instruction following and task format. | Bad examples, template drift, eval leakage. |
| Preference tuning | Prefer better answers over worse answers. | Labeler bias, reward hacking, oversmoothing. |
| RLHF | Optimize against a reward model with RL. | Instability and reward-model misspecification. |
| DPO | Optimize directly from preference pairs. | Pair quality and preference coverage. |
| RLAIF | Use AI feedback to scale preference signals. | Judge-model bias and correlated failure. |
Data Filtering
Data filtering removes:
- duplicates and near-duplicates,
- unsafe or disallowed data,
- low-quality boilerplate,
- private or secret material,
- eval contamination,
- malformed examples,
- language or domain outliers when not intended.
Synthetic Data
Synthetic data can fill coverage gaps, but it can also amplify model errors or create narrow, unrealistic patterns. Treat generated examples as candidates that need review, filtering, and eval.
Stage-Specific Evaluation
| Stage | Eval Focus |
|---|---|
| Pretraining | Perplexity, contamination, broad capability probes. |
| Continued pretraining | Domain understanding and general regression. |
| SFT | Instruction following, format, refusal, task success. |
| Preference tuning | Win rate, safety, helpfulness, calibration. |
| Release | Golden set, red team, latency, cost, rollback. |
Practical Lab: Training Stage Audit
model_candidate:
base_model:
continued_pretraining_data:
sft_dataset:
preference_dataset:
chat_template:
eval_report:
safety_report:
rollback_artifacts:
Study Cards
What does continued pretraining usually adapt?
It adapts a base model to a domain distribution before task-specific instruction tuning.
Why is synthetic data risky?
It can amplify model errors, narrow style, or unrealistic patterns if not reviewed and evaluated.
How is DPO different from RLHF at a high level?
DPO learns directly from preference pairs without a separate online RL loop.