LLM Training Lifecycle

LLM behavior is built in stages. Each stage changes data, objective, evaluation, and risk. A production team needs to know which stage created the behavior it is trying to preserve or change.

Lifecycle Stages

Stage Objective Main Risk
Pretraining Learn broad language/statistical structure from large corpora. Data quality, memorization, contamination, compute cost.
Continued pretraining Adapt base model to domain distribution. Forgetting, domain overfit, data licensing.
Supervised fine-tuning Teach instruction following and task format. Bad examples, template drift, eval leakage.
Preference tuning Prefer better answers over worse answers. Labeler bias, reward hacking, oversmoothing.
RLHF Optimize against a reward model with RL. Instability and reward-model misspecification.
DPO Optimize directly from preference pairs. Pair quality and preference coverage.
RLAIF Use AI feedback to scale preference signals. Judge-model bias and correlated failure.

Data Filtering

Data filtering removes:

  • duplicates and near-duplicates,
  • unsafe or disallowed data,
  • low-quality boilerplate,
  • private or secret material,
  • eval contamination,
  • malformed examples,
  • language or domain outliers when not intended.

Synthetic Data

Synthetic data can fill coverage gaps, but it can also amplify model errors or create narrow, unrealistic patterns. Treat generated examples as candidates that need review, filtering, and eval.

Stage-Specific Evaluation

Stage Eval Focus
Pretraining Perplexity, contamination, broad capability probes.
Continued pretraining Domain understanding and general regression.
SFT Instruction following, format, refusal, task success.
Preference tuning Win rate, safety, helpfulness, calibration.
Release Golden set, red team, latency, cost, rollback.

Practical Lab: Training Stage Audit

model_candidate:
  base_model:
  continued_pretraining_data:
  sft_dataset:
  preference_dataset:
  chat_template:
  eval_report:
  safety_report:
  rollback_artifacts:

Study Cards

Question

What does continued pretraining usually adapt?

Answer

It adapts a base model to a domain distribution before task-specific instruction tuning.

Question

Why is synthetic data risky?

Answer

It can amplify model errors, narrow style, or unrealistic patterns if not reviewed and evaluated.

Question

How is DPO different from RLHF at a high level?

Answer

DPO learns directly from preference pairs without a separate online RL loop.

References