ML Data Pipelines and Feature Stores

Data pipelines define what the model can learn and what it can see at inference time. A strong model trained on leaky, stale, mislabeled, or irreproducible data becomes a fragile system.

Command Examples

date -Is
ls -lh data/
find data -maxdepth 2 -type f | sort | head

Example output and meaning:

Command	Example output	What it does
`date -Is`	`2026-06-06T10:24:33-07:00`	Pins command output and logs to an exact incident timestamp.
`ls -lh data/`	`File names, sizes, owners, permissions, and modification times.`	Confirms the expected artifacts exist with usable ownership and freshness.
`find data -maxdepth 2 -type f \\\| sort \\\| head`	`Sorted file paths such as data/train.parquet and data/validation.parquet.`	Shows which files the pipeline or prompt loader will actually consume.

In production, replace filesystem checks with source inventory, lineage, data contracts, and dataset version manifests.

Pipeline Boundaries

Boundary	What To Version
Raw source	Source system, extraction query, timestamp, schema, permissions.
Cleaning	Dedup rules, filters, redaction, normalization, language detection.
Labeling	Label source, instructions, adjudication, quality audits.
Feature generation	Code version, backfill window, joins, aggregation windows.
Dataset split	Split key, time boundary, leakage checks, holdout policy.
Training manifest	Dataset hashes, preprocessing, tokenizer, feature schema, model config.

Feature Store Model

Feature stores help when the same features need offline training and low-latency online serving.

Concept	Why It Matters
Offline store	Historical feature values for training and backfills.
Online store	Fresh low-latency features for inference.
Entity key	Stable join key such as user, account, host, document, or device.
Event time	Time the fact happened; needed for point-in-time correctness.
Materialization	Copying computed features into serving storage.
Freshness SLA	Maximum tolerated lag between source and online value.

Train/Serve Skew

Train/serve skew happens when training features differ from inference features. Common causes:

training uses future data unavailable at inference,
online features lag or are missing,
preprocessing code differs between batch and request path,
categorical vocabularies or tokenizers drift,
time windows are computed differently,
null/default behavior differs.

Skew check:

For each feature:
  source system
  event-time semantics
  offline transform
  online transform
  null/default policy
  freshness target
  owner

Data Quality Gates

Gate	Blocks
Schema contract	Missing fields, type drift, enum drift.
Range checks	Impossible values, unit changes, clipped distributions.
Freshness checks	Late sources, failed materialization, stale online store.
Leakage checks	Post-outcome fields, future timestamps, duplicate entities across splits.
Privacy checks	PII, secrets, disallowed training data, retention violations.

Study Cards

Question

What is train/serve skew?

Answer

A mismatch between features used during training and features available or computed at inference time.

Question

Why does event time matter for ML features?

Answer

It prevents training from using facts that would not have been known at prediction time.

Question

When does a feature store help?

Answer

When the same feature definitions must support offline training and online low-latency inference.

ML Data Pipelines and Feature Stores

Command Examples

Pipeline Boundaries

Feature Store Model

Train/Serve Skew

Data Quality Gates

Study Cards

References