ML Data Pipelines and Feature Stores

Data pipelines define what the model can learn and what it can see at inference time. A strong model trained on leaky, stale, mislabeled, or irreproducible data becomes a fragile system.

Command Examples

date -Is
ls -lh data/
find data -maxdepth 2 -type f | sort | head

Example output and meaning:

Command Example output What it does
date -Is 2026-06-06T10:24:33-07:00 Pins command output and logs to an exact incident timestamp.
ls -lh data/ File names, sizes, owners, permissions, and modification times. Confirms the expected artifacts exist with usable ownership and freshness.
find data -maxdepth 2 -type f \\| sort \\| head Sorted file paths such as data/train.parquet and data/validation.parquet. Shows which files the pipeline or prompt loader will actually consume.

In production, replace filesystem checks with source inventory, lineage, data contracts, and dataset version manifests.

Pipeline Boundaries

Boundary What To Version
Raw source Source system, extraction query, timestamp, schema, permissions.
Cleaning Dedup rules, filters, redaction, normalization, language detection.
Labeling Label source, instructions, adjudication, quality audits.
Feature generation Code version, backfill window, joins, aggregation windows.
Dataset split Split key, time boundary, leakage checks, holdout policy.
Training manifest Dataset hashes, preprocessing, tokenizer, feature schema, model config.

Feature Store Model

Feature stores help when the same features need offline training and low-latency online serving.

Concept Why It Matters
Offline store Historical feature values for training and backfills.
Online store Fresh low-latency features for inference.
Entity key Stable join key such as user, account, host, document, or device.
Event time Time the fact happened; needed for point-in-time correctness.
Materialization Copying computed features into serving storage.
Freshness SLA Maximum tolerated lag between source and online value.

Train/Serve Skew

Train/serve skew happens when training features differ from inference features. Common causes:

  • training uses future data unavailable at inference,
  • online features lag or are missing,
  • preprocessing code differs between batch and request path,
  • categorical vocabularies or tokenizers drift,
  • time windows are computed differently,
  • null/default behavior differs.

Skew check:

For each feature:
  source system
  event-time semantics
  offline transform
  online transform
  null/default policy
  freshness target
  owner

Data Quality Gates

Gate Blocks
Schema contract Missing fields, type drift, enum drift.
Range checks Impossible values, unit changes, clipped distributions.
Freshness checks Late sources, failed materialization, stale online store.
Leakage checks Post-outcome fields, future timestamps, duplicate entities across splits.
Privacy checks PII, secrets, disallowed training data, retention violations.

Study Cards

Question

What is train/serve skew?

Answer

A mismatch between features used during training and features available or computed at inference time.

Question

Why does event time matter for ML features?

Answer

It prevents training from using facts that would not have been known at prediction time.

Question

When does a feature store help?

Answer

When the same feature definitions must support offline training and online low-latency inference.

References