Multimodal ML

Multimodal systems combine text, images, audio, video, documents, or structured data. They add preprocessing, alignment, storage, privacy, and evaluation problems beyond text-only LLMs.

Modality Map

Modality	Common Tasks	Operational Risk
Vision	Classification, detection, segmentation, OCR.	Image quality, resolution, privacy, bias.
Audio	Speech recognition, diarization, classification.	Noise, accents, latency, consent.
Text-to-speech	Voice generation.	Misuse, consent, watermarking.
Vision-language	Captioning, document QA, UI understanding.	Hallucinated visual details.
Video	Temporal understanding and generation.	Cost, storage, frame sampling.

Multimodal RAG

Multimodal RAG retrieves evidence from text, images, tables, diagrams, and OCR. The system needs source metadata, page/region coordinates, image hashes, and citation rendering that users can inspect.

OCR Pipeline

flowchart LR
  PDF[PDF or image] --> Render[Render pages]
  Render --> OCR[OCR text]
  Render --> Vision[Image embeddings]
  OCR --> Chunks[Text chunks]
  Vision --> Index[Image/vector index]
  Chunks --> Index
  Index --> Answer[Grounded answer]

Practical Lab: Document QA Evidence

question: "What is the invoice due date?"
evidence:
  page: 2
  bounding_box: [120, 310, 260, 340]
  extracted_text: "Due: 2026-06-15"
answer_policy:
  cite page and field
  abstain if OCR confidence is low

Study Cards

Question

Why is multimodal RAG harder than text RAG?

Answer

Evidence can live in images, coordinates, OCR text, tables, or temporal segments that need different indexing and citation.

Question

What should document QA cite?

Answer

The source document plus page, region, or extracted field that supports the answer.

Question

Why does audio ML need privacy review?

Answer

Audio can contain voices, identity, background speech, and sensitive context.

Multimodal ML

Modality Map

Multimodal RAG

OCR Pipeline

Practical Lab: Document QA Evidence

Study Cards

References