Tech Study Guide
Multimodal ML
Multimodal ML with vision, audio, speech-to-text, text-to-speech, vision-language models, image embeddings, OCR pipelines, multimodal RAG, and video model basics.
Multimodal ML
Multimodal systems combine text, images, audio, video, documents, or structured data. They add preprocessing, alignment, storage, privacy, and evaluation problems beyond text-only LLMs.
Modality Map
| Modality | Common Tasks | Operational Risk |
|---|---|---|
| Vision | Classification, detection, segmentation, OCR. | Image quality, resolution, privacy, bias. |
| Audio | Speech recognition, diarization, classification. | Noise, accents, latency, consent. |
| Text-to-speech | Voice generation. | Misuse, consent, watermarking. |
| Vision-language | Captioning, document QA, UI understanding. | Hallucinated visual details. |
| Video | Temporal understanding and generation. | Cost, storage, frame sampling. |
Multimodal RAG
Multimodal RAG retrieves evidence from text, images, tables, diagrams, and OCR. The system needs source metadata, page/region coordinates, image hashes, and citation rendering that users can inspect.
OCR Pipeline
flowchart LR
PDF[PDF or image] --> Render[Render pages]
Render --> OCR[OCR text]
Render --> Vision[Image embeddings]
OCR --> Chunks[Text chunks]
Vision --> Index[Image/vector index]
Chunks --> Index
Index --> Answer[Grounded answer]
Practical Lab: Document QA Evidence
question: "What is the invoice due date?"
evidence:
page: 2
bounding_box: [120, 310, 260, 340]
extracted_text: "Due: 2026-06-15"
answer_policy:
cite page and field
abstain if OCR confidence is low
Study Cards
Why is multimodal RAG harder than text RAG?
Evidence can live in images, coordinates, OCR text, tables, or temporal segments that need different indexing and citation.
What should document QA cite?
The source document plus page, region, or extracted field that supports the answer.
Why does audio ML need privacy review?
Audio can contain voices, identity, background speech, and sensitive context.