Long-Context Serving

Long context expands what a model can read, but it does not make all tokens equally useful or cheap. Long prompts increase prefill latency, KV-cache memory, retrieval cost, and evaluation complexity.

Command Examples

prompt_budget:
  system: 800
  tools: 1200
  retrieved_context: 24000
  user: 500
  output_cap: 2000
  model_context: 32768

Example output and meaning:

Command	Example output	What it does
`Captured fields`	`Named fields with concrete values: IDs, scores, tokens, routes, states, timestamps, or errors.`	Turns a capture template into evidence you can compare across runs.

Budget input and output together. Output tokens also consume context and KV cache.

Long-Context Mechanics

Topic	Why It Matters
Context window	Hard limit on input plus generated tokens.
RoPE scaling	Extends position handling but can affect quality.
Sliding-window attention	Reduces attention span and KV growth in supported models.
Sink tokens	Some models preserve early tokens for stability.
Lost in the middle	Models may underuse evidence buried in long contexts.
Prefix caching	Stable long prefixes can reduce repeated prefill.
Chunked prefill	Improves scheduling fairness for long prompts.

Serving Risks

Risk	Mitigation
High TTFT	Compress prompts, cache prefixes, chunk prefill, route long prompts separately.
KV-cache OOM	Lower context, lower concurrency, quantize KV, add memory, use GQA/MQA models.
Lower answer quality	Evaluate evidence position and retrieval ordering.
Higher cost	Budget tokens and apply admission control.
Noisy neighbors	Separate long-context pool from short chat pool.

Long-Context Eval Design

Eval	What It Catches
Evidence at beginning/middle/end	Lost-in-the-middle behavior.
Multi-document distractors	Retrieval and reasoning robustness.
Long prompt with short answer	Prefill cost vs value.
Long output generation	Decode occupancy and cache growth.
Truncation tests	Whether critical instructions survive budget pressure.

Study Cards

Question

Why is long context not free?

Answer

It increases prefill time, KV-cache memory, cost, and eval complexity.

Question

What is lost-in-the-middle behavior?

Answer

The model may fail to use relevant evidence placed in the middle of a long context.

Question

Why route long-context traffic separately?

Answer

It prevents long prompts from consuming cache and scheduler capacity needed for short-latency traffic.

Long-Context Serving

Command Examples

Long-Context Mechanics

Serving Risks

Long-Context Eval Design

Study Cards

References