Scenario Labs

Each lab starts from symptoms, then asks for evidence before an answer. The same lab cards are embedded into the related topic pages so the scenario stays near the concepts and commands it exercises.

Kubernetes

Kubernetes DNS Outage

Pods intermittently fail name resolution while node-level DNS still works.

Symptoms

  • nslookup inside Pods times out for ClusterIP Services.
  • CoreDNS Pods are running but query latency spikes.
  • Node lookups through the host resolver succeed.

Evidence

  • Compare /etc/resolv.conf inside a Pod with the node resolver.
  • Check CoreDNS logs, metrics, endpoints, and EndpointSlice watch errors.
  • Run dig +search +showsearch to expose ndots expansion.

Command Examples

Command

kubectl -n kube-system logs deploy/coredns --tail=100

Example output

[ERROR] plugin/errors: 2 kubernetes.default.svc.cluster.local. A: read udp 10.244.1.8:39948->10.96.0.10:53: i/o timeout
[INFO] 10.244.2.17:43321 - 44812 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR

What it does: Shows whether CoreDNS is returning normal answers, timing out upstream, failing Kubernetes API watches, or logging plugin-level errors.

Command

kubectl -n kube-system get endpointslice,endpoints,svc -l k8s-app=kube-dns

Example output

NAME                              ADDRESSTYPE   PORTS   ENDPOINTS
endpointslice.discovery.k8s.io/kube-dns-abc12   IPv4          53      10.244.1.8,10.244.2.9
NAME                 TYPE        CLUSTER-IP    PORT(S)
service/kube-dns     ClusterIP   10.96.0.10    53/UDP,53/TCP

What it does: Confirms the DNS Service has ready CoreDNS endpoints and exposes both UDP and TCP port 53.

Command

kubectl exec deploy/debug -- dig kubernetes.default.svc.cluster.local

Example output

;; status: NOERROR
kubernetes.default.svc.cluster.local. 30 IN A 10.96.0.1

What it does: Tests name resolution from the same Pod network path that application Pods use.

Answer: Prove whether the failure is stub resolver search expansion, CoreDNS health, upstream recursion, NetworkPolicy, or API watch state before changing the Corefile.

Open related topic
Databases

PostgreSQL Failover and Pooling

A failover completes, but application errors continue through stale pooled connections.

Symptoms

  • Application logs show read-only transaction or connection reset errors.
  • PgBouncer pools point at the old primary.
  • Replication lag is low, but write traffic still fails.

Evidence

  • Compare pg_is_in_recovery() across endpoints.
  • Inspect PgBouncer SHOW POOLS and SHOW SERVERS.
  • Check application DNS TTL and connection retry behavior.

Command Examples

Command

psql -c "select pg_is_in_recovery(), now() - pg_last_xact_replay_timestamp()"

Example output

pg_is_in_recovery | ?column?
-------------------+----------
f                 |

What it does: Identifies whether the endpoint is the writable primary and whether replay lag is relevant.

Command

psql -p 6432 pgbouncer -c "show pools"

Example output

database | user | cl_active | cl_waiting | sv_active | sv_idle
app      | app  | 120       | 18         | 20        | 0

What it does: Shows whether clients are stuck behind PgBouncer pools or mapped to stale server connections.

Command

psql -p 6432 pgbouncer -c "reconnect"

Example output

RECONNECT

What it does: Forces PgBouncer to drop old server connections after promotion or endpoint changes.

Answer: Treat database promotion, pool reconnection, DNS/service routing, and application retry budgets as one cutover sequence.

Open related topic
Linux

Linux Boot Recovery

A host reaches emergency mode after a package or filesystem change.

Symptoms

  • Console shows failed mounts or initramfs device discovery errors.
  • systemd drops to emergency target.
  • Remote SSH never starts.

Evidence

  • Read journalctl -xb from the failed boot.
  • Compare /etc/fstab UUIDs with blkid.
  • Check initramfs contents and kernel command line.

Command Examples

Command

journalctl -xb -p warning

Example output

systemd[1]: dev-disk-by\x2duuid-...device: Job timed out.
systemd[1]: Dependency failed for /data.

What it does: Finds the failed unit, mount, device, or boot dependency from the current emergency boot.

Command

blkid

Example output

/dev/nvme0n1p2: UUID="9f7c..." TYPE="ext4"
/dev/nvme1n1p1: UUID="2ad4..." TYPE="xfs"

What it does: Compares real block-device UUIDs and filesystem types against `/etc/fstab` and boot configuration.

Command

systemctl --failed

Example output

UNIT             LOAD   ACTIVE SUB    DESCRIPTION
data.mount       loaded failed failed /data

What it does: Lists failed systemd units so repair starts from the blocking dependency instead of guessing.

Answer: Restore boot by proving the failing unit or mount, disabling the bad dependency, rebuilding initramfs when needed, and keeping the recovery shell read/write only for the minimum repair.

Open related topic
Machine Learning

vLLM Inference Latency Spike

Token latency rises after traffic mix changes even though GPU utilization looks acceptable.

Symptoms

  • Time to first token is stable, but inter-token latency rises.
  • Queue depth grows during long-context requests.
  • KV-cache pressure increases before errors appear.

Evidence

  • Compare prompt length, output length, and batch shape histograms.
  • Track KV-cache utilization, prefill/decode split, and scheduler queue time.
  • Check whether speculative decoding or tensor parallel settings changed.

Command Examples

Command

nvidia-smi dmon

Example output

# gpu   sm  mem  enc  dec  mclk  pclk
# Idx    %    %    %    %   MHz   MHz
  0     72   88    0    0  1593  1410

What it does: Separates GPU compute pressure from memory-bandwidth pressure during prefill and decode.

Command

curl -sS http://localhost:8000/metrics | grep vllm

Example output

vllm:num_requests_waiting{model_name="llama"} 14
vllm:gpu_cache_usage_perc{model_name="llama"} 0.91
vllm:time_to_first_token_seconds_bucket{le="1.0"} 248

What it does: Shows queue depth, KV-cache pressure, and token-latency signals from the serving engine.

Command

kubectl top pod -l app=vllm

Example output

NAME            CPU(cores)   MEMORY(bytes)
vllm-0          920m         38Gi

What it does: Confirms whether Kubernetes-visible CPU and memory pressure line up with model-server symptoms.

Answer: Separate queueing, prefill saturation, decode throughput, KV-cache eviction, and model parallelism before scaling replicas or changing batch limits.

Open related topic
Ceph

Ceph Degraded PGs After OSD Loss

A drive failure leaves placement groups degraded while client latency rises during recovery.

Symptoms

  • ceph -s shows active+degraded or active+undersized placement groups.
  • One OSD is down or flapping and recovery traffic increases.
  • Application writes are slower but still mostly succeeding.

Evidence

  • Capture ceph health detail, ceph osd tree, and ceph pg dump_stuck before changing flags.
  • Check whether any OSD or CRUSH subtree is nearfull, backfillfull, or full.
  • Compare client latency with recovery and backfill activity.

Command Examples

Command

ceph -s && ceph health detail

Example output

health: HEALTH_WARN
64 pgs degraded
osd.12 is down

What it does: Establishes the cluster health state and the specific warnings driving recovery work.

Command

ceph osd tree && ceph osd df tree

Example output

ID  CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT
12  hdd   7.276   osd.12         down   1.00000
ID  REWEIGHT SIZE  USE  AVAIL %USE
12  1.00000  7.3T  0B   7.3T  0.00

What it does: Locates the failed OSD and checks whether capacity or CRUSH placement will constrain recovery.

Command

ceph pg dump_stuck && ceph -w

Example output

ok
2026-06-06T10:14:12 64 pgs active+degraded; recovery io 120 MiB/s

What it does: Watches whether placement groups are making progress toward `active+clean` or staying stuck.

Answer: Restore redundancy without creating extra churn: identify the failed domain, replace or restart only the bad OSD, avoid broad forgotten flags, and watch recovery to active+clean.

Open related topic
Istio

Istio mTLS Policy Breakage

A service-to-service call starts returning proxy-generated 503 or 403 after a policy rollout.

Symptoms

  • Client Pods can resolve the Service, but requests fail at the proxy.
  • Access logs show mTLS, authorization, or upstream health response flags.
  • Direct app logs do not show the failed request reaching the handler.

Evidence

  • Compare PeerAuthentication, DestinationRule, AuthorizationPolicy, and RequestAuthentication scope.
  • Inspect Envoy clusters, routes, endpoints, and secrets for the source or gateway proxy.
  • Check workload identities and namespace labels before changing policy.

Command Examples

Command

istioctl analyze --all-namespaces

Example output

Warning [IST0101] (VirtualService payments.payments) Referenced host not found: payments-api

What it does: Finds mesh configuration conflicts before inspecting individual Envoy proxies.

Command

istioctl proxy-config clusters  -n </code></pre>
              

Example output

SERVICE FQDN                  PORT     SUBSET     DIRECTION     TYPE
payments-api.payments.svc     8443     -          outbound      EDS

What it does: Shows whether the proxy received an outbound cluster for the target workload and port.

</article>

Command

istioctl proxy-config secret  -n </code></pre>
              

Example output

RESOURCE NAME     TYPE           STATUS     VALID CERT     SERIAL NUMBER
default           Cert Chain     ACTIVE     true           1a2b3c

What it does: Verifies that the sidecar has active mTLS certificates and can participate in the mesh identity path.

</article> </div>

Answer: Treat YAML as intent and proxy config as enforcement; prove whether the break is TLS mode, identity, authorization, endpoint readiness, or route attachment before relaxing policy.

Open related topic </section>
Networking

NAT Exhaustion and API Errors

Outbound calls to third-party APIs fail intermittently during a traffic spike.

Symptoms

  • Error rates rise for outbound dependencies while internal service traffic remains healthy.
  • Connection attempts time out or fail with resets during peak concurrency.
  • One NAT gateway, firewall, or node egress path carries most of the traffic.

Evidence

  • Compare active connections, conntrack usage, SNAT port use, and upstream destination fan-out.
  • Check whether retries multiply outbound concurrency.
  • Correlate failures with one source node, subnet, gateway, or external IP.

Command Examples

Command

conntrack -S

Example output

entries 1048211
searched 23041017
insert_failed 482

What it does: Shows whether the host or gateway is failing to allocate new connection-tracking entries.

Command

ss -tan state established,time-wait | wc -l

Example output

58234

What it does: Estimates active and recently closed TCP connection pressure that can consume SNAT ports.

Command

tcpdump -nn host  and tcp</code></pre>
              

Example output

10.0.12.15.53412 > 198.51.100.20.443: Flags [S]
10.0.12.15.53412 > 198.51.100.20.443: Flags [S], retransmission

What it does: Distinguishes dropped SYNs, resets, and successful handshakes on the egress path.

</article> </div>

Answer: Separate upstream failure from egress exhaustion by proving SNAT port, conntrack, retry, and destination distribution before scaling clients or blaming the API provider.

Open related topic </section>
Networking

TLS Certificate Expiry at the Edge

Browsers reject a public endpoint while internal health checks still pass.

Symptoms

  • Users see certificate expired, wrong host, or incomplete-chain errors.
  • TCP connectivity works and HTTP health checks may remain green.
  • A recent load balancer, ingress, or gateway change touched TLS termination.

Evidence

  • Inspect the certificate chain from the same SNI and address users hit.
  • Compare DNS, load balancer listener, gateway secret, and backend certificate boundaries.
  • Check whether automation renewed a secret but the serving proxy did not reload it.

Command Examples

Command

openssl s_client -connect :443 -servername  -showcerts </dev/null</code></pre>
              

Example output

subject=CN=www.example.com
issuer=C=US,O=Example CA
notAfter=Jun  7 12:00:00 2026 GMT
Verify return code: 0 (ok)

What it does: Inspects the public certificate chain served for the exact SNI users hit.

</article>

Command

curl -vkI https:///</code></pre>
              

Example output

* SSL connection using TLSv1.3
* subjectAltName: host "www.example.com" matched cert's "www.example.com"
HTTP/2 200

What it does: Confirms TLS negotiation, hostname validation, and the edge HTTP response in one request.

</article>

Command

kubectl get secret  -o yaml</code></pre>
              

Example output

type: kubernetes.io/tls
data:
  tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0t...

What it does: Confirms the Kubernetes Secret exists and contains TLS material, before checking whether a proxy reloaded it.

</article> </div>

Answer: Prove the exact TLS endpoint and SNI first; then rotate or reload the certificate at that layer and verify the public chain, SANs, expiry, and route behavior.

Open related topic </section>
Databases

OpenSearch Shard Pressure

Search latency and indexing errors rise after shard count and disk use grow.

Symptoms

  • Cluster health is yellow or red, or relocation never finishes.
  • JVM pressure, disk watermarks, or pending tasks increase.
  • Search p95 latency rises while some nodes are much hotter than others.

Evidence

  • Compare shard allocation, disk watermarks, heap pressure, rejected threadpool tasks, and hot nodes.
  • Check index count, shard size, replica count, and recent rollover behavior.
  • Separate query load, indexing pressure, relocation, and disk-full risk.

Command Examples

Command

curl -s localhost:9200/_cluster/health?pretty

Example output

{
  "status" : "yellow",
  "active_shards_percent_as_number" : 94.2,
  "number_of_pending_tasks" : 37
}

What it does: Establishes cluster health, pending tasks, and whether shard availability is degraded.

Command

curl -s localhost:9200/_cat/shards?v

Example output

index    shard prirep state      docs  store ip          node
logs-01  2     r      UNASSIGNED
logs-01  2     p      STARTED    12m   42gb  10.0.1.12  os-node-1

What it does: Shows which shards are unassigned, relocating, or concentrated on hot nodes.

Command

curl -s localhost:9200/_cat/allocation?v

Example output

shards disk.indices disk.used disk.avail disk.percent host       node
318    1.9tb        2.7tb     110gb      96           10.0.1.12  os-node-1

What it does: Connects shard placement to disk watermarks and node imbalance.

Answer: Do not add random shards under pressure; first prove whether the bottleneck is disk watermark, heap, allocation, query load, indexing load, or shard-count overhead.

Open related topic
Machine Learning

RAG Quality Regression

Answers become less grounded after a retriever, embedding, or prompt change.

Symptoms

  • The model still responds fluently but cites irrelevant or stale context.
  • Retrieval scores look plausible while user task success drops.
  • A prompt, chunking, embedding model, or reranker deployment changed recently.

Evidence

  • Compare query text, retrieved chunk IDs, scores, reranker order, and final prompt context.
  • Replay a fixed evaluation set across old and new retriever pipelines.
  • Check whether chunking, metadata filters, or tenant boundaries changed.

Command Examples

Command

grep -R "retrieved_chunk_ids" logs/

Example output

request_id=42 query="renew cert" retrieved_chunk_ids=["tls-17","tls-22"] scores=[0.82,0.79]

What it does: Confirms which chunks were retrieved for failing answers and whether the IDs changed after release.

Command

python evals/rag_replay.py --before old.jsonl --after new.jsonl

Example output

query_set=golden_2026_06
recall@5: 0.82 -> 0.61
grounded_answer_rate: 0.74 -> 0.58

What it does: Replays fixed examples to separate retrieval regression from generation noise.

Command

curl -sS http://localhost:8000/search?q=''</code></pre>
              

Example output

{"results":[{"chunk_id":"tls-17","score":0.82,"title":"TLS renewal runbook"}]}

What it does: Checks the live retrieval endpoint without running the full answer-generation path.

</article> </div>

Answer: Treat RAG quality as a pipeline incident: isolate retrieval recall, reranking, prompt assembly, generation config, and citation policy before changing the model.

Open related topic </section> </div> ## Study Cards
Question

Why start labs from symptoms instead of commands?

Answer

Symptoms force you to name the failing user-visible behavior before collecting evidence.

Question

What should evidence do in an operational lab?

Answer

It should distinguish competing failure domains without depending on one favored fix.

Question

Why embed labs in topic pages?

Answer

The lab can be practiced beside the protocol, system, or tool behavior it depends on.

## References - [Incident Entry Points](/docs/troubleshooting/incident-entrypoints/) - [Kubernetes DNS and CoreDNS](/docs/kubernetes/dns-coredns/) - [PostgreSQL Operations and HA](/docs/databases/postgres/operations-ha/) - [Ceph Operations and Recovery](/docs/ceph/operations-recovery/) - [Istio Security, mTLS, and Policy](/docs/istio/security-mtls-policy/) - [NAT Gateways and NAT](/docs/networking/nat-gateways/) - [Certificates and HTTPS](/docs/networking/certificates-https/) - [OpenSearch](/docs/databases/opensearch/) - [Retrieval-Augmented Generation](/docs/ml/rag/) - [ML Serving, Inference, and vLLM](/docs/ml/serving-inference-vllm/)