Tech Study Guide
Scenario Labs
Operational labs for practicing cross-layer debugging with symptoms, evidence, checks, and answers.
Scenario Labs
Each lab starts from symptoms, then asks for evidence before an answer. The same lab cards are embedded into the related topic pages so the scenario stays near the concepts and commands it exercises.
Kubernetes DNS Outage
Pods intermittently fail name resolution while node-level DNS still works.
Symptoms
nslookupinside Pods times out for ClusterIP Services.- CoreDNS Pods are running but query latency spikes.
- Node lookups through the host resolver succeed.
Evidence
- Compare
/etc/resolv.confinside a Pod with the node resolver. - Check CoreDNS logs, metrics, endpoints, and EndpointSlice watch errors.
- Run
dig +search +showsearchto exposendotsexpansion.
Command Examples
Command
kubectl -n kube-system logs deploy/coredns --tail=100
Example output
[ERROR] plugin/errors: 2 kubernetes.default.svc.cluster.local. A: read udp 10.244.1.8:39948->10.96.0.10:53: i/o timeout
[INFO] 10.244.2.17:43321 - 44812 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR
What it does: Shows whether CoreDNS is returning normal answers, timing out upstream, failing Kubernetes API watches, or logging plugin-level errors.
Command
kubectl -n kube-system get endpointslice,endpoints,svc -l k8s-app=kube-dns
Example output
NAME ADDRESSTYPE PORTS ENDPOINTS
endpointslice.discovery.k8s.io/kube-dns-abc12 IPv4 53 10.244.1.8,10.244.2.9
NAME TYPE CLUSTER-IP PORT(S)
service/kube-dns ClusterIP 10.96.0.10 53/UDP,53/TCP
What it does: Confirms the DNS Service has ready CoreDNS endpoints and exposes both UDP and TCP port 53.
Command
kubectl exec deploy/debug -- dig kubernetes.default.svc.cluster.local
Example output
;; status: NOERROR
kubernetes.default.svc.cluster.local. 30 IN A 10.96.0.1
What it does: Tests name resolution from the same Pod network path that application Pods use.
Answer: Prove whether the failure is stub resolver search expansion, CoreDNS health, upstream recursion, NetworkPolicy, or API watch state before changing the Corefile.
Open related topicPostgreSQL Failover and Pooling
A failover completes, but application errors continue through stale pooled connections.
Symptoms
- Application logs show read-only transaction or connection reset errors.
- PgBouncer pools point at the old primary.
- Replication lag is low, but write traffic still fails.
Evidence
- Compare
pg_is_in_recovery()across endpoints. - Inspect PgBouncer
SHOW POOLSandSHOW SERVERS. - Check application DNS TTL and connection retry behavior.
Command Examples
Command
psql -c "select pg_is_in_recovery(), now() - pg_last_xact_replay_timestamp()"
Example output
pg_is_in_recovery | ?column?
-------------------+----------
f |
What it does: Identifies whether the endpoint is the writable primary and whether replay lag is relevant.
Command
psql -p 6432 pgbouncer -c "show pools"
Example output
database | user | cl_active | cl_waiting | sv_active | sv_idle
app | app | 120 | 18 | 20 | 0
What it does: Shows whether clients are stuck behind PgBouncer pools or mapped to stale server connections.
Command
psql -p 6432 pgbouncer -c "reconnect"
Example output
RECONNECT
What it does: Forces PgBouncer to drop old server connections after promotion or endpoint changes.
Answer: Treat database promotion, pool reconnection, DNS/service routing, and application retry budgets as one cutover sequence.
Open related topicLinux Boot Recovery
A host reaches emergency mode after a package or filesystem change.
Symptoms
- Console shows failed mounts or initramfs device discovery errors.
systemddrops to emergency target.- Remote SSH never starts.
Evidence
- Read
journalctl -xbfrom the failed boot. - Compare
/etc/fstabUUIDs withblkid. - Check initramfs contents and kernel command line.
Command Examples
Command
journalctl -xb -p warning
Example output
systemd[1]: dev-disk-by\x2duuid-...device: Job timed out.
systemd[1]: Dependency failed for /data.
What it does: Finds the failed unit, mount, device, or boot dependency from the current emergency boot.
Command
blkid
Example output
/dev/nvme0n1p2: UUID="9f7c..." TYPE="ext4"
/dev/nvme1n1p1: UUID="2ad4..." TYPE="xfs"
What it does: Compares real block-device UUIDs and filesystem types against `/etc/fstab` and boot configuration.
Command
systemctl --failed
Example output
UNIT LOAD ACTIVE SUB DESCRIPTION
data.mount loaded failed failed /data
What it does: Lists failed systemd units so repair starts from the blocking dependency instead of guessing.
Answer: Restore boot by proving the failing unit or mount, disabling the bad dependency, rebuilding initramfs when needed, and keeping the recovery shell read/write only for the minimum repair.
Open related topicvLLM Inference Latency Spike
Token latency rises after traffic mix changes even though GPU utilization looks acceptable.
Symptoms
- Time to first token is stable, but inter-token latency rises.
- Queue depth grows during long-context requests.
- KV-cache pressure increases before errors appear.
Evidence
- Compare prompt length, output length, and batch shape histograms.
- Track KV-cache utilization, prefill/decode split, and scheduler queue time.
- Check whether speculative decoding or tensor parallel settings changed.
Command Examples
Command
nvidia-smi dmon
Example output
# gpu sm mem enc dec mclk pclk
# Idx % % % % MHz MHz
0 72 88 0 0 1593 1410
What it does: Separates GPU compute pressure from memory-bandwidth pressure during prefill and decode.
Command
curl -sS http://localhost:8000/metrics | grep vllm
Example output
vllm:num_requests_waiting{model_name="llama"} 14
vllm:gpu_cache_usage_perc{model_name="llama"} 0.91
vllm:time_to_first_token_seconds_bucket{le="1.0"} 248
What it does: Shows queue depth, KV-cache pressure, and token-latency signals from the serving engine.
Command
kubectl top pod -l app=vllm
Example output
NAME CPU(cores) MEMORY(bytes)
vllm-0 920m 38Gi
What it does: Confirms whether Kubernetes-visible CPU and memory pressure line up with model-server symptoms.
Answer: Separate queueing, prefill saturation, decode throughput, KV-cache eviction, and model parallelism before scaling replicas or changing batch limits.
Open related topicCeph Degraded PGs After OSD Loss
A drive failure leaves placement groups degraded while client latency rises during recovery.
Symptoms
ceph -sshowsactive+degradedoractive+undersizedplacement groups.- One OSD is down or flapping and recovery traffic increases.
- Application writes are slower but still mostly succeeding.
Evidence
- Capture
ceph health detail,ceph osd tree, andceph pg dump_stuckbefore changing flags. - Check whether any OSD or CRUSH subtree is nearfull, backfillfull, or full.
- Compare client latency with recovery and backfill activity.
Command Examples
Command
ceph -s && ceph health detail
Example output
health: HEALTH_WARN
64 pgs degraded
osd.12 is down
What it does: Establishes the cluster health state and the specific warnings driving recovery work.
Command
ceph osd tree && ceph osd df tree
Example output
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT
12 hdd 7.276 osd.12 down 1.00000
ID REWEIGHT SIZE USE AVAIL %USE
12 1.00000 7.3T 0B 7.3T 0.00
What it does: Locates the failed OSD and checks whether capacity or CRUSH placement will constrain recovery.
Command
ceph pg dump_stuck && ceph -w
Example output
ok
2026-06-06T10:14:12 64 pgs active+degraded; recovery io 120 MiB/s
What it does: Watches whether placement groups are making progress toward `active+clean` or staying stuck.
Answer: Restore redundancy without creating extra churn: identify the failed domain, replace or restart only the bad OSD, avoid broad forgotten flags, and watch recovery to active+clean.
Open related topicIstio mTLS Policy Breakage
A service-to-service call starts returning proxy-generated 503 or 403 after a policy rollout.
Symptoms
- Client Pods can resolve the Service, but requests fail at the proxy.
- Access logs show mTLS, authorization, or upstream health response flags.
- Direct app logs do not show the failed request reaching the handler.
Evidence
- Compare PeerAuthentication, DestinationRule, AuthorizationPolicy, and RequestAuthentication scope.
- Inspect Envoy clusters, routes, endpoints, and secrets for the source or gateway proxy.
- Check workload identities and namespace labels before changing policy.
Command Examples
Command
istioctl analyze --all-namespaces
Example output
Warning [IST0101] (VirtualService payments.payments) Referenced host not found: payments-api
What it does: Finds mesh configuration conflicts before inspecting individual Envoy proxies.
Command
istioctl proxy-config clusters -n </code></pre>
Example output
SERVICE FQDN PORT SUBSET DIRECTION TYPE
payments-api.payments.svc 8443 - outbound EDS
What it does: Shows whether the proxy received an outbound cluster for the target workload and port.
</article>
Command
istioctl proxy-config secret -n </code></pre>
Example output
RESOURCE NAME TYPE STATUS VALID CERT SERIAL NUMBER
default Cert Chain ACTIVE true 1a2b3c
What it does: Verifies that the sidecar has active mTLS certificates and can participate in the mesh identity path.
</article>
</div>
Answer: Treat YAML as intent and proxy config as enforcement; prove whether the break is TLS mode, identity, authorization, endpoint readiness, or route attachment before relaxing policy.
Open related topic
</section>
Networking
NAT Exhaustion and API Errors
Outbound calls to third-party APIs fail intermittently during a traffic spike.
Symptoms
- Error rates rise for outbound dependencies while internal service traffic remains healthy.
- Connection attempts time out or fail with resets during peak concurrency.
- One NAT gateway, firewall, or node egress path carries most of the traffic.
Evidence
- Compare active connections, conntrack usage, SNAT port use, and upstream destination fan-out.
- Check whether retries multiply outbound concurrency.
- Correlate failures with one source node, subnet, gateway, or external IP.
Command Examples
Command
conntrack -S
Example output
entries 1048211
searched 23041017
insert_failed 482
What it does: Shows whether the host or gateway is failing to allocate new connection-tracking entries.
Command
ss -tan state established,time-wait | wc -l
Example output
58234
What it does: Estimates active and recently closed TCP connection pressure that can consume SNAT ports.
Command
tcpdump -nn host and tcp</code></pre>
Example output
10.0.12.15.53412 > 198.51.100.20.443: Flags [S]
10.0.12.15.53412 > 198.51.100.20.443: Flags [S], retransmission
What it does: Distinguishes dropped SYNs, resets, and successful handshakes on the egress path.
</article>
</div>
Answer: Separate upstream failure from egress exhaustion by proving SNAT port, conntrack, retry, and destination distribution before scaling clients or blaming the API provider.
Open related topic
</section>
Networking
TLS Certificate Expiry at the Edge
Browsers reject a public endpoint while internal health checks still pass.
Symptoms
- Users see certificate expired, wrong host, or incomplete-chain errors.
- TCP connectivity works and HTTP health checks may remain green.
- A recent load balancer, ingress, or gateway change touched TLS termination.
Evidence
- Inspect the certificate chain from the same SNI and address users hit.
- Compare DNS, load balancer listener, gateway secret, and backend certificate boundaries.
- Check whether automation renewed a secret but the serving proxy did not reload it.
Command Examples
Command
openssl s_client -connect :443 -servername -showcerts </dev/null</code></pre>
Example output
subject=CN=www.example.com
issuer=C=US,O=Example CA
notAfter=Jun 7 12:00:00 2026 GMT
Verify return code: 0 (ok)
What it does: Inspects the public certificate chain served for the exact SNI users hit.
</article>
Command
curl -vkI https:///</code></pre>
Example output
* SSL connection using TLSv1.3
* subjectAltName: host "www.example.com" matched cert's "www.example.com"
HTTP/2 200
What it does: Confirms TLS negotiation, hostname validation, and the edge HTTP response in one request.
</article>
Command
kubectl get secret -o yaml</code></pre>
Example output
type: kubernetes.io/tls
data:
tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0t...
What it does: Confirms the Kubernetes Secret exists and contains TLS material, before checking whether a proxy reloaded it.
</article>
</div>
Answer: Prove the exact TLS endpoint and SNI first; then rotate or reload the certificate at that layer and verify the public chain, SANs, expiry, and route behavior.
Open related topic
</section>
Databases
OpenSearch Shard Pressure
Search latency and indexing errors rise after shard count and disk use grow.
Symptoms
- Cluster health is yellow or red, or relocation never finishes.
- JVM pressure, disk watermarks, or pending tasks increase.
- Search p95 latency rises while some nodes are much hotter than others.
Evidence
- Compare shard allocation, disk watermarks, heap pressure, rejected threadpool tasks, and hot nodes.
- Check index count, shard size, replica count, and recent rollover behavior.
- Separate query load, indexing pressure, relocation, and disk-full risk.
Command Examples
Command
curl -s localhost:9200/_cluster/health?pretty
Example output
{
"status" : "yellow",
"active_shards_percent_as_number" : 94.2,
"number_of_pending_tasks" : 37
}
What it does: Establishes cluster health, pending tasks, and whether shard availability is degraded.
Command
curl -s localhost:9200/_cat/shards?v
Example output
index shard prirep state docs store ip node
logs-01 2 r UNASSIGNED
logs-01 2 p STARTED 12m 42gb 10.0.1.12 os-node-1
What it does: Shows which shards are unassigned, relocating, or concentrated on hot nodes.
Command
curl -s localhost:9200/_cat/allocation?v
Example output
shards disk.indices disk.used disk.avail disk.percent host node
318 1.9tb 2.7tb 110gb 96 10.0.1.12 os-node-1
What it does: Connects shard placement to disk watermarks and node imbalance.
Answer: Do not add random shards under pressure; first prove whether the bottleneck is disk watermark, heap, allocation, query load, indexing load, or shard-count overhead.
Open related topic
Machine Learning
RAG Quality Regression
Answers become less grounded after a retriever, embedding, or prompt change.
Symptoms
- The model still responds fluently but cites irrelevant or stale context.
- Retrieval scores look plausible while user task success drops.
- A prompt, chunking, embedding model, or reranker deployment changed recently.
Evidence
- Compare query text, retrieved chunk IDs, scores, reranker order, and final prompt context.
- Replay a fixed evaluation set across old and new retriever pipelines.
- Check whether chunking, metadata filters, or tenant boundaries changed.
Command Examples
Command
grep -R "retrieved_chunk_ids" logs/
Example output
request_id=42 query="renew cert" retrieved_chunk_ids=["tls-17","tls-22"] scores=[0.82,0.79]
What it does: Confirms which chunks were retrieved for failing answers and whether the IDs changed after release.
Command
python evals/rag_replay.py --before old.jsonl --after new.jsonl
Example output
query_set=golden_2026_06
recall@5: 0.82 -> 0.61
grounded_answer_rate: 0.74 -> 0.58
What it does: Replays fixed examples to separate retrieval regression from generation noise.
Command
curl -sS http://localhost:8000/search?q=''</code></pre>
Example output
{"results":[{"chunk_id":"tls-17","score":0.82,"title":"TLS renewal runbook"}]}
What it does: Checks the live retrieval endpoint without running the full answer-generation path.
</article>
</div>
Answer: Treat RAG quality as a pipeline incident: isolate retrieval recall, reranking, prompt assembly, generation config, and citation policy before changing the model.
Open related topic
</section>
</div>
## Study Cards
Question
Why start labs from symptoms instead of commands?
Answer
Symptoms force you to name the failing user-visible behavior before collecting evidence.
Question
What should evidence do in an operational lab?
Answer
It should distinguish competing failure domains without depending on one favored fix.
Question
Why embed labs in topic pages?
Answer
The lab can be practiced beside the protocol, system, or tool behavior it depends on.
## References
- [Incident Entry Points](/docs/troubleshooting/incident-entrypoints/)
- [Kubernetes DNS and CoreDNS](/docs/kubernetes/dns-coredns/)
- [PostgreSQL Operations and HA](/docs/databases/postgres/operations-ha/)
- [Ceph Operations and Recovery](/docs/ceph/operations-recovery/)
- [Istio Security, mTLS, and Policy](/docs/istio/security-mtls-policy/)
- [NAT Gateways and NAT](/docs/networking/nat-gateways/)
- [Certificates and HTTPS](/docs/networking/certificates-https/)
- [OpenSearch](/docs/databases/opensearch/)
- [Retrieval-Augmented Generation](/docs/ml/rag/)
- [ML Serving, Inference, and vLLM](/docs/ml/serving-inference-vllm/)