Incident Entry Points

Start with the symptom the caller sees, then move toward the layer that can prove or disprove it. These entry points link back to the deeper pages, but keep the first ten minutes of an incident concrete.

Command Examples

date -Is
curl -v --connect-timeout 3 --max-time 15 "$URL"
getent hosts "$HOST"
ip route get "$IP"
ss -tanp
journalctl -k --since -10min -g 'TCP|conntrack|DROP|REJECT|oom|reset'

For Kubernetes:

kubectl get pod,svc,endpointslice,ingress,gateway -A -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail -50
kubectl exec deploy/client -- curl -v "$URL"

Example output and meaning:

Command	Example output	What it does
`curl -v --connect-timeout 3 --max-time 15 "$URL"`	Connection timing, TLS details, response headers, or timeout.	Anchors the user-visible symptom in one reproducible request.
`journalctl -k --since -10min -g 'TCP\|conntrack\|DROP\|REJECT\|oom\|reset'`	Kernel log lines for drops, conntrack, resets, or OOM kills.	Finds host-level evidence that dashboards often miss.
`kubectl get events -A --sort-by=.lastTimestamp \| tail -50`	Recent scheduling, readiness, pull, probe, or endpoint events.	Adds cluster control-plane context to the same incident window.

Symptom Map

Symptom	First Split	Likely Pages
Timeout	SYN timeout, TLS timeout, HTTP 504, DB timeout, or queue wait.	Request Path, Cross-Layer Incident Runbooks, Resilience, Timeouts, and Draining
Reset	RST before TLS, after idle, during upload, or from proxy.	TCP and Sockets, Packet Capture and Analysis
DNS works on node but not Pod	Resolver config, NetworkPolicy, CoreDNS, NodeLocal DNSCache, or CNI.	Kubernetes DNS and CoreDNS, Pod Networking and CNI
Large payload hangs	MTU, fragmentation, proxy body limit, TLS record path, or streaming timeout.	Packet Path, ICMP, MTU, and Path Testing
Intermittent 5xx	Endpoint churn, load balancer health, retries, database pool, or partial node datapath.	Load Balancers and Proxies, Kubernetes Services and EndpointSlices

Timeouts

Timeouts are not one failure. Split them by boundary:

curl -v --connect-timeout 2 --max-time 10 https://api.example.com/
openssl s_client -connect api.example.com:443 -servername api.example.com -brief
tcpdump -nn -i any 'host 198.51.100.10 and tcp'

If connect times out, inspect route, firewall, NAT, listener, and return path. If TLS times out, inspect SNI, ALPN, certificate policy, large packets, and proxy routing. If HTTP returns 504, inspect upstream health and timeout budget.

Resets

Resets are active signals. Identify who sent the RST before changing policy.

tcpdump -nn -i any 'tcp[tcpflags] & tcp-rst != 0'
ss -tanpi
journalctl -u nginx --since -10min

RST immediately after SYN usually means connection refused or reject. RST after idle usually means timeout alignment. RST after ClientHello often means TLS policy, SNI, ALPN, mTLS, or proxy mismatch.

DNS Works On Node But Not Pod

kubectl exec deploy/client -- cat /etc/resolv.conf
kubectl exec deploy/client -- nslookup kubernetes.default.svc.cluster.local
kubectl exec deploy/client -- nslookup api.example.com
kubectl -n kube-system logs deployment/coredns --since=10m
kubectl get networkpolicy -A

Compare the node resolver, Pod resolver, CoreDNS Service, NodeLocal DNSCache, and NetworkPolicy. A node shell is not proof of Pod DNS.

Large Payloads Hang

tracepath api.example.com
ping -M do -s 1472 api.example.com
curl -v --data-binary @large.bin https://api.example.com/upload
tcpdump -nn -i any 'icmp or host 198.51.100.10'

If small requests work and large ones hang, inspect MTU, ICMP Packet Too Big, proxy request body limits, upload buffering, and database write latency.

Intermittent 5xx

Intermittent 5xx is usually partial capacity, partial routing, or partial dependency failure.

kubectl get endpointslice -l kubernetes.io/service-name=api -o wide
kubectl get pods -l app=api -o wide
kubectl logs deploy/api --since=10m | grep -E '5..|timeout|reset|pool|deadline'

Look for one bad node, one bad endpoint, a small subset of failing request IDs, endpoint churn during rollout, or retries amplifying a dependency issue.

Study Cards

Question

Why start incidents from symptoms?

Answer

The visible error narrows which boundary to test first while preserving evidence from the caller's point of view.

Question

Why is node DNS not proof of Pod DNS?

Answer

Pods have their own resolver config, search paths, NetworkPolicy, CoreDNS path, and sometimes NodeLocal DNSCache behavior.

Question

What does intermittent 5xx often imply?

Answer

A partial failure such as one backend, one node, one route, endpoint churn, or a dependency pool problem.

Incident Entry Points

Command Examples

Symptom Map

Timeouts

Resets

DNS Works On Node But Not Pod

Large Payloads Hang

Intermittent 5xx

Study Cards

References