Cross-Layer Incident Runbooks

Production network incidents rarely stay inside one layer. An HTTP 504 can be an overloaded app, a stale EndpointSlice, a proxy timeout, a DNS answer pointing at the wrong load balancer, a conntrack table at capacity, or a Pod-to-node datapath issue. The job is to preserve evidence and walk from symptom to boundary, not to tune the first knob that looks familiar.

Command Examples

Capture the exact source, destination name, resolved address, port, protocol, timestamp, and error string before changing state.

date -Is
getent hosts api.example.com
curl -v --connect-timeout 3 --max-time 10 https://api.example.com/health
ss -tanp '( dport = :443 or sport = :443 )'
ip route get "$(getent ahostsv4 api.example.com | awk 'NR==1 {print $1}')"
journalctl -k --since -10min -g 'TCP|conntrack|DROP|REJECT|martian|oom'

In Kubernetes, collect API state and datapath hints together:

kubectl get pod,svc,endpointslice,ingress,gateway -A -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail -50
kubectl -n kube-system get pods -o wide
kubectl -n kube-system logs deployment/coredns --since=10m

Example output and meaning:

Command	Example output	What it does
`date -Is`	`2026-06-06T10:24:33-07:00`.	Pins evidence to a timestamp before logs rotate or retries change state.
`curl -v --connect-timeout 3 --max-time 10 https://api.example.com/health`	Connect, TLS, and HTTP status timing.	Places the symptom at the client, edge, proxy, or application boundary.
`kubectl -n kube-system logs deployment/coredns --since=10m`	CoreDNS query errors, SERVFAILs, or normal answers.	Checks whether DNS is part of the incident window.

HTTP 504

An HTTP 504 means a gateway or proxy timed out waiting for an upstream response. It does not prove the upstream process was down.

Boundary	What To Prove	Evidence
Client to edge	DNS, TCP, TLS, and HTTP reach the gateway.	`curl -v`, SNI, access logs.
Edge route	Host/path matched the intended backend.	proxy route config, ingress status, Gateway route status.
Upstream selection	Backend endpoints were healthy and ready.	EndpointSlices, load-balancer target health, proxy upstream stats.
Backend path	The app accepted and completed the request.	app logs, `ss`, latency metrics, dependency traces.
Timeout budget	Outer timeouts are longer than expected inner work.	client, LB, proxy, app, database timeout config.

Practical checks:

curl -vk --resolve api.example.com:443:198.51.100.10 https://api.example.com/slow
kubectl get endpointslice -l kubernetes.io/service-name=api -o wide
kubectl logs deploy/api --since=10m | grep -E 'timeout|deadline|upstream|cancel'

If backend logs show work finishing after the proxy gives up, fix timeout budget or make the operation asynchronous. If backend logs show no request, focus on route matching, EndpointSlices, NetworkPolicy, kube-proxy or CNI state, and proxy upstream health.

Connection Refused

ECONNREFUSED means a TCP reset came back. That usually means the destination host was reachable but no process accepted the port, or a firewall actively rejected the flow.

nc -vz api.example.com 443
ss -ltnp | grep ':443'
tcpdump -nn -i any 'tcp port 443 and (tcp[tcpflags] & (tcp-rst|tcp-syn) != 0)'

In Kubernetes:

kubectl get svc api -o yaml
kubectl get endpointslice -l kubernetes.io/service-name=api -o yaml
kubectl exec deploy/client -- nc -vz api.default.svc.cluster.local 443
kubectl exec deploy/client -- nc -vz <pod-ip> 8443

If direct Pod IP works but Service IP fails, inspect kube-proxy replacement, EndpointSlices, service port and targetPort, and node-local datapath. If Pod IP refuses, inspect container ports, app bind address, readiness, and process listeners.

Connection Reset

Resets are active aborts. They may come from the app, kernel, proxy, firewall, load balancer, or peer.

tcpdump -nn -i any 'host 203.0.113.10 and tcp[tcpflags] & tcp-rst != 0'
ss -ti dst 203.0.113.10
journalctl -u nginx --since -10min
journalctl -k --since -10min -g 'reset|conntrack|nf_conntrack'

Use sequence numbers and capture position to identify who sent the reset. A reset from the proxy after an idle period points at timeout alignment. A reset immediately after ClientHello often points at TLS policy, SNI, ALPN, or mTLS mismatch.

TLS Timeout or Handshake Failure

Separate TCP connection, TLS handshake, certificate validation, and HTTP.

openssl s_client -connect api.example.com:443 -servername api.example.com -showcerts
curl -vk --tlsv1.3 https://api.example.com/
tcpdump -nn -i any 'host 198.51.100.10 and port 443'

Symptom	Likely Boundary
TCP SYN retransmits	route, firewall, NAT, listener, or return path.
ClientHello sent, no ServerHello	TLS policy, SNI routing, proxy, blocked large packet, or backend stall.
Certificate verify failed	trust store, missing intermediate, SAN, expiry, or private CA.
mTLS alert	missing client certificate, wrong client CA, EKU, SPIFFE/SAN policy, or proxy termination.

DNS Intermittent

DNS intermittency is usually cache scope, negative caching, split-horizon routing, resolver health, UDP loss, or search-path amplification.

getent hosts api.example.com
dig api.example.com A
dig api.example.com AAAA
dig @<configured-resolver> api.example.com A +tries=1 +time=2
dig +tcp @<configured-resolver> api.example.com A
resolvectl status

In Kubernetes:

kubectl exec deploy/client -- cat /etc/resolv.conf
kubectl exec deploy/client -- nslookup api.default.svc.cluster.local
kubectl -n kube-system logs deployment/coredns --since=10m
kubectl get networkpolicy -A

If UDP DNS fails but TCP DNS succeeds, check MTU, fragments, firewall state, and conntrack. If only short names fail, inspect search domains and ndots.

Large Requests Hang

Small requests working while large requests hang is a classic MTU, fragmentation, proxy body-size, TLS record, or backend streaming problem.

tracepath api.example.com
ping -M do -s 1472 api.example.com
curl -v --data-binary @large.bin https://api.example.com/upload
tcpdump -nn -i any 'icmp or host 198.51.100.10'

For overlay, VPN, or cloud paths, account for VXLAN, Geneve, IPsec, WireGuard, GRE, and provider encapsulation overhead. MSS clamping can protect TCP, but it does not fix UDP or the underlying PMTUD failure.

Node Can Reach Service but Pod Cannot

This usually means the node network namespace path differs from the Pod network namespace path.

kubectl exec deploy/client -- ip addr
kubectl exec deploy/client -- ip route
kubectl exec deploy/client -- curl -v http://api.default.svc.cluster.local:8080
kubectl debug node/<node> -it --image=busybox

Check these differences:

Difference	What It Can Break
DNS config	Pod search paths, CoreDNS, NodeLocal DNSCache, or NetworkPolicy.
Source address	Firewall, cloud security group, app allowlist, or mTLS identity.
Route table	CNI routes, overlay tunnel, Pod CIDR route, or egress gateway.
NetworkPolicy	Pod egress denied while node traffic is unrestricted.
Service datapath	kube-proxy, IPVS, nftables, or eBPF replacement differs per node.

Study Cards

Question

What does an HTTP 504 prove?

Answer

A gateway or proxy timed out waiting for an upstream response; it does not by itself prove the backend process was down.

Question

Why test direct Pod IP after Service IP fails?

Answer

It separates workload listener behavior from Service, EndpointSlice, kube-proxy, or CNI datapath behavior.

Question

Why can large requests hang while small requests work?

Answer

MTU, fragmentation, PMTUD, proxy body limits, or streaming behavior may affect only larger payloads.

Cross-Layer Incident Runbooks

Command Examples

HTTP 504

Connection Refused

Connection Reset

TLS Timeout or Handshake Failure

DNS Intermittent

Large Requests Hang

Node Can Reach Service but Pod Cannot

Study Cards

References