Kubernetes Troubleshooting

Kubernetes troubleshooting is a graph walk. Start from the user’s symptom, then move through the API object, controller, Pod, node, network, storage, and external dependency that must all agree before the workload works.

Command Examples

kubectl get nodes
kubectl get pods --all-namespaces -o wide
kubectl get events --sort-by=.lastTimestamp
kubectl get deployment,statefulset,daemonset,job,cronjob --all-namespaces
kubectl get svc,endpointslice,ingress,networkpolicy --all-namespaces
kubectl get pv,pvc,storageclass --all-namespaces

Example output and meaning:

Command	Example output	What it does
`kubectl get nodes`	`node-a Ready ... INTERNAL-IP 10.0.1.10`	Shows node readiness, addresses, versions, and placement context.
`kubectl get pods --all-namespaces -o wide`	`Pods with STATUS, READY, IP, NODE, and recent restarts.`	Maps symptoms to namespaces, Pod IPs, nodes, and readiness state.
`kubectl get events --sort-by=.lastTimestamp`	`Concrete IDs, states, counters, versions, rows, or error strings.`	Turns the example from a command list into evidence for the next debugging step.

Read these from most specific to most general:

The object that users touched or traffic targets.
Its controller status and conditions.
The Pod status, events, logs, probes, and mounted config.
The node where the Pod landed.
Service, EndpointSlice, DNS, NetworkPolicy, ingress, and external dependency paths.

Common Paths

Symptom	Command Evidence	Common Causes
Pod stuck `Pending`	`kubectl describe pod`, events, scheduler messages, PVC status.	Node capacity, taints, tolerations, node affinity, unbound PVC, missing RuntimeClass.
`CrashLoopBackOff`	Previous logs, current logs, exit code, command, env, mounts, probes.	Bad config, missing secret, wrong command, dependency unavailable, liveness probe killing startup.
`ImagePullBackOff`	Pod events, image name, tag, registry auth, pull secret.	Typo, missing tag, private registry credentials, registry TLS/DNS/network issue.
Service unreachable	Service selector, EndpointSlice, port names, DNS, NetworkPolicy, Pod readiness.	Selector mismatch, Pods not ready, wrong targetPort, DNS search issue, policy deny.
Rollout stuck	Deployment conditions, ReplicaSet status, Pod events.	New Pods unavailable, probe failure, quota, image pull, PDB or scheduling pressure.
PVC pending	PVC events, StorageClass, CSI controller logs, volume attachments.	Missing default StorageClass, unsupported access mode, capacity, zone mismatch, CSI failure.
Node `NotReady`	Node conditions, kubelet logs, container runtime logs, disk and memory pressure.	Kubelet stopped, CNI broken, runtime down, disk pressure, certificate/bootstrap issue.

Pod Failure Workflow

Use the Pod as the smallest debuggable unit. A controller may create the Pod, but the Pod status tells you what Kubernetes actually tried to run.

kubectl -n <namespace> get pod <pod> -o wide
kubectl -n <namespace> describe pod <pod>
kubectl -n <namespace> logs <pod> --all-containers
kubectl -n <namespace> logs <pod> --all-containers --previous
kubectl -n <namespace> get pod <pod> -o jsonpath='{.status.containerStatuses[*]}'

Interpret the results:

Waiting with image pull reasons means the container never started.
Terminated with a nonzero exit code means the process started and exited.
Running but not Ready usually points at readiness probes, app startup, or dependency checks.
Repeated liveness probe failures can hide the real startup error by restarting the container before it can finish initialization.

Service and DNS Workflow

Services do not send traffic to arbitrary Pods. They select ready endpoints. Debug the object chain in order:

kubectl -n <namespace> get svc <service> -o yaml
kubectl -n <namespace> get endpointslice -l kubernetes.io/service-name=<service> -o wide
kubectl -n <namespace> get pods -l '<selector>' -o wide --show-labels
kubectl -n <namespace> exec -it <debug-pod> -- nslookup <service>.<namespace>.svc.cluster.local
kubectl -n <namespace> exec -it <debug-pod> -- nc -vz <service> <port>

If DNS resolves but connections fail, move to Service ports, EndpointSlices, kube-proxy or CNI dataplane, NetworkPolicy, and application listen ports. If DNS does not resolve, inspect CoreDNS, the Pod’s /etc/resolv.conf, namespace, search path, and service name.

Rollout and Controller Workflow

Controllers reconcile desired state into Pods. When a rollout is stuck, compare desired, current, available, and observed generation.

kubectl -n <namespace> rollout status deployment/<deployment>
kubectl -n <namespace> describe deployment <deployment>
kubectl -n <namespace> get rs -l app=<app> -o wide
kubectl -n <namespace> get events --sort-by=.lastTimestamp
kubectl -n <namespace> rollout history deployment/<deployment>

A Deployment that cannot progress usually has a Pod-level reason. Do not tune rollout settings until you know why the new ReplicaSet cannot produce ready Pods.

Node and Cluster Workflow

When many unrelated Pods fail on one node, inspect node health before debugging each workload.

kubectl describe node <node>
kubectl get node <node> -o jsonpath='{.status.conditions}'
kubectl top node <node>
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node> -o wide

Node conditions such as MemoryPressure, DiskPressure, PIDPressure, and Ready=False explain scheduling and eviction behavior. On the node, kubelet, container runtime, CNI plugin, disk, and certificate state are the usual next checks.

Node pressure decision tree:

flowchart TD
  Symptom[Pods evicted or node NotReady] --> Conditions[kubectl describe node conditions]
  Conditions --> Memory[MemoryPressure]
  Conditions --> Disk[DiskPressure]
  Conditions --> PID[PIDPressure]
  Conditions --> Ready[Ready=False]
  Memory --> MemChecks[PSI, OOM events, top RSS, cgroup memory]
  Disk --> DiskChecks[imagefs/nodefs usage, logs, emptyDir, inode pressure]
  PID --> PIDChecks[pid limits, fork storms, process counts]
  Ready --> NodeChecks[kubelet, runtime, CNI, certificates, network]

Treat node pressure as a scheduling and eviction problem first. Deleting Pods without fixing the pressure source usually recreates the same failure on the same or another node.

Image Pull Failure Workflow

Image pulls cross registry naming, credentials, DNS, TLS, network policy, runtime, and node disk state.

kubectl -n <namespace> describe pod <pod>
kubectl -n <namespace> get secret <pull-secret> -o yaml
kubectl get node <node> -o wide
kubectl debug node/<node> -it --image=nicolaka/netshoot
nslookup <registry-host>
curl -vk https://<registry-host>/v2/

Event or Symptom	Likely Cause
`manifest unknown`	Wrong image name or tag.
`unauthorized`	Missing or wrong imagePullSecret, registry scope, or token expiry.
TLS x509 error	Registry certificate chain, MITM proxy, or node trust store.
DNS timeout	Node resolver, firewall, proxy, or private registry DNS.
Pull starts then stalls	Registry throttling, NAT/proxy, node disk pressure, or MTU.

Evidence Capture

For incidents, capture state before deleting Pods or restarting components:

ns=<namespace>
app=<app-label>
mkdir -p k8s-capture
kubectl -n "$ns" get deploy,rs,pod,svc,endpointslice,networkpolicy,pvc -l app="$app" -o yaml > k8s-capture/objects.yaml
kubectl -n "$ns" get events --sort-by=.lastTimestamp > k8s-capture/events.txt
kubectl -n "$ns" describe pod -l app="$app" > k8s-capture/pods.describe.txt
kubectl -n "$ns" logs -l app="$app" --all-containers --prefix --tail=300 > k8s-capture/logs.txt

Study Cards

Question

Why start Kubernetes troubleshooting with describe and events?

Answer

They show scheduler, kubelet, image pull, probe, volume, and controller messages that are not visible from a simple get command.

Question

What does CrashLoopBackOff mean?

Answer

The container process keeps starting and exiting or being killed, so Kubernetes backs off before trying again.

Question

Why can a Service exist but send no traffic?

Answer

It may have no ready EndpointSlices because selectors, Pod labels, readiness, or ports do not match.

Question

What should you capture before deleting a broken Pod?

Answer

Object YAML, events, describe output, current and previous logs, and node placement.

Kubernetes Troubleshooting

Command Examples

Common Paths

Pod Failure Workflow

Service and DNS Workflow

Rollout and Controller Workflow

Node and Cluster Workflow

Image Pull Failure Workflow

Evidence Capture

Study Cards

References