Istio Observability and Troubleshooting

Istio troubleshooting is about comparing intended configuration with proxy behavior and observed traffic. Kubernetes objects, Istio resources, xDS config, Envoy access logs, metrics, and application logs each answer different questions.

Signals

flowchart LR
  K8s[Kubernetes Services and Endpoints] --> Istiod[istiod]
  Istio[Istio resources] --> Istiod
  Istiod --> XDS[xDS snapshots]
  XDS --> Envoy[Envoy sidecar / gateway / waypoint]
  Envoy --> Logs[Access logs]
  Envoy --> Metrics[Metrics and response flags]
  Request[Request] --> Envoy
  Envoy --> Upstream[Upstream workload]
Signal What It Answers
istioctl analyze Is mesh config internally inconsistent or invalid?
Proxy sync Did proxies receive current xDS from istiod?
Proxy config What listeners, routes, clusters, endpoints, and secrets are active?
Access logs What happened to individual requests at the proxy?
Metrics Where are latency, volume, errors, and saturation changing?
Traces Which service hop consumed time or returned an error?

Intent vs Runtime State

Question Istio/Kubernetes Intent Envoy Runtime State
Where to inspect YAML resources, status conditions, istioctl analyze, Kubernetes Services and EndpointSlices. istioctl proxy-config, access logs, response flags, Envoy metrics, proxy sync.
What it proves What operators asked the mesh to do. What the proxy actually received and enforced.
Common drift Wrong host, namespace, selector, export scope, route attachment, or policy target. Stale xDS, empty clusters, missing secrets, wrong TLS mode, or endpoint health mismatch.
Best for 403 AuthorizationPolicy, RequestAuthentication, principals, claims, namespace scope. Access log flags, matched route, peer principal, JWT validation and enforced policy.
Best for 503 DestinationRule, subsets, PeerAuthentication, Service and EndpointSlice readiness. Cluster TLS settings, endpoints, outlier ejection, upstream health, connection errors.

Core Commands

istioctl version
istioctl proxy-status
istioctl analyze --all-namespaces
istioctl proxy-config listeners <pod> -n <namespace>
istioctl proxy-config routes <pod> -n <namespace>
istioctl proxy-config clusters <pod> -n <namespace>
istioctl proxy-config endpoints <pod> -n <namespace>
istioctl proxy-config secret <pod> -n <namespace>
kubectl logs -n istio-system deploy/istiod

Ambient mode adds ztunnel and waypoint checks:

istioctl ztunnel-config workloads -n <namespace>
istioctl ztunnel-config services -n <namespace>
istioctl waypoint list -A
kubectl logs -n istio-system -l app=ztunnel

Response Pattern Triage

Symptom Likely Area
404 at gateway Host/path mismatch, route order, wrong Gateway attachment.
503 no healthy upstream Endpoint, subset, DestinationRule, mTLS, readiness, or outlier ejection issue.
403 from proxy AuthorizationPolicy, JWT requirement, principal mismatch, or deny policy.
TLS handshake failure Secret, SNI, certificate chain, PeerAuthentication, or TLS mode mismatch.
Proxy not synced istiod reachability, version skew, invalid config, or overloaded control plane.
Works from pod but not gateway Gateway listener, route host, TLS, AuthorizationPolicy, or external load balancer.

xDS Inspection

xDS names are often verbose, but the pattern is predictable:

  • listeners show where Envoy accepts traffic,
  • routes show HTTP match and destination decisions,
  • clusters show upstream service definitions and TLS settings,
  • endpoints show concrete backend IPs,
  • secrets show certificates and validation context.

If a VirtualService was applied but routes do not show it, check hosts, export visibility, namespace, gateway binding, and analyzer output. If routes are right but endpoints are empty, check Kubernetes Service selectors, EndpointSlices, subsets, and workload readiness.

Access Logs and Metrics

Access logs are useful when they include response code, response flags, upstream host, authority, route name, duration, and source identity. Metrics show whether the issue is isolated or systemic.

Common metric questions:

  • Which service saw error-rate change first?
  • Did request volume spike before latency?
  • Are 4xx policy denials or 5xx upstream failures?
  • Is the problem isolated to one source, destination, route, or revision?
  • Did a rollout or config change precede the shift?

Practical Debug Flow

  1. Reproduce with a single request and capture host, path, headers, source pod, destination, and response.
  2. Check istioctl analyze --all-namespaces.
  3. Confirm proxy sync for the source, destination, gateway, waypoint, or ztunnel.
  4. Inspect the proxy config at the enforcement point.
  5. Check access logs for response flags and upstream host.
  6. Compare with Kubernetes endpoints and application logs.
  7. Roll back or narrow the last mesh config change if evidence points to it.

Study Cards

Question

What does istioctl proxy-status show?

Answer

Whether data-plane proxies are connected to istiod and synchronized with current xDS configuration.

Question

What does an empty endpoint list usually suggest?

Answer

Kubernetes endpoints, subset labels, readiness, or service selection are not producing usable backends.

Question

Why are access logs useful in Istio?

Answer

They show per-request response codes, flags, upstreams, durations, and route behavior at the proxy.

Question

Why compare YAML with proxy config?

Answer

Applied resources are intent; proxy config shows what Envoy actually received and enforces.

Question

What should you inspect for a proxy-generated 403?

Answer

AuthorizationPolicy, RequestAuthentication, source principal, JWT claims, namespace scope, and deny policies.

References