Kubernetes DNS and CoreDNS

Kubernetes DNS is service discovery for Pods. CoreDNS watches the Kubernetes API and answers names for Services and, in selected cases, Pods. Most application-to-application traffic starts with this layer, so DNS failures often look like service, network, or application outages.

Command Examples

kubectl -n kube-system get deploy,svc,endpointslice -l k8s-app=kube-dns
kubectl -n kube-system logs deployment/coredns
kubectl exec -it <pod> -- cat /etc/resolv.conf
kubectl exec -it <pod> -- nslookup kubernetes.default.svc.cluster.local
kubectl -n kube-system get configmap coredns -o yaml

Example output and meaning:

Command	Example output	What it does
`kubectl -n kube-system get deploy,svc,endpointslice -l k8s-app=kube-dns`	`deployment.apps/coredns 2/2` and EndpointSlices with Pod IPs on port `53`.	Proves the DNS Service has ready CoreDNS backends.
`kubectl exec -it <pod> -- cat /etc/resolv.conf`	`nameserver 10.96.0.10`, `search default.svc.cluster.local ...`, `options ndots:5`.	Shows the resolver config the application actually uses.
`kubectl exec -it <pod> -- nslookup kubernetes.default.svc.cluster.local`	`Name: kubernetes.default.svc.cluster.local` and `Address: 10.96.0.1`.	Confirms cluster DNS resolution from the Pod network path.

Names Kubernetes Publishes

The most important Service name forms are:

Name	Scope
`service`	Same namespace as the querying Pod through search paths.
`service.namespace`	Service in another namespace.
`service.namespace.svc`	Service under the cluster service zone.
`service.namespace.svc.cluster.local`	Fully qualified Service name in the default cluster domain.

Headless Services use clusterIP: None and publish endpoint records directly. StatefulSets often combine headless Services, stable Pod hostnames, and ordered identities.

Pod Resolver Configuration

Kubelet writes Pod DNS configuration based on dnsPolicy, dnsConfig, cluster DNS settings, and node resolver configuration. The common ClusterFirst policy sends cluster-domain names to cluster DNS and forwards external names upstream through CoreDNS.

Key fields to inspect:

nameserver for the cluster DNS Service IP,
search list for namespace and cluster suffixes,
options ndots:5 or other resolver options,
custom dnsPolicy and dnsConfig on the Pod.

High ndots plus search suffixes can turn one external lookup into several cluster-local attempts before the absolute query.

CoreDNS Corefile

CoreDNS behavior is configured through the coredns ConfigMap. The kubernetes plugin answers Service and endpoint names. The forward plugin sends non-cluster queries upstream. Other plugins commonly add caching, loop detection, health endpoints, readiness, metrics, reload behavior, and error logging.

Operational risks:

forwarding loops from node stub resolvers,
missing RBAC for Services, EndpointSlices, namespaces, or Pods,
CoreDNS Pods not ready or overloaded,
NetworkPolicy blocking UDP or TCP 53,
upstream resolver timeouts,
cache hiding recently changed answers.

CoreDNS plugin chain example:

flowchart LR
  Query[Pod DNS query] --> Errors[errors]
  Errors --> Health[health / ready]
  Health --> Kubernetes[kubernetes plugin for cluster.local]
  Kubernetes --> Cache[cache]
  Cache --> Forward[forward external names upstream]
  Forward --> Loop[loop detection]
  Loop --> Metrics[prometheus metrics]

Plugin order matters. The kubernetes plugin should answer cluster names before external forwarding. The cache plugin reduces API and upstream load, but it can also make recent changes appear delayed. The loop plugin protects against forwarding loops, such as forwarding to a node stub that points back at CoreDNS.

CoreDNS Performance and Capacity

CoreDNS is a shared dependency. DNS query spikes can come from ndots, chatty clients, short TTLs, JVM or Go resolver behavior, broken retry loops, or external upstream slowness.

Signal	Meaning	Action
High CoreDNS CPU	Query volume, expensive plugins, logging, or upstream retries.	Check `coredns_dns_request_count_total`, top query names, and `ndots`.
Rising SERVFAIL	API watch/RBAC failure, upstream failure, DNSSEC issue, or loop.	Split cluster names from external names and inspect logs.
Long forward latency	Upstream resolver or network path is slow.	Compare multiple upstreams and check NAT/firewall paths.
Uneven node symptoms	NodeLocal DNSCache, CNI, or node-local firewall issue.	Test from Pods on multiple nodes.

Scale CoreDNS replicas with topology in mind, keep requests/limits realistic, and avoid verbose query logging during normal operation. If NodeLocal DNSCache is enabled, capacity exists both at node-local caches and at the central CoreDNS layer.

DNS Resolution Diagram

flowchart LR
  Pod[Pod getaddrinfo] --> Resolv[/Pod resolv.conf/]
  Resolv --> DNSIP[cluster DNS Service IP]
  DNSIP --> CoreDNS[CoreDNS Pod]
  CoreDNS --> KubeAPI[Kubernetes API watch]
  CoreDNS --> Upstream[forward upstream resolver]
  KubeAPI --> Svc[Service and EndpointSlice answers]
  Upstream --> External[external A/AAAA/CNAME answer]

CoreDNS Failure Labs

Lab	Test	Expected Evidence
Forwarding loop	Point CoreDNS forward target at a node stub resolver that points back to cluster DNS in a lab.	CoreDNS loop plugin logs, SERVFAIL, rising error count.
Stub resolver issue	Compare node `/etc/resolv.conf`, CoreDNS forward target, and Pod resolver.	Node-local `127.0.0.53` should not be blindly used as a cluster-wide upstream.
NodeLocal DNSCache	Query from a Pod on two nodes and capture DNS at node-local and CoreDNS points.	Packets may terminate at the node cache rather than CoreDNS first.
`ndots` expansion	Resolve `api.example.com` with and without trailing dot.	Multiple cluster-suffix queries before the absolute external name.
SERVFAIL	Query cluster name and external name separately.	Cluster-only SERVFAIL points at Kubernetes plugin/API/RBAC; external-only SERVFAIL points at forwarder/upstream.
TCP fallback	Query a large response with `dig +bufsize=4096` and `dig +tcp`.	UDP truncation or loss should be separated from TCP fallback behavior.
EndpointSlice RBAC	Remove EndpointSlice watch permission in a lab cluster only.	Service answers become stale, empty, or SERVFAIL depending on plugin behavior and cache.

Practical commands:

kubectl -n kube-system get configmap coredns -o yaml
kubectl -n kube-system logs deployment/coredns --since=10m
kubectl auth can-i list endpointslices.discovery.k8s.io --as=system:serviceaccount:kube-system:coredns -n default
kubectl exec <pod> -- sh -c 'nslookup api.example.com; nslookup api.example.com.'
kubectl exec <pod> -- sh -c 'dig +tcp kubernetes.default.svc.cluster.local'

DNS, NAT, and Egress

DNS decides which network path a Pod will try. For cluster names, CoreDNS usually returns ClusterIP or endpoint records that stay inside the cluster datapath. For external names, CoreDNS forwards upstream, and the resulting address may send Pod traffic through node SNAT, a cloud NAT gateway, an egress gateway, a proxy, or a private service endpoint.

NAT-related DNS issues:

Pods resolve a public address instead of a private endpoint and unnecessarily cross the NAT gateway.
CoreDNS upstream queries are SNATed through node or cloud NAT, so upstream resolvers see a shared source IP.
Upstream DNS rate limits can affect many Pods at once when queries share one NAT source.
ndots and search paths multiply external lookups, increasing CoreDNS and NAT gateway load.
NodeLocal DNSCache changes cache locality, source addresses, and the point where DNS packet captures should be taken.
DNS and app traffic may use different egress controls, causing names to resolve but connections to fail.

When a Pod can resolve a name but cannot connect, keep the DNS answer and the NAT path together: nslookup shows the selected address, while ip route get, flow logs, NAT gateway metrics, and packet captures show how that address leaves the cluster.

NATS Example

NATS shows why Kubernetes DNS details matter. Application clients usually connect to a stable Service name on the NATS client port, while NATS servers in a cluster often use StatefulSet Pod DNS names behind a headless Service for peer route connections. The route names that NATS gossips through advertise settings must be names that the receiving peers or clients can resolve, reach, and validate with TLS.

For the full walkthrough, see NATS, DNS, and Kubernetes Networking.

Debugging Flow

Test from inside an affected Pod, not only from a node.
Inspect the Pod’s /etc/resolv.conf.
Resolve a known cluster name such as kubernetes.default.svc.cluster.local.
Resolve the failing Service FQDN and the short name.
Check CoreDNS logs and readiness.
Check the CoreDNS ConfigMap and upstream forwarding.
Check whether NetworkPolicy allows UDP and TCP 53 to cluster DNS.
Check Service selectors and EndpointSlices if only one Service name fails.
Check whether external DNS answers use public NAT egress or private endpoint paths.

Study Cards

Question

What does CoreDNS watch for Kubernetes service discovery?

Answer

Kubernetes API objects such as Services, namespaces, Pods, and EndpointSlices, depending on configuration and permissions.

Question

Why can ndots:5 increase DNS load?

Answer

External-looking names may be expanded through several cluster search suffixes before the absolute query is tried.

Question

Why test DNS from inside the affected Pod?

Answer

The Pod's resolver config, namespace search path, NetworkPolicy, and node path can differ from a node shell.

Question

How can DNS affect NAT gateway use?

Answer

The DNS answer can choose a public address that leaves through NAT or a private endpoint that stays on private routing.

Question

What can EndpointSlice RBAC break in CoreDNS?

Answer

CoreDNS may be unable to watch Service backends correctly, causing stale, empty, or failing Service DNS answers.

References

Scenario Lab

Kubernetes

Kubernetes DNS Outage

Pods intermittently fail name resolution while node-level DNS still works.

Symptoms

nslookup inside Pods times out for ClusterIP Services.
CoreDNS Pods are running but query latency spikes.
Node lookups through the host resolver succeed.

Evidence

Compare /etc/resolv.conf inside a Pod with the node resolver.
Check CoreDNS logs, metrics, endpoints, and EndpointSlice watch errors.
Run dig +search +showsearch to expose ndots expansion.

Command Examples

Command

kubectl -n kube-system logs deploy/coredns --tail=100

Example output

[ERROR] plugin/errors: 2 kubernetes.default.svc.cluster.local. A: read udp 10.244.1.8:39948->10.96.0.10:53: i/o timeout
[INFO] 10.244.2.17:43321 - 44812 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR

What it does: Shows whether CoreDNS is returning normal answers, timing out upstream, failing Kubernetes API watches, or logging plugin-level errors.

Command

kubectl -n kube-system get endpointslice,endpoints,svc -l k8s-app=kube-dns

Example output

NAME                              ADDRESSTYPE   PORTS   ENDPOINTS
endpointslice.discovery.k8s.io/kube-dns-abc12   IPv4          53      10.244.1.8,10.244.2.9
NAME                 TYPE        CLUSTER-IP    PORT(S)
service/kube-dns     ClusterIP   10.96.0.10    53/UDP,53/TCP

What it does: Confirms the DNS Service has ready CoreDNS endpoints and exposes both UDP and TCP port 53.

Command

kubectl exec deploy/debug -- dig kubernetes.default.svc.cluster.local

Example output

;; status: NOERROR
kubernetes.default.svc.cluster.local. 30 IN A 10.96.0.1

What it does: Tests name resolution from the same Pod network path that application Pods use.

Answer: Prove whether the failure is stub resolver search expansion, CoreDNS health, upstream recursion, NetworkPolicy, or API watch state before changing the Corefile.

Open related topic