Linux Kernel Network Performance

Network performance on Linux is a pipeline problem. A packet moves through a NIC queue, interrupt handling, NAPI polling, softirq work, protocol processing, socket buffers, qdisc, and finally application code or the wire. A bottleneck can sit at any layer, so useful troubleshooting separates link errors, packet drops, CPU placement, queueing, retransmits, and application backpressure.

Command Examples

sar -n DEV,TCP,ETCP 1
ss -s
ip -s link
ethtool -S <interface>
cat /proc/net/softnet_stat
mpstat -P ALL 1

Example output and meaning:

Command	Example output	What it does
`sar -n DEV,TCP,ETCP 1`	Per-interface throughput plus TCP retransmit counters.	Correlates traffic rate with TCP error symptoms.
`cat /proc/net/softnet_stat`	Hex counters where dropped or squeezed columns increase.	Detects packet processing backlog pressure in softirq.
`mpstat -P ALL 1`	One CPU high in `%soft` while others are low.	Shows receive processing imbalance or flow concentration.

Use these commands to separate interface drops, TCP retransmits, socket pressure, softnet backlog drops, and CPU imbalance. One busy CPU doing most softirq work while other CPUs are idle usually means receive processing is not spread well, or a small number of flows dominate.

Receive Path

On receive, the NIC writes packets into RX descriptor rings and interrupts the host. Modern drivers then use NAPI: interrupts schedule polling, and the kernel drains packets in batches. This avoids one interrupt per packet under load, but it shifts visible cost into softirq CPU.

Important receive checkpoints:

Layer	What To Look For
NIC RX ring	Drops, missed packets, overruns, no-buffer counters from `ethtool -S`.
IRQ affinity	Whether NIC queue interrupts land on useful CPUs.
NAPI / softirq	High `%soft` CPU, growing `/proc/net/softnet_stat` drops, or CPU imbalance.
Protocol stack	TCP retransmits, resets, backlog pressure, conntrack pressure.
Socket receive buffer	Application not reading fast enough, receive queue growth in `ss`.

Softirq load is still real CPU load. If network processing consumes cores, the application may look slow even though user CPU is low.

Transmit Path

On transmit, the application writes to a socket, TCP may segment and pace data, qdisc decides when packets are released, and the NIC consumes TX descriptors. Drops can happen because the qdisc is full, the NIC queue is overloaded, shaping is configured, or lower-layer errors exist.

tc -s qdisc show dev <interface> is useful when latency or loss appears before the wire. A default qdisc may be fine for simple hosts, but traffic shaping, fq, fq_codel, and classful qdiscs change how queueing and drops behave.

Scaling Across CPUs

Linux has several mechanisms for spreading network work:

Mechanism	Purpose
RSS	NIC hardware hashes flows across receive queues.
RPS	Kernel steers receive packet processing to configured CPUs.
RFS	Kernel steers receive processing toward the CPU running the consuming application thread.
Accelerated RFS	Hardware-assisted steering when the NIC and driver support it.
XPS	Kernel chooses transmit queues based on CPU or receive-queue affinity.

These tools are not automatically better in every environment. On a host with one hardware queue and many CPUs, RPS may help. On a host where RSS already maps queues cleanly to CPUs, extra RPS can add overhead. For latency-sensitive work, NUMA locality and cache locality matter as much as raw parallelism.

Queue and CPU steering map:

flowchart LR
  Wire[Packets on wire] --> NIC[NIC]
  NIC --> RX0[RX queue 0]
  NIC --> RX1[RX queue 1]
  RX0 --> IRQ0[IRQ CPU 0]
  RX1 --> IRQ1[IRQ CPU 1]
  IRQ0 --> NAPI0[NAPI poll / softirq]
  IRQ1 --> NAPI1[NAPI poll / softirq]
  NAPI0 --> RPS[RPS/RFS optional CPU steering]
  NAPI1 --> RPS
  RPS --> TCP[TCP/IP stack and socket queues]
  App[Application threads] --> XPS[XPS selects TX queue]
  XPS --> TX0[TX queue]
  TX0 --> NIC

Interpretation:

Symptom	Likely Steering Issue
One CPU has most network softirq time	RSS queues, IRQ affinity, or one dominant flow concentrates work.
Many RX queues but only one interrupt moves	NIC indirection table or driver queue setup is not distributing flows.
RPS enabled on every CPU with worse latency	Cache locality or NUMA locality is being lost.
Transmit queue imbalance	XPS or application CPU placement does not line up with TX queues.
Drops in `/proc/net/softnet_stat`	Per-CPU backlog cannot drain fast enough.

Useful places to inspect:

cat /proc/interrupts
cat /proc/softirqs
cat /sys/class/net/<interface>/queues/rx-0/rps_cpus
cat /sys/class/net/<interface>/queues/rx-0/rps_flow_cnt
cat /sys/class/net/<interface>/queues/tx-0/xps_cpus

Offloads

NIC and kernel offloads reduce per-packet CPU cost by doing larger chunks of work at once.

Offload	What It Does
Checksum offload	NIC computes or verifies packet checksums.
TSO	TCP segmentation offload: NIC splits large TCP buffers into wire-sized packets.
GSO	Generic segmentation offload in the kernel.
GRO	Generic receive offload: kernel coalesces related received packets before upper-layer processing.
LRO	Large receive offload, usually avoided on routers/bridges because it can distort packets being forwarded.

Offloads can make packet captures confusing because tcpdump may see packets before segmentation or checksum completion. Disable offloads briefly only for controlled debugging; leaving them off can waste CPU.

ethtool -k <interface>
ethtool -K <interface> gro off tso off gso off

Drops, Retransmits, and Queue Pressure

Do not treat every drop counter the same. A drop at the NIC ring, softnet backlog, qdisc, conntrack table, or socket buffer points at a different fix.

Common interpretations:

RX errors or missed packets: physical link, driver, NIC ring, or host CPU not draining fast enough.
/proc/net/softnet_stat drops: per-CPU network backlog pressure.
TCP retransmits: loss, reordering, overloaded middleboxes, bad path, or receiver pressure.
Large send or receive queues in ss: application, peer, congestion window, or buffer pressure.
Full conntrack table: new stateful connections can fail even when some established flows continue.

Sysctls and Limits

Kernel network tunables are sharp tools. Change them only after measuring the bottleneck and recording the old values.

Useful areas:

Tunable Area	Why It Matters
`net.core.netdev_max_backlog`	Maximum packets queued on the per-CPU input backlog.
`net.core.rmem_max` / `net.core.wmem_max`	Upper bounds for socket buffers.
`net.ipv4.tcp_rmem` / `net.ipv4.tcp_wmem`	TCP buffer autotuning ranges.
`net.core.somaxconn`	Upper cap on listen backlog.
`net.ipv4.tcp_max_syn_backlog`	Queue for half-open TCP handshakes.

Tune the kernel and the application together. An application backlog of 128 cannot use a host somaxconn of 4096; the effective limit is constrained by both.

Runbook

Confirm which namespace and interface the traffic uses.
Check link counters with ip -s link and NIC counters with ethtool -S.
Check TCP health with sar -n TCP,ETCP, ss -s, and ss -ti.
Check per-CPU softirq and backlog pressure with /proc/softirqs, /proc/net/softnet_stat, and mpstat.
Check RSS/RPS/RFS/XPS, IRQ affinity, and NUMA placement before changing sysctls.
Inspect qdisc statistics if latency or drops happen before transmit.
Make one change at a time, capture before/after counters, and keep a rollback.

Syscall-to-NIC Diagnostic Path

When a network service is slow, tie application behavior to kernel and NIC evidence in one path instead of collecting disconnected commands.

Boundary	Command	What It Shows
Application syscall	`strace -ttT -p <pid> -e trace=network,poll,epoll_wait`	Blocking connect, accept, read, write, send, recv, or event loop waits.
CPU profile	`perf top -p <pid>` or `perf top -a`	User code, kernel TCP work, copy cost, crypto, or softirq hot paths.
Socket state	`ss -tanpi`	Queues, retransmits, RTT, congestion window, timers.
Kernel counters	`cat /proc/net/snmp /proc/net/netstat`	TCP retransmits, listen overflows, resets, IP errors.
Interface	`ip -s link`, `ethtool -S <if>`	Drops, errors, ring pressure, driver counters.
Scheduling	`mpstat -P ALL 1`, `/proc/softirqs`	Softirq imbalance, CPU saturation, IRQ placement.
Queueing	`tc -s qdisc show dev <if>`	qdisc backlog, shaping, drops before transmit.
Tracing	`bpftrace`, BCC, or CNI tools	Kernel tracepoints, kprobes, XDP/tc programs, drops with context.

Example flow for a slow HTTP service:

strace -ttT -p <pid> -e trace=network,poll,epoll_wait
ss -tanpi '( sport = :8080 )'
perf top -p <pid>
cat /proc/net/netstat | grep -E 'Listen|Retrans|Timeout|Backlog'
ethtool -S <interface> | grep -E 'drop|err|miss|timeout|rx|tx'
tc -s qdisc show dev <interface>

If strace shows the app blocked in epoll_wait while receive queues grow, the app may not be waking or accepting fast enough. If perf shows kernel and softirq cost with NIC drops, tune receive distribution, rings, offloads, or flow placement before changing application code.

Study Cards

Question

What does NAPI change about packet receive handling?

Answer

It lets the driver switch from interrupt-driven notification to batched polling, usually processed through softirq context.

Question

What is RPS?

Answer

Receive Packet Steering, a kernel mechanism that spreads receive packet processing across configured CPUs.

Question

Why can offloads confuse packet captures?

Answer

Captures may observe packets before checksum completion or before large buffers are segmented into wire-sized packets.

Question

What does /proc/net/softnet_stat help reveal?

Answer

Per-CPU network backlog pressure, including drops when the kernel cannot drain receive work fast enough.

Question

Why combine strace, ss, perf, and ethtool?

Answer

Together they connect application syscalls, socket queues, CPU cost, kernel packet processing, and NIC counters.

Linux Kernel Network Performance

Command Examples

Receive Path

Transmit Path

Scaling Across CPUs

Offloads

Drops, Retransmits, and Queue Pressure

Sysctls and Limits

Runbook

Syscall-to-NIC Diagnostic Path

Study Cards

References