Tech Study Guide
Linux Kernel Network Performance
Linux kernel packet processing, NAPI, softirq, NIC queues, RPS, RFS, XPS, offloads, qdisc, drops, and practical network performance troubleshooting.
Linux Kernel Network Performance
Network performance on Linux is a pipeline problem. A packet moves through a NIC queue, interrupt handling, NAPI polling, softirq work, protocol processing, socket buffers, qdisc, and finally application code or the wire. A bottleneck can sit at any layer, so useful troubleshooting separates link errors, packet drops, CPU placement, queueing, retransmits, and application backpressure.
Command Examples
sar -n DEV,TCP,ETCP 1
ss -s
ip -s link
ethtool -S <interface>
cat /proc/net/softnet_stat
mpstat -P ALL 1
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
sar -n DEV,TCP,ETCP 1 |
Per-interface throughput plus TCP retransmit counters. | Correlates traffic rate with TCP error symptoms. |
cat /proc/net/softnet_stat |
Hex counters where dropped or squeezed columns increase. | Detects packet processing backlog pressure in softirq. |
mpstat -P ALL 1 |
One CPU high in %soft while others are low. |
Shows receive processing imbalance or flow concentration. |
Use these commands to separate interface drops, TCP retransmits, socket pressure, softnet backlog drops, and CPU imbalance. One busy CPU doing most softirq work while other CPUs are idle usually means receive processing is not spread well, or a small number of flows dominate.
Receive Path
On receive, the NIC writes packets into RX descriptor rings and interrupts the host. Modern drivers then use NAPI: interrupts schedule polling, and the kernel drains packets in batches. This avoids one interrupt per packet under load, but it shifts visible cost into softirq CPU.
Important receive checkpoints:
| Layer | What To Look For |
|---|---|
| NIC RX ring | Drops, missed packets, overruns, no-buffer counters from ethtool -S. |
| IRQ affinity | Whether NIC queue interrupts land on useful CPUs. |
| NAPI / softirq | High %soft CPU, growing /proc/net/softnet_stat drops, or CPU imbalance. |
| Protocol stack | TCP retransmits, resets, backlog pressure, conntrack pressure. |
| Socket receive buffer | Application not reading fast enough, receive queue growth in ss. |
Softirq load is still real CPU load. If network processing consumes cores, the application may look slow even though user CPU is low.
Transmit Path
On transmit, the application writes to a socket, TCP may segment and pace data, qdisc decides when packets are released, and the NIC consumes TX descriptors. Drops can happen because the qdisc is full, the NIC queue is overloaded, shaping is configured, or lower-layer errors exist.
tc -s qdisc show dev <interface> is useful when latency or loss appears before the wire. A default qdisc may be fine for simple hosts, but traffic shaping, fq, fq_codel, and classful qdiscs change how queueing and drops behave.
Scaling Across CPUs
Linux has several mechanisms for spreading network work:
| Mechanism | Purpose |
|---|---|
| RSS | NIC hardware hashes flows across receive queues. |
| RPS | Kernel steers receive packet processing to configured CPUs. |
| RFS | Kernel steers receive processing toward the CPU running the consuming application thread. |
| Accelerated RFS | Hardware-assisted steering when the NIC and driver support it. |
| XPS | Kernel chooses transmit queues based on CPU or receive-queue affinity. |
These tools are not automatically better in every environment. On a host with one hardware queue and many CPUs, RPS may help. On a host where RSS already maps queues cleanly to CPUs, extra RPS can add overhead. For latency-sensitive work, NUMA locality and cache locality matter as much as raw parallelism.
Queue and CPU steering map:
flowchart LR
Wire[Packets on wire] --> NIC[NIC]
NIC --> RX0[RX queue 0]
NIC --> RX1[RX queue 1]
RX0 --> IRQ0[IRQ CPU 0]
RX1 --> IRQ1[IRQ CPU 1]
IRQ0 --> NAPI0[NAPI poll / softirq]
IRQ1 --> NAPI1[NAPI poll / softirq]
NAPI0 --> RPS[RPS/RFS optional CPU steering]
NAPI1 --> RPS
RPS --> TCP[TCP/IP stack and socket queues]
App[Application threads] --> XPS[XPS selects TX queue]
XPS --> TX0[TX queue]
TX0 --> NIC
Interpretation:
| Symptom | Likely Steering Issue |
|---|---|
| One CPU has most network softirq time | RSS queues, IRQ affinity, or one dominant flow concentrates work. |
| Many RX queues but only one interrupt moves | NIC indirection table or driver queue setup is not distributing flows. |
| RPS enabled on every CPU with worse latency | Cache locality or NUMA locality is being lost. |
| Transmit queue imbalance | XPS or application CPU placement does not line up with TX queues. |
Drops in /proc/net/softnet_stat |
Per-CPU backlog cannot drain fast enough. |
Useful places to inspect:
cat /proc/interrupts
cat /proc/softirqs
cat /sys/class/net/<interface>/queues/rx-0/rps_cpus
cat /sys/class/net/<interface>/queues/rx-0/rps_flow_cnt
cat /sys/class/net/<interface>/queues/tx-0/xps_cpus
Offloads
NIC and kernel offloads reduce per-packet CPU cost by doing larger chunks of work at once.
| Offload | What It Does |
|---|---|
| Checksum offload | NIC computes or verifies packet checksums. |
| TSO | TCP segmentation offload: NIC splits large TCP buffers into wire-sized packets. |
| GSO | Generic segmentation offload in the kernel. |
| GRO | Generic receive offload: kernel coalesces related received packets before upper-layer processing. |
| LRO | Large receive offload, usually avoided on routers/bridges because it can distort packets being forwarded. |
Offloads can make packet captures confusing because tcpdump may see packets before segmentation or checksum completion. Disable offloads briefly only for controlled debugging; leaving them off can waste CPU.
ethtool -k <interface>
ethtool -K <interface> gro off tso off gso off
Drops, Retransmits, and Queue Pressure
Do not treat every drop counter the same. A drop at the NIC ring, softnet backlog, qdisc, conntrack table, or socket buffer points at a different fix.
Common interpretations:
- RX errors or missed packets: physical link, driver, NIC ring, or host CPU not draining fast enough.
/proc/net/softnet_statdrops: per-CPU network backlog pressure.- TCP retransmits: loss, reordering, overloaded middleboxes, bad path, or receiver pressure.
- Large send or receive queues in
ss: application, peer, congestion window, or buffer pressure. - Full conntrack table: new stateful connections can fail even when some established flows continue.
Sysctls and Limits
Kernel network tunables are sharp tools. Change them only after measuring the bottleneck and recording the old values.
Useful areas:
| Tunable Area | Why It Matters |
|---|---|
net.core.netdev_max_backlog |
Maximum packets queued on the per-CPU input backlog. |
net.core.rmem_max / net.core.wmem_max |
Upper bounds for socket buffers. |
net.ipv4.tcp_rmem / net.ipv4.tcp_wmem |
TCP buffer autotuning ranges. |
net.core.somaxconn |
Upper cap on listen backlog. |
net.ipv4.tcp_max_syn_backlog |
Queue for half-open TCP handshakes. |
Tune the kernel and the application together. An application backlog of 128 cannot use a host somaxconn of 4096; the effective limit is constrained by both.
Runbook
- Confirm which namespace and interface the traffic uses.
- Check link counters with
ip -s linkand NIC counters withethtool -S. - Check TCP health with
sar -n TCP,ETCP,ss -s, andss -ti. - Check per-CPU softirq and backlog pressure with
/proc/softirqs,/proc/net/softnet_stat, andmpstat. - Check RSS/RPS/RFS/XPS, IRQ affinity, and NUMA placement before changing sysctls.
- Inspect qdisc statistics if latency or drops happen before transmit.
- Make one change at a time, capture before/after counters, and keep a rollback.
Syscall-to-NIC Diagnostic Path
When a network service is slow, tie application behavior to kernel and NIC evidence in one path instead of collecting disconnected commands.
| Boundary | Command | What It Shows |
|---|---|---|
| Application syscall | strace -ttT -p <pid> -e trace=network,poll,epoll_wait |
Blocking connect, accept, read, write, send, recv, or event loop waits. |
| CPU profile | perf top -p <pid> or perf top -a |
User code, kernel TCP work, copy cost, crypto, or softirq hot paths. |
| Socket state | ss -tanpi |
Queues, retransmits, RTT, congestion window, timers. |
| Kernel counters | cat /proc/net/snmp /proc/net/netstat |
TCP retransmits, listen overflows, resets, IP errors. |
| Interface | ip -s link, ethtool -S <if> |
Drops, errors, ring pressure, driver counters. |
| Scheduling | mpstat -P ALL 1, /proc/softirqs |
Softirq imbalance, CPU saturation, IRQ placement. |
| Queueing | tc -s qdisc show dev <if> |
qdisc backlog, shaping, drops before transmit. |
| Tracing | bpftrace, BCC, or CNI tools |
Kernel tracepoints, kprobes, XDP/tc programs, drops with context. |
Example flow for a slow HTTP service:
strace -ttT -p <pid> -e trace=network,poll,epoll_wait
ss -tanpi '( sport = :8080 )'
perf top -p <pid>
cat /proc/net/netstat | grep -E 'Listen|Retrans|Timeout|Backlog'
ethtool -S <interface> | grep -E 'drop|err|miss|timeout|rx|tx'
tc -s qdisc show dev <interface>
If strace shows the app blocked in epoll_wait while receive queues grow, the app may not be waking or accepting fast enough. If perf shows kernel and softirq cost with NIC drops, tune receive distribution, rings, offloads, or flow placement before changing application code.
Study Cards
What does NAPI change about packet receive handling?
It lets the driver switch from interrupt-driven notification to batched polling, usually processed through softirq context.
What is RPS?
Receive Packet Steering, a kernel mechanism that spreads receive packet processing across configured CPUs.
Why can offloads confuse packet captures?
Captures may observe packets before checksum completion or before large buffers are segmented into wire-sized packets.
What does /proc/net/softnet_stat help reveal?
Per-CPU network backlog pressure, including drops when the kernel cannot drain receive work fast enough.
Why combine strace, ss, perf, and ethtool?
Together they connect application syscalls, socket queues, CPU cost, kernel packet processing, and NIC counters.