Tech Study Guide
Linux Performance Triage Runbooks
Symptom-driven Linux runbooks for high CPU, high load, memory pressure, disk latency, softirq saturation, file descriptor exhaustion, and cgroup throttling.
Linux Performance Triage Runbooks
Performance triage starts by naming the bottleneck. “The server is slow” is not actionable. The first split is CPU, run queue, memory reclaim, disk latency, network softirq, file descriptor pressure, cgroup throttling, or dependency latency.
These runbooks are meant for the first 10 minutes of an incident.
Command Examples
uptime
vmstat 1
mpstat -P ALL 1
iostat -xz 1
pidstat -durh 1
cat /proc/pressure/cpu /proc/pressure/memory /proc/pressure/io
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
uptime |
load average: 8.42, 7.91, 6.03. |
Shows whether runnable or blocked work is piling up. |
vmstat 1 |
Memory totals, PSI stalls, swap activity, process RSS, or OOM evidence. |
Separates real pressure from cached memory and per-process growth. |
mpstat -P ALL 1 |
Per-interface, per-CPU, softirq, retransmit, or driver counters. |
Locates performance pressure in CPU, NIC, queue, or protocol layers. |
Capture before restarting. Restarts often erase the evidence that explains the incident.
High CPU
Goal: identify whether CPU is user code, kernel work, interrupts, softirq, or steal time.
mpstat -P ALL 1
top -H -p <pid>
pidstat -t -p <pid> 1
perf top
Interpretation:
| Signal | Likely Cause |
|---|---|
High %usr |
Application code or runtime work. |
High %sys |
Syscall-heavy workload, kernel path, filesystem, network, or locks. |
High %soft |
Network or block softirq processing. |
High %steal |
Hypervisor contention. |
High Load, Low CPU
High load with low CPU usually means tasks are runnable but not scheduled, or stuck in uninterruptible sleep.
ps -eo state,pid,ppid,comm,wchan:32 --sort=state | head -50
cat /proc/pressure/cpu
cat /proc/pressure/io
Many D state tasks point toward storage, filesystem, NFS, block devices, or kernel waits. Many runnable tasks point toward CPU saturation or scheduler contention.
Memory Pressure
free -h
cat /proc/meminfo
cat /proc/pressure/memory
vmstat 1
journalctl -k -g 'oom|Out of memory|Killed process'
Decide whether pressure is process RSS, page cache/writeback, slab, swap, or cgroup limit. For containers, check cgroup memory files before blaming the node.
Disk Latency
iostat -xz 1
pidstat -d 1
lsblk -o NAME,TYPE,MODEL,ROTA,SIZE,MOUNTPOINTS
journalctl -k -g 'I/O error|nvme|scsi|reset|EXT4|XFS'
Look for high await, high utilization, queue depth, errors, resets, and one process dominating writes or fsyncs.
Network Softirq Saturation
sar -n DEV,TCP,ETCP 1
mpstat -P ALL 1
cat /proc/softirqs
cat /proc/net/softnet_stat
ethtool -S <interface>
One CPU doing most softirq work points at queue placement, RSS/RPS/RFS/XPS, IRQ affinity, or one dominant flow.
File Descriptor Exhaustion
cat /proc/sys/fs/file-nr
ls -l /proc/<pid>/fd | wc -l
cat /proc/<pid>/limits
lsof -p <pid> | head
ss -s
EMFILE means the process hit its limit. ENFILE means the system-wide file table is under pressure.
Cgroup Throttling and Noisy Neighbors
cat /proc/<pid>/cgroup
cat /sys/fs/cgroup/cpu.stat
cat /sys/fs/cgroup/memory.events
cat /sys/fs/cgroup/io.stat
CPU throttling, memory high events, and I/O limits can make a service slow while host-level resources look available.
Runbook
- Capture
uptime,vmstat,mpstat,iostat,pidstat, and PSI. - Pick one dominant bottleneck; do not tune everything at once.
- Identify affected process, cgroup, device, interface, or dependency.
- Mitigate with the smallest reversible action: reduce concurrency, shed load, move traffic, pause batch work, raise a limit, or restart one bad process.
- Save before/after counters.
- Convert the finding into an alert or dashboard gap.
Study Cards
What does high load with low CPU often suggest?
Tasks may be stuck in uninterruptible sleep or waiting on IO rather than consuming CPU.
Why check PSI during performance triage?
It shows time tasks lost waiting for CPU, memory, or IO resources.
What is the difference between EMFILE and ENFILE?
EMFILE is a per-process file descriptor limit; ENFILE is system-wide file table pressure.
Why can cgroup throttling hide behind healthy host metrics?
A workload can hit its cgroup CPU, memory, or IO limit while the host still has spare capacity.