Tech Study Guide
Linux Memory Pressure and OOM
Linux memory troubleshooting for RSS, VSZ, page cache, slab, THP, NUMA, swap, cgroup memory, PSI, and OOM killer behavior.
Linux Memory Pressure and OOM
Linux memory incidents are usually not “free memory is low.” The kernel uses memory for page cache, slab, anonymous pages, file mappings, buffers, kernel stacks, and cgroups. Healthy systems often keep free memory low because unused RAM is wasted RAM.
The operational question is whether reclaim, compaction, swapping, cgroup limits, Pressure Stall Information, or the OOM killer are affecting workloads.
Command Examples
free -h
cat /proc/meminfo
cat /proc/pressure/memory
vmstat 1
ps -eo pid,ppid,comm,rss,vsz,%mem --sort=-rss | head
journalctl -k -g 'Out of memory|Killed process|oom-kill'
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
free -h |
Mem: 31Gi used 2.1Gi free 18Gi buff/cache. |
Separates free memory from reclaimable page cache. |
cat /proc/pressure/memory |
some avg10=4.20 and full avg10=0.35. |
Shows whether tasks are stalled on memory pressure. |
journalctl -k -g 'Out of memory|Killed process|oom-kill' |
Killed process 1234 (java) total-vm:... anon-rss:.... |
Confirms whether the OOM killer acted and which process lost. |
Start with system pressure, then separate process RSS, kernel memory, page cache, swap activity, and cgroup limits.
Memory Types
| Term | Meaning | Incident Signal |
|---|---|---|
| RSS | Resident physical pages mapped into a process. | Large or growing process memory use. |
| VSZ / VIRT | Virtual address space reserved or mapped by a process. | Often high without real pressure; not a leak by itself. |
| Anonymous memory | Heap, stack, and private writable pages not backed by files. | Main source of process memory pressure. |
| Page cache | File data cached by the kernel. | Usually reclaimable, but writeback or dirty pages can stall. |
| Slab | Kernel object caches. | Can reveal dentries, inodes, conntrack, or filesystem pressure. |
| Buffers | Block-device metadata buffers. | Usually less important than page cache/slab. |
| Swap | Disk-backed memory extension. | Sustained swap-in/out indicates pressure and latency risk. |
available memory is usually a better quick signal than free memory because it estimates memory that can be used without heavy reclaim.
Page Cache and Reclaim
Linux aggressively caches file data. Page cache is useful until reclaim has to fight active workloads.
Useful checks:
grep -E 'MemAvailable|Cached|Dirty|Writeback|Slab|SReclaimable|SUnreclaim' /proc/meminfo
sar -B 1
vmstat 1
Watch for sustained page scanning, dirty writeback stalls, and low MemAvailable. Dropping caches is rarely a fix; it can hide evidence and make the next read path slower.
OOM Killer
The OOM killer chooses a victim when the kernel cannot satisfy memory allocation after reclaim. The decision uses badness scoring, memory use, privileges, and oom_score_adj.
Inspect scores:
cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj
grep -E 'VmRSS|VmSize|RssAnon|RssFile|RssShmem' /proc/<pid>/status
In containers, the kernel can kill a process because the cgroup limit is hit even when the host has memory available.
Cgroups and Containers
For cgroup v2:
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/memory.high
cat /sys/fs/cgroup/memory.events
cat /sys/fs/cgroup/memory.pressure
Important events:
| Event | Meaning |
|---|---|
high |
Workload exceeded memory.high and was throttled/reclaimed. |
max |
Workload hit memory.max. |
oom |
Cgroup allocation failed and OOM handling ran. |
oom_kill |
A task in the cgroup was killed. |
Kubernetes memory limits map to cgroup limits. A pod can be OOMKilled while the node looks mostly healthy.
OOM Comparison Matrix
| OOM Shape | Boundary | Evidence | Common Fix |
|---|---|---|---|
| Host OOM | Whole node memory is exhausted after reclaim. | dmesg, journalctl -k, low available memory, host OOM victim. |
Reduce node pressure, add memory, tune workload placement, fix leaks. |
| cgroup OOM | A service or container cgroup hits memory.max. |
memory.events oom_kill, container exit, host may still have memory. |
Raise limit, reduce heap/native/page-cache use, split sidecars, fix leaks. |
| kubelet eviction | Node pressure crosses eviction thresholds. | Pod Evicted, kubelet events, node pressure conditions. |
Adjust requests, free ephemeral storage/memory, fix noisy workloads. |
| Application heap OOM | Runtime heap limit is hit before cgroup limit. | JVM/Go/Python/node error logs, heap dumps, process exit. | Tune runtime heap, fix allocations, account for native memory. |
Do not assume every OOMKilled is a node outage. The first split is host boundary, cgroup boundary, kubelet eviction, or language runtime.
Runtime Memory Examples
Language runtimes reserve and report memory differently, so a single RSS number does not explain the whole failure.
| Runtime Shape | What Uses Memory | Common Surprise | Command Evidence |
|---|---|---|---|
| Java service | Java heap, metaspace, thread stacks, direct buffers, JIT/code cache, mmap files, native libraries. | -Xmx is not the container limit; native memory can push RSS beyond heap. |
jcmd <pid> VM.native_memory summary, GC logs, -XX:MaxRAMPercentage, cgroup limit. |
| Go service | Go heap, goroutine stacks, spans, caches, mmap, Cgo/native allocations. | Go may hold memory for reuse after GC, so RSS can stay high after heap drops. | GODEBUG=gctrace=1, pprof heap, runtime.MemStats, cgroup memory. |
| Native C/C++ service | malloc arenas, thread stacks, mmap regions, file mappings, allocator fragmentation. | Leaks may be outside application metrics, and glibc arenas can grow with thread count. | /proc/<pid>/smaps_rollup, pmap -x, allocator stats, perf, ASAN in test. |
| Python/Node service | Managed heap plus native extensions, buffers, JIT/runtime overhead, mmap files. | Runtime heap limit and cgroup limit can disagree. | Runtime heap tools, /proc/<pid>/status, cgroup events. |
Example: a Java container with memory.max=1GiB and -Xmx1g can still die because thread stacks, direct buffers, metaspace, and libc allocations need memory outside the Java heap. Leave headroom or use container-aware heap sizing.
Example: a Go service can show a stable application heap in pprof while RSS grows from mmap, Cgo, or retained spans. Compare pprof with /proc/<pid>/smaps_rollup before blaming the garbage collector.
THP, NUMA, and Swap
Transparent Huge Pages can improve TLB efficiency for some workloads and hurt latency for others through compaction and allocation stalls. NUMA can make memory access slower when processes run far from their memory. Swap can preserve availability but create latency spikes when hot pages are swapped out.
Checks:
cat /sys/kernel/mm/transparent_hugepage/enabled
numastat
numastat -p <pid>
swapon --show
cat /proc/swaps
Treat THP and swap policy as workload-specific. Databases often need explicit guidance from vendor docs and measured testing.
Leak Triage
- Confirm whether RSS is growing or only VSZ is large.
- Compare process memory, cgroup memory, and node memory.
- Split anonymous RSS from file-backed RSS.
- Check whether page cache, slab, or dirty writeback dominates.
- Review deployment changes, traffic changes, and batch jobs.
- Capture heap profiles when the application runtime supports them.
- If kernel memory grows, inspect slab caches and subsystem counters.
Useful commands:
pmap -x <pid> | tail -20
grep -E 'Rss|Pss|Private|Shared' /proc/<pid>/smaps_rollup
slabtop
Runbook
- Confirm whether the symptom is latency, allocation failure, swap storm, cgroup OOM, or node OOM.
- Save
free -h,/proc/meminfo, PSI,vmstat, top RSS processes, and kernel OOM logs. - Check cgroup memory files for affected services or containers.
- Identify dominant memory: anonymous RSS, page cache, slab, dirty pages, or swap.
- Apply the smallest mitigation: reduce concurrency, restart one leaking workload, raise a cgroup limit, disable a batch job, or shed traffic.
- After recovery, add alerts for PSI, cgroup
oom_kill, swap activity, and sustained RSS growth.
Study Cards
Why is low free memory not automatically a Linux problem?
Linux uses otherwise idle memory for cache; MemAvailable and pressure signals are more useful.
What is the difference between RSS and VSZ?
RSS is resident physical memory; VSZ is virtual address space and can be large without real pressure.
Why can a container be OOMKilled on a healthy node?
The cgroup memory limit can be exhausted even when the host still has available memory.
What does memory PSI show?
How much time tasks are stalled because memory reclaim or allocation pressure blocks forward progress.