Linux Storage Health and Performance

Storage incidents often start as latency before they become hard failures. Operators need to read device saturation, kernel errors, SMART/NVMe health, queue behavior, filesystem symptoms, and application latency together.

Command Examples

iostat -xz 1
lsblk -D
smartctl -a /dev/sda
nvme smart-log /dev/nvme0
dmesg -T | grep -Ei 'I/O error|medium error|nvme|scsi|reset'
journalctl -k -p warning..alert

Example output and meaning:

Command	Example output	What it does
`iostat -xz 1`	`Device names, filesystems, mountpoints, latency, errors, or health fields.`	Connects storage symptoms to device and filesystem evidence.
`lsblk -D`	`Device names, filesystems, mountpoints, latency, errors, or health fields.`	Connects storage symptoms to device and filesystem evidence.
`smartctl -a /dev/sda`	`Device names, filesystems, mountpoints, latency, errors, or health fields.`	Connects storage symptoms to device and filesystem evidence.

Latency, Utilization, and Saturation

High storage latency is not the same as high throughput. A device can be slow because it is saturated, retrying errors, waiting on firmware, throttled by the hypervisor, blocked behind a controller queue, or overloaded by sync writes.

Useful signals:

await and service-time trends,
read/write IOPS and throughput,
queue depth,
%util or equivalent saturation indicators,
filesystem read-only remounts,
application fsync latency,
PSI I/O pressure,
cgroup I/O throttling.

SMART and NVMe Health

SMART and NVMe health data can expose media errors, wear, temperature, unsafe shutdowns, controller resets, and lifetime indicators. The exact fields vary by device and vendor, so treat them as signals, not a universal pass/fail oracle.

Operational rules:

collect health data before and after incidents,
alert on growing media errors or critical warnings,
know how your cloud provider exposes disk health,
replace suspect devices before redundancy is exhausted,
correlate kernel errors with physical slot, serial, or by-id path.

SMART and NVMe Interpretation Gallery

Evidence	Why It Matters	Action
SMART `Reallocated_Sector_Ct` or `Current_Pending_Sector` rising	Media is remapping or waiting to remap unreadable sectors.	Replace the disk before RAID or replicas are exhausted.
SMART `UDMA_CRC_Error_Count` rising	Often cable, backplane, controller, or signal integrity rather than platter/media failure.	Reseat or replace cable/backplane path and correlate with kernel link resets.
SMART `Power_On_Hours` high but no errors	Age alone is not a failure, but it changes risk and maintenance planning.	Watch trend data and replace by fleet policy.
SMART overall-health `PASSED` with kernel I/O errors	SMART summary is too coarse; the OS is already seeing failures.	Trust kernel errors and detailed attributes over the summary.
NVMe `critical_warning` nonzero	Controller reports a critical health condition such as spare, temperature, reliability, or read-only risk.	Treat as urgent and inspect detailed log.
NVMe `percentage_used` over 100	Device has exceeded vendor endurance estimate; not always immediate failure.	Plan replacement and reduce write amplification.
NVMe media/data integrity errors increasing	Controller is reporting unrecovered data integrity issues.	Replace or evacuate; verify application and filesystem consistency.
NVMe unsafe shutdowns rising	Power loss or reset path may be unhealthy and can explain journal recovery.	Check power, firmware, host resets, and platform events.

Useful captures:

smartctl -a /dev/sdX
smartctl -x /dev/sdX
nvme smart-log /dev/nvme0
nvme error-log /dev/nvme0
journalctl -k -g 'I/O error|reset|timeout|nvme|scsi|blk_update_request'

Kernel I/O Errors

Kernel logs are often the first place storage failure appears. Look for I/O errors, resets, timeouts, medium errors, filesystem aborts, ext4 or XFS warnings, NVMe controller resets, SCSI sense data, and read-only remounts.

Do not immediately run filesystem repair when kernel logs show device errors. Stabilize or replace the lower layer first.

Discard and Thin Provisioning

lsblk -D shows discard capabilities. fstrim can return unused blocks to SSDs, thin LUNs, and virtual disks. Discard support depends on every layer: filesystem, dm-crypt, LVM, multipath, hypervisor, and storage backend.

Failure Response

Stop unnecessary writes.
Capture lsblk, dmesg, journalctl -k, iostat, and health logs.
Identify the physical or virtual device by serial, by-id path, or cloud disk ID.
Check RAID, multipath, LVM, and filesystem state.
Decide whether redundancy is intact.
Replace or detach the failing layer before repair.
Verify backups and restore paths.

Study Cards

Question

Why is high storage latency different from high throughput?

Answer

Latency can come from retries, queueing, firmware stalls, sync writes, throttling, or errors even when throughput is low.

Question

Why collect serial or by-id paths during failures?

Answer

They map kernel errors to the physical or virtual device that must be replaced or inspected.

Question

Why stabilize the block layer before fsck?

Answer

Filesystem repair on failing storage can worsen corruption and destroy recoverable data.

Linux Storage Health and Performance

Command Examples

Latency, Utilization, and Saturation

SMART and NVMe Health

SMART and NVMe Interpretation Gallery

Kernel I/O Errors

Discard and Thin Provisioning

Failure Response

Study Cards

References