Linux GPU Drivers

Linux GPU incidents cross several layers at once: PCIe enumeration, kernel modules, firmware, display management, compute runtimes, device-node permissions, userspace libraries, containers, and application frameworks. Do not treat “the driver” as one file. A working GPU stack is a matched set of kernel and userspace pieces.

Command Examples

lspci -nnk | grep -A4 -E 'VGA|3D|Display'
lsmod | grep -E 'amdgpu|radeon|nvidia|nouveau'
ls -l /dev/dri /dev/nvidia* 2>/dev/null
dmesg -T | grep -Ei 'drm|amdgpu|nvidia|nouveau|xid|firmware'
cat /proc/driver/nvidia/version 2>/dev/null
nvidia-smi 2>/dev/null

Example output and meaning:

Command	Example output	What it does
`lspci -nnk \\\| grep -A4 -E 'VGA\\\|3D\\\|Display'`	`GPU PCI IDs and bound kernel driver names.`	Confirms hardware presence and driver binding.
`lsmod \\\| grep -E 'amdgpu\\\|radeon\\\|nvidia\\\|nouveau'`	`Loaded modules such as nvidia, amdgpu, nouveau, or overlay.`	Shows which kernel driver stack is active.
`ls -l /dev/dri /dev/nvidia* 2>/dev/null`	`File names, sizes, owners, permissions, and modification times.`	Confirms the expected artifacts exist with usable ownership and freshness.

These commands answer the first operational questions: is the PCI device visible, which kernel module bound to it, which device nodes exist, whether firmware or reset errors appeared in kernel logs, and whether NVIDIA userspace can talk to the loaded NVIDIA kernel driver.

Stack Model

Layer	AMD Examples	NVIDIA Examples	What Breaks
PCIe and platform	PCI bus, BARs, IOMMU, power, firmware	PCI bus, BARs, IOMMU, power, firmware	Device not enumerated, AER errors, reset failures, wrong NUMA path.
Kernel module	`amdgpu`, older `radeon` for some legacy GPUs	`nvidia`, `nvidia_uvm`, `nvidia_modeset`, `nvidia_drm`; `nouveau` as open driver	Module not loaded, wrong module bound, Secure Boot rejection, DKMS build failure.
Firmware and microcode	`linux-firmware`, PSP/SMU/display firmware	GSP firmware on supported devices, driver-bundled components	Missing firmware, failed initialization, feature unavailable.
Kernel graphics API	DRM, KMS, GEM/TTM, render nodes	NVIDIA kernel/user APIs plus optional DRM KMS integration	No `/dev/dri`, display fails, compositor cannot modeset.
Userspace graphics	Mesa RadeonSI for OpenGL, RADV for Vulkan	NVIDIA OpenGL/Vulkan userspace libraries	Wrong library path, mixed vendor libraries, app uses software renderer.
Compute runtime	ROCm/HIP, OpenCL stacks	CUDA, NVML, NCCL, container runtime	Runtime cannot see GPU, unsupported GPU, version mismatch.
Containers	`/dev/dri/renderD*`, ROCm userspace, group permissions	NVIDIA Container Toolkit, driver-mounted libraries, CUDA images	Devices or libraries missing inside container.

The kernel module and userspace libraries must match the workload. A desktop can render through the display stack while compute fails because ROCm/CUDA support, permissions, or container hooks are wrong. A compute node can run CUDA while a graphical session is irrelevant or absent.

Device Nodes and Permissions

Common device paths:

Path	Meaning
`/dev/dri/card*`	DRM primary nodes, often used by display servers and privileged graphics management.
`/dev/dri/renderD*`	DRM render nodes for unprivileged rendering or compute userspace.
`/dev/kfd`	AMD kernel fusion driver node used by ROCm/HSA workloads.
`/dev/nvidia*`	NVIDIA driver devices for control, memory, UVM, and GPU access.

Permissions matter. On many systems users need membership in groups such as render or video, depending on distro rules. For containers, passing only the GPU device node may not be enough; the matching userspace libraries and runtime hooks must also be present.

AMD GPU Stack

The upstream Linux kernel amdgpu DRM driver supports AMD Radeon GPUs based on GCN, RDNA, and CDNA architectures. It covers kernel responsibilities such as device initialization, memory management, scheduling rings, interrupts, display, power management, firmware interfaces, reset handling, and sysfs/debugfs exposure.

AMD graphics userspace is commonly Mesa:

Component	Role
`amdgpu`	Kernel driver for supported AMD GPUs.
`linux-firmware`	Firmware blobs needed by many GPUs for display, power, security, and compute microcontrollers.
RadeonSI	Mesa OpenGL driver for AMD GPUs.
RADV	Mesa Vulkan driver for AMD GCN/RDNA GPUs.
ROCm/HIP	AMD compute platform and programming stack for supported GPUs and operating systems.
`rocm-smi` / `rocminfo`	ROCm-oriented status and discovery tools.

Important nuance: some older GCN generations may still bind to the older radeon kernel driver by default on some distributions. Mesa’s RADV documentation notes that older GFX6-7 GPUs may require explicitly enabling amdgpu and disabling radeon through kernel parameters before RADV works. Do not cargo-cult those parameters onto modern systems; first confirm the PCI ID, generation, current bound driver, and distro defaults.

AMD checks:

lspci -nnk | grep -A4 -E 'VGA|3D|Display'
modinfo amdgpu | head
cat /sys/module/amdgpu/parameters/* 2>/dev/null
ls -l /dev/dri /dev/kfd 2>/dev/null
find /sys/class/drm -maxdepth 3 -type f -name '*busy*' -o -name '*mem*'
dmesg -T | grep -i -E 'amdgpu|firmware|gpu reset|ring|psp|smu'

ROCm checks:

rocminfo
rocm-smi
groups
ls -l /dev/kfd /dev/dri/renderD* 2>/dev/null

For ROCm, always check AMD’s current support matrix for the exact GPU, OS, kernel, and ROCm version. A GPU can work well for Mesa graphics while still being unsupported or partially supported for ROCm compute. AMD’s current ROCm Linux installation docs also emphasize package-manager based installation paths, so avoid stale installer instructions unless the current vendor doc for your product explicitly says otherwise.

NVIDIA GPU Stack

NVIDIA’s production Linux stack commonly includes NVIDIA kernel modules plus NVIDIA userspace libraries. On Ubuntu, driver packaging distinguishes generic desktop/UDA drivers from Enterprise Ready Driver packages, whose names commonly include -server and are recommended for server and compute tasks. Ubuntu also packages open kernel module variants with -open in the package name for supported hardware.

Core NVIDIA pieces:

Component	Role
`nvidia`	Main NVIDIA kernel module.
`nvidia_uvm`	Unified Virtual Memory support used by CUDA workloads.
`nvidia_modeset` / `nvidia_drm`	Display and DRM integration pieces when used.
`nvidia-smi`	CLI for monitoring and management through NVML.
NVML	C library API behind many stable monitoring integrations.
CUDA libraries	Userspace compute runtime and libraries.
Fabric Manager	Required for some NVSwitch/NVLink fabric environments.
NVIDIA Container Toolkit	Makes host NVIDIA GPUs and driver libraries available to containers.

NVIDIA notes:

nvidia-smi text output is useful for humans, but NVIDIA documents NVML as the better target for tools that must survive driver-release changes.
Use GPU UUID or PCI bus ID in automation; nvidia-smi warns that natural GPU enumeration order is not guaranteed across reboots.
Persistence mode reduces driver lifecycle churn when no clients are attached. On Linux, nvidia-persistenced is the daemon-based approach.
Some nvidia-smi changes require root and may not persist across reboot, while others such as ECC mode can be persistent or take effect after reboot depending on the setting.
GPU reset is not guaranteed to work in every case. NVIDIA explicitly recommends verifying GPU health afterward and power cycling the node if the device is not healthy.

NVIDIA checks:

nvidia-smi
nvidia-smi -L
nvidia-smi --query-gpu=uuid,pci.bus_id,driver_version,persistence_mode,memory.used,memory.total,temperature.gpu,power.draw --format=csv
nvidia-smi topo -m
cat /proc/driver/nvidia/version
lsmod | grep -E 'nvidia|nouveau'
dmesg -T | grep -i -E 'nvidia|xid|nouveau|NVRM'

Ubuntu NVIDIA Packaging and Secure Boot

On Ubuntu, prefer ubuntu-drivers or Ubuntu/NVIDIA apt packages over runfile installers for routine server operations. Ubuntu’s documentation warns that drivers installed from sources outside the guide can overwrite packaged drivers and may break Secure Boot.

Useful commands:

sudo ubuntu-drivers list --gpgpu
sudo ubuntu-drivers install --gpgpu
apt-cache policy 'nvidia-driver-*'
apt-cache policy linux-modules-nvidia-$(uname -r)
apt-cache policy linux-headers-$(uname -r)
mokutil --sb-state

Secure Boot is a common failure boundary:

Prebuilt Ubuntu linux-modules-nvidia-* packages are signed for supported kernels.
DKMS-built modules need headers and signing/enrollment when Secure Boot is enabled.
A driver install can appear successful while the kernel refuses to load an unsigned module.
Kernel upgrades can strand a host if matching NVIDIA modules are not available for the running kernel ABI.

Containers and Kubernetes

GPU containers are not just normal containers with a device file added.

For NVIDIA, the NVIDIA Container Toolkit wires host driver libraries and devices into containers. Kubernetes clusters usually layer a device plugin or GPU Operator on top, but the host still needs a working kernel driver and runtime integration.

For AMD, containers often need /dev/kfd, /dev/dri/renderD*, group permissions, and ROCm userspace compatible with the host kernel/driver stack. Kubernetes deployments need node labels, device plugin behavior, and images built for the expected ROCm version.

Container debugging split:

Does the host see the GPU?
Does the host runtime tool work (nvidia-smi, rocminfo, rocm-smi)?
Does the container have the device nodes?
Does the container have compatible userspace libraries?
Does the orchestrator expose the GPU resource and schedule onto the right node?

Failure Modes

Symptom	Likely Boundary	Checks
GPU absent from `lspci`	Hardware, firmware, BIOS, PCIe, power, passthrough	BIOS/firmware, BMC inventory, PCIe slot, IOMMU, host logs.
PCI device visible but no driver	Kernel module, Secure Boot, unsupported ID	`lspci -nnk`, `modprobe`, `dmesg`, `mokutil --sb-state`.
Driver loaded but tool fails	Userspace/kernel mismatch or permissions	`cat /proc/driver/nvidia/version`, package versions, `/dev` permissions.
AMD graphics works but ROCm fails	ROCm support matrix, `/dev/kfd`, groups, userspace	`rocminfo`, `rocm-smi`, supported GPU/OS/kernel matrix.
NVIDIA CUDA fails but display works	`nvidia_uvm`, CUDA library mismatch, container runtime	`lsmod`, `nvidia-smi`, CUDA sample, container runtime config.
Xid or GPU reset messages	NVIDIA device, driver, power, thermal, PCIe, app workload	`dmesg`, `nvidia-smi -q`, power/thermal history, workload logs.
`amdgpu` ring timeout or reset	AMD kernel/display/firmware/runtime boundary	`dmesg`, firmware package, kernel version, workload trigger.
GPU numbering changes	Enumeration order changed	Use UUID or PCI bus ID, not index.
Host works but container fails	Device/runtime/library exposure	Container device list, runtime hooks, library paths, group IDs.

Troubleshooting Flow

Identify the exact GPU and PCI address with lspci -nnk.
Confirm the intended kernel module is loaded and bound.
Check Secure Boot, DKMS status, kernel headers, and package versions.
Read kernel logs for firmware, reset, Xid, ring, or BAR/IOMMU messages.
Confirm device nodes and user permissions.
Confirm userspace libraries match the driver branch and workload runtime.
For compute, validate vendor tools before testing frameworks.
For containers, compare host success with container device and library visibility.
Use vendor support matrices before upgrading kernels, ROCm, CUDA, or driver branches.
Preserve logs before rebooting because GPU reset failures often lose the most useful evidence.

Study Cards

Question

Why is a Linux GPU driver not just one component?

Answer

A working GPU stack includes PCIe/platform state, kernel modules, firmware, device nodes, userspace graphics or compute libraries, and sometimes container runtime hooks.

Question

What is the difference between /dev/dri/card* and /dev/dri/renderD*?

Answer

Primary DRM nodes are often used by display and management paths, while render nodes allow unprivileged graphics or compute userspace to submit work.

Question

What does amdgpu provide?

Answer

It is the upstream Linux DRM kernel driver for AMD GCN, RDNA, and CDNA GPUs.

Question

Why can AMD graphics work while ROCm fails?

Answer

Mesa graphics support and ROCm compute support have different userspace stacks and support matrices.

Question

Why should NVIDIA automation use UUID or PCI bus ID?

Answer

Natural GPU index ordering is not guaranteed to stay consistent across reboots.

Question

Why is nvidia-smi text output a weak automation API?

Answer

NVIDIA documents NVML as the more stable API for tools that must work across driver releases.

Linux GPU Drivers

Command Examples

Stack Model

Device Nodes and Permissions

AMD GPU Stack

NVIDIA GPU Stack

Ubuntu NVIDIA Packaging and Secure Boot

Containers and Kubernetes

Failure Modes

Troubleshooting Flow

Study Cards

References