Tech Study Guide
Ceph Operations and Recovery
Ceph health triage, OSD failure response, recovery and backfill, scrub repair, maintenance flags, full ratios, and safe operational runbooks.
Ceph Operations and Recovery
Ceph recovery work should preserve data first, then restore redundancy, then restore performance. The fastest-looking action is not always the safest one when PGs are degraded, OSDs are near full, or the cluster is already moving data.
flowchart LR
A[Detect health warning] --> B[Capture ceph -s and health detail]
B --> C[Identify failed domain]
C --> D[Preserve quorum and placement]
D --> E[Choose repair or replace]
E --> F[Watch recovery and backfill]
F --> G[Clear temporary flags]
First Health Pass
ceph -s
ceph health detail
ceph versions
ceph mon stat
ceph mgr stat
ceph osd stat
ceph osd tree
ceph osd df tree
ceph pg stat
ceph pg dump_stuck
Read ceph health detail before taking action. HEALTH_WARN can mean an expected maintenance flag, but it can also mean clock skew, full OSDs, inactive PGs, scrub errors, or daemons that are down.
OSD Failure Response
- Identify the OSD, host, device path, and by-id serial.
- Check whether the OSD is down, out, full, slow, or flapping.
- Confirm whether PGs are degraded, undersized, inactive, or backfilling.
- Verify backups or replicas for critical clients before risky repair.
- Replace failed hardware or redeploy the OSD through the orchestrator.
- Watch recovery until PGs return to
active+clean.
ceph osd find <osd-id>
ceph osd metadata <osd-id>
ceph device ls
ceph device info <devid>
ceph orch device ls
ceph orch daemon restart osd.<id>
ceph orch osd rm <id>
Avoid repeatedly marking OSDs in and out while the root cause is unknown. Flapping can create extra peering and recovery churn.
Recovery and Backfill
Recovery restores missing replicas or chunks after failures. Backfill moves data to satisfy the current CRUSH placement after topology or weight changes. Both consume disk, CPU, and network.
Key questions:
- Is recovery making progress or stuck?
- Is a full or nearfull OSD blocking backfill?
- Is recovery throttled intentionally?
- Are clients suffering because recovery is too aggressive?
- Is the cluster at risk because recovery is too slow?
Recovery tuning decision matrix:
| Situation | Prefer | Avoid |
|---|---|---|
| Client latency is critical and redundancy risk is low | Lower recovery/backfill concurrency temporarily and monitor degraded time. | Disabling recovery indefinitely. |
| Multiple OSDs down or PGs undersized | Restore redundancy first, even if clients slow down. | Prioritizing performance while data loss risk is rising. |
| Backfill blocked by nearfull OSDs | Add capacity, reweight carefully, or free space before forcing movement. | Marking more OSDs out and increasing pressure. |
| Flapping device or host | Stabilize hardware/network before repeated in/out changes. | Repeatedly restarting daemons without root-cause evidence. |
| Planned host maintenance | Set narrow noout, drain or stop one failure domain at a time, remove flags after. |
Broad flags left in place after the window. |
ceph -w
ceph osd perf
ceph osd blocked-by
ceph tell osd.* dump_recovery_reservations
ceph config get osd osd_max_backfills
ceph config get osd osd_recovery_max_active
Scrub and Inconsistency
Scrub checks object metadata. Deep scrub reads data and checks checksums. If Ceph reports inconsistent PGs, identify scope before repair.
ceph health detail
ceph pg <pgid> query
ceph pg deep-scrub <pgid>
rados list-inconsistent-pg <pool>
rados list-inconsistent-obj <pgid>
ceph pg repair <pgid>
ceph pg repair is not a generic first step. Repair can choose an authoritative copy based on available information, but operators should understand what is inconsistent and whether backups or application-level validation are needed.
Maintenance Flags
Flags are useful during controlled work and dangerous when forgotten.
| Flag | Typical Use |
|---|---|
noout |
Prevent OSDs from being marked out during short maintenance. |
norebalance |
Stop rebalancing while changing topology carefully. |
nobackfill |
Temporarily stop backfill pressure. |
norecover |
Temporarily stop recovery pressure. |
noscrub / nodeep-scrub |
Avoid scrub load during sensitive windows. |
ceph osd set noout
ceph osd unset noout
ceph osd dump | grep flags
Full Cluster Risks
Ceph needs slack space for recovery. A pool can fail writes because an OSD or CRUSH subtree is full even when raw cluster capacity appears available.
Watch:
nearfull,backfillfull, andfullhealth checks,- skew in
ceph osd df tree, - misplaced objects that cannot move,
- pools with quotas or bad target ratios,
- device class capacity, not only total capacity.
Study Cards
What should you check before repairing an inconsistent PG?
The affected PG, objects, acting OSDs, health detail, and whether backups or application validation are needed.
Why can OSD flapping make an outage worse?
Repeated in/out changes trigger peering, remapping, and recovery churn before the root cause is fixed.
What is the difference between recovery and backfill?
Recovery restores missing replicas or chunks; backfill moves data to satisfy current placement.
Why is noout dangerous if forgotten?
It can hide real OSD loss and prevent the cluster from restoring redundancy after maintenance.
Why does a nearly full Ceph cluster recover poorly?
Recovery and rebalance need free space; full OSDs can block data movement and client writes.