Ceph Operations and Recovery

Ceph recovery work should preserve data first, then restore redundancy, then restore performance. The fastest-looking action is not always the safest one when PGs are degraded, OSDs are near full, or the cluster is already moving data.

flowchart LR
  A[Detect health warning] --> B[Capture ceph -s and health detail]
  B --> C[Identify failed domain]
  C --> D[Preserve quorum and placement]
  D --> E[Choose repair or replace]
  E --> F[Watch recovery and backfill]
  F --> G[Clear temporary flags]

First Health Pass

ceph -s
ceph health detail
ceph versions
ceph mon stat
ceph mgr stat
ceph osd stat
ceph osd tree
ceph osd df tree
ceph pg stat
ceph pg dump_stuck

Read ceph health detail before taking action. HEALTH_WARN can mean an expected maintenance flag, but it can also mean clock skew, full OSDs, inactive PGs, scrub errors, or daemons that are down.

OSD Failure Response

Identify the OSD, host, device path, and by-id serial.
Check whether the OSD is down, out, full, slow, or flapping.
Confirm whether PGs are degraded, undersized, inactive, or backfilling.
Verify backups or replicas for critical clients before risky repair.
Replace failed hardware or redeploy the OSD through the orchestrator.
Watch recovery until PGs return to active+clean.

ceph osd find <osd-id>
ceph osd metadata <osd-id>
ceph device ls
ceph device info <devid>
ceph orch device ls
ceph orch daemon restart osd.<id>
ceph orch osd rm <id>

Avoid repeatedly marking OSDs in and out while the root cause is unknown. Flapping can create extra peering and recovery churn.

Recovery and Backfill

Recovery restores missing replicas or chunks after failures. Backfill moves data to satisfy the current CRUSH placement after topology or weight changes. Both consume disk, CPU, and network.

Key questions:

Is recovery making progress or stuck?
Is a full or nearfull OSD blocking backfill?
Is recovery throttled intentionally?
Are clients suffering because recovery is too aggressive?
Is the cluster at risk because recovery is too slow?

Recovery tuning decision matrix:

Situation	Prefer	Avoid
Client latency is critical and redundancy risk is low	Lower recovery/backfill concurrency temporarily and monitor degraded time.	Disabling recovery indefinitely.
Multiple OSDs down or PGs undersized	Restore redundancy first, even if clients slow down.	Prioritizing performance while data loss risk is rising.
Backfill blocked by nearfull OSDs	Add capacity, reweight carefully, or free space before forcing movement.	Marking more OSDs out and increasing pressure.
Flapping device or host	Stabilize hardware/network before repeated in/out changes.	Repeatedly restarting daemons without root-cause evidence.
Planned host maintenance	Set narrow `noout`, drain or stop one failure domain at a time, remove flags after.	Broad flags left in place after the window.

ceph -w
ceph osd perf
ceph osd blocked-by
ceph tell osd.* dump_recovery_reservations
ceph config get osd osd_max_backfills
ceph config get osd osd_recovery_max_active

Scrub and Inconsistency

Scrub checks object metadata. Deep scrub reads data and checks checksums. If Ceph reports inconsistent PGs, identify scope before repair.

ceph health detail
ceph pg <pgid> query
ceph pg deep-scrub <pgid>
rados list-inconsistent-pg <pool>
rados list-inconsistent-obj <pgid>
ceph pg repair <pgid>

ceph pg repair is not a generic first step. Repair can choose an authoritative copy based on available information, but operators should understand what is inconsistent and whether backups or application-level validation are needed.

Maintenance Flags

Flags are useful during controlled work and dangerous when forgotten.

Flag	Typical Use
`noout`	Prevent OSDs from being marked out during short maintenance.
`norebalance`	Stop rebalancing while changing topology carefully.
`nobackfill`	Temporarily stop backfill pressure.
`norecover`	Temporarily stop recovery pressure.
`noscrub` / `nodeep-scrub`	Avoid scrub load during sensitive windows.

ceph osd set noout
ceph osd unset noout
ceph osd dump | grep flags

Full Cluster Risks

Ceph needs slack space for recovery. A pool can fail writes because an OSD or CRUSH subtree is full even when raw cluster capacity appears available.

Watch:

nearfull, backfillfull, and full health checks,
skew in ceph osd df tree,
misplaced objects that cannot move,
pools with quotas or bad target ratios,
device class capacity, not only total capacity.

Study Cards

Question

What should you check before repairing an inconsistent PG?

Answer

The affected PG, objects, acting OSDs, health detail, and whether backups or application validation are needed.

Question

Why can OSD flapping make an outage worse?

Answer

Repeated in/out changes trigger peering, remapping, and recovery churn before the root cause is fixed.

Question

What is the difference between recovery and backfill?

Answer

Recovery restores missing replicas or chunks; backfill moves data to satisfy current placement.

Question

Why is noout dangerous if forgotten?

Answer

It can hide real OSD loss and prevent the cluster from restoring redundancy after maintenance.

Question

Why does a nearly full Ceph cluster recover poorly?

Answer

Recovery and rebalance need free space; full OSDs can block data movement and client writes.

References

Scenario Lab

Ceph

Ceph Degraded PGs After OSD Loss

A drive failure leaves placement groups degraded while client latency rises during recovery.

Symptoms

ceph -s shows active+degraded or active+undersized placement groups.
One OSD is down or flapping and recovery traffic increases.
Application writes are slower but still mostly succeeding.

Evidence

Capture ceph health detail, ceph osd tree, and ceph pg dump_stuck before changing flags.
Check whether any OSD or CRUSH subtree is nearfull, backfillfull, or full.
Compare client latency with recovery and backfill activity.

Command Examples

Command

ceph -s && ceph health detail

Example output

health: HEALTH_WARN
64 pgs degraded
osd.12 is down

What it does: Establishes the cluster health state and the specific warnings driving recovery work.

Command

ceph osd tree && ceph osd df tree

Example output

ID  CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT
12  hdd   7.276   osd.12         down   1.00000
ID  REWEIGHT SIZE  USE  AVAIL %USE
12  1.00000  7.3T  0B   7.3T  0.00

What it does: Locates the failed OSD and checks whether capacity or CRUSH placement will constrain recovery.

Command

ceph pg dump_stuck && ceph -w

Example output

ok
2026-06-06T10:14:12 64 pgs active+degraded; recovery io 120 MiB/s

What it does: Watches whether placement groups are making progress toward `active+clean` or staying stuck.

Answer: Restore redundancy without creating extra churn: identify the failed domain, replace or restart only the bad OSD, avoid broad forgotten flags, and watch recovery to active+clean.

Open related topic