<aside> ⚠️ By following this document, you will format and destroy data on a disk, proceed with extreme care!

</aside>

Problem and Goal

This article aims to provide a fast and clean way to recreate Ceph OSD when OSD is down.

Environment

Kubernetes cluster with at least three nodes and at least one worker node in the cluster, and has Rook Ceph installed.

Step-by-step Method

<aside> 💡 Replace the name with the real name in your environment for operation.

</aside>

The environmental variables in this operation are listed as follows:

Item Description Name in this doc
OSD ID The id of OSD, value is a natural number. 2
Block device name The store path of OSD is an OS-assigned device name. sdb
Host The host in which the OSD is stored in test-k8s-1
Ceph LV name The LV was created by rook to store data ceph--eba2294c--dac1--4c30--98c7--d02cd5c9830e-osd--data--16c28b6c--b9da--4280--a6a1--4da469fdd3a4
  1. Check Ceph status

    $ kubectl ceph -s
      cluster:
        id:     4ecc59d7-0173-4c2d-9802-ae0c9c9aea6c
        health: HEALTH_WARN
     
      services:
        mon: 3 daemons, quorum b,a,c
        mgr: a(active)
        osd: 9 osds: 8 up, 8 in
     
      data:
        pools:   1 pools, 128 pgs
        objects: 1.49 k objects, 4.4 GiB
        usage:   21 GiB used, 210 GiB / 231 GiB avail
        pgs:     128 active+clean
     
      io:
        client:   9.3 KiB/s wr, 0 op/s rd, 1 op/s wr
    

    We can find one OSD is down.

  2. Set noout and nobackfill flag on the Ceph cluster to prevent rebalancing the data files and new backfill operations.

    $ kubectl ceph osd set nobackfill
    
    $ kubectl ceph osd set noout
    
    $ kubectl ceph -s
      cluster:
        id:     4ecc59d7-0173-4c2d-9802-ae0c9c9aea6c
        health: HEALTH_WARN
                noout,nobackfill flag(s) set
      .....
    
  3. Find the down OSD.

    $ kubectl ceph osd tree | grep down
    2   hdd 0.01859         osd.2           down  1.00000 1.00000
    

    We can see that the number of the down OSD is 2.

  4. Check rook OSD pod status

    $ kubectl -n rook get pod | grep osd-2
    rook-ceph-osd-2-8567f5668f-72b8g         1/1     Running     1          XXm
    
  5. Verify which host and disk the OSD is stored on.

    $ kubectl ceph osd metadata 2 | grep -e devices -e hostname
    "container_hostname": "rook-ceph-osd-2-64bf499c74-kp29f",
        "devices": "dm-2,sdb",
        "hostname": "test-k8s-1",
    

    The block device in which osd 2 is stored is sdb on host test-k8s-1.

  6. Connect to test-k8s-1 and check the block device list.

    $ lsblk  | grep -A1 sdb
    sdb                                                        259:4    0   20G  0 disk
    └─ceph--eba2294c--dac1--4c30--98c7--d02cd5c9830e-osd--data--16c28b6c--b9da--4280--a6a1--4da469fdd3a4 253:1    0   19G  0 lvm
    

    The Disk is currently occupied by LV previously created by Rook.

    Take note of the LV name shown in the command for later use. In this case, it’s ceph--eba2294c--dac1--4c30--98c7--d02cd5c9830e-osd--data--16c28b6c--b9da--4280--a6a1--4da469fdd3a4 .

  7. Back to the control node, now we need to purge the down OSD.

    <aside> ⚠️ Make sure you purge the target OSD id, or you may destroy a working OSD by accident.

    </aside>

    # Purge the osd
    $ kubectl ceph osd out osd.2
    marked out osd.2.
    
    $ kubectl ceph osd down osd.2
    osd.2 is already down.
    
    $ kubectl ceph osd purge osd.2 --yes-i-really-mean-it
    purged osd.2
    
    # Remove the rook deployment to prevent it from deploying again
    $ kubectl -n rook delete deployment rook-ceph-osd-2
    deployment.extensions "rook-ceph-osd-2" deleted
    
  8. Connect to test-k8s-1 and clean up the block device, in this case, sdb for the rook to use it again.

    <aside> ⚠️ Make sure you connect to the target host and operate with the target block device, or you may destroy a working OSD by accident.

    </aside>

    $ ssh test-k8s-1
    
    # Clean the GPT data structures on the disk
    $ sudo sgdisk --zap /dev/sdb
     
    # Erase filesystem signatures (magic strings) from the disk to make the signatures invisible for libblkid.
    $ sudo wipefs -a -f /dev/sdb
    
    # Remove the device from the logical volume
    # <CEPH-FOLDER> is the LV name we captured in step 6
    $ sudo dmsetup remove /dev/mapper/<CEPH-FOLDER> 
    
    # Manually remove the block special file in case it is not removed by dmsetup
    $ sudo rm /dev/mapper/<CEPH-FOLDER>
    
    # Check the udev property of the target disk, and confirm it is totally wiped out
    $ udevadm info --query=property /dev/sdb | grep ID_FS_TYPE
    
    # Remove any rook outdated files on the host
    $ sudo rm -rf /var/lib/rook/osd2
    
  9. Back to the control node, and check the cephcluster config to ensure the device name is in the deviceFilter wildcard.

    $ kubectl get cephcluster -n rook -o yaml | less
    .....
    - config:
              osdsPerDevice: "1"
              storeType: bluestore
            deviceFilter: ^sd[b-f]
            name: test-k8s-1
            resources: {}
    .....
    

    We can see that sdb is in the deviceFilter wildcard.

  10. Restart the rook-ceph-operator for it to recreate a new OSD.

    # Stop rook-ceph-operator
    $ kubectl -n rook-system scale deploy rook-ceph-operator --replicas 0
    
    # Start rook-ceph-operator
    $ kubectl -n rook-system scale deploy rook-ceph-operator --replicas 1
    
  11. Follow the pod logs of rook-ceph-operator to monitor the OSD recreation.

    # Get the pod name of rook-ceph-operator
    $ kubectl -n rook-system get pod | grep rook-ceph-operator
    
    # Follow the pod log of rook-ceph-operator
    $ kubectl -n rook-system logs <rook-ceph-operator-pod-name> -f
    # You will see new osd deployment message on the node it is previously stored