Goal

This document guide you to recreate a new Ceph monitor on a Kubernetes cluster if the Ceph monitor is out of quorum.

Step-by-step Method

Check ceph cluster status, and one Ceph monitor mon.a is out of quorum.

$ kubectl ceph -s
  cluster:
    id:     4ecc59d7-0173-4c2d-9802-ae0c9c9aea6c
    health: HEALTH_WARN
 
  services:
    mon: 3 daemons, quorum b,c, out of quorum: a
    mgr: a(active)
    osd: 9 osds: 9 up, 9 in
 
  data:
    pools:   1 pools, 128 pgs
    objects: 1.49 k objects, 4.4 GiB
    usage:   21 GiB used, 210 GiB / 231 GiB avail
    pgs:     128 active+clean

Check the Ceph health detail.

$ kubectl ceph health detail
...
MON_DOWN 1/3 mons down, quorum b,c
    mon.a (rank 0) addr 10.43.116.150:6789/0 is down (out of quorum)

Check the Ceph monitoring pod status, and all the Ceph monitor pods are running. (Not necessary for all the pods to be running for the next step, just a record)

$ kubectl -n rook get pod | grep mon
rook-ceph-mon-a-7fdc8559b6-vfqgx         1/1     Running     1          11d
rook-ceph-mon-b-8597ccdd76-q9qsr         1/1     Running     1          11d
rook-ceph-mon-c-7c7c7fbdff-gbxwx         1/1     Running     1          11d

Check Kubernetes cluster status, and make sure there is no node down.

$ kubectl get nodes
NAME         STATUS   ROLES                      AGE   VERSION
test-k8s-1   Ready    controlplane,etcd,worker   11d   v1.21.9
test-k8s-2   Ready    controlplane,etcd,worker   11d   v1.21.9
test-k8s-3   Ready    worker                     11d   v1.21.9

Scale rook-ceph-mon deployment rook-ceph-mon-a to 0 to prevent further deployment of the pod.

$ kubectl -n rook get deploy | grep mon
rook-ceph-mon-a   1/1     1            1           11d
rook-ceph-mon-b   1/1     1            1           11d
rook-ceph-mon-c   1/1     1            1           11d

$ kubectl -n rook scale deploy rook-ceph-mon-a --replicas 0

Edit rook configmap rook-ceph-mon-endpoints to remove inactive Ceph monitor.

$ kubectl -n rook edit cm rook-ceph-mon-endpoints

# Before
apiVersion: v1
data:
  csi-cluster-config-json: '[{"clusterID":"rook","monitors":["10.43.1.72:6789","10.43.206.183:6789","10.43.116.150:6789"]}]'
  data: b=10.43.1.72:6789,c=10.43.206.183:6789,a=10.43.116.150:6789
  mapping: '{"node":{"a":{"Name":"test-k8s-2","Hostname":"test-k8s-2","Address":"192.168.1.12"},"b":{"Name":"test-k8s-3","Hostname":"test-k8s-3","Address":"192.168.1.13"},"c":{"Name":"test-k8s-1","Hostname":"test-k8s-1","Address":"192.168.1.11"}}}'
  maxMonId: "2"

# After, remove mon.a
apiVersion: v1
data:
  csi-cluster-config-json: '[{"clusterID":"rook","monitors":["10.43.1.72:6789","10.43.206.183:6789"]}]'
  data: b=10.43.1.72:6789,c=10.43.206.183:6789
  mapping: '{"node":{"b":{"Name":"test-k8s-3","Hostname":"test-k8s-3","Address":"192.168.1.13"},"c":{"Name":"test-k8s-1","Hostname":"test-k8s-1","Address":"192.168.1.11"}}}'
  maxMonId: "2"

Remove mon.a from the Ceph monitor.

$ kubectl ceph mon remove a
removing mon.a at 10.43.116.150:6789/0, there will be 2 monitors

Check the Ceph cluster status again to see if mon.a is no longer in the Ceph monitor quorum.

$ kubectl ceph -s
  cluster:
    id:     4ecc59d7-0173-4c2d-9802-ae0c9c9aea6c
    health: HEALTH_OK
 
  services:
    mon: 2 daemons, quorum b,c
    mgr: a(active)
    osd: 9 osds: 9 up, 9 in
 
  data:
    pools:   1 pools, 128 pgs
    objects: 1.49 k objects, 4.4 GiB
    usage:   21 GiB used, 210 GiB / 231 GiB avail
    pgs:     128 active+clean
 
  io:
    client:   7.7 KiB/s wr, 0 op/s rd, 0 op/s wr

Restart rook-ceph-operator to make Ceph recreate the new Ceph monitor automatically.

# Stop rook-ceph-operator
$ kubectl -n rook-system scale deploy rook-ceph-operator --replicas 0

# Restart rook-ceph-operator
$ kubectl -n rook-system scale deploy rook-ceph-operator --replicas 1

Check rook deployment to see if the new Ceph monitor is created. We can see mon.d is created on node test-k8s-1.

$ kubectl -n rook get pod -o wide | grep mon
rook-ceph-mon-c-7fdc8559b6-vfqgx         1/1     Running     1          11d   192.168.1.12    test-k8s-2   <none>           <none>
rook-ceph-mon-b-8597ccdd76-q9qsr         1/1     Running     1          11d   192.168.1.13    test-k8s-3   <none>           <none>
rook-ceph-mon-d-9fdc2df9b6-b1d45         1/1     Running     1          60s   192.168.1.11    test-k8s-1   <none>           <none>

Delete unused Ceph monitor deployment mon.a.

$ kubectl -n rook delete deploy rook-ceph-mon-a
deployment.extensions "rook-ceph-mon-a" deleted

$ kubectl -n rook get deploy | grep mon
rook-ceph-mon-b   1/1     1            1           11d
rook-ceph-mon-c   1/1     1            1           11d
rook-ceph-mon-d   1/1     1            1           2m

Connect to the node on which the Ceph monitor is created, and local files will be created under /var/lib/rook/<mon-id>/data/. The result will look like this:

# Connect to node which ceph monitor is created on
$ ll /var/lib/rook/mon-d/data/
total 20
drwxr-xr-x 3  167  167 4096 Sep  1 20:24 ./
drwxr-xr-x 3 root root 4096 Sep  1 20:21 ../
-rw------- 1  167  167   77 Sep  1 20:24 keyring
-rw-r--r-- 1  167  167    8 Sep  1 20:24 kv_backend
drwxr-xr-x 2  167  167 4096 Sep 13 16:31 store.db/

Check Ceph cluster status, and unset any flag set by rook during the operation.

$ kubectl ceph -s
  cluster:
    id:     4ecc59d7-0173-4c2d-9802-ae0c9c9aea6c
    health: HEALTH_OK
						noscrub,nodeep-scrub flag(s) set
 
  services:
    mon: 3 daemons, quorum b,d,c
    mgr: a(active)
    osd: 9 osds: 9 up, 9 in
 
  data:
    pools:   1 pools, 128 pgs
    objects: 1.49 k objects, 4.4 GiB
    usage:   21 GiB used, 210 GiB / 231 GiB avail
    pgs:     128 active+clean
 
  io:
    client:   9.3 KiB/s wr, 0 op/s rd, 0 op/s wr

$ kubectl ceph osd unset noscrub
noscrub is unset

$ kubectl ceph osd unset nodeep-scrub
nodeep-scrub is unset

The whole procedure is completed.

Reference

The document of Ceph: