From 4d6f0d3c8e373c004159109cb6ff5d83b219dc58 Mon Sep 17 00:00:00 2001 From: Erickson Silva de Oliveira Date: Mon, 7 Apr 2025 22:58:42 -0300 Subject: [PATCH] Check cephfs recovery commands during rook-ceph restore If the open deployment model is used and depending on how the monitors and OSDs were distributed in the cluster, the 'cephfs' commands may hang during cephfs recovery due to missing data since only controller-0 is running at that time. To solve this, the remaining cephfs recovery commands are not executed if the first command fails. This will not affect the rook-ceph restore, as there is a cephfs check at the end of the process, when all hosts are online, and the recovery is redone. Test Plan: - PASS: B&R on STD w/ 3 workers open deployment model and 1 mon and 1 OSD on each host. Closes-Bug: 2106479 Change-Id: Ifa136dfe76bd9ee76e346f86f097f66dbfdff463 Signed-off-by: Erickson Silva de Oliveira --- .../files/recover_rook_ceph.py | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py b/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py index 8bbcf407b..f7da957cd 100644 --- a/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py +++ b/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py @@ -415,12 +415,15 @@ data: # Try to recover from some common errors # The timeout command was used because depending on the status of the cluster, it can get stuck - # on "cephfs-journal-tool" commands. But this will not cause any problems in recovery. - timeout 180 cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary - timeout 180 cephfs-journal-tool --rank=${FS_NAME}:0 journal reset - cephfs-table-tool ${FS_NAME}:0 reset session - cephfs-table-tool ${FS_NAME}:0 reset snap - cephfs-table-tool ${FS_NAME}:0 reset inode + # on "cephfs" commands. But this will not cause any problems in recovery. + CEPHFS_CMD_TIMEOUT=180 + timeout ${CEPHFS_CMD_TIMEOUT} cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary + if [ $? -eq 0 ]; then + timeout ${CEPHFS_CMD_TIMEOUT} cephfs-journal-tool --rank=${FS_NAME}:0 journal reset + timeout ${CEPHFS_CMD_TIMEOUT} cephfs-table-tool ${FS_NAME}:0 reset session + timeout ${CEPHFS_CMD_TIMEOUT} cephfs-table-tool ${FS_NAME}:0 reset snap + timeout ${CEPHFS_CMD_TIMEOUT} cephfs-table-tool ${FS_NAME}:0 reset inode + fi fi kubectl -n rook-ceph scale deployment -l app=rook-ceph-osd --replicas 1