From 4d6f0d3c8e373c004159109cb6ff5d83b219dc58 Mon Sep 17 00:00:00 2001
From: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
Date: Mon, 7 Apr 2025 22:58:42 -0300
Subject: [PATCH] Check cephfs recovery commands during rook-ceph restore

If the open deployment model is used and depending on how the
monitors and OSDs were distributed in the cluster, the 'cephfs'
commands may hang during cephfs recovery due to missing data
since only controller-0 is running at that time.

To solve this, the remaining cephfs recovery commands are not
executed if the first command fails.

This will not affect the rook-ceph restore, as there is a cephfs
check at the end of the process, when all hosts are online,
and the recovery is redone.

Test Plan:
- PASS: B&R on STD w/ 3 workers
        open deployment model and 1 mon and 1 OSD on each host.

Closes-Bug: 2106479

Change-Id: Ifa136dfe76bd9ee76e346f86f097f66dbfdff463
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
---
 .../files/recover_rook_ceph.py                    | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py b/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py
index 8bbcf407b..f7da957cd 100644
--- a/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py
+++ b/playbookconfig/src/playbooks/roles/recover-rook-ceph-data/files/recover_rook_ceph.py
@@ -415,12 +415,15 @@ data:
 
         # Try to recover from some common errors
         # The timeout command was used because depending on the status of the cluster, it can get stuck
-        # on "cephfs-journal-tool" commands. But this will not cause any problems in recovery.
-        timeout 180 cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary
-        timeout 180 cephfs-journal-tool --rank=${FS_NAME}:0 journal reset
-        cephfs-table-tool ${FS_NAME}:0 reset session
-        cephfs-table-tool ${FS_NAME}:0 reset snap
-        cephfs-table-tool ${FS_NAME}:0 reset inode
+        # on "cephfs" commands. But this will not cause any problems in recovery.
+        CEPHFS_CMD_TIMEOUT=180
+        timeout ${CEPHFS_CMD_TIMEOUT} cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary
+        if [ $? -eq 0 ]; then
+          timeout ${CEPHFS_CMD_TIMEOUT} cephfs-journal-tool --rank=${FS_NAME}:0 journal reset
+          timeout ${CEPHFS_CMD_TIMEOUT} cephfs-table-tool ${FS_NAME}:0 reset session
+          timeout ${CEPHFS_CMD_TIMEOUT} cephfs-table-tool ${FS_NAME}:0 reset snap
+          timeout ${CEPHFS_CMD_TIMEOUT} cephfs-table-tool ${FS_NAME}:0 reset inode
+        fi
     fi
 
     kubectl -n rook-ceph scale deployment -l app=rook-ceph-osd --replicas 1