Eric MacDonald 1daa670126 Improve DOR Recovery banner to include all hosts and their status
A power outage of a running system is referred to by StarlingX as a
DOR or Dead Office Recovery event where some or all the servers of
a system loose power and then automatically regain power and reboot
after the power outage is resolved.

The Maintenance Agent (mtcAgent) detects a DOR (Dead Office Recovery)
event when it starts up on a controller whose uptime is less than 15
minutes and remains in a DOR mode active state for up to 20 minutes.

Servers of different model, make and vintages can recover from a power
outage at different rates. Some may take longer than others. Therefore,
while in DOR mode maintenance is more forgiving towards node recovery.

The existing implementation of DOR handling produces what is called a
"DOR Recovery" banner in its log files. The logs that comprise this
banner are produced at the times when maintenance detects those hosts'
recoveries. The banner can be displayed by the following command

cat /var/log/mtcAgent.log | grep "DOR Recovery"

See DOR banner as comment in this review.

The issue leading to this update was from hosts that experienced no
heartbeat failure over the DOR recovery were not included in the DOR
recovery banner. The DOR recovery banner contains key performance
indicator (kpi) for DOR recovery and is much more useful if  all hosts
were included in the DOR banner.

This update adds a heartbeat soak to the maintenance add handler as a
means to affirmatively detect and report successful DOR recoveries.

The following addition fixes were implemented:

- The hard coded 15 minute (900 seconds) DOR uptime threshold is
  sometimes too short for some servers/systems. As a result, on such
  systems, DOR mode in real DOR events is never activated so therefore
  the banner is not produced.
  This update modifies that threshold to 20 minutes and makes it
  configurable through a dor_mode_detect label in mtc.conf.
- added DOR recovery count summary; x of y nodes recovered during DOR

Test Plan:

PASS: Install AIO DX plus 3 worker nodes system

Verify DOR recovery banner after the following DOR recovery conditions

PASS: - all nodes recover enabled    ; DOR Recovery Banner all ENABLED
PASS: - node that recovers degraded  ; DOR Recovery Banner DEGRADED
PASS: - node that gracefully recovers; DOR Recovery Banner ENABLED
PASS: - node that fails heartbeat    ; DOR Recovery Banner FAILED
PASS: - node that fails goenabled    ; DOR Recovery Banner FAILED
PASS: - node that fails config error ; DOR Recovery Banner FAILED
PASS: - node that fails host services; DOR Recovery Banner FAILED
PASS: - node that never comes online ; DOR Recovery Banner OFFLINE
PASS: - node in auto recovery disable; DOR Recovery Banner DISABLE

Combination Cases:

PASS: - all worker nodes do not recover online; with and without MNFA.
PASS: - when one controller powers up 90 seconds after the other,
        one compute is unlocked-disabled-offline while the other
        computes are powered up 4 minutes after initial controller.
PASS: - when both controllers reboot but computes don't ; no power loss
PASS: - when one worker experiences a double config error while another
        worker experiences UNHEALTY error.
PASS: - when one controller never recovers online.
PASS: - when only one controller of a 2+3 node set recovers online.
PASS: - when worker nodes come up well before the controller nodes.
PASS: - when one controller and 1 worker is locked.

Regression:

PASS: Verify locked nodes do not show up in the banner.
PASS: Verify heartbeat loss handling due to spontaneous reboot.
PASS: Verify mtcAgent and mtcClient logging for all above test cases.
PASS: Verify uptimes reported in the DOR Recovery banner.
PASS: Verify results of overnight DOR soak ; 30+ DORs.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/942744

Closes-Bug: 2100667
Change-Id: I2f28dd1fd6e8544b9cda9dedda2023b6f76ceeda
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-04-03 15:23:05 +00:00
..