
A power outage of a running system is referred to by StarlingX as a DOR or Dead Office Recovery event where some or all the servers of a system loose power and then automatically regain power and reboot after the power outage is resolved. The Maintenance Agent (mtcAgent) detects a DOR (Dead Office Recovery) event when it starts up on a controller whose uptime is less than 15 minutes and remains in a DOR mode active state for up to 20 minutes. Servers of different model, make and vintages can recover from a power outage at different rates. Some may take longer than others. Therefore, while in DOR mode maintenance is more forgiving towards node recovery. The existing implementation of DOR handling produces what is called a "DOR Recovery" banner in its log files. The logs that comprise this banner are produced at the times when maintenance detects those hosts' recoveries. The banner can be displayed by the following command cat /var/log/mtcAgent.log | grep "DOR Recovery" See DOR banner as comment in this review. The issue leading to this update was from hosts that experienced no heartbeat failure over the DOR recovery were not included in the DOR recovery banner. The DOR recovery banner contains key performance indicator (kpi) for DOR recovery and is much more useful if all hosts were included in the DOR banner. This update adds a heartbeat soak to the maintenance add handler as a means to affirmatively detect and report successful DOR recoveries. The following addition fixes were implemented: - The hard coded 15 minute (900 seconds) DOR uptime threshold is sometimes too short for some servers/systems. As a result, on such systems, DOR mode in real DOR events is never activated so therefore the banner is not produced. This update modifies that threshold to 20 minutes and makes it configurable through a dor_mode_detect label in mtc.conf. - added DOR recovery count summary; x of y nodes recovered during DOR Test Plan: PASS: Install AIO DX plus 3 worker nodes system Verify DOR recovery banner after the following DOR recovery conditions PASS: - all nodes recover enabled ; DOR Recovery Banner all ENABLED PASS: - node that recovers degraded ; DOR Recovery Banner DEGRADED PASS: - node that gracefully recovers; DOR Recovery Banner ENABLED PASS: - node that fails heartbeat ; DOR Recovery Banner FAILED PASS: - node that fails goenabled ; DOR Recovery Banner FAILED PASS: - node that fails config error ; DOR Recovery Banner FAILED PASS: - node that fails host services; DOR Recovery Banner FAILED PASS: - node that never comes online ; DOR Recovery Banner OFFLINE PASS: - node in auto recovery disable; DOR Recovery Banner DISABLE Combination Cases: PASS: - all worker nodes do not recover online; with and without MNFA. PASS: - when one controller powers up 90 seconds after the other, one compute is unlocked-disabled-offline while the other computes are powered up 4 minutes after initial controller. PASS: - when both controllers reboot but computes don't ; no power loss PASS: - when one worker experiences a double config error while another worker experiences UNHEALTY error. PASS: - when one controller never recovers online. PASS: - when only one controller of a 2+3 node set recovers online. PASS: - when worker nodes come up well before the controller nodes. PASS: - when one controller and 1 worker is locked. Regression: PASS: Verify locked nodes do not show up in the banner. PASS: Verify heartbeat loss handling due to spontaneous reboot. PASS: Verify mtcAgent and mtcClient logging for all above test cases. PASS: Verify uptimes reported in the DOR Recovery banner. PASS: Verify results of overnight DOR soak ; 30+ DORs. Depends-On: https://review.opendev.org/c/starlingx/metal/+/942744 Closes-Bug: 2100667 Change-Id: I2f28dd1fd6e8544b9cda9dedda2023b6f76ceeda Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Description
Languages
C++
83%
Shell
10.2%
Python
3.3%
C
2.5%
Makefile
1%