Go to file

Eric MacDonald 1daa670126 Improve DOR Recovery banner to include all hosts and their status

A power outage of a running system is referred to by StarlingX as a
DOR or Dead Office Recovery event where some or all the servers of
a system loose power and then automatically regain power and reboot
after the power outage is resolved.

The Maintenance Agent (mtcAgent) detects a DOR (Dead Office Recovery)
event when it starts up on a controller whose uptime is less than 15
minutes and remains in a DOR mode active state for up to 20 minutes.

Servers of different model, make and vintages can recover from a power
outage at different rates. Some may take longer than others. Therefore,
while in DOR mode maintenance is more forgiving towards node recovery.

The existing implementation of DOR handling produces what is called a
"DOR Recovery" banner in its log files. The logs that comprise this
banner are produced at the times when maintenance detects those hosts'
recoveries. The banner can be displayed by the following command

cat /var/log/mtcAgent.log | grep "DOR Recovery"

See DOR banner as comment in this review.

The issue leading to this update was from hosts that experienced no
heartbeat failure over the DOR recovery were not included in the DOR
recovery banner. The DOR recovery banner contains key performance
indicator (kpi) for DOR recovery and is much more useful if  all hosts
were included in the DOR banner.

This update adds a heartbeat soak to the maintenance add handler as a
means to affirmatively detect and report successful DOR recoveries.

The following addition fixes were implemented:

- The hard coded 15 minute (900 seconds) DOR uptime threshold is
  sometimes too short for some servers/systems. As a result, on such
  systems, DOR mode in real DOR events is never activated so therefore
  the banner is not produced.
  This update modifies that threshold to 20 minutes and makes it
  configurable through a dor_mode_detect label in mtc.conf.
- added DOR recovery count summary; x of y nodes recovered during DOR

Test Plan:

PASS: Install AIO DX plus 3 worker nodes system

Verify DOR recovery banner after the following DOR recovery conditions

PASS: - all nodes recover enabled    ; DOR Recovery Banner all ENABLED
PASS: - node that recovers degraded  ; DOR Recovery Banner DEGRADED
PASS: - node that gracefully recovers; DOR Recovery Banner ENABLED
PASS: - node that fails heartbeat    ; DOR Recovery Banner FAILED
PASS: - node that fails goenabled    ; DOR Recovery Banner FAILED
PASS: - node that fails config error ; DOR Recovery Banner FAILED
PASS: - node that fails host services; DOR Recovery Banner FAILED
PASS: - node that never comes online ; DOR Recovery Banner OFFLINE
PASS: - node in auto recovery disable; DOR Recovery Banner DISABLE

Combination Cases:

PASS: - all worker nodes do not recover online; with and without MNFA.
PASS: - when one controller powers up 90 seconds after the other,
        one compute is unlocked-disabled-offline while the other
        computes are powered up 4 minutes after initial controller.
PASS: - when both controllers reboot but computes don't ; no power loss
PASS: - when one worker experiences a double config error while another
        worker experiences UNHEALTY error.
PASS: - when one controller never recovers online.
PASS: - when only one controller of a 2+3 node set recovers online.
PASS: - when worker nodes come up well before the controller nodes.
PASS: - when one controller and 1 worker is locked.

Regression:

PASS: Verify locked nodes do not show up in the banner.
PASS: Verify heartbeat loss handling due to spontaneous reboot.
PASS: Verify mtcAgent and mtcClient logging for all above test cases.
PASS: Verify uptimes reported in the DOR Recovery banner.
PASS: Verify results of overnight DOR soak ; 30+ DORs.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/942744

Closes-Bug: 2100667
Change-Id: I2f28dd1fd6e8544b9cda9dedda2023b6f76ceeda
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

2025-04-03 15:23:05 +00:00

api-ref/source

Fix reference to LLDP neighbors API path in the documentation

2024-02-22 16:32:47 -03:00

bsp-files

Upversion to 25.09

2025-01-28 13:32:10 +00:00

devstack

Security: Handle nospectre_v1 in the bootargs

2020-01-28 18:21:13 -05:00

doc

Fix tox-docs failing sphinx

2023-08-29 16:50:22 -04:00

installer

Update input_file in pxeboot-update script

2024-09-09 15:37:07 -03:00

kickstart

Merge "miniboot: Lock the root account during subcloud install"

2025-03-24 21:41:33 +00:00

mtce

Improve DOR Recovery banner to include all hosts and their status

2025-04-03 15:23:05 +00:00

mtce-common

Improve DOR Recovery banner to include all hosts and their status

2025-04-03 15:23:05 +00:00

mtce-compute

Remove CentOS/OpenSUSE build support

2024-05-02 16:01:04 -04:00

mtce-control

Remove CentOS/OpenSUSE build support

2024-05-02 16:01:04 -04:00

mtce-storage

Remove CentOS/OpenSUSE build support

2024-05-02 16:01:04 -04:00

releasenotes

Switch to newer openstackdocstheme and reno versions

2020-06-04 14:32:46 +02:00

tools

Remove CentOS/OpenSUSE build support

2024-05-02 16:01:04 -04:00

.gitignore

Update tox.ini files to use stein constraints

2019-06-25 13:20:35 -04:00

.gitreview

OpenDev Migration Patch

2019-04-19 19:52:33 +00:00

.zuul.yaml

Fix github mirroring for this repo

2023-04-28 12:38:51 -04:00

CONTRIBUTORS.wrs

StarlingX open source release updates

2018-05-31 07:36:43 -07:00

debian_build_layer.cfg

Add debian_build_layer.cfg file

2021-10-05 14:08:23 -04:00

debian_iso_image.inc

Debian: metal: update debian_iso_image.inc

2022-11-16 12:06:51 +08:00

debian_pkg_dirs

Include upgrades meta files to Debian ISO

2022-08-02 21:01:58 +00:00

debian_stable_docker_images.inc

debian: port rvmc docker image to Debian

2022-08-12 16:30:01 +00:00

LICENSE

StarlingX open source release updates

2018-05-31 07:36:43 -07:00

pylint.rc

Add pylint py3 portability checks for the metal repo

2021-09-13 11:57:42 -03:00

README.rst

starlingx/metal README improvement

2023-07-19 12:32:13 -03:00

test-requirements.txt

Removed wait_for_worker_config_init in AIO systems

2021-07-08 18:48:28 -04:00

tox.ini

Update tox.ini to work with tox 4

2022-12-26 23:26:54 +00:00

README.rst

metal

The starlingx/metal repository handles StarlingX Bare Metal Management¹.

This repository is not intended to be developed standalone, but rather as part of the StarlingX Source System, which is defined by the StarlingX manifest².

References

Languages

C++ 83%

Shell 10.2%

Python 3.3%

C 2.5%

Makefile 1%