Eric MacDonald cbcb19420c Send Node Locked command on pxeboot network as corrective action.
This update affects only locked nodes.

If a remote node fails early config in a way that prevents IPSec
over management from being established, and no cluster interface
is configured or provisioned, then Node Locked commands sent from
mtcAgent over management and cluster networks are not received by
mtcClient.

This leads to a perpetual watchdog reset loop. The pmon process fails
to reach the configured state, and without the presence of
the .node_locked file, the watchdog treats the node as unlocked.
A quorum failure triggers a crashdump reset, repeating indefinitely.

The mtcAgent detects this and attempts corrective action by resending
the Node Locked command over the same failing networks, which also
fails.

This update adds a fallback: the Node Locked command is also sent
over the pxeboot network.

Testing also revealed that mtcClient socket recovery stops at the
first socket failure rather than try and rcover them all.

This update improves socket recovery by attempting all sockets in
order. The pxeboot socket is tried first, now followed by management
and cluster sockets.

Test Plan:

PASS: Verify mtcClient socket init and failure recovery handling.
PASS: Verify the mtcAgent sends the Node Locked command on the
      pxeboot network when it sees a node locked state mismatch.
PASS: Verify a locked node with failing management and cluster
      networking will get the node locked command serviced and
      node locked file produced as expected on the remote node.
      This event is noted by the following host specific mtcAgent log.

      "hostname mtcAlive reporting unlocked while locked ; correcting"

      Note: that before this update we see the above 'correcting' log
            every 5 seconds. With this update we see that log only
            once and the remote node does not go into a perpetual
            crashdump loop.

      Note: The host watchdog will not force a quorum failure
            crashdump if the /var/run/.noide_locked file is present.

Closes-Bug: 2103863
Change-Id: I020c7ebe1e83254c52219546ec938f6cf3284c2e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-03-25 12:48:10 +00:00
2025-01-28 13:32:10 +00:00
2023-08-29 16:50:22 -04:00
2019-04-19 19:52:33 +00:00
2023-04-28 12:38:51 -04:00
2018-05-31 07:36:43 -07:00
2023-07-19 12:32:13 -03:00
2022-12-26 23:26:54 +00:00

metal

The starlingx/metal repository handles StarlingX Bare Metal Management1.

This repository is not intended to be developed standalone, but rather as part of the StarlingX Source System, which is defined by the StarlingX manifest2.

References


  1. https://docs.starlingx.io/api-ref/metal↩︎

  2. https://opendev.org/starlingx/manifest.git↩︎

Description
StarlingX Bare Metal and Node Management, Hardware Maintenance
Readme 15 MiB
Languages
C++ 83%
Shell 10.2%
Python 3.3%
C 2.5%
Makefile 1%