This update affects only locked nodes.
If a remote node fails early config in a way that prevents IPSec
over management from being established, and no cluster interface
is configured or provisioned, then Node Locked commands sent from
mtcAgent over management and cluster networks are not received by
mtcClient.
This leads to a perpetual watchdog reset loop. The pmon process fails
to reach the configured state, and without the presence of
the .node_locked file, the watchdog treats the node as unlocked.
A quorum failure triggers a crashdump reset, repeating indefinitely.
The mtcAgent detects this and attempts corrective action by resending
the Node Locked command over the same failing networks, which also
fails.
This update adds a fallback: the Node Locked command is also sent
over the pxeboot network.
Testing also revealed that mtcClient socket recovery stops at the
first socket failure rather than try and rcover them all.
This update improves socket recovery by attempting all sockets in
order. The pxeboot socket is tried first, now followed by management
and cluster sockets.
Test Plan:
PASS: Verify mtcClient socket init and failure recovery handling.
PASS: Verify the mtcAgent sends the Node Locked command on the
pxeboot network when it sees a node locked state mismatch.
PASS: Verify a locked node with failing management and cluster
networking will get the node locked command serviced and
node locked file produced as expected on the remote node.
This event is noted by the following host specific mtcAgent log.
"hostname mtcAlive reporting unlocked while locked ; correcting"
Note: that before this update we see the above 'correcting' log
every 5 seconds. With this update we see that log only
once and the remote node does not go into a perpetual
crashdump loop.
Note: The host watchdog will not force a quorum failure
crashdump if the /var/run/.noide_locked file is present.
Closes-Bug: 2103863
Change-Id: I020c7ebe1e83254c52219546ec938f6cf3284c2e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>