Eric MacDonald 649e94c8da Add pxeboot mtcAlive messaging alarm handling
This update adds alarm handling to the recently introduced pxeboot
network mtcAlive messaging, see depends on review below.

A new 200.003 maintenance alarm is introduced with the second depends
on update below. This new alarm is MINOR but also Management Affecting
because the pxeboot network is required for node installation.

This update enhances the new pxeboot_mtcAlive_monitor FSM for the
purpose of detecting pxeboot mtcAlive message loss, alarming and
then clearing the alarm once pxceboot mtcAlive messaging resumes.

The new alarm assertion and clear is debounced:
 - alarm is asserted if message loss persists to the accumulation of
   12 missed messages or after 2 minutes of complete message loss.
 - alarm is cleared after decrementing the message missed counter to
   zero or 1 minute of loss-less messaging.

Upgrades are supported with the addition of a features list to the
mtcClient ready event. All new mtcClients that support pxeboot network
messaging now publish pxeboot mtcAlive support through this new
features list. This is rendered in the logs like this:

    <hostname> mtcClient ready ; with pxeboot mtcAlive support

The mtcAgent does not expect/monitor pxeboot mtcAlive messages from
hosts that don't publish the feature support.

Test Plan:

PASS: Verify mtcAlive period is 5 seconds.
PASS: Verify pxeboot mtcAlive monitor period is 10 seconds.
PASS: Verify mtcAgent sends mtcClient a mtcAlive request on every
      mtcAlive monitor miss.
PASS: Verify pxeboot mtcAlive alarm is not raised while a node is
      locked.

Alarm attributes:

PASS: Verify severity is minor.
PASS: Verify alarm is cleared while node is locked.
PASS: Verify alarm can be suppressed while unlocked.
PASS: Verify asserted alarm is management affecting.
PASS: Verify alarm-show output format including cause and repair
      action text.

Process Restart Handling:

PASS: Verify alarm is maintained over a mtcAgent process restart.
PASS: Verify pxeboot monitoring resumes with or without asserted alarm
      immediately following a mtcAgent process restart.
PASS: Verify mtcClient learns and starts pxeboot mtcAlive messaging
      immediately following mtcClient process restart for locked or
      unlocked nodes.

Alarm Debounce Handling:

PASS: Verify alarm assertion only after 2 minutes of mtcAlive loss.
PASS: Verify alarm clear after 1 minutes of mtcAlive recovery.
PASS: Verify assertion and recovery debounce logging.
PASS: Verify alarm management miss and loss controls handle all
      boundary conditions exercised by a 12 hr soak with randomized
      period between message loss and recovery.

Host Action Handling:

PASS: Verify mtcAlive alarm is not raised over a Host Unlock Enable.
PASS: Verify mtcAlive alarm is not raised over a Host Graceful Recovery.
PASS: Verify mtcAlive alarm is not raised over a Host Power Off/On.
PASS: Verify mtcAlive alarm is not raised over a Host Reboot/Reset.
PASS: Verify mtcAlive alarm is not raised over a Host Reinstall.
PASS: Verify pxeboot mtcAlive is factored into Host Offline Handling.
PASS: Verify pxeboot alarm handling for node that does not send
      pxeboot mtcAlive after unlock.

Stuck Alarm Avoidance Handling:

PASS: Verify typical alarm assertion and clear handling.
PASS: Verify alarm is maintained or cleared over node reboot if the
      messaging issue persists or resolves over the reboot recovery.
PASS: Verify mtcAlive alarm is maintained over a Swact and cleared
      if the messaging is ok on the newly active controller.
PASS: Verify mtcAlive alarm assertion recovery case over uncontrolled
      Swact due to active controller reboot.
PASS: Verify alarm is cleared over a spontaneous reboot if pxeboot
      messaging recovers over that reboot.

Upgrades Case:

PASS: Verify pxeboot mtcAlive monitoring only occurs on mtcClients
      that actually support pxeboot network mtcAlive monitoring.

PASS: Verify mtcClient new features list, parsing which enables
      pxeboot  mtcAlive monitoring for that node.

PASS: Verify pxeboot mtcAlive messaging monitoring is not enabled
      towards nodes whose mtcClient does publish pxeboot mtcAlive
      messaging feature support.
PROG: Verify AIO DX upgrade from 22.12 to current master branch.
      Focus on pxeboot messaging over the upgrade process.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/912654
Depends-On: https://review.opendev.org/c/starlingx/fault/+/914660
Story: 2010940
Task: 49542
Change-Id: I1b51ad9ebcf010f5dee9a86c0295be3da6e2f9b1
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-04-09 14:13:23 +00:00

193 lines
8.8 KiB
C

#ifndef __INCLUDE_FITCODES_H__
#define __INCLUDE_FITCODES_H__
/*
* Copyright (c) 2013, 2016, 2024 Wind River Systems, Inc.
*
* SPDX-License-Identifier: Apache-2.0
*
*/
/**
* @file
* Wind River CGTS Platform Common Fault Insertion Code Definitions
*/
/*************************************************************************************
*
* These definitions are used for fault insertion testing.
*
* Here are examples of how they are used,
*
* - touch the 'no_reboot' file on the mtcClient to cause it to
* servie the reboot request but don't actually reboot
*
* - touch the 'no_mgmnt_ack' file on the mtcClient to cause
* it to handle command requests but drop/not send the ack message
* if it came in on the management network ; same for cluster-host
*
* - touch the 'no_mtcAlive file to tell mtcClient to stop sending
* its mtcAlive messages while this file is present.
*
**************************************************************************************/
/**
* This is the Fault Insertion Dir - Code that looks for multiple fit files need not
* bother if the dir is not present
**/
#define MTC_CMD_FIT__DIR ("/var/run/fit")
#define MTC_CMD_FIT__NO_REBOOT ("/var/run/fit/no_reboot") /* mtcClient */
#define MTC_CMD_FIT__NO_RESET ("/var/run/fit/no_reset") /* mtcClient */
#define MTC_CMD_FIT__NO_WIPEDISK ("/var/run/fit/no_wipedisk") /* mtcClient */
#define MTC_CMD_FIT__NO_MGMNT_ACK ("/var/run/fit/no_mgmnt_ack") /* mtcClient */
#define MTC_CMD_FIT__NO_CLSTR_ACK ("/var/run/fit/no_clstr_ack") /* mtcClient */
#define MTC_CMD_FIT__NO_MTCALIVE ("/var/run/fit/no_mtcalive") /* mtcClient */
#define MTC_CMD_FIT__PXEBOOT_RXSOCK ("/var/run/fit/pxeboot_rxsock") /* mtcClient */
#define MTC_CMD_FIT__PXEBOOT_TXSOCK ("/var/run/fit/pxeboot_txsock") /* mtcClient */
#define MTC_CMD_FIT__MGMNT_RXSOCK ("/var/run/fit/mgmnt_rxsock") /* mtcClient */
#define MTC_CMD_FIT__MGMNT_TXSOCK ("/var/run/fit/mgmnt_txsock") /* mtcClient */
#define MTC_CMD_FIT__CLSTR_RXSOCK ("/var/run/fit/clstr_rxsock") /* mtcClient */
#define MTC_CMD_FIT__CLSTR_TXSOCK ("/var/run/fit/clstr_txsock") /* mtcClient */
#define MTC_CMD_FIT__AMON_SOCK ("/var/run/fit/amon_sock") /* mtcClient */
#define MTC_CMD_FIT__NO_CLSTR_RSP ("/var/run/fit/no_clstr_rsp") /* hbsClient */
#define MTC_CMD_FIT__NO_MGMNT_RSP ("/var/run/fit/no_mgmnt_rsp") /* hbsClient */
#define MTC_CMD_FIT__LINKLIST ("/var/run/fit/linklist") /* hbsAgent */
#define MTC_CMD_FIT__HBSSILENT ("/var/run/fit/hbs_silent_fault") /* hbsAgent */
#define MTC_CMD_FIT__SENSOR_DATA ("/var/run/fit/sensor_data") /* hwmond */
#define MTC_CMD_FIT__INLINE_CREDS ("/var/run/fit/inline_creds") /* mtcAgent */
#define MTC_CMD_FIT__POWER_CMD ("/var/run/fit/power_cmd_result") /* mtcAgent */
#define MTC_CMD_FIT__ROOT_QUERY ("/var/run/fit/root_query") /* mtcAgent */
#define MTC_CMD_FIT__MC_INFO ("/var/run/fit/mc_info") /* mtcAgent */
#define MTC_CMD_FIT__POWER_STATUS ("/var/run/fit/power_status") /* mtcAgent */
#define MTC_CMD_FIT__RESTART_CAUSE ("/var/run/fit/restart_cause") /* mtcAgent */
#define MTC_CMD_FIT__UPTIME ("/var/run/fit/uptime") /* mtcAgent */
#define MTC_CMD_FIT__LOUD_BM_PW ("/var/run/fit/loud_bm_pw") /* mtcAgent & hwmond */
#define MTC_CMD_FIT__START_SVCS ("/var/run/fit/host_services") /* mtcClient */
#define MTC_CMD_FIT__NO_HS_ACK ("/var/run/fit/no_hs_ack") /* mtcClient */
#define MTC_CMD_FIT__GOENABLE_AUDIT ("/var/run/fit/goenable_audit") /* mtcAgent */
#define MTC_CMD_FIT__JSON_LEAK_SOAK ("/var/run/fit/json_leak_soak") /* mtcAgent */
#define MTC_CMD_FIT__BMC_ACC_FAIL ("/var/run/fit/bmc_access_fail")/* mtcAgent */
#define MTC_CMD_FIT__MEM_LEAK_DEBUG ("/var/run/fit/mem_leak_debug")/* mtcAgent */
#define MTC_CMD_FIT__FM_ERROR_CODE ("/var/run/fit/fm_error_code") /* mtcAgent */
/*****************************************************
* Fault Insertion Codes
*****************************************************/
/*****************************************************************************
*
* the fit /var/run/fit/fitinfo file contains the following format,
* - code and process are required
* - other fields are optional
* - no spaces, exclude <>
*
* proc=<process shortname>
* code=<decimal number>
* host=<hostname>
* name=<some string>
* data=<some string>
*
*****************************************************************************/
/*********************** Common FIT Codes **********************************/
#define FIT_CODE__NONE (0)
#define FIT_CODE__CORRUPT_TOKEN (1)
#define FIT_CODE__ADD_DELETE (2)
#define FIT_CODE__STUCK_TASK (3)
#define FIT_CODE__AVOID_N_FAIL_BMC_REQUEST (4)
#define FIT_CODE__THREAD_TIMEOUT (5)
#define FIT_CODE__THREAD_SEGFAULT (6)
#define FIT_CODE__SIGNAL_NOEXIT (7)
#define FIT_CODE__STRESS_THREAD (8)
#define FIT_CODE__DO_NOTHING_THREAD (9)
#define FIT_CODE__EMPTY_BM_PASSWORD (10)
#define FIT_CODE__INVALIDATE_MGMNT_IP (11)
#define FIT_CODE__INVALIDATE_CLSTR_IP (12)
#define FIT_CODE__WORK_QUEUE (13)
#define FIT_CODE__NO_READY_EVENT (14)
#define FIT_CODE__NO_PULSE_REQUEST (15)
#define FIT_CODE__NO_PULSE_RESPONSE (16)
#define FIT_CODE__TOKEN (17)
#define FIT_CODE__FAST_PING_AUDIT_HOST (20)
#define FIT_CODE__FAST_PING_AUDIT_ALL (21)
#define FIT_CODE__TRANSLATE_LOCK_TO_FORCELOCK (30)
#define FIT_CODE__LOCK_HOST (31)
#define FIT_CODE__FORCE_LOCK_HOST (32)
#define FIT_CODE__UNLOCK_HOST (33)
#define FIT_CODE__FAIL_SWACT (34)
#define FIT_CODE__FAIL_PXEBOOT_MTCALIVE (35)
#define FIT_CODE__FM_SET_ALARM (40)
#define FIT_CODE__FM_GET_ALARM (41)
#define FIT_CODE__FM_CLR_ALARM (42)
#define FIT_CODE__FM_QRY_ALARMS (43)
#define FIT_CODE__BMC_COMMAND_SEND (60)
#define FIT_CODE__BMC_COMMAND_RECV (61)
#define FIT_CODE__START_HOST_SERVICES (70)
#define FIT_CODE__STOP_HOST_SERVICES (71)
#define FIT_CODE__SOCKET_SETUP (72)
#define FIT_CODE__READ_JSON_FROM_FILE (73)
#define FIT_CODE__HTTP_WORKQUEUE_OPERATION_FAILED (75)
#define FIT_CODE__HTTP_WORKQUEUE_REQUEST_TIMEOUT (76)
#define FIT_CODE__HTTP_WORKQUEUE_CONNECTION_LOSS (77)
/***************** Process Fit Codes ********************************/
/* Hardware Monitor FIT Codes */
#define FIT_CODE__HWMON__CORRUPT_TOKEN (101)
#define FIT_CODE__HWMON__AVOID_TOKEN_REFRESH (102)
#define FIT_CODE__HWMON__THREAD_TIMEOUT (103)
#define FIT_CODE__HWMON__AVOID_SENSOR_QUERY (104)
#define FIT_CODE__HWMON__SENSOR_STATUS (105)
#define FIT_CODE__HWMON__STARTUP_STATES_FAILURE (106)
#define FIT_CODE__HWMON__HTTP_LOAD_SENSORS (120)
#define FIT_CODE__HWMON__HTTP_ADD_SENSOR (121)
#define FIT_CODE__HWMON__HTTP_DEL_SENSOR (122)
#define FIT_CODE__HWMON__HTTP_MOD_SENSOR (123)
#define FIT_CODE__HWMON__ADD_SENSOR (130)
#define FIT_CODE__HWMON__BAD_SENSOR (131)
#define FIT_CODE__HWMON__GET_SENSOR (132)
#define FIT_CODE__HWMON__CREATE_ORPHAN_SENSOR_ALARM (136)
#define FIT_CODE__HWMON__HTTP_LOAD_GROUPS (140)
#define FIT_CODE__HWMON__HTTP_ADD_GROUP (141)
#define FIT_CODE__HWMON__HTTP_DEL_GROUP (142)
#define FIT_CODE__HWMON__HTTP_MOD_GROUP (143)
#define FIT_CODE__HWMON__HTTP_GROUP_SENSORS (144)
#define FIT_CODE__HWMON__ADD_GROUP (150)
#define FIT_CODE__HWMON__BAD_GROUP (151)
#define FIT_CODE__HWMON__GET_GROUP (152)
#define FIT_CODE__HWMON__CREATE_ORPHAN_GROUP_ALARM (156)
#define FIT_CODE__HWMON__NO_DATA (160)
#define FIT_CODE__HWMON__RAISE_SENSOR_ALARM (170)
#define FIT_CODE__HWMON__CLEAR_SENSOR_ALARM (171)
#define FIT_CODE__HWMON__RAISE_GROUP_ALARM (172)
#define FIT_CODE__HWMON__CLEAR_GROUP_ALARM (173)
#define FIT_CODE__HWMON__SET_DB_SENSOR_STATUS (175)
#define FIT_CODE__HWMON__SET_DB_SENSOR_STATE (176)
#define FIT_CODE__HWMON__SET_DB_GROUP_STATUS (177)
#define FIT_CODE__HWMON__SET_DB_GROUP_STATE (178)
#define TESTMASK__MSG__MTCALIVE_STRESS (0x00000001)
#endif /* __INCLUDE_FITCODES_H__ */