metal

Author	SHA1	Message	Date
Eric MacDonald	d863aea172	Increase mtce host offline threshold to handle slow host shutdown Mtce polls/queries the remote host for mtcAlive messages for 42 x 100 ms intervals over unlock or host failed cases. Absence of mtcAlive during this (~5 sec) period indicates the node is offline. However, in the rare case where shutdown is slow, 5 seconds is not long enough. Rare cases have been seen where 7 or 8 second wait time is required to properly declare offline. To avoid the rare transient 200.004 host alarm over an unlock operation, this update increases the mtce host offline window from 5 to 10 seconds (approx) by modifying the mtce configuration file offline threshold from 42 to 90. Test Plan: PASS: Verify unchallenged failed to offline period to be ~10 secs PASS: Verify algorithm restarts if there is mtcAlive received anytime during the polls/queries (challenge) window. PASS: Verify challenge handling leads to a longer but successful offline declaration. PASS: Verify above handling for both unlock and spontaneous failure handling cases. Closes-Bug: 2024249 Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2023-06-16 18:14:08 +00:00
Matheus Guilhermino	a0e270b51b	Add mpath support to wipedisk script The wipedisk script was not able to find the boot device when using multipath disks. This is due to the fact that multipath devices are not listed under /dev/disk/by-path/. To add support to multipath devices, the script should look for the boot device under /dev/disk/by-id/ as well. Test Plan PASS: Successfully run wipedisk on a AIO-SX with multipath PASS: Successfully run wipedisk on a AIO-SX w/o multipath Closes-bug: 2013391 Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com> Change-Id: I3af76cd44f22795784a9184daf75c66fc1b9874f	2023-04-10 17:10:22 -03:00
Al Bailey	37c5910a62	Update mtce debian package ver based on git Update debian package versions to use git commits for: - mtce (old 9, new 30) - mtce-common (old 1, new 9) - mtce-compute (old 3, new 4) - mtce-control (old 7, new 10) - mtce-storage (old 3, new 4) The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Test Plan: PASS: build-pkgs -p mtce PASS: build-pkgs -p mtce-common PASS: build-pkgs -p mtce-compute PASS: build-pkgs -p mtce-control PASS: build-pkgs -p mtce-storage Story: 2010550 Task: 47401 Task: 47402 Task: 47403 Task: 47404 Task: 47405 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I4846804320b0ad3ec10799a468a9ee3bf7973587	2023-03-02 14:50:35 +00:00
Kyale, Eliud	502662a8a7	Cleanup mtcAgent error logging during startup - reduced log level in http util to warning - use inservice test handler to ensure state change notification is sent to vim - reduce retry count from 3 to 1 for add_handler state_change vim notification Test plan: PASS - AIO-SX: ansible controller startup (race condition) PASS - AIO-DX: ansible controller startup PASS - AIO-DX: SWACT PASS - AIO-DX: power off restart PASS - AIO-DX: full ISO install PASS - AIO-DX: Lock Host PASS - AIO-DX: Unlock Host PASS - AIO-DX: Fail Host ( by rebooting unlocked-enabled standby controller) Story: 2010533 Task: 47338 Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com> Change-Id: I7576e2642d33c69a4b355be863bd7183fbb81f45	2023-02-14 14:18:02 -05:00
Christopher Souza	56ab793bc5	Change hostwd emergency log to write to /dev/kmsg The hostwd emergency logs was written to /dev/console, the change was to add the prefix "hoswd:" to the log message and write to /dev/kmsg. Test Plan: Pass: AIO-SX and AIO DX full deployment. Pass: kill pmond and wait for the emergency log to be written. Pass: check if the emergency log was written to /dev/kmsg. Pass: Verify logging for quorum report missing failure. Pass: Verify logging for quorum process failure. Pass: Verify emergency log crash dump logging to mesg and console logging for each of the 2 cases above with stressng overloading the server (CPU, FS and Memory); stress-ng --vm-bytes 4000000000 --vm-keep -m 30 -i 30 -c 30 Story: 2010533 Task: 47216 Co-authored-by: Eric MacDonald <eric.macdonald@windriver.com> Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Co-authored-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com> Signed-off-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com> Change-Id: I0da82f964dd096840259c4d0ed4e5f558debdf22	2023-02-01 23:41:14 +00:00
Eric MacDonald	a3cba57a1f	Adapt Host Watchdog to use kdump-tools The Debian package for kdump changed from kdump to kdump-tools Test Plan: PASS: Verify build and install AIO DX system PASS: Verify host watchdog detects kdump as active in debian Closes-Bug: 2001692 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ie1ac29d3d29f3d9c843789cdedf85081fe790616	2023-01-04 12:57:19 -05:00
Robert Church	1796ed8740	Update wipedisk for LVM based rootfs Now that the root filesystem is based on an LVM logical volume, discover the root disk by searching for the boot partition. Changes include: - remove detection of rootfs_part/rootfs and adjust rootfs related references with boot_disk. - run bashate on the script and resolve indentation and syntax related errors. Leave long-line errors alone for improved readability. Test Plan: PASS - run 'wipedisk', answer prompts, and ensure all partitions are cleaned up except for the platform backup partition PASS - run 'wipedisk --include-backup', answer prompts, and ensure all partitions are cleaned up PASS - run 'wipedisk --include-backup --force' and ensure all partitions are cleaned up Change-Id: I036ce745353b6a26bc2615ffc6e3b8955b4dd1ec Closes-Bug: #1998204 Signed-off-by: Robert Church <robert.church@windriver.com>	2022-11-29 05:04:38 -06:00
Eric MacDonald	da398e0c5f	Debian: Make Mtce offline handler more resilient to slow shutdowns The current offline handler assumes the node is offline after 'offline_search_count' reaches 'offline_threshold' count regardless of whether mtcAlive messages were received during the search window. The offline algorithm requires that no mtcAlive messages be seen for the full offline_threshold count. During a slow shutdown the mtcClient runs for longer than it should and as a result can lead to maintenance seeing the node as recovered before it should. This update manages the offline search counter to ensure that it only reached the count threshold after seeing no mtcAlive messages for the full search count. Any mtcAlive message seen during the count triggers a count reset. This update also 1. Adjusts the reset retry cadence from 7 to 12 secs to prevent unnecessary reboot thrash during the current shutdown. 2. Clears the hbsClient ready event at the start of the subfunction handler so the heartbeat soak is only started after seeing heartbeat client ready events that follow the main config. Test Plan: PASS: Debian and CentOS Build and DX install PASS: Verify search count management PASS: Verify issue does not occur over lock/unlock soak (100+) - where the same test without update did show issue. PASS: Monitor alive logs for behavioral correctness PASS: Verify recovery reset occurs after expected extended time. Closes-Bug: 1993656 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e	2022-10-24 15:57:43 +00:00
Eric MacDonald	3f4c2cbb45	Mtce: Add ActionInfo extension support for reset operations. StarlingX Maintenance supports host power and reset control through both IPMI and Redfish Platform Management protocols when the host's BMC (Board Management Controller) is provisioned. The power and reset action commands for Redfish are learned through HTTP payload annotations at the Systems level; "/redfish/v1/Systems. The existing maintenance implementation only supports the "ResetType@Redfish.AllowableValues" payload property annotation at the #ComputerSystem.Reset Actions property level. However, the Redfish schema also supports an 'ActionInfo' extension at /redfish/v1/Systems/1/ResetActionInfo. This update adds support for the 'ActionInfo' extension for Reset and power control command learning. For more information refer to the section 6.3 ActionInfo 1.3.0 of the Redfish Data Model Specification link in the launchpad report. Test Plan: PASS: Verify CentOS build and patch install. PASS: Verify Debian build and ISO install. PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0 PASS: Verify reset/power control cmd load from newly added second level query from ActionInfo service. Failure Handling: Significant failure path testing with this update PASS: Verify Redfish protocol is periodically retried from start when bm_type=redfish fails to connect. PASS: Verify BMC access protocol defaults to IPMI when bm_type=dynamic but failed connect using redfish. Connection failures in the above cases include - redfish bmc root query fails - redfish bmc info query fails - redfish bmc load power/reset control actions fails - missing second level Parameters label list - missing second level AllowableValues label list PASS: Verify sensor monitoring is relearned to ipmi from failed and retried with bm_type=redfish after switch to bm_type=dynamic or bm_type=ipmi by sysinv update command. Regression: PASS: Verify with CentOS redfishtool 1.1.0 PASS: Verify switch back and forth between ipmi and redfish using update bm_type=ipmi and bm_type=redfish commands PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for hosts that support redfish PASS: Verify redfish protocol is preferred in bm_type=dynamic mode PASS: Verify IPMI sensor monitoring when bm_type=ipmi PASS: Verify IPMI sensor monitoring when bm_type=dynamic and redfish connect fails. PASS: Verify redfish sensor event assert/clear handling with alarm and degrade condition for both IPMI and redfish. PASS: Verify reset/power command learn by single level query. PASS: Verify mtcAgent.log logging Closes-Bug: 1992286 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35	2022-10-13 17:40:05 +00:00
Zuul	8fd1bcbb97	Merge "Alarm Hostname controller function has in-service failure reported"	2022-10-06 20:47:06 +00:00
Al Bailey	dd5a24037d	Fix bashate failure in zuul This review allows this repo to pass zuul. When tox is run locally it pulls in an older bashate 0.6.0 but the zuul jobs are pulling in the higher version. Bashate 2.1.1 was releated Oct 6, 2022 Changed the upper constraints to allow developers to pull in dependencies that are more aligned with zuul. Fixed the new bashate error. Also cleaned up the yamllint syntax. Closes-Bug: 1991971 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I9cda349a20c63f9d222a3c3fc3645c5ceb4c2751	2022-10-06 17:22:12 +00:00
Girish Subramanya	86681b7598	Alarm Hostname controller function has in-service failure reported When compute services remain healthy: - listing alarms shall not refer to the below Obsoleted alarm - 200.012 alarm hostname controller function has an in-service failure This update deletes definition of the obsoleted alarm and any references 200.012 is removed in events.yaml file Also updated any reference to this alarm definition. Need to also raise a Bug to track the Doc change. Test Plan: Verify on a Standard configuration no alarms are listed for hostname controller in-service failure Code (removal) changes exercised with fix prior to ansible bootstrap and host-unlock and verify no unexpected alarms Regression: There is no need to test the alarm referred here as they are obsolete Closes-Bug: 1991531 Signed-off-by: Girish Subramanya <girish.subramanya@windriver.com> Change-Id: I255af68155c5392ea42244b931516f742fa838c3	2022-10-05 10:30:01 -04:00
Zuul	6bcd8333b2	Merge "Debian: Remove conf files from etc-pmon.d"	2022-09-30 19:41:16 +00:00
Leonardo Fagundes Luz Serrano	d1c0d04719	Debian: Remove conf files from etc-pmon.d Removed conf files from /etc/pmon.d/ as they are being moved to another location. This is part of an effort to allow pmon conf files to be selected at runtime by kickstarts. The change is debian-only, since centos support will be dropped soon. Centos' pmon conf files remain in /etc/pmon.d/ Test Plan: PASS - deb doesn't install anything to /etc/pmon.d/ PASS - rpm files unchanged PASS - AIOSX unlocked-enabled-available PASS - Standard 2+2 unlocked-enabled-available Story: 2010211 Task: 46306 Depends-On: https://review.opendev.org/c/starlingx/metal/+/855095 Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com> Change-Id: I086db0750df5626d2a8ba1010153ce4f45535ca5	2022-09-26 13:41:40 +00:00
Charles Short	3935abf187	mtcAgent: Run in active mode Run the mtcAgent with active mode by default. This was done because it was being observed that mtcAgent was causing an increased CPU load under Debian. Story: 2009964 Task: 46202 Test-Plan PASS Build playbookconfig package PASS Boot ISO PASS Bootstrap simplex PASS Check for running mtcAgent PASS Install and provision CentOS 2+3 Standard System Signed-off-by: Charles Short <charles.short@windriver.com> Change-Id: If4278ab6e14cd30c995ce5004004fab955ad23eb	2022-09-13 21:38:50 +00:00
Davi Frossard	646192989d	Remove sm-watchdog residues Due to the changes `bd9e560d4b` which removed the sm-watchdog, we also need to remove residues in kickstart config. Story: 2010087 Task: 46007 Signed-off-by: Davi Frossard <dbarrosf@windriver.com> Change-Id: I17911773ec4db1549df32a77acd43cd4615b28ee	2022-09-01 12:35:06 +00:00
Leonardo Fagundes Luz Serrano	a5e7a108f5	Duplicate pmon.d conf files to another location Created a duplicate install of /etc/pmon.d/*.conf files to /usr/share/starlingx/pmon.d/ This is part of an effort to allow pmon conf files to be selected at runtime by kickstarter. Test Plan: PASS: duplicate conf on deb Story: 2010211 Task: 46112 Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com> Change-Id: Ie07c1bfa370da5b2ec71fe3fce948d59be1dd098	2022-08-26 16:21:18 -03:00
Andy Ning	162398acbc	Add pmon configuration file for sssd This is part of the change to replace nslcd with sssd to support multiple secure ldap backends. This change added pmon configuration file for sssd so that it is monitored by pmon. Test Plan on Debian (SX and DX): PASS: Package build, image build. PASS: System deployment. PASS: After controller is unlocked, sssd is running. PASS: ldap user creation by ldapadduser and ldapusersetup. PASS: ldap user login on console. PASS: ldap user remote login by oam IP address: ssh <ldapuser>@<controller-oam-ip-address> PASS: ldap user login by local ldap domain within controllers: ssh <ldapuser>@controller PASS: For DX system, same ldap functions still work properly after swact. PASS: Kill sssd process, verify that it is brought up by pmon. Story: 2009834 Task: 46064 Signed-off-by: Andy Ning <andy.ning@windriver.com> Change-Id: I701a4cbbda0f900dafd0456aad63132b62d8424f	2022-08-24 14:42:25 -04:00
Eric MacDonald	038eb198fd	Re-enable sensor suppression support in Mtce Hardware Monitor Sensor and sensorgroup suppression was temporarily disabled in Debian while System Inventory was modified to align API types with database types. That update is now merged so this update removes the Debian only gate on sensor and sensorgroup suppression. Test Plan: PASS: Verify Debian build and install PASS: Verify CentOS build and install PASS: Verify multiple individual sensor suppression/unsuppression PASS: Verify sensorgroup suppression/unsuppression PASS: Verify host degrade and alarm mgmt for each above 2 cases Story: 2009968 Task: 45964 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: I34abf6bd7c72df2f7da743e4f20300956248c6d7	2022-08-06 00:02:29 +00:00
Eric MacDonald	f7f552ad8e	Debian: Fix mtcAgent segfault on SM host state change requests The mtcAgent communicates to Service Managenment using libEvent. The host state change notification requests are all blocking requests. Both the common and service manager handlers are freeing the object. This double free results in a segmentation fault with the newer version of libEvent in Debian. The bug is fixed by removing the free in the service handler to allow the dispatch handler to manage the object free as it does for other blocking requests for other services. Test Plan: PASS: Verify mtcAgent does not crash on SM state change request PASS: Verify all blocking state change requests PASS: Verify no memory leak (before ; request stress ; after) Regression: PASS: Verify Debian Build and Install (duplex/duplex) PASS: Verify CentOS Build and Patch (duplex) PASS: Verify CentOS Swact PASS: Verify Logging Story: 2009968 Task: 45675 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Iad27a0e77cb9d2233a2f2e1b6f8216b93964335b	2022-06-26 20:18:20 +00:00
Jiping Ma	c031a990f2	Debian: modify crashDumpMgr to adapt to the vmcore name format. This commit modifies crashDumpMgr to support the current vmcore name format for debian. They are dmesg.202206101633 and dump.202206101633 in Debian. They are vmcore-dmesg.txt and vmcore in CentOS. Test Plan: PASS: Image builds successfully. PASS: vmcore are put to /var/log/crash successfully. PASS: Create dump files manually in /var/crash with the format of CentOS, then run the crashDumpMgr. PASS: Create dump files manually in /var/crash with the format of Debian, then run the crashDumpMgr. Story: 2009964 Task: 45629 Depends-On: https://review.opendev.org/c/starlingx/integ/+/845883 Signed-off-by: Jiping Ma <jiping.ma2@windriver.com> Change-Id: Ic540f7004a4fffd3ce7c008968ac10dca4d1c4d0	2022-06-17 11:31:29 -04:00
Eric MacDonald	aaf9d08028	Mtce: Fix bmc password fetch error handling The mtcAgent process sometimes segfaults while trying to fetch the bmc password from a failing barbican process. With that issue fixed the mtcAgent sends the bmc access credentials to the hardware monitor (hwmond) process which then segfaults for a reason similar In cases where the process does not segfault but also does not get a bmc password, the mtcAgent will flood its log file. This update 1. Prevents the segfault case by properly managing acquired json-c object releases. There was one in the mtcAgent and another in the hardware monitor (hwmond). The json_object_put object release api should only be called against objects that were created with very specific apis. See new comments in the code. 2. Avoids log flooding error case by performing a password size check rather than assume the password is valid following the secret payload receive stage. 3. Simplifies the secret fsm and error and retry handling. 4. Deletes useless creation and release of a few unused json objects in the common jsonUtil and hwmonJson modules. Note: This update temporarily disables sensor and sensorgroup suppression support for the debian hardware monitor while a suppression type fix in sysinv is being investigated. Test Plan: PASS: Verify success path bmc password secret fetch PASS: Verify secret reference get error handling PASS: Verify secret password read error handling PASS: Verify 24 hr provision/deprov success path soak PASS: Verify 24 hr provision/deprov error path path soak PASS: Verify no memory leak over success and failure path soaking PASS: Verify failure handling stress soak ; reduced retry delay PASS: Verify blocking secret fetch success and error handling PASS: Verify non-blocking secret fetch success and error handling PASS: Verify secret fetch is set non-blocking PASS: Verify success and failure path logging PASS: Verify all of jsonUtil module manages object release properly PASS: Verify hardware monitor sensor model creation, monitoring, alarming and relearning. This test requires suppress disable in order to create sensor groups in debian. PASS: Verify both ipmi and redfish and switch between them with just bm_type change. PASS: Verify all above tests in CentOS PASS: Verify over 4000 provision/deprovision cycles across both failure and success path handling with no process failures Closes-Bug: 1975520 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff	2022-06-01 15:21:05 +00:00
Zuul	9af975fbdb	Merge "Fix pmon scripts path (Debian)"	2022-03-17 23:32:09 +00:00
Roberto Luiz Martins Nogueira	bb06207bd2	debian: correct bindir for maintenance services - mtce Deploying to /usr/local/bin instead of /usr/bin for maintenance services mtce, detected in Debian 11(bullseye). Without this patch the ocf script will fail to run. Test Plan: PASS Build package PASS Build ISO PASS Bootstrap in VM PASS Fresh new build Story: 2009101 Task: 44699 Signed-off-by: Roberto Luiz Martins Nogueira <robertoluiz.martinsnogueira@windriver.com> Change-Id: I60471ff51e9e9770de41f67ee1f48a08408eec7d	2022-03-10 11:13:47 +00:00
Lucas Cavalcante	d39f461031	Fix pmon scripts path (Debian) Puppet expects pmon-* executables to be found at /usr/local/sbin, therefore Debian should install these files at the correct location. Test Plan: PASS: Unlock controller (Debian) SKIPPED: Unlock controller (Centos) Story: 2009101 Task: 44711 Change-Id: I5abe5a4c79b58c0a58649f74f54475cca8d29593 Signed-off-by: Lucas Cavalcante <lucasmedeiros.cavalcante@windriver.com>	2022-03-09 11:48:36 -03:00
Matheus Machado Guilhermino	360a344370	Fix remaining failing mtce services on Debian Modified mtce to address the following failing services on Debian: crashDumpMgr.service fsmon.service goenabled.service hostw.service hwclock.service mtcClient.service pmon.service Applied fix: - Included modified .service files for debian directly into into the deb_folder. - Changed the init files to account for the different locations of the init-functions and service daemons on Debian and CentOS - Included "override_dh_installsystemd" section to rules in order to start services at boot. Test Plan: PASS: Package installed and ISO built successfully PASS: Ran "systemctl list-units --failed" and verified that the services are not failing PASS: Ran "systemctl status <service_name>" for each service and verified that they are behaving as desired PASS: Services work as expected on CentOS PASS: Bootstrap and host-unlock successful on CentOS Story: 2009101 Task: 44323 Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com> Change-Id: Ie61cedac24f84baea80cab6a69772f8b2e9e1395	2022-01-25 12:10:39 -03:00
Matheus Machado Guilhermino	4c8abe18d3	Fix failing mtce services on Debian Modified mtce and mtce-control to address the following failing services on Debian: hbsAgent.service hbsClient.service hwmon.service lmon.service mtcalarm.service mtclog.service runservices.service Applied fix: - Included modified .service files for debian directly into into the deb_folder. - Changed the init files to account for the different locations of the init-functions and service daemons on Debian and CentOS - Included "override_dh_installsystemd" section to rules in order to start services at boot. Test Plan: PASS: Package installed and ISO built successfully PASS: Ran "systemctl list-units --failed" and verified that the services are not failing PASS: Ran "systemctl status <service_name>" for each service and verified that they are active Story: 2009101 Task: 44192 Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com> Change-Id: I50915c17d6f50f5e20e6448d3e75bfe54a75acc0	2022-01-14 10:50:09 -03:00
Zuul	13d18b98ad	Merge "Reduce log rates for daemon-ocf"	2021-11-08 16:39:50 +00:00
Tracey Bogue	0551c665cb	Add Debian packaging for mtce packages Some of the code used TRUE instead of true which did not compile for Debian. These instances were changed to true. Some #define constants generated narrowing errors because their values are negative in a 32 bit integer. These values were explicitly casted to int in the case statements causing the errors. Story: 2009101 Task: 43426 Signed-off-by: Tracey Bogue <tracey.bogue@windriver.com> Change-Id: Iffc4305660779010969e0c506d4ef46e1ebc2c71	2021-10-29 09:17:00 -05:00
Delfino Curado	366b68d3c7	Add option --include-backup to wipedisk The option --include-backup offers the possibility to wipe the directory /opt/platform-backup by ignoring the "protected" partition GUID. Test Plan: PASS: Verify that wipedisk without parameters keep the directory /opt/platform-backup contents PASS: Verify that wipedisk with parameter --include-backup remove contents of /opt/platform-backup Story: 2009291 Task: 43719 Signed-off-by: Delfino Curado <delfinogomes.curadofilho@windriver.com> Change-Id: I1a7c0b284a4c229d6ea59433fd7db296745ead2f	2021-10-22 11:58:39 -04:00
jmusico	5138cb12e4	Reduce log rates for daemon-ocf This change will change a few info logs, making them debug level. To be able to check these logs, HA_debug=1 variable shall be added to each process ocf script. Test Plan: PASS: Verify that selected logs to be changed to debug are not logged as info anymore PASS: Verify after enabling debug level logs these logs are correctly logged as debug Failure Path: PASS: Verify logs are not logged if variable is removed or set to 0 Regression: PASS: Verify system install PASS: Verify all log levels, other than debug, are still being generated (related to task 43606) Story: 2009272 Task: 43728 Signed-off-by: jmusico <joaopaulotavares.musico@windriver.com> Change-Id: Ie58683054fd6e60ee5ae496cb823d9ae956251cd	2021-10-21 21:54:42 +00:00
M. Vefa Bicakci	2d25f71f2a	pmon.h: Ensure compat. with v5.10 kernel The v5.10 kernel no longer guards the task_state_notify_info data structure with #ifdef CONFIG_SIGEXIT, which causes a redefinition-related compilation error. Work around this by checking for the existence of the PR_DO_NOTIFY_TASK_STATE macro, and only define the PR_DO_NOTIFY_TASK_STATE and the task_state_notify_info structure if the kernel does not do so. Story: 2008921 Task: 42915 Change-Id: I4bb499e2b52e20542f202dea1c2c55d88bb8ba61 Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>	2021-07-29 17:36:31 -04:00
Eric MacDonald	74bfeba7d3	Increase maximum preserved crash dump vmcore file size to 5Gi The current crashDumpMgr service has several filesystem protection methods that can result in the auto deletion of a crashdump vmcore file. One is a hard cap of 3Gi. This max vmcore size is too small for some applications. Crash dump vmcore files can get big with servers that have a lot of memory and big apps. This update modifies the crashDumpMgr service file max_size override to 5Gi. Test Plan: PASS: Verify change functions as expected PASS: Verify change is inserted after patch apply PASS: Verify crash dump under-size threshold handling PASS: Verify crash dump over-size threshold handling PASS: Verify change is reverted after patch removal Change-Id: I867600460ba9311818ace466986603f5bffe4cd7 Closes-Bug: 1936976 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-07-21 01:46:30 +00:00
Zuul	469cc3ba06	Merge "Clear bmc alarm over mtcAgent process restart for ALL system types"	2021-06-15 20:44:12 +00:00
Eric MacDonald	d6932f49d7	Remove swerr log in hbsAgent cluster delete The mtcAgent does not track the stopped or started heartbeat state of a host, that is left to the heartbeat service itself in response to the mtcAgent commanding heartbeat start and stop based on current running state. Therefore heartbeat stop command is sometimes called against a host that is already in the stopped state. The heartbeat stop command results in a call in the hbsAgent to delete a host from the heartbeat cluster; hbs_cluster_del. If that host is not already in the cluster then this call can result in a Swerr (Software Error) log. This update removes this success path Swerr log. Change-Id: Idb96a791a932827749e329a123f60006ff7c48ec Closes-Bug: 1931911 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-06-14 19:04:33 -04:00
Eric MacDonald	fd5dd4254a	Clear bmc alarm over mtcAgent process restart for ALL system types If a host's BMC is provisioned and the mtcAgent process is restarted then remove the gating condition that avoids clearing the BMC access alarm in AIO SX. Change-Id: I0734c2203a7acaee27c40c3c0d259b4cc5726b5d Closes-Bug: 1931906 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-06-14 16:46:41 -04:00
Eric MacDonald	ba6c61584d	Refactor background in-service start host services handling The maintenance add_handler fsm loads inventory and recovers host state over a process restart. If the active controller's uptime is less than 15 minutes the restart event is treated as a Dead Office Recovery (DOR) and is more forgiving to host recovery by scheduling the 'start host services' as a background operation so as to not hold up the add operation. The current implementation of the background handling of 'start host services' is not handling the AIO subfunction case properly in DOR mode as well as being difficult to follow and therfore fix and maintain. This miss handling leads to maintenance incorrectly failing the node with a subfunction configuration error over the DOR case. This update refactors the background handling of 'start host services' to fix the issue and improve its clearity and maintainability. Test Cases: PASS: Verify AIO DX DOR handling PASS: Verify AIO DX active controller reboot handling - standby with uptime ; < 15 min and > 15 min PASS: Verify AIO DX standby controller reboot handling PASS: Verify subfunction configuration error handling Regression: PASS: Verify start host services wait/retry handling. PASS: Verify start host services failure handling. PASS: Verify DOR of Standard system PASS: Verify DOR of AIO Plus system PASS: Verify AIO System Install PASS: Verify Standard System Install PASS: Verify AIO plus system install Change-Id: Ia4683672e3a2852b5b4837167b2dcd2a1e4e6d57 Closes-Bug: 1928095 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-05-11 12:25:27 -04:00
Eric MacDonald	ce75299649	Fix enabling heartbeat of self from the peer controller This issue only occurs over an hbsAgent process restart where the ready event response does not include the heartbeat start of the peer controller. This update reverts a small code change that was introduced by the following update. https://review.opendev.org/c/starlingx/metal/+/788495 Remove the my_hostname gate introduced at line 1267 of mtcCtrlMsg.cpp because it prevents enabling heartbeat of self by the peer controller. Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e Closes-Bug: 1922584 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-05-06 13:35:54 -04:00
Eric MacDonald	48978d804d	Improved maintenance handling of spontaneous active controller reboot Performing a forced reboot of the active controller sometimes results in a second reboot of that controller. The cause of the second reboot was due to its reported uptime in the first mtcAlive message, following the reboot, as greater than 10 minutes. Maintenance has a long standing graceful recovery threshold of 10 minutes. Meaning that if a host looses heartbeat and enters Graceful Recovery, if the uptime value extracted from the first mtcAlive message following the recovery of that host exceeds 10 minutes, then maintenance interprets that the host did not reboot. If a host goes absent for longer than this threshold then for reasons not limited to security, maintenance declares the host as 'failed' and force re-enables it through a reboot. With the introduction of containers and addition of new features over the last few releases, boot times on some servers are approaching the 10 minute threshold and in this case exceeded the threshold. The primary fix in this update is to increase this long standing threshold to 15 minutes to account for evolution of the product. During the debug of this issue a few other related undesirable behaviors related to Graceful Recovery were observed with the following additional changes implemented. - Remove hbsAgent process restart in ha service management failover failure recovery handling. This change is in the ha git with a loose dependency placed on this update. Reason: https://review.opendev.org/c/starlingx/ha/+/788299 - Prevent the hbsAgent from sending heartbeat clear events to maintenance in response to a heartbeat stop command. Reason: Maintenance receiving these clear events while in Graceful Recovery causes it to pop out of graceful recovery only to re-enter as a retry and therefore needlessly consumes one (of a max of 5) retry count. - Prevent successful Graceful Recovery until all heartbeat monitored networks recover. Reason: If heartbeat of one network, say cluster recovers but another (management) does not then its possible the max Graceful Recovery Retries could be reached quite quickly, while one network recovered but the other may not have, causing maintenance to fail the host and force a full enable with reboot. - Extend the wait for the hbsClient ready event in the graceful recovery handler timout from 1 minute to worker config timeout. Reason: To give the worker config time to complete before force starting the recovery handler's heartbeat soak. - Add Graceful Recovery Wait state recovery over process restart. Reason: Avoid double reboot of Gracefully Recovering host over SM service bounce. - Add requirement for a valid out-of-band mtce flags value before declaring configuration error in the subfunction enable handler. Reason: rebooting the active controller can sometimes result in a falsely reported configation error due to the subfunction enable handler interpreting a zero value as a configuration error. - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs. Reason: To assist log analysis and issue debug Test Plan: PASS: Verify handling active controller reboot cases: AIO DC, AIO DX, Standard, and Storage PASS: Verify Graceful Recovery Wait behavior cases: with and without timeout, with and without bmc cases: uptime > 15 mins and 10 < uptime < 15 mins PASS: Verify Graceful Recovery continuation over mtcAgent restart cases: peer controller, compute, MNFA 4 computes PASS: Verify AIO DX and DC active controller reboot to standby takeover that up for less than 15 minutes. Regression: PASS: Verify MNFA feature ; 4 computes in 8 node Storage system PASS: Verify cluster network only heartbeat loss handling cases: worker and standby controller in all systems. PASS: Verify Dead Office Recovery (DOR) cases: AIO DC, AIO DX, Standard, Storage PASS: Verify system installations cases: AIO SX/DC/DX and 8 node Storage system PASS: Verify heartbeat and graceful recovery of both 'standby controller' and worker nodes in AIO Plus. PASS: Verify logging and no coredumps over all of testing PASS: Verify no missing or stuck alarms over all of testing Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef Closes-Bug: 1922584 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-04-30 15:35:53 +00:00
Eric MacDonald	7539d36c3f	Prevent mtcClient from sending to uninitialized socket in AIO SX The mtcClient will perform a socket reinit if it detects a socket failure. The mtcClient also avoids setting up its controller-1 cluster network socket for the AIO SX system type ; because there is no controller-1 provisioned. Most AIO SX systems have the management/cluster networks set to the 'loopback' interface. However, when an AIO SX system is setup with its management and cluster networks on physical interfaces, with or without vlan, the mtcAlive send message utility will try to send to the uninitialized controller-1 cluster socket. This leads to a socket error that triggers a socket reinitialization loop which causes log flooding. This update adds a check to the mtcAlive send utility to avoid sending mtcAlive to controller-1 for AIO SX system type where there is no controller-1 provisioned; no send,no error,no flood. Since this update needed to add a system type check, this update also implemented a system type definition rename from CPE to AIO. Other related definitions and comments were also changed to make the code base more understandable and maintainable Test Plan: PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode) PASS: Verify AIO SX Install with mgmnt/clstr on 'lo' PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr PASS: Verify AIO SX locked-disabled-online state PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit) PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit) Regression: PASS: Verify AIO SX Lock and Unlock (lazy reboot) PASS: Verify AIO DX and DC install with pv regression and sanity PASS: Verify Standard system install with pv regression and sanity Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762 Closes-Bug: 1897334 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-04-21 10:20:10 -04:00
Zuul	412ff83f25	Merge "Modify mtce daemon log rotation config files"	2021-04-12 21:45:08 +00:00
Eric MacDonald	3c1e9d9601	Modify mtce daemon log rotation config files This update make the following setting changes to the maintenance log rotation configuration files - add 'create' with permissions to each tuple - add 'delaycompress' - group together log files with similar settings - move global settings ro local settings - remove 'copytruncate' global setting - remove the 'nodateext' global and local setting Test Plan: PASS: Verify log rotation for all mtc log files PASS: Verify no log loss over rotation PASS: Verify log rotation file naming convention PASS: Verify delaycompress on all mtce log files PASS: Verify log permissions after rotate are 0640 Regression: PASS: Verify AIO system install PASS: Verify Standard system install PASS: Verify full and dated collect Change-Id: I623030fa2c1ce4e8085e654ae3fb782c7e520924 Partial-Bug: 1918979 Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-04-07 20:47:54 +00:00
Zuul	d3b9a1f0c0	Merge "Add in-service test to clear stale config failure alarm"	2021-04-06 14:39:20 +00:00
Eric MacDonald	031818e55b	Add in-service test to clear stale config failure alarm A configuration failure alarm can get stuck asserted if that node experiences an uncontrolled reboot that recovers without a configuration failure. This update adds an in-service test that audits host health while there is a configuration failure alarm raised and clear that alarm if the failure condition goes away. This could be a result of an in-service manifest that runs and corrects the configuration or if the node reboots and comes back up in a healthy (properly configured) state. Fixed bug that was clearing config alarm severity state when a heartbeat clear event is received. This update also goes a step further and introduces an alarms state audit that detects and corrects maintenance alarm state mismatches. Test Plan: PASS: Verify the add handler loads config alarm state PASS: Verify in-service test clears stale config alarm PASS: Verify in-service test acts on new config failure ... degrade - active controller ... fail - other hosts PASS: Verify audit fixes mtce alarm state mismatches PASS: Verify audit handles fm not running case PASS: Verify audit handling behavior with valid alarm cases PASS: Verify locked alarm management over process restart PASS: Verify audit only logs active alarms list changes PASS: Verify audit runs for both locked/unlocked nodes PASS: Verify update as a patch Regression: PASS: Verify enable sequence config failure handling PASS: ... active controller - recoverable degrade PASS: ... other nodes - threshold fail PASS: ... auto recovery disable - config failure PASS: Verify mtcAgent process logging PASS: Verify heartbeat handling and alarming PASS: Verify Standard system install PASS: Verify AIO system install Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826 Closes-Bug: 1918195 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-03-29 16:39:52 -04:00
Zuul	0d98938f2b	Merge "Fix reinstall of controller nodes"	2021-03-28 15:19:40 +00:00
Eric MacDonald	5c83453fdf	Fix Graceful Recovery handling while in Graceful Recovery handling The current Graceful Recovery handler is not properly handling back-to-back Multi Node Failure Avoidance (MNFA) events. There are two phases to MNFA phase 1: waiting for number of failed nodes to fall below mnfa_threahold as each affected node's heartbeat is recovered. phase 2: then a Graceful Recovery Wait period which is an 11 second heartbeat soak to verify that a stable heartbeat is regained before declaring the NMFA event complete. The Graceful Recovery Wait status of one or more affected nodes has been seen to be left uncleared (stuck) on one or more of the affected nodes if phase 2 of MNFA is interrupted by another MNFA event ; aka MNFA Nesting. Although this stuck status is not service affecting it does leave one or more nodes' host.task field, as observed under host-show, with "Graceful Recovery Wait" rather than empty. This update makes Multi Node Failure Avoidance (MNFA) handling changes to ensure that, upon MNFA exit, the recovery handler is properly restarted if MNFA Nesting occurs. Two additional Graceful Recovery phase issues were identified and fixed by this update. 1. Cut Graceful recovery handling in half - Found and removed a redundant 11 second heartbeat soak at the very end of the recovery handler. - This cuts the graceful recovery handling time down from 22 to 11 seconds thereby cutting potential for nesting in half. 2. Increased supported Graceful Recovery nesting from 3 to 5 - Found that some links bounce more than others so a nesting count of 3 can lead to an occasional single node failure. - This adds a bit more resiliency to MNFA handling of cases that exhibit more link messaging bounce. Test Plan: Verified 60+ MNFA occurrences across 4 different system types including AIO plus, Standard and Storage PASS: Verify Single Node Graceful Recovery Handling PASS: Verify Multi Node Graceful Recovery Handling PASS: Verify Single Node Graceful Recovery Nesting Handling PASS: Verify Multi Node Graceful Recovery Nesting Handling PASS: Verify MNFA of up to 5 nests can be gracefully recovered PASS: Verify MNFA of 6 nests lead to full enable of affected nodes PASS: Verify update as a patch PASS: Verify mtcAgent logging Regression: PASS: Verify standard system install PASS: Verify product verification maintenance regression (4 runs) PASS: Verify MNFA threshold increase and below threshold behavior PASS: Verify MNFA with reduced timeout behavior for ... nested case that does not timeout ... case that does not timeout ... case that does timeout Closes Bug: 1892877 Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-03-17 14:25:19 -04:00
Mihnea Saracin	497a6f93f4	Fix reinstall of controller nodes At shutdown, systemd will try to remount everything read-only before attempting to unmount it. In the wipedisk script we are deleting the partitions without unmounting their corresponding filesystems. This leads to errors because systemd will try to remount filesystems whose partitions were deleted. To fix this we have to unmount the filesystems that are linked to the removed partitions. Closes-Bug: 1919153 Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com> Change-Id: I49a3c06ae6bce1324dd06f4fc63fb3e5cd4d28c1	2021-03-16 14:02:10 +02:00
Zuul	84ba5f693a	Merge "Fix mtce compiling issue with gcc8"	2021-03-15 22:33:45 +00:00
Eric MacDonald	4f5bf78f55	Improve mtcAgent interrupted thread cleanup A BMC command send will be rejected if its thread is not in the IDLE state going into the call. This issue is seen to occur over a reprovisioning action while the bmc access alarmable condition exists. Maintenance will do retries. So the only visible side affect of this issue is a failure to provision to 'redfish' over a provisioning switch to 'dynamic' (learn mode). Instead ipmi is selected. The non-return to idle can occur when the bmc handler FSM is interrupted by a reprovisioning request while a bmc command is in flight. This update enhances the thread management module by introducing a thread consumption utility that is called by the bmc command send utility. If the send finds that its thread is not in the IDLE state it will either kill the thread if it is running or free a completed but-not- consumed thread result. Note: Maintenance only supports the execution of a single thread per host per process at one time. Test Plan: PASS: Verify BMC provisioning change from ipmi to dynamic while the ipmi provisioning was failing prior to re-provisioning. Verify the previous error is cleaned up and the reprovisioning request succeeds as expected. PASS: Verify thread 'execution timeout kill' cleanup handling. PASS: Verify thread 'complete but not consumed' cleanup handling. PASS: Verify logging during regression soaks Regression: PASS: Verify bmc protocol reprovisioning script soak PASS: Verify sensor monitoring following BMC reprovisioning PASS: Verify product verification mtce regression test suite Change-Id: Ie5e9e89ed2f8db6888c0fc7de03d494c75517178 Closes-Bug: 1864906 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-03-15 10:51:16 -04:00
Eric MacDonald	4f7d82308f	Add NonRecoverable property to Hardware Monitor's Redfish This update adds 'NonRecoverable' sensor health property to the Hardware Monitor's Redfish platform management protocol support. Test Plan: PASS: Verify handling of Redfish NonRecoverable sensor ... using redfish ... switching between ipmi and redfish and back PASS: Verify sensor model relearn over change of bmc protocol Regression: PASS: Verify sensor model relearn by command PASS: Verify sensor suppression PASS: Verify sensor alarm and degrade management ... as sensor events come and go ... on sensor suppression and unsuppression PASS: Verify sensor monitoring regression test PASS: Verify update as a patch (apply/remove) Change-Id: I2770e63f4d44e269b4410f392707f3cd01e9a2cc Closes-Bug: 1918152 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-03-11 11:13:59 -05:00

1 2 3 4 5

239 Commits