239 Commits

Author SHA1 Message Date
Eric MacDonald
d863aea172 Increase mtce host offline threshold to handle slow host shutdown
Mtce polls/queries the remote host for mtcAlive messages
for 42 x 100 ms intervals over unlock or host failed cases.
Absence of mtcAlive during this (~5 sec) period indicates
the node is offline.

However, in the rare case where shutdown is slow, 5 seconds
is not long enough. Rare cases have been seen where 7 or 8
second wait time is required to properly declare offline.

To avoid the rare transient 200.004 host alarm over an
unlock operation, this update increases the mtce host
offline window from 5 to 10 seconds (approx) by modifying
the mtce configuration file offline threshold from 42 to 90.

Test Plan:

PASS: Verify unchallenged failed to offline period to be ~10 secs
PASS: Verify algorithm restarts if there is mtcAlive received
      anytime during the polls/queries (challenge) window.
PASS: Verify challenge handling leads to a longer but
      successful offline declaration.
PASS: Verify above handling for both unlock and spontaneous
      failure handling cases.

Closes-Bug: 2024249
Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-06-16 18:14:08 +00:00
Matheus Guilhermino
a0e270b51b Add mpath support to wipedisk script
The wipedisk script was not able to find the boot device
when using multipath disks. This is due to the fact that
multipath devices are not listed under /dev/disk/by-path/.

To add support to multipath devices, the script should look
for the boot device under /dev/disk/by-id/ as well.

Test Plan
PASS: Successfully run wipedisk on a AIO-SX with multipath
PASS: Successfully run wipedisk on a AIO-SX w/o multipath

Closes-bug: 2013391

Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
Change-Id: I3af76cd44f22795784a9184daf75c66fc1b9874f
2023-04-10 17:10:22 -03:00
Al Bailey
37c5910a62 Update mtce debian package ver based on git
Update debian package versions to use git commits for:
 - mtce         (old 9, new 30)
 - mtce-common  (old 1, new 9)
 - mtce-compute (old 3, new 4)
 - mtce-control (old 7, new 10)
 - mtce-storage (old 3, new 4)

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
  PASS: build-pkgs -p mtce
  PASS: build-pkgs -p mtce-common
  PASS: build-pkgs -p mtce-compute
  PASS: build-pkgs -p mtce-control
  PASS: build-pkgs -p mtce-storage

Story: 2010550
Task: 47401
Task: 47402
Task: 47403
Task: 47404
Task: 47405

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I4846804320b0ad3ec10799a468a9ee3bf7973587
2023-03-02 14:50:35 +00:00
Kyale, Eliud
502662a8a7 Cleanup mtcAgent error logging during startup
- reduced log level in http util to warning
- use inservice test handler to ensure state change notification
  is sent to vim
- reduce retry count from 3 to 1 for add_handler state_change
  vim notification

Test plan:
PASS - AIO-SX: ansible controller startup (race condition)
PASS - AIO-DX: ansible controller startup
PASS - AIO-DX: SWACT
PASS - AIO-DX: power off restart
PASS - AIO-DX: full ISO install
PASS - AIO-DX: Lock Host
PASS - AIO-DX: Unlock Host
PASS - AIO-DX: Fail Host ( by rebooting unlocked-enabled standby controller)

Story: 2010533
Task: 47338

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I7576e2642d33c69a4b355be863bd7183fbb81f45
2023-02-14 14:18:02 -05:00
Christopher Souza
56ab793bc5 Change hostwd emergency log to write to /dev/kmsg
The hostwd emergency logs was written to /dev/console,
the change was to add the prefix "hoswd:" to the log message
and write to /dev/kmsg.

Test Plan:

Pass: AIO-SX and AIO DX full deployment.
Pass: kill pmond and wait for the emergency log to be written.
Pass: check if the emergency log was written to /dev/kmsg.
Pass: Verify logging for quorum report missing failure.
Pass: Verify logging for quorum process failure.
Pass: Verify emergency log crash dump logging to mesg and
      console logging for each of the 2 cases above with
      stressng overloading the server (CPU, FS and Memory);
      stress-ng --vm-bytes 4000000000 --vm-keep -m 30 -i 30 -c 30

Story: 2010533
Task: 47216

Co-authored-by: Eric MacDonald <eric.macdonald@windriver.com>
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Co-authored-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com>
Signed-off-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com>
Change-Id: I0da82f964dd096840259c4d0ed4e5f558debdf22
2023-02-01 23:41:14 +00:00
Eric MacDonald
a3cba57a1f Adapt Host Watchdog to use kdump-tools
The Debian package for kdump changed from kdump to kdump-tools

Test Plan:

PASS: Verify build and install AIO DX system
PASS: Verify host watchdog detects kdump as active in debian

Closes-Bug: 2001692
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie1ac29d3d29f3d9c843789cdedf85081fe790616
2023-01-04 12:57:19 -05:00
Robert Church
1796ed8740 Update wipedisk for LVM based rootfs
Now that the root filesystem is based on an LVM logical volume, discover
the root disk by searching for the boot partition.

Changes include:
 - remove detection of rootfs_part/rootfs and adjust rootfs related
   references with boot_disk.
 - run bashate on the script and resolve indentation and syntax related
   errors. Leave long-line errors alone for improved readability.

Test Plan:
PASS - run 'wipedisk', answer prompts, and ensure all partitions are
       cleaned up except for the platform backup partition
PASS - run 'wipedisk --include-backup', answer prompts, and ensure all
       partitions are cleaned up
PASS - run 'wipedisk --include-backup --force' and ensure all partitions
       are cleaned up

Change-Id: I036ce745353b6a26bc2615ffc6e3b8955b4dd1ec
Closes-Bug: #1998204
Signed-off-by: Robert Church <robert.church@windriver.com>
2022-11-29 05:04:38 -06:00
Eric MacDonald
da398e0c5f Debian: Make Mtce offline handler more resilient to slow shutdowns
The current offline handler assumes the node is offline after
'offline_search_count' reaches 'offline_threshold' count
regardless of whether mtcAlive messages were received during
the search window.

The offline algorithm requires that no mtcAlive messages
be seen for the full offline_threshold count.

During a slow shutdown the mtcClient runs for longer than
it should and as a result can lead to maintenance seeing
the node as recovered before it should.

This update manages the offline search counter to ensure that
it only reached the count threshold after seeing no mtcAlive
messages for the full search count. Any mtcAlive message seen
during the count triggers a count reset.

This update also
1. Adjusts the reset retry cadence from 7 to 12 secs
   to prevent unnecessary reboot thrash during
   the current shutdown.
2. Clears the hbsClient ready event at the start of the
   subfunction handler so the heartbeat soak is only
   started after seeing heartbeat client ready events
   that follow the main config.

Test Plan:

PASS: Debian and CentOS Build and DX install
PASS: Verify search count management
PASS: Verify issue does not occur over lock/unlock soak (100+)
      - where the same test without update did show issue.
PASS: Monitor alive logs for behavioral correctness
PASS: Verify recovery reset occurs after expected extended time.

Closes-Bug: 1993656
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e
2022-10-24 15:57:43 +00:00
Eric MacDonald
3f4c2cbb45 Mtce: Add ActionInfo extension support for reset operations.
StarlingX Maintenance supports host power and reset control through
both IPMI and Redfish Platform Management protocols when the host's
BMC (Board Management Controller) is provisioned.

The power and reset action commands for Redfish are learned through
HTTP payload annotations at the Systems level; "/redfish/v1/Systems.

The existing maintenance implementation only supports the
"ResetType@Redfish.AllowableValues" payload property annotation at
the #ComputerSystem.Reset Actions property level.

However, the Redfish schema also supports an 'ActionInfo' extension
at /redfish/v1/Systems/1/ResetActionInfo.

This update adds support for the 'ActionInfo' extension for Reset
and power control command learning.

For more information refer to the section 6.3 ActionInfo 1.3.0 of
the Redfish Data Model Specification link in the launchpad report.

Test Plan:

PASS: Verify CentOS build and patch install.
PASS: Verify Debian build and ISO install.
PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0
PASS: Verify reset/power control cmd load from newly added second
      level query from ActionInfo service.

Failure Handling: Significant failure path testing with this update

PASS: Verify Redfish protocol is periodically retried from start
      when bm_type=redfish fails to connect.
PASS: Verify BMC access protocol defaults to IPMI when
      bm_type=dynamic but failed connect using redfish.
      Connection failures in the above cases include
      - redfish bmc root query fails
      - redfish bmc info query fails
      - redfish bmc load power/reset control actions fails
      - missing second level Parameters label list
      - missing second level AllowableValues label list
PASS: Verify sensor monitoring is relearned to ipmi from failed and
      retried with bm_type=redfish after switch to bm_type=dynamic
      or bm_type=ipmi by sysinv update command.

Regression:

PASS: Verify with CentOS redfishtool 1.1.0
PASS: Verify switch back and forth between ipmi and redfish using
      update bm_type=ipmi and bm_type=redfish commands
PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for
      hosts that support redfish
PASS: Verify redfish protocol is preferred in bm_type=dynamic mode
PASS: Verify IPMI sensor monitoring when bm_type=ipmi
PASS: Verify IPMI sensor monitoring when bm_type=dynamic
      and redfish connect fails.
PASS: Verify redfish sensor event assert/clear handling with
      alarm and degrade condition for both IPMI and redfish.
PASS: Verify reset/power command learn by single level query.
PASS: Verify mtcAgent.log logging

Closes-Bug: 1992286
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35
2022-10-13 17:40:05 +00:00
Zuul
8fd1bcbb97 Merge "Alarm Hostname controller function has in-service failure reported" 2022-10-06 20:47:06 +00:00
Al Bailey
dd5a24037d Fix bashate failure in zuul
This review allows this repo to pass zuul.

When tox is run locally it pulls in an older
bashate 0.6.0 but the zuul jobs are pulling in
the higher version.

Bashate 2.1.1 was releated Oct 6, 2022

Changed the upper constraints to allow developers
to pull in dependencies that are more aligned with zuul.

Fixed the new bashate error.
Also cleaned up the yamllint syntax.

Closes-Bug: 1991971
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I9cda349a20c63f9d222a3c3fc3645c5ceb4c2751
2022-10-06 17:22:12 +00:00
Girish Subramanya
86681b7598 Alarm Hostname controller function has in-service failure reported
When compute services remain healthy:
 - listing alarms shall not refer to the below Obsoleted alarm
 - 200.012 alarm hostname controller function has an in-service failure

This update deletes definition of the obsoleted alarm and any references
200.012 is removed in events.yaml file
Also updated any reference to this alarm definition.
Need to also raise a Bug to track the Doc change.

Test Plan:
Verify on a Standard configuration no alarms are listed for
hostname controller in-service failure
Code (removal) changes exercised with fix prior to ansible bootstrap
and host-unlock and verify no unexpected alarms
Regression:
There is no need to test the alarm referred here as they are obsolete

Closes-Bug: 1991531

Signed-off-by: Girish Subramanya <girish.subramanya@windriver.com>

Change-Id: I255af68155c5392ea42244b931516f742fa838c3
2022-10-05 10:30:01 -04:00
Zuul
6bcd8333b2 Merge "Debian: Remove conf files from etc-pmon.d" 2022-09-30 19:41:16 +00:00
Leonardo Fagundes Luz Serrano
d1c0d04719 Debian: Remove conf files from etc-pmon.d
Removed conf files from /etc/pmon.d/
as they are being moved to another location.

This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarts.

The change is debian-only, since centos support
will be dropped soon.
Centos' pmon conf files remain in /etc/pmon.d/

Test Plan:
PASS - deb doesn't install anything to /etc/pmon.d/
PASS - rpm files unchanged
PASS - AIOSX unlocked-enabled-available
PASS - Standard 2+2 unlocked-enabled-available

Story: 2010211
Task: 46306

Depends-On: https://review.opendev.org/c/starlingx/metal/+/855095

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: I086db0750df5626d2a8ba1010153ce4f45535ca5
2022-09-26 13:41:40 +00:00
Charles Short
3935abf187 mtcAgent: Run in active mode
Run the mtcAgent with active mode by default. This was done because
it was being observed that mtcAgent was causing an increased CPU
load under Debian.

Story: 2009964
Task: 46202

Test-Plan
PASS Build playbookconfig package
PASS Boot ISO
PASS Bootstrap simplex
PASS Check for running mtcAgent
PASS Install and provision CentOS 2+3 Standard System

Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: If4278ab6e14cd30c995ce5004004fab955ad23eb
2022-09-13 21:38:50 +00:00
Davi Frossard
646192989d Remove sm-watchdog residues
Due to the changes
bd9e560d4b
which removed the sm-watchdog, we also need to remove residues in
kickstart config.

Story: 2010087
Task: 46007

Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
Change-Id: I17911773ec4db1549df32a77acd43cd4615b28ee
2022-09-01 12:35:06 +00:00
Leonardo Fagundes Luz Serrano
a5e7a108f5 Duplicate pmon.d conf files to another location
Created a duplicate install of /etc/pmon.d/*.conf files
to /usr/share/starlingx/pmon.d/

This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarter.

Test Plan:
PASS: duplicate conf on deb

Story: 2010211
Task: 46112

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: Ie07c1bfa370da5b2ec71fe3fce948d59be1dd098
2022-08-26 16:21:18 -03:00
Andy Ning
162398acbc Add pmon configuration file for sssd
This is part of the change to replace nslcd with sssd to
support multiple secure ldap backends.

This change added pmon configuration file for sssd so that it
is monitored by pmon.

Test Plan on Debian (SX and DX):
PASS: Package build, image build.
PASS: System deployment.
PASS: After controller is unlocked, sssd is running.
PASS: ldap user creation by ldapadduser and ldapusersetup.
PASS: ldap user login on console.
PASS: ldap user remote login by oam IP address:
      ssh <ldapuser>@<controller-oam-ip-address>
PASS: ldap user login by local ldap domain within controllers:
      ssh <ldapuser>@controller
PASS: For DX system, same ldap functions still work properly after
      swact.
PASS: Kill sssd process, verify that it is brought up by pmon.

Story: 2009834
Task: 46064
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Change-Id: I701a4cbbda0f900dafd0456aad63132b62d8424f
2022-08-24 14:42:25 -04:00
Eric MacDonald
038eb198fd Re-enable sensor suppression support in Mtce Hardware Monitor
Sensor and sensorgroup suppression was temporarily disabled in
Debian while System Inventory was modified to align API types
with database types.

That update is now merged so this update removes the Debian only
gate on sensor and sensorgroup suppression.

Test Plan:

PASS: Verify Debian build and install
PASS: Verify CentOS build and install
PASS: Verify multiple individual sensor suppression/unsuppression
PASS: Verify sensorgroup suppression/unsuppression
PASS: Verify host degrade and alarm mgmt for each above 2 cases

Story: 2009968
Task: 45964
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I34abf6bd7c72df2f7da743e4f20300956248c6d7
2022-08-06 00:02:29 +00:00
Eric MacDonald
f7f552ad8e Debian: Fix mtcAgent segfault on SM host state change requests
The mtcAgent communicates to Service Managenment using libEvent.

The host state change notification requests are all blocking
requests. Both the common and service manager handlers are
freeing the object. This double free results in a segmentation
fault with the newer version of libEvent in Debian.

The bug is fixed by removing the free in the service handler
to allow the dispatch handler to manage the object free as it
does for other blocking requests for other services.

Test Plan:

PASS: Verify mtcAgent does not crash on SM state change request
PASS: Verify all blocking state change requests
PASS: Verify no memory leak (before ; request stress ; after)

Regression:

PASS: Verify Debian Build and Install (duplex/duplex)
PASS: Verify CentOS Build and Patch (duplex)
PASS: Verify CentOS Swact
PASS: Verify Logging

Story: 2009968
Task: 45675
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Iad27a0e77cb9d2233a2f2e1b6f8216b93964335b
2022-06-26 20:18:20 +00:00
Jiping Ma
c031a990f2 Debian: modify crashDumpMgr to adapt to the vmcore name format.
This commit modifies crashDumpMgr to support the current vmcore name
format for debian.

They are dmesg.202206101633 and dump.202206101633 in Debian.
They are vmcore-dmesg.txt and vmcore in CentOS.

Test Plan:
PASS: Image builds successfully.
PASS: vmcore are put to /var/log/crash successfully.
PASS: Create dump files manually in /var/crash with the format of
      CentOS, then run the crashDumpMgr.
PASS: Create dump files manually in /var/crash with the format of
      Debian, then run the crashDumpMgr.

Story: 2009964
Task: 45629

Depends-On: https://review.opendev.org/c/starlingx/integ/+/845883

Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>
Change-Id: Ic540f7004a4fffd3ce7c008968ac10dca4d1c4d0
2022-06-17 11:31:29 -04:00
Eric MacDonald
aaf9d08028 Mtce: Fix bmc password fetch error handling
The mtcAgent process sometimes segfaults while trying to fetch
the bmc password from a failing barbican process.

With that issue fixed the mtcAgent sends the bmc access
credentials to the hardware monitor (hwmond) process which
then segfaults for a reason similar

In cases where the process does not segfault but also does not
get a bmc password, the mtcAgent will flood its log file.

This update

 1. Prevents the segfault case by properly managing acquired
    json-c object releases. There was one in the mtcAgent and
    another in the hardware monitor (hwmond).

    The json_object_put object release api should only be called
    against objects that were created with very specific apis.
    See new comments in the code.

 2. Avoids log flooding error case by performing a password size
    check rather than assume the password is valid following the
    secret payload receive stage.

 3. Simplifies the secret fsm and error and retry handling.

 4. Deletes useless creation and release of a few unused json
    objects in the common jsonUtil and hwmonJson modules.

Note: This update temporarily disables sensor and sensorgroup
      suppression support for the debian hardware monitor while
      a suppression type fix in sysinv is being investigated.

Test Plan:

PASS: Verify success path bmc password secret fetch
PASS: Verify secret reference get error handling
PASS: Verify secret password read error handling
PASS: Verify 24 hr provision/deprov success path soak
PASS: Verify 24 hr provision/deprov error path path soak
PASS: Verify no memory leak over success and failure path soaking
PASS: Verify failure handling stress soak ; reduced retry delay
PASS: Verify blocking secret fetch success and error handling
PASS: Verify non-blocking secret fetch success and error handling
PASS: Verify secret fetch is set non-blocking
PASS: Verify success and failure path logging
PASS: Verify all of jsonUtil module manages object release properly
PASS: Verify hardware monitor sensor model creation, monitoring,
             alarming and relearning. This test requires suppress
             disable in order to create sensor groups in debian.
PASS: Verify both ipmi and redfish and switch between them with
             just bm_type change.
PASS: Verify all above tests in CentOS
PASS: Verify over 4000 provision/deprovision cycles across both
             failure and success path handling with no process
             failures

Closes-Bug: 1975520
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff
2022-06-01 15:21:05 +00:00
Zuul
9af975fbdb Merge "Fix pmon scripts path (Debian)" 2022-03-17 23:32:09 +00:00
Roberto Luiz Martins Nogueira
bb06207bd2 debian: correct bindir for maintenance services - mtce
Deploying to /usr/local/bin instead of /usr/bin for maintenance
services mtce, detected in  Debian 11(bullseye).
Without this patch the ocf script will fail to run.

Test Plan:
PASS Build package
PASS Build ISO
PASS Bootstrap in VM
PASS Fresh new build

Story: 2009101
Task: 44699

Signed-off-by: Roberto Luiz Martins Nogueira <robertoluiz.martinsnogueira@windriver.com>
Change-Id: I60471ff51e9e9770de41f67ee1f48a08408eec7d
2022-03-10 11:13:47 +00:00
Lucas Cavalcante
d39f461031 Fix pmon scripts path (Debian)
Puppet expects pmon-* executables to be found at /usr/local/sbin,
therefore Debian should install these files at the correct location.

Test Plan:

PASS: Unlock controller (Debian)
SKIPPED: Unlock controller (Centos)

Story: 2009101
Task: 44711
Change-Id: I5abe5a4c79b58c0a58649f74f54475cca8d29593
Signed-off-by: Lucas Cavalcante <lucasmedeiros.cavalcante@windriver.com>
2022-03-09 11:48:36 -03:00
Matheus Machado Guilhermino
360a344370 Fix remaining failing mtce services on Debian
Modified mtce to address the following
failing services on Debian:
crashDumpMgr.service
fsmon.service
goenabled.service
hostw.service
hwclock.service
mtcClient.service
pmon.service

Applied fix:
- Included modified .service files for debian
directly into into the deb_folder.
- Changed the init files to account for the different
locations of the init-functions and service daemons
on Debian and CentOS
- Included "override_dh_installsystemd" section
to rules in order to start services at boot.

Test Plan:

PASS: Package installed and ISO built successfully
PASS: Ran "systemctl list-units --failed" and verified that the
services are not failing
PASS: Ran "systemctl status <service_name>" for
each service and verified that they are behaving as desired
PASS: Services work as expected on CentOS
PASS: Bootstrap and host-unlock successful on CentOS

Story: 2009101
Task: 44323

Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com>
Change-Id: Ie61cedac24f84baea80cab6a69772f8b2e9e1395
2022-01-25 12:10:39 -03:00
Matheus Machado Guilhermino
4c8abe18d3 Fix failing mtce services on Debian
Modified mtce and mtce-control to address the following
failing services on Debian:
hbsAgent.service
hbsClient.service
hwmon.service
lmon.service
mtcalarm.service
mtclog.service
runservices.service

Applied fix:
- Included modified .service files for debian
directly into into the deb_folder.
- Changed the init files to account for the different
locations of the init-functions and service daemons
on Debian and CentOS
- Included "override_dh_installsystemd" section
to rules in order to start services at boot.

Test Plan:

PASS: Package installed and ISO built successfully
PASS: Ran "systemctl list-units --failed" and verified that the
services are not failing
PASS: Ran "systemctl status <service_name>" for
each service and verified that they are active

Story: 2009101
Task: 44192

Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com>
Change-Id: I50915c17d6f50f5e20e6448d3e75bfe54a75acc0
2022-01-14 10:50:09 -03:00
Zuul
13d18b98ad Merge "Reduce log rates for daemon-ocf" 2021-11-08 16:39:50 +00:00
Tracey Bogue
0551c665cb Add Debian packaging for mtce packages
Some of the code used TRUE instead of true which did not compile
for Debian. These instances were changed to true.
Some #define constants generated narrowing errors because their
values are negative in a 32 bit integer. These values were
explicitly casted to int in the case statements causing the errors.

Story: 2009101
Task: 43426

Signed-off-by: Tracey Bogue <tracey.bogue@windriver.com>
Change-Id: Iffc4305660779010969e0c506d4ef46e1ebc2c71
2021-10-29 09:17:00 -05:00
Delfino Curado
366b68d3c7 Add option --include-backup to wipedisk
The option --include-backup offers the possibility to wipe the
directory /opt/platform-backup by ignoring the "protected" partition
GUID.

Test Plan:

PASS: Verify that wipedisk without parameters keep the directory
/opt/platform-backup contents
PASS: Verify that wipedisk with parameter --include-backup remove
contents of /opt/platform-backup

Story: 2009291
Task: 43719
Signed-off-by: Delfino Curado <delfinogomes.curadofilho@windriver.com>
Change-Id: I1a7c0b284a4c229d6ea59433fd7db296745ead2f
2021-10-22 11:58:39 -04:00
jmusico
5138cb12e4 Reduce log rates for daemon-ocf
This change will change a few info logs, making them debug level.
To be able to check these logs, HA_debug=1 variable shall be added to
each process ocf script.

Test Plan:

PASS: Verify that selected logs to be changed to debug are not logged
as info anymore
PASS: Verify after enabling debug level logs these logs are correctly
logged as debug

Failure Path:

PASS: Verify logs are not logged if variable is removed or set to 0

Regression:

PASS: Verify system install
PASS: Verify all log levels, other than debug, are still being
generated (related to task 43606)

Story: 2009272
Task: 43728

Signed-off-by: jmusico <joaopaulotavares.musico@windriver.com>
Change-Id: Ie58683054fd6e60ee5ae496cb823d9ae956251cd
2021-10-21 21:54:42 +00:00
M. Vefa Bicakci
2d25f71f2a pmon.h: Ensure compat. with v5.10 kernel
The v5.10 kernel no longer guards the task_state_notify_info data
structure with #ifdef CONFIG_SIGEXIT, which causes a
redefinition-related compilation error. Work around this by checking for
the existence of the PR_DO_NOTIFY_TASK_STATE macro, and only define the
PR_DO_NOTIFY_TASK_STATE and the task_state_notify_info structure if the
kernel does not do so.

Story: 2008921
Task: 42915

Change-Id: I4bb499e2b52e20542f202dea1c2c55d88bb8ba61
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
2021-07-29 17:36:31 -04:00
Eric MacDonald
74bfeba7d3 Increase maximum preserved crash dump vmcore file size to 5Gi
The current crashDumpMgr service has several filesystem
protection methods that can result in the auto deletion
of a crashdump vmcore file. One is a hard cap of 3Gi.

This max vmcore size is too small for some applications.
Crash dump vmcore files can get big with servers that have
a lot of memory and big apps.

This update modifies the crashDumpMgr service file
max_size override to 5Gi.

Test Plan:

PASS: Verify change functions as expected
PASS: Verify change is inserted after patch apply
PASS: Verify crash dump under-size threshold handling
PASS: Verify crash dump over-size threshold handling
PASS: Verify change is reverted after patch removal

Change-Id: I867600460ba9311818ace466986603f5bffe4cd7
Closes-Bug: 1936976
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-07-21 01:46:30 +00:00
Zuul
469cc3ba06 Merge "Clear bmc alarm over mtcAgent process restart for ALL system types" 2021-06-15 20:44:12 +00:00
Eric MacDonald
d6932f49d7 Remove swerr log in hbsAgent cluster delete
The mtcAgent does not track the stopped or started
heartbeat state of a host, that is left to the
heartbeat service itself in response to the mtcAgent
commanding heartbeat start and stop based on current
running state.

Therefore heartbeat stop command is sometimes called
against a host that is already in the stopped state.

The heartbeat stop command results in a call in the
hbsAgent to delete a host from the heartbeat cluster;
hbs_cluster_del.

If that host is not already in the cluster then this
call can result in a Swerr (Software Error) log.

This update removes this success path Swerr log.

Change-Id: Idb96a791a932827749e329a123f60006ff7c48ec
Closes-Bug: 1931911
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-06-14 19:04:33 -04:00
Eric MacDonald
fd5dd4254a Clear bmc alarm over mtcAgent process restart for ALL system types
If a host's BMC is provisioned and the mtcAgent process
is restarted then remove the gating condition that avoids
clearing the BMC access alarm in AIO SX.

Change-Id: I0734c2203a7acaee27c40c3c0d259b4cc5726b5d
Closes-Bug: 1931906
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-06-14 16:46:41 -04:00
Eric MacDonald
ba6c61584d Refactor background in-service start host services handling
The maintenance add_handler fsm loads inventory and recovers
host state over a process restart. If the active controller's
uptime is less than 15 minutes the restart event is treated as
a Dead Office Recovery (DOR) and is more forgiving to host
recovery by scheduling the 'start host services' as a
background operation so as to not hold up the add operation.

The current implementation of the background handling of
'start host services' is not handling the AIO subfunction
case properly in DOR mode as well as being difficult to
follow and therfore fix and maintain. This miss handling
leads to maintenance incorrectly failing the node with a
subfunction configuration error over the DOR case.

This update refactors the background handling of 'start host
services' to fix the issue and improve its clearity and
maintainability.

Test Cases:

PASS: Verify AIO DX DOR handling
PASS: Verify AIO DX active controller reboot handling
      - standby with uptime ; < 15 min and > 15 min
PASS: Verify AIO DX standby controller reboot handling
PASS: Verify subfunction configuration error handling

Regression:

PASS: Verify start host services wait/retry handling.
PASS: Verify start host services failure handling.
PASS: Verify DOR of Standard system
PASS: Verify DOR of AIO Plus system
PASS: Verify AIO System Install
PASS: Verify Standard System Install
PASS: Verify AIO plus system install

Change-Id: Ia4683672e3a2852b5b4837167b2dcd2a1e4e6d57
Closes-Bug: 1928095
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-05-11 12:25:27 -04:00
Eric MacDonald
ce75299649 Fix enabling heartbeat of self from the peer controller
This issue only occurs over an hbsAgent process restart
where the ready event response does not include the
heartbeat start of the peer controller.

This update reverts a small code change that was
introduced by the following update.

https://review.opendev.org/c/starlingx/metal/+/788495

Remove the my_hostname gate introduced at line 1267 of
mtcCtrlMsg.cpp because it prevents enabling heartbeat
of self by the peer controller.

Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
Closes-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-05-06 13:35:54 -04:00
Eric MacDonald
48978d804d Improved maintenance handling of spontaneous active controller reboot
Performing a forced reboot of the active controller sometimes
results in a second reboot of that controller. The cause of the
second reboot was due to its reported uptime in the first mtcAlive
message, following the reboot, as greater than 10 minutes.

Maintenance has a long standing graceful recovery threshold of
10 minutes. Meaning that if a host looses heartbeat and enters
Graceful Recovery, if the uptime value extracted from the first
mtcAlive message following the recovery of that host exceeds 10
minutes, then maintenance interprets that the host did not reboot.
If a host goes absent for longer than this threshold then for
reasons not limited to security, maintenance declares the host
as 'failed' and force re-enables it through a reboot.

With the introduction of containers and addition of new features
over the last few releases, boot times on some servers are
approaching the 10 minute threshold and in this case exceeded
the threshold.

The primary fix in this update is to increase this long standing
threshold to 15 minutes to account for evolution of the product.

During the debug of this issue a few other related undesirable
behaviors related to Graceful Recovery were observed with the
following additional changes implemented.

 - Remove hbsAgent process restart in ha service management
   failover failure recovery handling. This change is in the
   ha git with a loose dependency placed on this update.
   Reason: https://review.opendev.org/c/starlingx/ha/+/788299

 - Prevent the hbsAgent from sending heartbeat clear events
   to maintenance in response to a heartbeat stop command.
   Reason: Maintenance receiving these clear events while in
           Graceful Recovery causes it to pop out of graceful
           recovery only to re-enter as a retry and therefore
           needlessly consumes one (of a max of 5) retry count.

 - Prevent successful Graceful Recovery until all heartbeat
   monitored networks recover.
   Reason: If heartbeat of one network, say cluster recovers but
           another (management) does not then its possible the
           max Graceful Recovery Retries could be reached quite
           quickly, while one network recovered but the other
           may not have, causing maintenance to fail the host and
           force a full enable with reboot.

 - Extend the wait for the hbsClient ready event in the graceful
   recovery handler timout from 1 minute to worker config timeout.
   Reason: To give the worker config time to complete before force
           starting the recovery handler's heartbeat soak.

 - Add Graceful Recovery Wait state recovery over process restart.
   Reason: Avoid double reboot of Gracefully Recovering host over
           SM service bounce.

 - Add requirement for a valid out-of-band mtce flags value before
   declaring configuration error in the subfunction enable handler.
   Reason: rebooting the active controller can sometimes result in
           a falsely reported configation error due to the
           subfunction enable handler interpreting a zero value as
           a configuration error.

 - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
   Reason: To assist log analysis and issue debug

Test Plan:

PASS: Verify handling active controller reboot
             cases: AIO DC, AIO DX, Standard, and Storage
PASS: Verify Graceful Recovery Wait behavior
             cases: with and without timeout, with and without bmc
             cases: uptime > 15 mins and 10 < uptime < 15 mins
PASS: Verify Graceful Recovery continuation over mtcAgent restart
             cases: peer controller, compute, MNFA 4 computes
PASS: Verify AIO DX and DC active controller reboot to standby
             takeover that up for less than 15 minutes.

Regression:

PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
PASS: Verify cluster network only heartbeat loss handling
             cases: worker and standby controller in all systems.
PASS: Verify Dead Office Recovery (DOR)
             cases: AIO DC, AIO DX, Standard, Storage
PASS: Verify system installations
             cases: AIO SX/DC/DX and 8 node Storage system
PASS: Verify heartbeat and graceful recovery of both 'standby
             controller' and worker nodes in AIO Plus.

PASS: Verify logging and no coredumps over all of testing
PASS: Verify no missing or stuck alarms over all of testing

Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
Closes-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-30 15:35:53 +00:00
Eric MacDonald
7539d36c3f Prevent mtcClient from sending to uninitialized socket in AIO SX
The mtcClient will perform a socket reinit if it detects a socket
failure. The mtcClient also avoids setting up its controller-1
cluster network socket for the AIO SX system type ; because there
is no controller-1 provisioned.

Most AIO SX systems have the management/cluster networks set to
the 'loopback' interface. However, when an AIO SX system is setup
with its management and cluster networks on physical interfaces,
with or without vlan, the mtcAlive send message utility will try
to send to the uninitialized controller-1 cluster socket. This
leads to a socket error that triggers a socket reinitialization
loop which causes log flooding.

This update adds a check to the mtcAlive send utility to avoid
sending mtcAlive to controller-1 for AIO SX system type where
there is no controller-1 provisioned; no send,no error,no flood.

Since this update needed to add a system type check, this update
also implemented a system type definition rename from CPE to AIO.
Other related definitions and comments were also changed to make
the code base more understandable and maintainable

Test Plan:

PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
PASS: Verify AIO SX locked-disabled-online state
PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)

Regression:

PASS: Verify AIO SX Lock and Unlock (lazy reboot)
PASS: Verify AIO DX and DC install with pv regression and sanity
PASS: Verify Standard system install with pv regression and sanity

Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
Closes-Bug: 1897334
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-21 10:20:10 -04:00
Zuul
412ff83f25 Merge "Modify mtce daemon log rotation config files" 2021-04-12 21:45:08 +00:00
Eric MacDonald
3c1e9d9601 Modify mtce daemon log rotation config files
This update make the following setting changes to the
maintenance log rotation configuration files

 - add 'create' with permissions to each tuple
 - add 'delaycompress'
 - group together log files with similar settings
 - move global settings ro local settings
 - remove 'copytruncate' global setting
 - remove the 'nodateext' global and local setting

Test Plan:

PASS: Verify log rotation for all mtc log files
PASS: Verify no log loss over rotation
PASS: Verify log rotation file naming convention
PASS: Verify delaycompress on all mtce log files
PASS: Verify log permissions after rotate are 0640

Regression:

PASS: Verify AIO system install
PASS: Verify Standard system install
PASS: Verify full and dated collect

Change-Id: I623030fa2c1ce4e8085e654ae3fb782c7e520924
Partial-Bug: 1918979
Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-07 20:47:54 +00:00
Zuul
d3b9a1f0c0 Merge "Add in-service test to clear stale config failure alarm" 2021-04-06 14:39:20 +00:00
Eric MacDonald
031818e55b Add in-service test to clear stale config failure alarm
A configuration failure alarm can get stuck asserted if
that node experiences an uncontrolled reboot that recovers
without a configuration failure.

This update adds an in-service test that audits host health
while there is a configuration failure alarm raised and
clear that alarm if the failure condition goes away. This
could be a result of an in-service manifest that runs and
corrects the configuration or if the node reboots and comes
back up in a healthy (properly configured) state.

Fixed bug that was clearing config alarm severity state
when a heartbeat clear event is received.

This update also goes a step further and introduces an
alarms state audit that detects and corrects maintenance
alarm state mismatches.

Test Plan:

PASS: Verify the add handler loads config alarm state
PASS: Verify in-service test clears stale config alarm
PASS: Verify in-service test acts on new config failure
      ... degrade - active controller
      ... fail    - other hosts
PASS: Verify audit fixes mtce alarm state mismatches
PASS: Verify audit handles fm not running case
PASS: Verify audit handling behavior with valid alarm cases
PASS: Verify locked alarm management over process restart
PASS: Verify audit only logs active alarms list changes
PASS: Verify audit runs for both locked/unlocked nodes
PASS: Verify update as a patch

Regression:

PASS: Verify enable sequence config failure handling
PASS: ... active controller     - recoverable degrade
PASS: ... other nodes           - threshold fail
PASS: ... auto recovery disable - config failure
PASS: Verify mtcAgent process logging
PASS: Verify heartbeat handling and alarming
PASS: Verify Standard system install
PASS: Verify AIO system install

Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
Closes-Bug: 1918195
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-29 16:39:52 -04:00
Zuul
0d98938f2b Merge "Fix reinstall of controller nodes" 2021-03-28 15:19:40 +00:00
Eric MacDonald
5c83453fdf Fix Graceful Recovery handling while in Graceful Recovery handling
The current Graceful Recovery handler is not properly handling
back-to-back Multi Node Failure Avoidance (MNFA) events.

There are two phases to MNFA

 phase 1: waiting for number of failed nodes to fall below
          mnfa_threahold as each affected node's heartbeat
          is recovered.
 phase 2: then a Graceful Recovery Wait period which is an
          11 second heartbeat soak to verify that a stable
          heartbeat is regained before declaring the NMFA
          event complete.

The Graceful Recovery Wait status of one or more affected nodes
has been seen to be left uncleared (stuck) on one or more of the
affected nodes if phase 2 of MNFA is interrupted by another MNFA
event ; aka MNFA Nesting.

Although this stuck status is not service affecting it does leave
one or more nodes' host.task field, as observed under host-show,
with "Graceful Recovery Wait" rather than empty.

This update makes Multi Node Failure Avoidance (MNFA) handling
changes to ensure that, upon MNFA exit, the recovery handler
is properly restarted if MNFA Nesting occurs.

Two additional Graceful Recovery phase issues were identified
and fixed by this update.

 1. Cut Graceful recovery handling in half

    - Found and removed a redundant 11 second heartbeat soak
      at the very end of the recovery handler.
    - This cuts the graceful recovery handling time down from
      22 to 11 seconds thereby cutting potential for nesting
      in half.

 2. Increased supported Graceful Recovery nesting from 3 to 5

    - Found that some links bounce more than others so a nesting
      count of 3 can lead to an occasional single node failure.
    - This adds a bit more resiliency to MNFA handling of cases
      that exhibit more link messaging bounce.

Test Plan: Verified 60+ MNFA occurrences across 4 different
           system types including AIO plus, Standard and Storage

PASS: Verify Single Node Graceful Recovery Handling
PASS: Verify Multi Node Graceful Recovery Handling
PASS: Verify Single Node Graceful Recovery Nesting Handling
PASS: Verify Multi Node Graceful Recovery Nesting Handling
PASS: Verify MNFA of up to 5 nests can be gracefully recovered
PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
PASS: Verify update as a patch
PASS: Verify mtcAgent logging

Regression:

PASS: Verify standard system install
PASS: Verify product verification maintenance regression (4 runs)
PASS: Verify MNFA threshold increase and below threshold behavior
PASS: Verify MNFA with reduced timeout behavior for
      ... nested case that does not timeout
      ... case that does not timeout
      ... case that does timeout

Closes Bug: 1892877
Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-17 14:25:19 -04:00
Mihnea Saracin
497a6f93f4 Fix reinstall of controller nodes
At shutdown, systemd will try to remount everything read-only
before attempting to unmount it. In the wipedisk script we
are deleting the partitions without unmounting
their corresponding filesystems. This leads to errors because
systemd will try to remount filesystems
whose partitions were deleted.

To fix this we have to unmount the filesystems that are linked to the
removed partitions.

Closes-Bug: 1919153
Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Change-Id: I49a3c06ae6bce1324dd06f4fc63fb3e5cd4d28c1
2021-03-16 14:02:10 +02:00
Zuul
84ba5f693a Merge "Fix mtce compiling issue with gcc8" 2021-03-15 22:33:45 +00:00
Eric MacDonald
4f5bf78f55 Improve mtcAgent interrupted thread cleanup
A BMC command send will be rejected if its thread
is not in the IDLE state going into the call.

This issue is seen to occur over a reprovisioning action
while the bmc access alarmable condition exists.

Maintenance will do retries. So the only visible side affect
of this issue is a failure to provision to 'redfish' over a
provisioning switch to 'dynamic' (learn mode). Instead
ipmi is selected.

The non-return to idle can occur when the bmc handler FSM
is interrupted by a reprovisioning request while a bmc
command is in flight.

This update enhances the thread management module by
introducing a thread consumption utility that is called
by the bmc command send utility. If the send finds that
its thread is not in the IDLE state it will either kill
the thread if it is running or free a completed but-not-
consumed thread result.

Note: Maintenance only supports the execution of
a single thread per host per process at one time.

Test Plan:

PASS: Verify BMC provisioning change from ipmi to dynamic
      while the ipmi provisioning was failing prior to
      re-provisioning. Verify the previous error is cleaned
      up and the reprovisioning request succeeds as expected.

PASS: Verify thread 'execution timeout kill' cleanup handling.
PASS: Verify thread 'complete but not consumed' cleanup handling.
PASS: Verify logging during regression soaks

Regression:

PASS: Verify bmc protocol reprovisioning script soak
PASS: Verify sensor monitoring following BMC reprovisioning
PASS: Verify product verification mtce regression test suite

Change-Id: Ie5e9e89ed2f8db6888c0fc7de03d494c75517178
Closes-Bug: 1864906
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-15 10:51:16 -04:00
Eric MacDonald
4f7d82308f Add NonRecoverable property to Hardware Monitor's Redfish
This update adds 'NonRecoverable' sensor health property
to the Hardware Monitor's Redfish platform management
protocol support.

Test Plan:

PASS: Verify handling of Redfish NonRecoverable sensor
      ... using redfish
      ... switching between ipmi and redfish and back
PASS: Verify sensor model relearn over change of bmc protocol

Regression:

PASS: Verify sensor model relearn by command
PASS: Verify sensor suppression
PASS: Verify sensor alarm and degrade management
      ... as sensor events come and go
      ... on sensor suppression and unsuppression
PASS: Verify sensor monitoring regression test
PASS: Verify update as a patch (apply/remove)

Change-Id: I2770e63f4d44e269b4410f392707f3cd01e9a2cc
Closes-Bug: 1918152
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-11 11:13:59 -05:00