metal/mtce/src/maintenance/mtcThreads.h
Eric MacDonald a1256a3c32 Make Hardware Monitor sensor list a thread local variable
The current sensor list is shared across all hosts. On large systems,
this can lead to list corruption when host sensor read threads output
data concurrently.

This update moves sensor_list to be thread local, so each thread
gets its own unique instance. Although thread_local variables are not
on the stack, their memory is tied to the thread’s resources. In many
cases, this memory is drawn from the same per-thread region as the
stack, also known as TLS (Thread-Local Storage).

The TLS area is often allocated adjacent to or within the thread’s
stack mapping. A large thread_local variable increases the TLS
requirement, and if it exceeds the reserved space or overlaps with the
stack, thread creation may fail with Resource temporarily unavailable.

To accommodate this, the per-thread stack size was increased.
The sensor_list allocates for up to 512 sensors per host, which is
excessive. This update reduces the max sensors per host to 256, cutting
the list size from 327 KB to 163 KB per thread.

Even with this reduction, the thread stack size needed to be increased
from 128 KB to 512 KB. The Mtce Thread utility was updated to support
custom stack sizes. This allows mtcAgent to remain at 128 KB while
hwmond threads can specify a larger size.

This update also adds a debug feature to create dated sensor reading
files for each host. While testing, it was found that output files were
created with inconsistent permissions. This update fixes the file mode
to 0644.

Test Plan: Verified in 2+2+50 node system

PASS: Verify large system install and sensor monitoring
PASS: Verify large system sensor monitoring over DOR and Swact
PASS: Verify the sensor_sample list storage is unique per thread
PASS: Verify sensor read file permissions
PASS: Verify dated debug sensor read files
PASS: Verify added debug options are disabled by default
PASS: Verify 24 hour provision/monitor/deprovision soak
PASS: Verify sensor monitoring following host delete and readd
PASS: Verify sensor model is deleted completely with host delete
PASS: Verify sensor model is recreated over host readd

Regression:

PASS: Verify sensor monitoring and alarm management
PASS: Verify hardware monitor process restart handling
PASS: Verify no coredumps
PASS: Verify logging for all test cases

Closes-Bug: 2102671
Change-Id: I9263ec2242e03d46e9dc768af965fed7e1ac9175
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-03-29 01:23:16 +00:00

31 lines
564 B
C

#ifndef __INCLUDE_MTCTHREAD_HH__
#define __INCLUDE_MTCTHREAD_HH__
/*
* Copyright (c) 2013-2017 Wind River Systems, Inc.
*
* SPDX-License-Identifier: Apache-2.0
*
*/
/**
* @file
* Wind River CGTS Platform Node Maintenance "Thread Header"
* Header and Maintenance API
*/
typedef struct
{
string bm_ip ;
string bm_un ;
string bm_pw ;
string bm_cmd ;
} thread_extra_info_type ;
#define MTCAGENT_STACK_SIZE (0x20000) // 128 kBytes
void * mtcThread_bmc ( void * );
void * mtcThread_bmc_test ( void * arg );
#endif // __INCLUDE_MTCTHREAD_HH__