metal

Author	SHA1	Message	Date
Eric MacDonald	4074c6c25d	Move the mtce /etc/mtc/tmp/.node_locked flag file out of /etc When the mtcAgent locks a node, it commands the mtcClient to create a persistent .node_locked flag file at /etc/mtc/tmp/.node_locked. Conversely, when the node is unlocked, the mtcAgent commands the mtcClient to remove this flag file. However, an issue arises where an unlocked node may still have the .node_locked file present after an upgrade-rollback or patch-removal operation. The issue occurs because the OSTree upgrade deployment process runs while the node is locked. During this process, OSTree takes a snapshot of the /etc directory, which includes the .node_locked file. Even if the file is later removed by maintenance actions, after deploy but before reboot, OSTree restores it from the snapshot resulting in the reinstatement of the .node_locked file on an unlocked node. To eliminate this file management conflict, this update moves the persistent .node_locked flag file to a location outside of OSTree's management, specifically to /var/persist/mtc/.node_locked. The directory name 'persist' was chosen to clearly indicate that the files in this directory are intended to persist across reboots. This update also fixed a post install script logging error trying to rename the hwclock.sh.<init>.bak file with one already present. Test Plan: PASS: Verify the creation of the new /var/persist/mtc directory. PASS: Verify any files under this directory persist over reboot. PASS: Verify proper management of the node locked file over upgrade and rollback. PASS: Install a AIO DX and verify the node locked file management. PASS: Verify AIO DX upgrade from MR2PLUS to 24.09 master. PASS: Install a Standard DX System with 1 worker and 2 storage and verify the node locked file management over and following an upgrade from MR2PLUS to 24.09 master. PASS: Verify obsoleted /etc/mtc/tmp/.node_locked file is auto removed by both package install and over a mtcClient startup/restart. PASS: Verify /etc/mtc/tmp dir remains. PASS: Verify mtce debian package installs without error or warning. Closes-Bug: 2095212 Change-Id: I3431abfef74c678fbeaa149bf6ac29ee254be111 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2025-02-06 16:54:33 +00:00
Eric MacDonald	0853bb3fcc	Add configured add host delay to mtcAgent The active 'controller' domain name is used by the mtcAgent management interface to communicate with the mtcClient. The System Swact (Switch Activity) function dynamically migrates active controller services between controller-0 and controller-1. During this process, the mtcAgent, along with other services, are restarted on the newly active controller. When the mtcAgent starts, it reads the system inventory and adds the hosts to its internal control structure. During this "add" operation, the mtcAgent sends commands and expects responses from the local and remote mtcClients on individual nodes, using the controller domain name, which represents the management network's floating IP address. A new feature, the FQDN (Fully Qualified Domain Name) Resolution Manager, was introduced to handle domain name resolution in the StarlingX system. However, an issue was identified where the FQDN resolution manager does not have the 'controller' domain name resolution support fully available (qualified) when the mtcAgent starts messaging with its mtcClients. As a result, the communication between the mtcAgent and mtcClient can lead to silent message loss. This issue can cause the "add host" operation to fail, potentially being service affecting for that host. This update adds a small, manually configurable delay, to the mtcAgent host add operation start. This gives FQDN the time to complete setting up name resolution for the required 'controller' domain name. The default add_host_delay of 20 seconds was selected after seeing the occasional failure with a 10 second delay. This update can be removed in the future if the system makes changes to avoid starting the mtcAgent before all name resolution is ready. Test Plan: PASS: Verify issue in system, apply update, verify issue is resolved. PASS: Verify package/iso build along with AIO DX system install. PASS: Verify mtcAgent logging. Regression: PASS: Verify standby controller lock/unlock soak ; 10+ loops. PASS: Verify Swact soak of 20+ swacts succeeds without reproducing the issue this update is designed to fix. PASS: Verify heart beating is enabled on all remote hosts on both controllers following an install and multiple Swacts. PASS: Verify sensor monitoring is enabled on all hosts that have their BVMC provisioned over a Swact. PASS: Verify mtcClient, mtcAgent, hbsAgent and hbsClient logs for unexpected behavior. PASS: Verify default add hosts delay can be changed and a mtcAgent configuration reload or process restart uses the modified value. PASS: Verify no add host delay is imposed if the new configuration label is removed from the config file or set to 0. PASS: Verify host lock immediately following a swact and successful system host-list. Closes-Bug: 2093381 Change-Id: I694322eff0945c7c56bf21051b3d6cccacf829a2 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2025-01-10 12:54:46 +00:00
Zuul	1db4094905	Merge "Remove sw-patch-agent from kickstart"	2024-11-26 18:57:22 +00:00
mmachado	1924d5191a	Remove sw-patch-agent from kickstart sw-patch is being removed and mentions of it should be removed as they serve no purpose. Depends-On: https://review.opendev.org/c/starlingx/update/+/934968 Test-Plan: PASS: AIO-SX upgrade using sw-manager strategy PASS: AIO-DX System Controller upgrade using strategy PASS: subcloud upgrade using dcmanager strategy PASS: AIO-DX initial install and bootstrap Story: 2010676 Task: 51401 Change-Id: I980aafc59b2abf6ecf405add8cdeef7ae4b3a7a3 Signed-off-by: mmachado <mmachado@windriver.com>	2024-11-25 19:34:53 +00:00
Jim Gauld	d368475197	Configure systemd CPUShares/CPUQuota for pmon.service This updates CPUShares and CPUQuota for pmon.service. This gives reduced shares and quota since pmon.service has sporadic CPU usage yet is not latency critical. Significant hirunner CPU usage comes from various audits (unrelated to pmon process itself) running under the systemd pmon.service cgroup. For example: ceph health audit, ceph osd audit, can easily require 100% cpu for several seconds, often taking 30% occupancy for multiple seconds. This reduces pmon cgroup to 150 CPUShare from 1024 and sets CPUQUota 15%. This smoothes out behaviour of poorly behaved audits. This effectively slows down the audit behaviour by a few seconds due to throttling. This is part of an overall set of adjustments are required for systemd cgroups CPUShares, CPUQuota, and AllowedCPUs for key system services. This will improve latency of Kubernetes critical components, and throttles lesser important services. Partial-Bug: 2084714 TEST PLAN: AIO-SX, AIO-DX, Standard, Storage, DC, AIO-DX with ceph: - PASS: Fresh install - PASS: verify systemd parameters for pmon Example: systemctl show pmon.service \| grep -e CPUShares -e CPUQuota AIO-SX, AIO-DX: - PASS: B&R AIO-DX: - PASS: K8S orchestrated upgrade 1.24 - 1.29 - TODO: controller swact Change-Id: I6ee5c6029c2a5a0fae26e9231401e4d4f1c016df Signed-off-by: Jim Gauld <James.Gauld@windriver.com>	2024-11-15 09:21:05 -05:00
fperez	e62642e97f	Fix intermittent process failure alarms auto-clear issues This commit addresses the issue of intermittent failures that occur when errors are encountered while opening files with extra text for specific processes. These errors led to mismatches between the entity_instance_id of the created alarm and the alarm being deleted. With this commit, the extra text is now appended only to the alarm when it is created, and it will not be considered when the system attempts to remove the alarm. This change helps prevent the mismatches caused by file errors and ensure alarms are handled correctly. Test plan PASS: Build package. PASS: Install package and bootstrap system PASS: Use Eric macDonald's pmon regression tests to verify behavior. closes-bug: 2078986 Change-Id: I622450c45770d251d62a80ccb964c65ce9e4d935 Signed-off-by: fperez <fabrizio.perez@windriver.com>	2024-09-11 16:09:12 -03:00
Eric MacDonald	dab9c4774b	Maintenance does not auto-start worker host services in AIO The mtcClient is required to 'start host services' autonomously following a node reboot. This is to handle the usecase where the administrator disables maintenance heartbeat loss auto recovery. If that node then reboots on its own, for whatever reason, maintenance needs to ensure that it auto starts 'host services'. A fairly recent update delivered support for that usecase: https://opendev.org/starlingx/metal/commit/ 1335bc484df331771e995ae822df3af84cc5739d However, the current mechanism the mtcClient used to manage auto- starting host services did not handle the worker subfunction case. Moreover, the current implementation is not handling the potential concurrency between the mtcClient process startup case and mtcAgent requests during unlock recovery. This case also fixes an issue where the mtcClient sometimes gets into a mode where it floods the mtcAgent with a start host services result message ; 20 unnecessary messages / sec. The aforementioned update modified the mtcAgent to log receipt of this message which then floods the mtcAgent log leading to unnecessary message handling and log rotations. Test Plan: Success Path: PASS: Verify mtcClient success path handling of start and stop host services function for the various node types in a ... - standard system with worker and storage nodes - all-in-one system with worker node PASS: Verify appropriate start host services are run on each node type following a Dead Office Recovery (DOR). - standard system with worker and storage nodes - all-in-one system with worker node PASS: Verify the mtcClient does not unnecessarily send host services result messages. PASS: Verify handling of periodic start host services message while a node is in service. Failure Path: PASS: Verify mtcClient failure path handling of start and stop host services function for the various node types in a ... - standard system with worker and storage nodes - all-in-one system with worker node PASS: Verify mtcClient start host services command handling when when message requests interleave with auto start handling during unlock recovery. Closes-Bug: 2073802 Change-Id: I0da7a16c1f600cc60364f6bcec7587e2ff71c624 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-08-09 14:48:05 +00:00
Eric MacDonald	50204147ff	Add PS Redundancy Sensor to Redfish server power sensor group This update adds the Power Supply Redundancy sensor to the redfish server power sensor group. Some special handling is required to make the assertion of this new sensor have a 'major' severity level and only while there are 2 or more power supplies provisioned. See code comments in the review that highlight the assertion only applies when the redundancy sensor count is 2 and severity is overridden from critical to major. This update does not apply to the IPMI 'server power' sensor group. This is because the IPMI protocol does not distinguish between single and redundant power supply provisioning cases and reports a redundancy loss in the single power supply case even when that power supply is operating fine. Test Plan: PASS: Verify new PS Redundancy sensor is added to the server power sensor group with redfish sensor monitoring. PASS: Verify no PS Redundancy assertion with redundant power supplies installed while both have AC power input. PASS: Verify major PS Redundancy assertion with redundant power supplies installed while one not receiving AC power input. PASS: Verify no PS Redundancy assertion with single power supply. PASS: Verify PS Redundancy sensor goes offline when 'state' is not 'Enabled' and returns to operating state when re-Enabled. PASS: Verify PS Redundancy sensor goes 'offline' when Redundancy label is missing. PASS: Verify PS Redundancy sensor goes 'offline' when RedundancySet count is missing. PASS: Verify PS Redundancy sensor goes 'offline' when Status label is missing. PASS: Verify PS Redundancy sensor assertion when Status:Health is not 'OK'. PASS: Verify PS Redundancy sensor goes 'offline' when Status:State is not 'Enabled'. PASS: Verify new PS Redundancy sensor survives a process restart. PASS: Verify new PS Redundancy sensor asserts with non-OK status while redundancy count is greater than one. Regression: PASS: Verify host is degraded when PS redundancy alarm is asserted. PASS: Verify alarm and degrade is cleared if sensor reads OK. PASS: Verify alarm and degrade is cleared if sensor goes offline. PASS: Verify a 'logged-major' PS Redundancy assertion raises alarm when the group action is changed to 'alarm'. PASS: Verify a' alarm-major' PS Redundancy assertion clears alarm when the group action is changed to 'log'. PASS: Verify no PS Redundancy sensor is added to the server power sensor group with ipmi sensor monitoring. PASS: Verify no PS Redundancy assertion with single or redundant power supplies with ipmi sensor monitoring. PASS: Verify all sensor assertions are cleared when a server's BMC is reprovisioned by bm_type or bm_ip address or completely deprovisioned by bm_type=none. PASS: Verify basic hardware monitor sensor assertion/clear operations. Closes-Bug: 2076200 Change-Id: Ieae8f2b8681d1a2b29da0707b2f439cf10c47a2c Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-08-07 12:23:40 +00:00
Eric MacDonald	b29fb32f60	Clear 200.014 sensor=profile alarm over model relearn and deprovision The 200.014 Sensor Config sensor=profile alarm was does not get cleared over a Sensor Profile Relearn nor BMC Deprovision actions. This can then lead to a stuck alarm if the sensor read / groups create issue never resolves. Sensor alarms against a host must get deleted if the BMC for that host is deprovisioned. This update removes the long time obsolete sensor=sensors alarm references and adds a clear sensor config "profile" alarm to the 'sensor group profile relearn' and 'bmc deprovisioning' code paths. Test Plan: PASS: Verify sensor config profile alarm is deleted when PASS: - sensor model is relearned PASS: - bmc deprovisioned PASS: - sensor model is properly created (FIT tested) PASS: Verify raised 200.014 alarm persists over a hwmond restart Regression: PASS: Verify basic hardware monitoring and alarming PASS: Verify sensor deprovisioning PASS: Verify sensor model relearn operation PASS: Verify sensor alarming and clear function Closes-Bug: 2074760 Change-Id: I3165105e9e4e933ab7b723bd0b6241a6a2b046ae Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-07-30 16:16:33 +00:00
Eric MacDonald	fd66519339	Fix Start Host Services race condition The following update, merged in early June, introduced a change to the mtcClient to auto-run the Start Host Services command on process startup like it does for the goenable tests. https://opendev.org/starlingx/metal/ commit/1335bc484df331771e995ae822df3af84cc5739d This change introduced the potential for a race condition that did not occur during the testing of that update. Likely due to the low reproduction rate. With that update in place it is possible for maintenbance to receive the acknowlegement of a "Start Host Services" request followed immediately by the "Start Host Services Result" message. Receiving these messages back to back in a batch does not give maintenance enough time to update its command handler with the next expected message. The Command handler is a separate time-sliced FSM that needs to run at least once following the start request's message ack. Otherwise, the result message is dropped which leads to a Start Host Services timeout. The fix is to accept a "Start Host Services Result" response anytime it arrives while a "Start Host Services" request is outstanding. Test Plan: PASS: Verify issue occurs at a rate greater than 75% and then apply this change and verify there are no failures in a lock/unlock soak of 100 iterations. Closes-Bug: 2073802 Change-Id: I657e5fd917073f6c7a37dc13517559a9740a62e9 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-07-23 15:54:42 +00:00
Eric MacDonald	fb36d3b810	Prevent maintenance setup of the pxeboot network on simplex systems The pxeboot network is used to install system nodes. However, simplex systems do not have system nodes. Therefore, the pxeboot network setup is not needed on SX systems. This update implements changes to Maintenance, specifically the mtcAgent and mtcClient processes, to not setup and service messaging on the pxeboot network on simplex systems. Test Plan: PASS: Verify before and after update behavior PASS: Verify Build, install and enable AIO SX PASS: Verify the pxeboot network is not setup on SX systems PASS: Verify pxeboot messaging and alarming works on DX systems PASS: Verify install and enable DX systems with no pxeboot alarms PASS: Verify mtcAgent and mtcClient logging PASS: Verify SX to DX Migration Closes-Bug: 2073292 Change-Id: I0e3749bab29d88917f36bc29e8b775dfd5e8a13f Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-07-23 13:51:34 +00:00
Zuul	13df7262c0	Merge "Remove CentOS/OpenSUSE build support"	2024-07-10 12:08:09 +00:00
Kyale, Eliud	94b9761011	Replace bmc system() commands with fork() execv() Mtce uses the system() command to run the ipmitool and redfishtool. The system() command launches a shell process that is susceptible to code injection. By switching to fork() execv() we can prevent command injection attacks if for example the bmc parameters are compromised. The bmc parameters are: - bm_type - bm_ip - bm_username - bm_password These are initially provided as user input and stored in either barbican (bm_password) or the sysinv postgres database. If these parameters are compromised, the injected code will not be run. For example, if bm_username="root; reboot&" the reboot command will not be run. Test plan: PASS - Code testing: designer testing of failure paths, verifying logs by compiling errors in the code - fork fail error path - file open failure path - dup/dup2 failure path - execv failure PASS - AIO-SX: iso install PASS - AIO-DX: iso install PASS - AIO-SX: ipmi bmc sensor/device queries system host-sensor-list <controller-0> PASS - AIO-SX: ipmi bmc reset designer modification of sysinv to allow simplex reset PASS - AIO-SX: modify bmc parameters in postgres and verify bmc command failure and proper handling e.g bm_username="root; reboot&" PASS - AIO-SX: file leak testing of execv error path sudo lsof -p `pidof mtcAgent` sudo lsof -p `pidof hwmond` PASS - AIO-SX: memory leak and file leak testingsoak sudo /usr/sbin/dmemchk.sh --C mtcAgent hwmond PASS - AIO-DX: ipmi bmc reset Virtual machine AIO-DX configured to physical bmc simulate reset on virtual machine by power down at the same time as system host-reset <controller> PASS - AIO-DX: ipmi bmc sensor/device queries system host-sensor-list <controller-0\|1> Example postgres commands to compromise the bm_username parameter: sudo -u postgres \ psql -d sysinv \ -c "select bm_username from i_host where hostname='controller-0';" sudo -u postgres \ psql -d sysinv \ -c \ "update i_host set bm_username='root; reboot&' "\ "where hostname='controller-0';" Story: 2011095 Task: 50344 Change-Id: I250900d1c757d7e04058f4c954502b1a38db235e Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>	2024-06-13 14:43:44 -04:00
Zuul	dbb9543c08	Merge "Deprovision mtcClient bmc info when bmc for node is deprovisioned"	2024-06-10 16:37:24 +00:00
Eric MacDonald	508b619400	Deprovision mtcClient bmc info when bmc for node is deprovisioned A node's BMC is provisioned and deprovisioned through the system CLI. Maintenance shares controller node BMC provisioning info with the mtcClient on each controller node. The mtcClient uses this BMC provisioning info to reset its peer controller when it sees the appropriate signal from SM (a flag file). However, when a controller node's BMC is deprovisioned from the system CLI, the mtcAgent does not send a the deprovisioned data to the mtcClient. Without getting the deprovisioning data the mtcClient will continue to use the previous provisioning data. This is incorrect and the reason for this fix. This update fixes this by having the mtcAgent periodically share controller node BMC provisioning data to each controller's mtcClient regardless of its provisioning state. The BMC provisioning data update period remains the same as it was while the BMCs were provisioned. This update also offers the followiong messaging/logging improvements. - restrict the updates to the management network only. There is no need to send the same data over the pxeboot. - stop logging while the BMC is deprovisioned. The absence/presence of the logs is sufficient to know what the provisioning state is without needlessly logging when the BMCs are not provisioned. - bypasses sending the bmc provisioning data to the controller-0 mtcClient in an SX system. The data is only needed in a DX system. Test Plan: PASS: Verify mtcClient gets BMC deprovisioning data ; fix for this bug. PASS: Verify mtcClient periodically logs valid BMC provisioning data. PASS: Verify mtcClient doesn't log unprovisioned BMC provisioning data. PASS: Verify mtcAgent does not send bmc provision data on SX systems. PASS: Verify mtcAgent does send bmc provision data on DX systems. PASS: Verify worker and storage never receive bmc provisioning data. Closes-Bug: 2067925 Change-Id: I29e5eb0b072ee38358d99d682555c466de322f2d Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-06-05 15:27:19 +00:00
Eric MacDonald	1335bc484d	Add auto run goenabled and start hosts services to mtcClient The 'mtcClient' currently automatically runs the main function's 'goenabled' scripts on process startup for all nodes if and when their run preconditions are met. However, that is not true for 'start host services' and, in the AIO system type case, the subfunction 'goenabled' scripts. Typically, this is acceptable because the 'mtcAgent' will request these scripts to be run during unlock and failure recovery scenarios. However, if the system administrator reconfigures the maintenance heartbeat fault handling action from the default 'fail' to any other setting [degrade,alarm,none] and a node reboots outside of maintenance control, then upon reboot recovery, the 'start host services' and, if the node is an AIO controller, the required subfunction 'goenabled' scripts are not executed. In such a case, the missing subfunction 'goenabled' flag file (/var/run/goenabled_subf) prevents the hbsAgent and hbsClient on that node from entering its in-service mode of operation. Instead they run waiting for the node's In-Test phase to complete ; which never happens. This can lead to what appears to be suck maintenance heartbeat alarms. However, its really caused by the maintenance heartbeat processes on that node gated from performing their mission mode function. The /var/run/goenabled_subf flag file is the AIO In-Test complete gate. It is set if the subfunction 'goenabled' tests pass. However, because this flag file is in /var/run (a volatile directory) it is lost/cleared over a reboot. This update adds the automatic execution of the AIO controller's subfunction 'goenabled' scripts and the 'start host services' for all nodes. Once all the required preconditions are met the scripts are run and that node is ready for service, regardless of how and the conditions underwhich it rebooted. Testing of this update is focused on - Verifying the originating issue is resolved. - Verify the changed behavior over the install of all system types. - Verify the changed behavior with an uncontrolled reboot or each node type for all the supported maintenance heartbeat failure action modes. Test Plan: PASS: Verify install of the following system types PASS: - AIO SX PASS: - AIO DX and AIO DX Plus PASS: - Standard DX with worker and storage nodes (vbox) PASS: - System Controller with 1 subcloud (dc-libvirt) PASS: Verify spontaneous reboot of unlocked active AIO controller with PASS: - heartbeat_failure_action=fail PASS: - heartbeat_failure_action=degrade PASS: - heartbeat_failure_action=alarm PASS: - heartbeat_failure_action=none PASS: Verify spontaneous reboot of unlocked standby AIO controller with PASS: - heartbeat_failure_action=fail PASS: - heartbeat_failure_action=degrade PASS: - heartbeat_failure_action=alarm PASS: - heartbeat_failure_action=none PASS: Verify reboot recovery after spontaneous reboot of worker PASS: Verify reboot recovery after spontaneous reboot of storage PASS: Verify start host services is run on mtcClient process startup. PASS: Verify start host services is run on worker and storage nodes when rebooted with all heartbeat failure recovery action modes. Regression: PASS: Verify degrade and alarm management over in-service heartbeat failure while when heartbeat_failure_action=fail PASS: Verify degrade and alarm management over in-service heartbeat failure while when heartbeat_failure_action=degrade PASS: Verify degrade and alarm management over in-service heartbeat failure while when heartbeat_failure_action=alarm PASS: Verify no alarm or degrade over in-service heartbeat failure while when heartbeat_failure_action=none PASS: Verify mtcClint over AIO standby controller lock/unlock PASS: Verify start host services is run on mtcClient on every node by command from mtcAgent process startup. PASS: Verify start host services is run on mtcClient over a unlock or graceful recovery by command from mtcAgent. PASS: Verify start host services check follows goenabled test completion on process startup. PASS: Verify stop host services is run over a node lock. PASS: Verify goenable main and subfunction failure handling PASS: Verify start hosts service failure handling PASS: Verify no coredump or crashdumps PASS: Verify no stuck alarms Closes-Bug: 2067917 Change-Id: Ie8aaf5da20b092267f637ad3df125019c244991b Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-06-04 19:42:54 +00:00
Scott Little	b31d11314b	Remove CentOS/OpenSUSE build support StarlingX stopped supporting CentOS builds in the after release 7.0. This update will strip CentOS from our code base. It will also remove references to the failed OpenSUSE feature as well. There are centos references in the kickstarts that still appear to be packaged in the debian build. I won't touch those. Story: 2011110 Task: 49956 Change-Id: Ifb5aa75b71a17db52e66d6fd91e7c52ed931532d Signed-off-by: Scott Little <scott.little@windriver.com>	2024-05-02 16:01:04 -04:00
Eric MacDonald	4e62e3ac9f	Prevent process coredump due to missing token in response header Both Maintenance and the Hardware Monitor use a common token refresh utility that has been seen to crash the calling process when a token 'get' request is missing the token in its response header. This update avoids that by exiting the token handler at error detection point rather than continue handling the response with invalid data. Significant fault insertion testing was performed on the update which lead to some additional improvements in token request error handling that both processes benefit from. Additional specific fixes include - fixed race condition memory leak around authentication error handling - differentiate token refresh from failure recovery renewal. - fixed a few missing event status / rc updates. Test Plan: - used mtce fault insertion tools to create failure modes - 24+ hr memory leak test run for both success & token error handling - all tests were done with both hwmond and mtcAgent PASS: Verify build and AIO DX install. PASS: Verify reported hwmon coredump issue is avoided/resolved. PASS: Verify issue also exists in the mtcAgent and is also avoided/resolved by this update. Regression: PASS: Verify token get failure retry handling: PASS: - get first token inline - retry cadence: 5 seconds PASS: - refresh token by http - retry cadence: 10, 30 and 1200 secs PASS: Verify recovery handling cases: PASS: - corrupt token PASS: - no token present PASS: - no token in header PASS: Verify token renewal stress soak ; every 10 seconds for 24+ hrs PASS: - repeat over token get failure cases PASS: - in each success and failure case verify no memory leaks. PASS: Verify authentication error handling soak - every 10-60 secs for 24+ hrs - token is corrupted followed by a sysinv request to exercise authentication error handling and renewal process. PASS: Verify no coredumps. PASS: Verify logging and token retry. PASS: Verify process continues to use the previous token until a new one is acquired. - Token Refresh is on time. - Token Renew is on event. PASS: Verify soak of persistent authentication error / token renewal cycle. No memory leak or coredumps. Closes-Bug: 2063475 Change-Id: I5eef62518ac606e6b54323b46fbb6f9475b5c1ef	2024-04-29 13:11:26 +00:00
Zuul	975e868431	Merge "Change mtcInfo log in mtcCtrlMsg.cpp to a dlog"	2024-04-17 12:10:16 +00:00
Eric MacDonald	97092bd38b	Change mtcInfo log in mtcCtrlMsg.cpp to a dlog The mtcInfo message log was enhanced to include the payload in a previous update without realizing that message contained the target BMC's username and password. This update switches that log to a debug (not enabled by default) log to avoid revealing provisioned BMC crediatials in the mtce logs. Test Plan: PASS: Verify mtce package build PASS: Verify mtcInfo log with bmc info payload is no longer logged. Story: 2010940 Task: 49857 Change-Id: I35db04e9292471d92c24c98922350cfb72b5035e	2024-04-11 17:17:45 +00:00
Eric MacDonald	649e94c8da	Add pxeboot mtcAlive messaging alarm handling This update adds alarm handling to the recently introduced pxeboot network mtcAlive messaging, see depends on review below. A new 200.003 maintenance alarm is introduced with the second depends on update below. This new alarm is MINOR but also Management Affecting because the pxeboot network is required for node installation. This update enhances the new pxeboot_mtcAlive_monitor FSM for the purpose of detecting pxeboot mtcAlive message loss, alarming and then clearing the alarm once pxceboot mtcAlive messaging resumes. The new alarm assertion and clear is debounced: - alarm is asserted if message loss persists to the accumulation of 12 missed messages or after 2 minutes of complete message loss. - alarm is cleared after decrementing the message missed counter to zero or 1 minute of loss-less messaging. Upgrades are supported with the addition of a features list to the mtcClient ready event. All new mtcClients that support pxeboot network messaging now publish pxeboot mtcAlive support through this new features list. This is rendered in the logs like this: <hostname> mtcClient ready ; with pxeboot mtcAlive support The mtcAgent does not expect/monitor pxeboot mtcAlive messages from hosts that don't publish the feature support. Test Plan: PASS: Verify mtcAlive period is 5 seconds. PASS: Verify pxeboot mtcAlive monitor period is 10 seconds. PASS: Verify mtcAgent sends mtcClient a mtcAlive request on every mtcAlive monitor miss. PASS: Verify pxeboot mtcAlive alarm is not raised while a node is locked. Alarm attributes: PASS: Verify severity is minor. PASS: Verify alarm is cleared while node is locked. PASS: Verify alarm can be suppressed while unlocked. PASS: Verify asserted alarm is management affecting. PASS: Verify alarm-show output format including cause and repair action text. Process Restart Handling: PASS: Verify alarm is maintained over a mtcAgent process restart. PASS: Verify pxeboot monitoring resumes with or without asserted alarm immediately following a mtcAgent process restart. PASS: Verify mtcClient learns and starts pxeboot mtcAlive messaging immediately following mtcClient process restart for locked or unlocked nodes. Alarm Debounce Handling: PASS: Verify alarm assertion only after 2 minutes of mtcAlive loss. PASS: Verify alarm clear after 1 minutes of mtcAlive recovery. PASS: Verify assertion and recovery debounce logging. PASS: Verify alarm management miss and loss controls handle all boundary conditions exercised by a 12 hr soak with randomized period between message loss and recovery. Host Action Handling: PASS: Verify mtcAlive alarm is not raised over a Host Unlock Enable. PASS: Verify mtcAlive alarm is not raised over a Host Graceful Recovery. PASS: Verify mtcAlive alarm is not raised over a Host Power Off/On. PASS: Verify mtcAlive alarm is not raised over a Host Reboot/Reset. PASS: Verify mtcAlive alarm is not raised over a Host Reinstall. PASS: Verify pxeboot mtcAlive is factored into Host Offline Handling. PASS: Verify pxeboot alarm handling for node that does not send pxeboot mtcAlive after unlock. Stuck Alarm Avoidance Handling: PASS: Verify typical alarm assertion and clear handling. PASS: Verify alarm is maintained or cleared over node reboot if the messaging issue persists or resolves over the reboot recovery. PASS: Verify mtcAlive alarm is maintained over a Swact and cleared if the messaging is ok on the newly active controller. PASS: Verify mtcAlive alarm assertion recovery case over uncontrolled Swact due to active controller reboot. PASS: Verify alarm is cleared over a spontaneous reboot if pxeboot messaging recovers over that reboot. Upgrades Case: PASS: Verify pxeboot mtcAlive monitoring only occurs on mtcClients that actually support pxeboot network mtcAlive monitoring. PASS: Verify mtcClient new features list, parsing which enables pxeboot mtcAlive monitoring for that node. PASS: Verify pxeboot mtcAlive messaging monitoring is not enabled towards nodes whose mtcClient does publish pxeboot mtcAlive messaging feature support. PROG: Verify AIO DX upgrade from 22.12 to current master branch. Focus on pxeboot messaging over the upgrade process. Depends-On: https://review.opendev.org/c/starlingx/metal/+/912654 Depends-On: https://review.opendev.org/c/starlingx/fault/+/914660 Story: 2010940 Task: 49542 Change-Id: I1b51ad9ebcf010f5dee9a86c0295be3da6e2f9b1 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-04-09 14:13:23 +00:00
Eric MacDonald	14bb67789e	Add pxeboot network mtcAlive messaging to Maintenance The introduction of the new pxeboot network requires maintenance verify and report on messaging failures over that network. Towards that, this update introduces periodic mtcAlive messaging between the mtcAgent and mtcClinet. Test Plan: PASS: Verify install and provision each system type with a mix of networking modes ; ethernet, bond and vlan - AIO SX, AIO DX, AIO DX plus - Standard System 2+1 - Storage System 2+1+1 PASS: Verify feature with physical on management interface PASS: Verify feature with vlan on management interface PASS: Verify feature with bonded management interface PASS: Verify feature with bonded vlans on management interface PASS: Verify in bonded cases handling with 2, 1 or no slaves found PASS: Verify mgmt-combined or separate cluster-host network PASS: Verify mtcClient pxeboot interface address learning - for worker and storage nodes ; dhcp leases file - for controller nodes before unlock ; dhcp leases file - for controller nodes after unlock ; static from ifcfg - from controller within 10 seconds of process restart PASS: Verify mtcAgent pxeboot interface address learning from dnsmasq.hosts file PASS: Verify pxeboot mtcAlive initiation, handling, loss detection and recovery PASS: Verify success and failure handling of all new pxeboot ip address learning functions ; - dhcp - all system node installs. - dnsmasq.hosts - active controller for all hosts. - interfaces.d - controller's mtcClient pxeboot address. - pxeboot req mtcAlive - mtcAgent mtcAlive request message. PASS: Verify mtcClient pxeboot network 'mtcAlive request' and 'reboot' command handling for ethernet, vlan and bond configs. PASS: Verify mtcAlive sequence number monitoring, out-of-sequence detection, handling and logging. PASS: Verify pxeboot rx socket binding and non-blocking attribute PASS: Verify mtcAgent handling stress soaking of sustained incoming 500+ msgs/sec ; batch handling and logging. PASS: Verify mtcAgent and mtcClient pxeboot tx and rx socket messaging, failure recovery handling and logging. PASS: Verify pxeboot receiver is not setup on the oam interface on controller-0 first install until after initial config complete. Regression: PASS: Verify mtcAgent/mtcClient online and offline state management PASS: Verify mtcAgent/mtcClient command handling - over management network - over cluster-host network PASS: Verify mtcClient interface chain log for all iface types - bond : vlan123 -> pxeboot0 (802.3ad 4) -> enp0s8 and enp0s9 - vlan : vlan123 -> enp0s8 - ethernet: enp0s8 PASS: Verify mtcAgent/mtcClient handling and logging including debug logging for standard operations - node install and unlock - node lock and unlock - node reinstall, reboot, reset PASS: Verify graceful recovery handling of heartbeat loss failure. - node reboot - management interface down PASS: Verify systemcontroller and subcloud install with dc-libvirt PASS: Verify no log flooding, coredumps, memory leaks Story: 2010940 Task: 49541 Change-Id: Ibc87b85e3e0e07c3b8c40b5291bd3372506fbdfb Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-03-28 15:28:27 +00:00
Eric MacDonald	3c94b0e552	Avoid creating non-volatile node locked file while in simplex mode It is possible to lock controller-0 on a DX system before controller-1 has been configured/enabled. Due to the following recent updates this can lead to SM disabling all controller services on that now locked controller-0 thereby preventing any subsequent controller-0 unlock attempts. https://review.opendev.org/c/starlingx/metal/+/907620 https://review.opendev.org/c/starlingx/ha/+/910227 This update modifies the mtce node locked flag file management so that the non-volatile node locked file (/etc/mtc/tmp/.node_locked) is only created on a locked host after controller-1 is installed, provisioned and configured. This prevents SM from shutting down if the administrator locks controller-0 before controller-1 is configured. Test Plan: PASS: Verify AIO DX Install. PASS: Verify Standard System Install. PASS: Verify Swact back and forth. PASS: Verify lock/unlock of controller-0 prior to controller-1 config PASS: Verify the non-volatile node locked flag file is not created while the /etc/platform/simplex file exists on the active controller. PASS: Verify lock and delete of controller-1 puts the system back into simplex mode where the non-volatile node locked flag file is once again not created if controller-0 is then unlocked. PASS: Verify an existing non-volatile node locked flag file is removed if present on a node that is locked without new persist option. PASS: Verify original reported issue is resolved for DX systems. Closes-Bug: 2051578 Change-Id: I40e9dd77aa3e5b0dc03dca3b1d3d73153d8816be Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-03-09 12:45:54 +00:00
Eric MacDonald	d9982a3b7e	Mtce: Create non-volatile backup of node locked flag file The existing /var/run/.node_locked flag file is volatile. Meaning it is lost over a host reboot which has DOR implications. Service Management (SM) sometimes selects and activates services on a locked controller following a DOR (Dead Office Recovery). This update is part one of a two-part update that solves both of the above problems. Part two is a change to SM in the ha git. This update can be merged without part two. This update maintains the existing volatile node locked file because it is looked at by other system services. So to minimize the change and therefore patchback impact, a new non-volatile 'backup' of the existing node locked flag file is created. This update incorporates modifications to the mtcAgent and mtcClient, introducing a new backup file and ensuring their synchronized management to guarantee their simultaneous presence or absence. Note: A design choice was made to not use a symlink of one to the other rather than add support to manage symlinks in the code. This approach was chosen for its simplicity and reliability in directly managing both files. At some point in the future volatile file could be deprecated contingent upon identifying and updating all services that directly reference it. This update also removes some dead code that was adjacent to my update. Test Plan: This test plan covers the maintenance management of both files to ensure they always align and the expected behavior exists. PASS: Verify AIO DX Install. PASS: Verify Storage System Install. PASS: Verify Swact back and forth. PASS: Verify mtcClient and mtcAgent logging. PASS: Verify node lock/unlock soak. Non-volatile (Nv) node locked management test cases: PASS: Verify Nv node locked file is present when a node is locked. Confirmed on all node types. PASS: Verify any system node install comes up locked with both node locked flag files present. PASS: Verify mtcClient logs when a node is locked and unlocked. PASS: Verify Nv node locked file present/absent state mirrors the already existing /var/run/.node_locked flag file. PASS: Verify node locked file is present on controller-0 during ansible run following initial install and removed as part of the self-unlock. PASS: Verify the Nv node locked file is removed over the unlock along with the administrative state change prior to the unlock reboot. PASS: Verify both node locked files are always present or absent together. PASS: Verify node locked file management while the management interface is down. File is still managed over cluster network. PASS: Verify node locked file management while the cluster interface is down. File is still managed over management network. PASS: Verify behavior if the new unlocked message is received by a mtcClient process that does not support it ; unknown command log. PASS: Verify a node locked state is auto corrected while not in a locked/unlocked action change state. ... Manually remove either file on locked node and verify they are both recreated within 5 seconds. ... Manually create either node locked file on unlocked worker or storage node and verify the created files are removed within 5 seconds. Note: doing this to the new backup file on the active controller will cause SM to shutdown as expected. PASS: Verify Nv node locked file is auto created on a node that spontaneously rebooted while it was unlocked. During the reboot the node was administratively locked. The node should come online with both node locked files present. Partial-Bug: 2051578 Change-Id: I0c279b92491e526682d43d78c66f8736934221de Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-02-14 00:54:11 +00:00
Eric MacDonald	191c0aa6a8	Add a wait time between http request retries Maintenance interfaces with sysinv, sm and the vim using http requests. Request timeout's have an implicit delay between retries. However, command failures or outright connection failures don't. This has only become obvious in mtce's communication with the vim where there appears to be a process startup timing change that leads to the 'vim' not being ready to handle commands before mtcAgent startup starts sending them after a platform services group startup by sm. This update adds a 10 second http retry wait as a configuration option to mtc.conf. The mtcAgent loads this value at startup and uses it in a new HTTP__RETRY_WAIT state of http request work FSM. The number of retries remains unchanged. This update is only forcing a minimum wait time between retries, regardless of cause. Failure path testing was done using Fault Insertion Testing (FIT). Test Plan: PASS: Verify the reported issue is resolved by this update. PASS: Verify http retry config value load on process startup. PASS: Verify updated value is used over a process -sighup. PASS: Verify default value if new mtc.conf config value is not found. PASS: Verify http connection failure http retry handling. PASS: Verify http request timeout failure retry handling. PASS: Verify http request operation failure retry handling. Regression: PASS: Build and install ISO - Standard and AIO DX. PASS: Verify http failures do not fail a lock operation. PASS: Verify host unlock fails if its http done queue shows failures. PASS: Verify host swact. PASS: Verify handling of random and persistent http errors involving the need for retries. Closes-Bug: 2047958 Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-02-07 20:33:01 +00:00
Eric Macdonald	50dc29f6c0	Improve maintenance power/reset control command retry handling This update improves on and drives consistency into the maintenance power on/off and reset handling in terms of retries and use of graceful and immediate commands. This update maintains the 10 retries for both power-on and power-off commands and increases the number of retries for the reset command from 5 to 10 to line up with the power operation commands. This update also ensures that the first 5 retries are done with the graceful action command while the last 5 are with the immediate. This update also removed a power on handling case that could have lead to a stuck state. This case was virtually impossible to hit based on the required sequence of intermittent command failures but that scenario handling was fixed up anyway. Issues have been seen with the power-off handling on some servers. Suspect that those servers need more time to power-off. So, this introduced a 30 seconds delay following a power-off command before issuing the power status query to give the server some time to power-off before retrying the power-off command. Test Plan: Both IPMI and Redfish PASS: Verify power on/off and reset handling support up to 10 retries PASS: Verify graceful command is used for the first power on/off or reset try and the first 5 retries PASS: Verify immediate command is used for the final 5 retries PASS: Verify reset handling with/without retries (none/mid/max) PASS: Verify power-on handling with/without retries (none/mid/max) PASS: Verify power-off handling with/without retries (none/mid/max) PASS: Verify power status command failure handling for power on/off NOTE: FIT (fault insertion testing) was used to create retry scenarios PASS: Verify power-off inter retry delay feature PASS: Verify 30 second power-off to power query delay PASS: Verify redfish power/reset commands used are logged by default PASS: Verify power-off/on and reset logging Regression: PASS: verify power-on/off and reset handling without retries PASS: Verify power-off handling when power is already off PASS: Verify power-on handling when power is already on Closes-Bug: 2031945 Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com> Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36	2024-01-25 22:42:26 +00:00
Zuul	125601c2f9	Merge "Failure case handling of LUKS service"	2023-12-14 18:09:46 +00:00
Jagatguru Prasad Mishra	1210ed450a	Failure case handling of LUKS service luks-fs-mgr service creates and unseals the LUKS volume used to store keys/secrets. This change handles the failure case if this essential service is inactive. It introduces an alarm LUKS_ALARM_ID which is raised if service is inactive which implies that there is an issue in creating or unsealing the LUKS volume. Test Plan: PASS" build-pkgs -c -p mtce-common PASS: build-pkgs -c -p mtce PASS: build-image PASS: AIO-SX bootstrap with luks volume status active PASS: AIO-DX bootstrap with volume status active PASS: Standard setup with 2 controllers and 1 compute node with luks volume status active. There should not be any alarm and node status should be unlocked/enabled/available. PASS: AIO-DX node enable failure on the controller where luks volume is inactive. Node availability should be failed. A critical alarm with id 200.016 should be displayed with 'fm alarm-list' PASS: AIO-SX node enable failure on the controller-0. Node availability should be failed. A critical alarm with id 200.016 should be displayed with 'fm alarm-list' PASS: Standard- node enable failure on the node (controller-0, controller-1, storage-0, compute-1). Node availability should be failed. A critical alarm with id 200.016 should be displayed with 'fm alarm-list' for the failed host. PASS: AIO-DX In service volume inactive should be detected and a critical alarm should be raised with ID 200.016. Node availability should be changed to degraded. PASS: AIO-SX In service volume inactive status should be detected and a critical alarm should be raised with ID 200.016. Node availability should be changed to degraded. PASS: Standard ( 2 controller, 1 storage, 1 compute) In service volume inactive status should be detected and a critical alarm should be raised with ID 200.016. Node availability should be changed to degraded. PASS: AIO-DX In service: If volume becomes active and a LUKS alarm is active, alarm should be cleared. Node availability should be changed to available. PASS: AIO-SX In service: If volume becomes active and a LUKS alarm is active, alarm should be cleared. Node availability should be changed to available. PASS: Standard ( 2 controller, 1 storage, 1 compute) In service: If volume becomes active and a LUKS alarm is active, alarm should be cleared. Node availability should be changed to available. PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability is 'failed'. After fixing the volume issue, a lock/unlock should make the node available. Story: 2010872 Task: 49108 Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>	2023-12-06 00:34:02 -05:00
Zuul	1332ebb7a7	Merge "Replace a file test from fsmond"	2023-12-04 14:03:11 +00:00
Teresa Ho	36814db843	Increase timeout for runtime manifest In management network reconfiguration for AIO-SX, the runtime manifest executed during host unlock could take more than five minutes to complete. This commit is to extend the timeout period from five minutes to eight minutes. Test Plan: PASS: AIO-SX subcloud mgmt network reconfiguration Story: 2010722 Task: 49133 Change-Id: I6bc0bacad86e82cc1385132f9cf10b56002f385e Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2023-11-23 16:51:22 -05:00
Erickson Silva de Oliveira	16181a2ce8	Replace a file test from fsmond fsmond tries to create a test file in "/.fs-test" but it is not possible because "/" is blocked by ostree. So the fix is to replace this path from fsmond monitoring with /sysroot/.fs_test. Below is a comparison of the logs: - Before change: ( 196) fsmon_service : Warn : File (/.fs-test) test failed - After change: ( 201) fsmon_service : Info : tests passed Test Plan: - PASS: Build mtce package - PASS: Replace fsmond binary on AIO-SX - PASS: Check fsmond.log output Closes-Bug: 2043712 Change-Id: Ib4bad73448735bce1dff598151fce86f867f4db7 Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>	2023-11-17 08:15:28 -03:00
Eric MacDonald	79d8644b1e	Add bmc reset delay in the reset progression command handler This update solves two issues involving bmc reset. Issue #1: A race condition can occur if the mtcAgent finds an unlocked-disabled or heartbeat failing node early in its startup sequence, say over a swact or an SM service restart and needs to issue a one-time-reset. If at that point it has not yet established access to the BMC then the one-time-reset request is skipped. Issue #2: When issue #1 race conbdition does not occur before BMC access is established the mtcAgent will issue its one-time reset to a node. If this occurs as a result of a crashdump then this one-time reset can interrupt the collection of the vmcore crashdump file. This update solves both of these issues by introducing a bmc reset delay following the detection and in the handling of a failed node that 'may' need to be reset to recover from being network isolated. The delay prevents the crashdump from being interrupted and removes the race condition by giving maintenance more time to establish bmc access required to send the reset command. To handle significantly long bmc reset delay values this update cancels the posted 'in waiting' reset if the target recovers online before the delay expires. It is recommended to use a bmc reset delay that is longer than a typical node reboot time. This is so that in the typical case, where there is no crashdump happening, we don't reset the node late in its almost done recovery. The number of seconds till the pending reset countdown is logged periodically. It can take upwards of 2-3 minutes for a crashdump to complete. To avoid the double reboot, in the typical case, the bmc reset delay is set to 5 minutes which is longer than a typical boot time. This means that if the node recovers online before the delay expires then great, the reset wasn't needed and is cancelled. However, if the node is truely isolated or the shutdown sequence hangs then although the recovery is delayed a bit to accomodate for the crashdump case, the node is still recovered after the bmc reset delay period. This could lead to a double reboot if the node recovery-to-online time is longer than the bmc reset delay. This update implements this change by adding a new 'reset send wait' phase to the exhisting reset progression command handler. Some consistency driven logging improvements were also implemented. Test Plan: PASS: Verify failed node crashdump is not interrupted by bmc reset. PASS: Verify bmc is accessible after the bmc reset delay. PASS: Verify handling of a node recovery case where the node does not come back before bmc_reset_delay timeout. PASS: Verify posted reset is cancelled if the node goes online before the bmc reset delay and uptime shows less than 5 mins. PASS: Verify reset is not cancelled if node comes back online without reboot before bmc reset delay and still seeing mtcAlive on one or more links.Handles the cluster-host only heartbeat loss case. The node is still rebooted with the bmc reset delay as backup. PASS: Verify reset progression command handling, with and without reboot ACKs, with and without bmc PASS: Verify reset delay defaults to 5 minutes PASS: Verify reset delay change over a manual change and sighup PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500 PASS: Verify host-reset when host is already rebooting PASS: Verify host-reboot when host is already rebooting PASS: Verify timing of retries and bmc reset timeout PASS: Verify posted reset throttled log countdown Failure Mode Cases: PASS: Verify recovery handling of failed powered off node PASS: Verify recovery handling of failed node that never comes online PASS: Verify recovery handling when bmc is never accessible PASS: Verify recovery handling cluster-host network heartbeat loss PASS: Verify recovery handling management network heartbeat loss PASS: Verify recovery handling both heartbeat loss PASS: Verify mtcAgent restart handling finding unlocked disabled host Regression: PASS: Verify build and DX system install PASS: Verify lock/unlock (soak 10 loops) PASS: Verify host-reboot PASS: Verify host-reset PASS: Verify host-reinstall PASS: Verify reboot graceful recovery (force and no force) PASS: Verify transient heartbeat failure handling PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks PASS: Verify SM peer reset handling when standby controller is rebooted PASS: Verify logging and issue debug ability Closes-Bug: 2042567 Closes-Bug: 2042571 Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2023-11-02 20:58:00 +00:00
Enzo Candotti	23143abbca	Update crashDumpMgr to source config from envfile This commit updates the crashDumpMgr service in order to: - Cleanup of current service naming and packaging to follow the standard Linux naming convention: - Repackage /etc/init.d/crashDumpMgr to /usr/sbin/crash-dump-manager - Rename crashDumpMgr.service to crash-dump-manager.service - Add EnvironmentFile to crash-dump-manager service file to source configuration from /etc/default/crash-dump-manager. - Update ExecStart of crash-dump-manager service to use parameters from EnvironmentFile - Update crash-dump-manager service dependencies to run after config.service. - Update logrotate configuration to support the retention polices of the maximum files. The “rotate 1” option was removed to permit crash-dump-manager to manage pruning old files. - Modify the crash-dump-manager script to enable updates to the max_files parameter to a lower value. If there are currently more files than the new max_files value, the oldest files will be deleted the next time a crash dump file needs to be stored, thus adhering to the new max_files values. Test Plan: PASS: Build ISO and perform a fresh install. Verify the new crash-dump-manager service is enabled and working as expected. PASS: Add and apply new crashdump service parameters and force a kernel panic. Verify that after the reboot, the max_files, max_used, min_available and max_size values are updated accordingly to the service parameters values. PASS: Verify that the crashdump files are rotated as expected. Story: 2010893 Task: 48910 Change-Id: I4a81fcc6ba456a0d73067b77588ee4a125e44e62 Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>	2023-10-06 23:06:54 +00:00
Enzo Candotti	a120cc5fea	Add new configuration parameters to crashDumpMgr This commmit updates crashDumpMgr in order to add three new parameters and enhance the existing one. 1. Maximum Files: Added 'max-files' parameter to specify the maximum number of saved crash dump files. The default value is 4. 2. Maximum Size: Updated the 'max-size' parameter to support the 'unlimited' value. The default value is 5GiB. 3. Maximum Used: Included 'max-used' parameter to limit the maximum storage used by saved crash dump files. It supports 'unlimited' and has a default value of unlimited. 4. Minimum Available: Implemented 'min-available' parameter, enabling the definition of a minimum available storage threshold on the crash dump file system. The value is restricted to a minimum of 1GB and defaults to 10%. These enhancements refine the crash dump management process and offer more control over storage usage and crash dump file retention. Story: 2010893 Task: 48676 Test Plan: 1) max-files parameter: PASS: don't set max-files param. Ensure the default value is used. Create 5 directories inside /var/crash. Each of them contains dmesg.<date> and dump.<date>. run the crashDumpMgr script. Verify: PASS: the vmcore_first.tar.1.gz is created when the first directory is read. PASS: 4 more vmcore_<date>.tar files are created. PASS: There will be 1 vmcore_first.tar.1.gz and 4 vmcore_<date>.tar inside /var/log/crash. PASS: There will be one summary file for each direcory: <date>_dmesg.<date> inside /var/crash 2) max-size parameter PASS: don't set max-size param. Ensure the default value is used (5GiB). PASS: Set a fixed max-size param. Create a dump.<date> file greater that the max-size param. Run the crashDumpMgr script. Verify that the crash dump file is not generated and a log message is displayed. 3) max-used parameter: PASS: don't set max-used param. Ensure the default value is used (unlimited). PASS: Set a fixed max-used param. Create a dump.<date> file that will generate that the used space is greater that the max-used param. Run the crashDumpMgr script. Verify that the crash dump file is not generated, a log message is displayed and the directory is deleted. 4) min-available parameter: PASS: don't set min-available param. Ensure the default value is used (10% of /var/log/crash). PASS: Set a fixed 'min-available' param. Generate a 'dump.<date>' file to simulate a situation where the remaining space is less than the 'min-available' parameter. Run the crashDumpMgr script and ensure that it does not create the crashdump file, displays a log message, and deletes the entry. 5) PASS: Since the crashDumpMgr.service file is not being modified, verify that the script takes the default values. Note: All tests have also been conducted by generating a kernel panic and ensuring the crashDumpMgr script follows the correct workflow. Change-Id: I8948593469dae01f190fd1ea21da3d0852bd7814 Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>	2023-09-18 19:22:09 +00:00
Eric MacDonald	d863aea172	Increase mtce host offline threshold to handle slow host shutdown Mtce polls/queries the remote host for mtcAlive messages for 42 x 100 ms intervals over unlock or host failed cases. Absence of mtcAlive during this (~5 sec) period indicates the node is offline. However, in the rare case where shutdown is slow, 5 seconds is not long enough. Rare cases have been seen where 7 or 8 second wait time is required to properly declare offline. To avoid the rare transient 200.004 host alarm over an unlock operation, this update increases the mtce host offline window from 5 to 10 seconds (approx) by modifying the mtce configuration file offline threshold from 42 to 90. Test Plan: PASS: Verify unchallenged failed to offline period to be ~10 secs PASS: Verify algorithm restarts if there is mtcAlive received anytime during the polls/queries (challenge) window. PASS: Verify challenge handling leads to a longer but successful offline declaration. PASS: Verify above handling for both unlock and spontaneous failure handling cases. Closes-Bug: 2024249 Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2023-06-16 18:14:08 +00:00
Matheus Guilhermino	a0e270b51b	Add mpath support to wipedisk script The wipedisk script was not able to find the boot device when using multipath disks. This is due to the fact that multipath devices are not listed under /dev/disk/by-path/. To add support to multipath devices, the script should look for the boot device under /dev/disk/by-id/ as well. Test Plan PASS: Successfully run wipedisk on a AIO-SX with multipath PASS: Successfully run wipedisk on a AIO-SX w/o multipath Closes-bug: 2013391 Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com> Change-Id: I3af76cd44f22795784a9184daf75c66fc1b9874f	2023-04-10 17:10:22 -03:00
Al Bailey	37c5910a62	Update mtce debian package ver based on git Update debian package versions to use git commits for: - mtce (old 9, new 30) - mtce-common (old 1, new 9) - mtce-compute (old 3, new 4) - mtce-control (old 7, new 10) - mtce-storage (old 3, new 4) The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Test Plan: PASS: build-pkgs -p mtce PASS: build-pkgs -p mtce-common PASS: build-pkgs -p mtce-compute PASS: build-pkgs -p mtce-control PASS: build-pkgs -p mtce-storage Story: 2010550 Task: 47401 Task: 47402 Task: 47403 Task: 47404 Task: 47405 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I4846804320b0ad3ec10799a468a9ee3bf7973587	2023-03-02 14:50:35 +00:00
Kyale, Eliud	502662a8a7	Cleanup mtcAgent error logging during startup - reduced log level in http util to warning - use inservice test handler to ensure state change notification is sent to vim - reduce retry count from 3 to 1 for add_handler state_change vim notification Test plan: PASS - AIO-SX: ansible controller startup (race condition) PASS - AIO-DX: ansible controller startup PASS - AIO-DX: SWACT PASS - AIO-DX: power off restart PASS - AIO-DX: full ISO install PASS - AIO-DX: Lock Host PASS - AIO-DX: Unlock Host PASS - AIO-DX: Fail Host ( by rebooting unlocked-enabled standby controller) Story: 2010533 Task: 47338 Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com> Change-Id: I7576e2642d33c69a4b355be863bd7183fbb81f45	2023-02-14 14:18:02 -05:00
Christopher Souza	56ab793bc5	Change hostwd emergency log to write to /dev/kmsg The hostwd emergency logs was written to /dev/console, the change was to add the prefix "hoswd:" to the log message and write to /dev/kmsg. Test Plan: Pass: AIO-SX and AIO DX full deployment. Pass: kill pmond and wait for the emergency log to be written. Pass: check if the emergency log was written to /dev/kmsg. Pass: Verify logging for quorum report missing failure. Pass: Verify logging for quorum process failure. Pass: Verify emergency log crash dump logging to mesg and console logging for each of the 2 cases above with stressng overloading the server (CPU, FS and Memory); stress-ng --vm-bytes 4000000000 --vm-keep -m 30 -i 30 -c 30 Story: 2010533 Task: 47216 Co-authored-by: Eric MacDonald <eric.macdonald@windriver.com> Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Co-authored-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com> Signed-off-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com> Change-Id: I0da82f964dd096840259c4d0ed4e5f558debdf22	2023-02-01 23:41:14 +00:00
Eric MacDonald	a3cba57a1f	Adapt Host Watchdog to use kdump-tools The Debian package for kdump changed from kdump to kdump-tools Test Plan: PASS: Verify build and install AIO DX system PASS: Verify host watchdog detects kdump as active in debian Closes-Bug: 2001692 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ie1ac29d3d29f3d9c843789cdedf85081fe790616	2023-01-04 12:57:19 -05:00
Robert Church	1796ed8740	Update wipedisk for LVM based rootfs Now that the root filesystem is based on an LVM logical volume, discover the root disk by searching for the boot partition. Changes include: - remove detection of rootfs_part/rootfs and adjust rootfs related references with boot_disk. - run bashate on the script and resolve indentation and syntax related errors. Leave long-line errors alone for improved readability. Test Plan: PASS - run 'wipedisk', answer prompts, and ensure all partitions are cleaned up except for the platform backup partition PASS - run 'wipedisk --include-backup', answer prompts, and ensure all partitions are cleaned up PASS - run 'wipedisk --include-backup --force' and ensure all partitions are cleaned up Change-Id: I036ce745353b6a26bc2615ffc6e3b8955b4dd1ec Closes-Bug: #1998204 Signed-off-by: Robert Church <robert.church@windriver.com>	2022-11-29 05:04:38 -06:00
Eric MacDonald	da398e0c5f	Debian: Make Mtce offline handler more resilient to slow shutdowns The current offline handler assumes the node is offline after 'offline_search_count' reaches 'offline_threshold' count regardless of whether mtcAlive messages were received during the search window. The offline algorithm requires that no mtcAlive messages be seen for the full offline_threshold count. During a slow shutdown the mtcClient runs for longer than it should and as a result can lead to maintenance seeing the node as recovered before it should. This update manages the offline search counter to ensure that it only reached the count threshold after seeing no mtcAlive messages for the full search count. Any mtcAlive message seen during the count triggers a count reset. This update also 1. Adjusts the reset retry cadence from 7 to 12 secs to prevent unnecessary reboot thrash during the current shutdown. 2. Clears the hbsClient ready event at the start of the subfunction handler so the heartbeat soak is only started after seeing heartbeat client ready events that follow the main config. Test Plan: PASS: Debian and CentOS Build and DX install PASS: Verify search count management PASS: Verify issue does not occur over lock/unlock soak (100+) - where the same test without update did show issue. PASS: Monitor alive logs for behavioral correctness PASS: Verify recovery reset occurs after expected extended time. Closes-Bug: 1993656 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e	2022-10-24 15:57:43 +00:00
Eric MacDonald	3f4c2cbb45	Mtce: Add ActionInfo extension support for reset operations. StarlingX Maintenance supports host power and reset control through both IPMI and Redfish Platform Management protocols when the host's BMC (Board Management Controller) is provisioned. The power and reset action commands for Redfish are learned through HTTP payload annotations at the Systems level; "/redfish/v1/Systems. The existing maintenance implementation only supports the "ResetType@Redfish.AllowableValues" payload property annotation at the #ComputerSystem.Reset Actions property level. However, the Redfish schema also supports an 'ActionInfo' extension at /redfish/v1/Systems/1/ResetActionInfo. This update adds support for the 'ActionInfo' extension for Reset and power control command learning. For more information refer to the section 6.3 ActionInfo 1.3.0 of the Redfish Data Model Specification link in the launchpad report. Test Plan: PASS: Verify CentOS build and patch install. PASS: Verify Debian build and ISO install. PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0 PASS: Verify reset/power control cmd load from newly added second level query from ActionInfo service. Failure Handling: Significant failure path testing with this update PASS: Verify Redfish protocol is periodically retried from start when bm_type=redfish fails to connect. PASS: Verify BMC access protocol defaults to IPMI when bm_type=dynamic but failed connect using redfish. Connection failures in the above cases include - redfish bmc root query fails - redfish bmc info query fails - redfish bmc load power/reset control actions fails - missing second level Parameters label list - missing second level AllowableValues label list PASS: Verify sensor monitoring is relearned to ipmi from failed and retried with bm_type=redfish after switch to bm_type=dynamic or bm_type=ipmi by sysinv update command. Regression: PASS: Verify with CentOS redfishtool 1.1.0 PASS: Verify switch back and forth between ipmi and redfish using update bm_type=ipmi and bm_type=redfish commands PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for hosts that support redfish PASS: Verify redfish protocol is preferred in bm_type=dynamic mode PASS: Verify IPMI sensor monitoring when bm_type=ipmi PASS: Verify IPMI sensor monitoring when bm_type=dynamic and redfish connect fails. PASS: Verify redfish sensor event assert/clear handling with alarm and degrade condition for both IPMI and redfish. PASS: Verify reset/power command learn by single level query. PASS: Verify mtcAgent.log logging Closes-Bug: 1992286 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35	2022-10-13 17:40:05 +00:00
Zuul	8fd1bcbb97	Merge "Alarm Hostname controller function has in-service failure reported"	2022-10-06 20:47:06 +00:00
Al Bailey	dd5a24037d	Fix bashate failure in zuul This review allows this repo to pass zuul. When tox is run locally it pulls in an older bashate 0.6.0 but the zuul jobs are pulling in the higher version. Bashate 2.1.1 was releated Oct 6, 2022 Changed the upper constraints to allow developers to pull in dependencies that are more aligned with zuul. Fixed the new bashate error. Also cleaned up the yamllint syntax. Closes-Bug: 1991971 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I9cda349a20c63f9d222a3c3fc3645c5ceb4c2751	2022-10-06 17:22:12 +00:00
Girish Subramanya	86681b7598	Alarm Hostname controller function has in-service failure reported When compute services remain healthy: - listing alarms shall not refer to the below Obsoleted alarm - 200.012 alarm hostname controller function has an in-service failure This update deletes definition of the obsoleted alarm and any references 200.012 is removed in events.yaml file Also updated any reference to this alarm definition. Need to also raise a Bug to track the Doc change. Test Plan: Verify on a Standard configuration no alarms are listed for hostname controller in-service failure Code (removal) changes exercised with fix prior to ansible bootstrap and host-unlock and verify no unexpected alarms Regression: There is no need to test the alarm referred here as they are obsolete Closes-Bug: 1991531 Signed-off-by: Girish Subramanya <girish.subramanya@windriver.com> Change-Id: I255af68155c5392ea42244b931516f742fa838c3	2022-10-05 10:30:01 -04:00
Zuul	6bcd8333b2	Merge "Debian: Remove conf files from etc-pmon.d"	2022-09-30 19:41:16 +00:00
Leonardo Fagundes Luz Serrano	d1c0d04719	Debian: Remove conf files from etc-pmon.d Removed conf files from /etc/pmon.d/ as they are being moved to another location. This is part of an effort to allow pmon conf files to be selected at runtime by kickstarts. The change is debian-only, since centos support will be dropped soon. Centos' pmon conf files remain in /etc/pmon.d/ Test Plan: PASS - deb doesn't install anything to /etc/pmon.d/ PASS - rpm files unchanged PASS - AIOSX unlocked-enabled-available PASS - Standard 2+2 unlocked-enabled-available Story: 2010211 Task: 46306 Depends-On: https://review.opendev.org/c/starlingx/metal/+/855095 Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com> Change-Id: I086db0750df5626d2a8ba1010153ce4f45535ca5	2022-09-26 13:41:40 +00:00
Charles Short	3935abf187	mtcAgent: Run in active mode Run the mtcAgent with active mode by default. This was done because it was being observed that mtcAgent was causing an increased CPU load under Debian. Story: 2009964 Task: 46202 Test-Plan PASS Build playbookconfig package PASS Boot ISO PASS Bootstrap simplex PASS Check for running mtcAgent PASS Install and provision CentOS 2+3 Standard System Signed-off-by: Charles Short <charles.short@windriver.com> Change-Id: If4278ab6e14cd30c995ce5004004fab955ad23eb	2022-09-13 21:38:50 +00:00
Davi Frossard	646192989d	Remove sm-watchdog residues Due to the changes `bd9e560d4b` which removed the sm-watchdog, we also need to remove residues in kickstart config. Story: 2010087 Task: 46007 Signed-off-by: Davi Frossard <dbarrosf@windriver.com> Change-Id: I17911773ec4db1549df32a77acd43cd4615b28ee	2022-09-01 12:35:06 +00:00

1 2 3 4 5 ...

273 Commits