This review allows this repo to pass zuul.
When tox is run locally it pulls in an older
bashate 0.6.0 but the zuul jobs are pulling in
the higher version.
Bashate 2.1.1 was releated Oct 6, 2022
Changed the upper constraints to allow developers
to pull in dependencies that are more aligned with zuul.
Fixed the new bashate error.
Also cleaned up the yamllint syntax.
Closes-Bug: 1991971
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I9cda349a20c63f9d222a3c3fc3645c5ceb4c2751
Due to the changes
bd9e560d4b
which removed the sm-watchdog, we also need to remove residues in
kickstart config.
Story: 2010087
Task: 46007
Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
Change-Id: I17911773ec4db1549df32a77acd43cd4615b28ee
This is part of the change to replace nslcd with sssd to
support multiple secure ldap backends.
This change added pmon configuration file for sssd so that it
is monitored by pmon.
Test Plan on Debian (SX and DX):
PASS: Package build, image build.
PASS: System deployment.
PASS: After controller is unlocked, sssd is running.
PASS: ldap user creation by ldapadduser and ldapusersetup.
PASS: ldap user login on console.
PASS: ldap user remote login by oam IP address:
ssh <ldapuser>@<controller-oam-ip-address>
PASS: ldap user login by local ldap domain within controllers:
ssh <ldapuser>@controller
PASS: For DX system, same ldap functions still work properly after
swact.
PASS: Kill sssd process, verify that it is brought up by pmon.
Story: 2009834
Task: 46064
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Change-Id: I701a4cbbda0f900dafd0456aad63132b62d8424f
Modified mtce to address the following
failing services on Debian:
crashDumpMgr.service
fsmon.service
goenabled.service
hostw.service
hwclock.service
mtcClient.service
pmon.service
Applied fix:
- Included modified .service files for debian
directly into into the deb_folder.
- Changed the init files to account for the different
locations of the init-functions and service daemons
on Debian and CentOS
- Included "override_dh_installsystemd" section
to rules in order to start services at boot.
Test Plan:
PASS: Package installed and ISO built successfully
PASS: Ran "systemctl list-units --failed" and verified that the
services are not failing
PASS: Ran "systemctl status <service_name>" for
each service and verified that they are behaving as desired
PASS: Services work as expected on CentOS
PASS: Bootstrap and host-unlock successful on CentOS
Story: 2009101
Task: 44323
Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com>
Change-Id: Ie61cedac24f84baea80cab6a69772f8b2e9e1395
The v5.10 kernel no longer guards the task_state_notify_info data
structure with #ifdef CONFIG_SIGEXIT, which causes a
redefinition-related compilation error. Work around this by checking for
the existence of the PR_DO_NOTIFY_TASK_STATE macro, and only define the
PR_DO_NOTIFY_TASK_STATE and the task_state_notify_info structure if the
kernel does not do so.
Story: 2008921
Task: 42915
Change-Id: I4bb499e2b52e20542f202dea1c2c55d88bb8ba61
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
This update make the following setting changes to the
maintenance log rotation configuration files
- add 'create' with permissions to each tuple
- add 'delaycompress'
- group together log files with similar settings
- move global settings ro local settings
- remove 'copytruncate' global setting
- remove the 'nodateext' global and local setting
Test Plan:
PASS: Verify log rotation for all mtc log files
PASS: Verify no log loss over rotation
PASS: Verify log rotation file naming convention
PASS: Verify delaycompress on all mtce log files
PASS: Verify log permissions after rotate are 0640
Regression:
PASS: Verify AIO system install
PASS: Verify Standard system install
PASS: Verify full and dated collect
Change-Id: I623030fa2c1ce4e8085e654ae3fb782c7e520924
Partial-Bug: 1918979
Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A failure to query process monitor alarms from
FM during process startup can lead to a stuck
failed process alarm.
Rather than hold up the process monitor startup
sequence due to an unresponsive fault manager,
this update introduces an in-service alarm audit
that looks for asserted alarms and compares that
readout to the process monitor's runtime view.
A difference in view is considered a state mismatch
that requires corrective action. The runtime state
of the process monitor always takes precidence over
what is found in the FM database.
A mismatch is declared and corrective action is
taken if:
- FM has a process failure alarm that pmond does not
Corrective Action: Clear alarm in FM database
- FM has a process failure alarm with a severity
that differs from the pmond runtime state.
Corrective Action: Update severity in FM database
- FM has a process failure alarm for a process
that pmond does not recognize.
Corrective Action: Clear alarm in FM database
This update only runs the audit on process startup
until first successful query.
A future update may enable the audit in-service.
Test Plan:
PASS: Verify all mismatch case handling
PASS: Verify handling of valid active alarm
PASS: Verify handling severity mismatch ; unsupported
PASS: Verify pmond failure handling regression soak
PASS: Verify pmond process restart regression soak
PASS: Verify alarm handling over pmond process restart
PASS: Verify alarmed state audit period and logging
PASS: Verify pmond process failure alarm remains ignored by pmond
PASS: Verify handling of persistently failed process over pmond restart
PASS: Verify audit handling while FM is not running
- audit retries every 50 seconds until fm query is successful
COND: Verify audit handling while FM is stopped/blocked/stalled
- alarm query blocks till fm runs again or is killed
- this is the reason the audit is not run in-service.
Change-Id: I697faa804dc7979fbb8b6f6c63811a6dda8c3118
Closes-Bug: 1892884
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The maintenance process monitor (pmon) should only
recover failed processes when the system state is
'running' or 'degraded'.
The current implementation allowed process recovery
for other non-inservice states, including an unknown
state if systemd returns no data on the state query.
This update tighten's up the system state check by
adding retries to the state query utility and
restricting accepted states to 'running' and 'degraded'.
This change then prevents pmon from inadvertently killing
and recovering the mtcClient which indirectly kills off
the mtcClient's fail-safe sysreq reboot child thread
if pmon state query returns anything other than running
or degraded during a shut down.
Change-Id: I605ae8be06f8f8351a51afce98a4f8bae54a40fd
Closes-Bug: 1883519
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
1. Rename Titanium Cloud to StarlingX for .spec files
2. Rename Titanium Cloud to StarlingX for .service file
Test:
After the de-brand change, bootimage.iso has built in the flock layer
and installed on the dev machine to validate the changes.
Please note, doing de-brand changes in batches, this is batch1 changes.
Story: 2006387
Task: 36207
Change-Id: Ifa4dc5c7aa3189815e00b796fc833852e88c8fe3
Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>
The pmon-restart service, through a call to respawn_process,
increments that process's restarts counter but does not clear
that counter after a successful restart.
So, each pmon-restart mistakenly contributes to that process's
failure count. This has the effect of pre-loading that process's
restart counter by one for every pmon-restart of that process.
The effect is best described by example.
Say a process is pmon-restart'ed 4 times during one day which
increments that process's restart counter to 4. So assuming its
conf file specifies its threshold is 3 ; its already exceeded
its threshold. Then, even days later that process experiences
a real failure pmon will immediate take the severity action
because the failure threshold had already been exceeded.
This update ensures a process's restart counter is cleared
after successful pmon-restart operation ; in the process pid
registration phase of recovery.
Test Plan:
PASS: Verify pmon-restart continues to work.
PASS: Verify proper thresholding of failed process following
many pmon-restart operations.
PEND: Verify pmon-restart and process failure automated test script
against this update. 5 loops, all processes.
Change-Id: Ib01446f2e053846cd30cb0ca0e06d7c987cdf581
Closes-Bug: 1853330
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Each monitored process's config file contains a startuptime
label that specifies how many seconds it takes for that newly
started process to stabalize and produce its pidfile.
The pmon-restart feature needs to delay monitoring
newly restarted process for 'startuptime' seconds.
Failing to do so can cause it to fail the restarted
process to early if there is pidfile creation delay.
Test Plan:
PASS: Verify collectd pmon-restart function with soak ;
> 5000+ collectd pmon-restarts.
PASS: Verify pmond regression test suite (test-pmon-action.sh)
> restart command ; graceful restart all monitored processes. (5 loops)
> kill command ; kill and recover all monitored processes. (5 loops)
Regression:
PASS: Verify pmon-stop command/function
PASS: Verify pmon-start command/function also honors the startuptime.
PASS: Verify pmon-stop auto start after auto-start timeout
PASS: Verify System Install
PASS: Verify Patching (soak)
Change-Id: I9fd7bba8e49fe4c28281539ab4930bdac370ef11
Closes-Bug: #1844724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The following warnings are being addressed.
- hbsUtil.cpp: The correct value for MAX_ENTRY_STR_LEN should be 13
this considering that values can be higher than 9999, the final space
and the leading '\n'.
- hwmonAlarm.cpp: As a result of a discussion with Eric Macdonald, he
suggested to remove the entire case for FM_ALARM_STATE_CLEAR as the
reason buffer is not used and thus, there's no need to store a string there.
- pmonHdlr.cpp: A truncation warning was shown due to a possible usage of a
unitialize buffer. The fix here is check by NULL.
Change-Id: I3c80cce99b2f521f8c7a9de4ce2b6036960dfaf6
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
The disable_worker_services file was originally created
to prevent the (bare metal) nova-compute services from
running on a newly upgraded controller in an AIO-DX
configuration. This situation no longer exists because
the bare metal nova-compute services do not exist after
transiting to containers. this flag is no longer needed.
Removing all references to the disable_worker_services file.
Change-Id: I20e08db737bb0df6ba34c071e2435f1a18f7c3ed
Partial-Bug: #1838432
Signed-off-by: marvin <weifei.yu@intel.com>
The LSB headers are required by openSUSE build system as one of the
quality checks. This patch adds the header and missing fields. A $null
value was set for all unchanged fields
Change-Id: I22ee3571b70b22e1fe8238c2a94b3f4d099d41cd
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
Add a free call after malloc in the Pmon alarm module.
Change-Id: I2ca948f28959e1b99f19777410b2f906b27a3e2e
Closes-Bug: #1836130
Signed-off-by: YeHuiSheng <hsye@fiberhome.com>
This also changes the group wrs_protected to sys_protected
to de-brand the user and group names.
Depends-On: I887464a20fc17d66529caea03be2b445156f9426
Change-Id: Icfd2faec0ba8236762c8045f5c244eaf13008ee4
Story: 2004716
Task: 28749
Signed-off-by: Saul Wold <sgw@linux.intel.com>
Maintence no longer has any plan to interface with
ceilometer so this update removes all such references.
In addition it removes 3 obsoleted files that also make
reference to ceilometer.
Change-Id: Iae0738946ff241acde44720024d25f8c38f65433
Story:2004764
Task:30666
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
All rmon resource monitoring has been moved to collectd.
This update removes rmon from mtce and the load.
Story: 2002823
Task: 30045
Test Plan:
PASS: Build and install a standard system.
PASS: Inspect mtce rpm list
PASS: Inspect logs
PASS: Check pmon.d
Change-Id: I7cf1fa071eac89274e7fae1f307e14d548cc945b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
libc6 renamed siginfo.h to siginfo-const.h sometime between
2.23 (in Xenial) and 2.27 (in bionic).
This builds on bionic and centos7 and in fact is required to
get DevStack to copmlete on bionic.
This is last in the stack since it has not been tested
beyond the compile/install that DevStack does. There
may be a better/alternate solution...but with this we should
get a passing DevStack job.
Change-Id: I5a2ed9455b05e604731c3775d0f402c6137da2ef
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
This allows DevStack plugins to add its configured STX_INST_DIR
to the linker search path.
Change-Id: I277204cd89767b93eec6c96969fc33d23e04516b
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
This update replaces compute references to worker in mtce,
kickstarts, installer and bsp files.
Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration
Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms
Story: 2004022
Task: 27013
Depends-On: https://review.openstack.org/#/c/624452/
Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f
Signed-off-by: Tao Liu <tao.liu@windriver.com>
The maintenance process monitor is failing the hbsClient
process over config or process reload operations.
The issue relates to the hbsClient's subfunction being
'last-config' without pmon properly gating the active
monitoring FSM from starting until the passive monitoring
phase is complete and in the MANAGE state.
Test Plan
PASS: Verify active monitoring failure detection and handling
PASS: Verify proper process monitoring over pmond config reload
PASS: Verify proper process monitoring over SIGHUP -> pmond
PASS: Verify proper process monitoring over SIGUSR2 -> pmond
PASS: Verify proper process monitoring over process failure recovery
PASS: Verify pmond regression test soak ; on active and inactive controllers
PASS: Verify pmond regression test soak ; on compute node
PASS: Verify pmond regression test soak ; kill/recovery function
PASS: Verify pmond regression test soak ; restart function
PASS: Verify pmond regression test soak ; alarming function
PASS: Verify pmond handles critical process failure with no restart config
PASS: Verify pmond handles ntpd process failure
PASS: Verify AIO DX Install
PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.
Change-Id: Ie2fe7b6ce479f660725e5600498cc98f36f78337
Closes-Bug: 1807724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A number of Makefiles use '[[' in their test to set
STATIC_ANALYSIS_TOOL_EXISTS. Set SHELL=/bin/bash
Change-Id: Ie9536d7cafd518f3e65acf38ac5b30aa7536ea79
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
The commit shown below introduced a main loop audit that
mistakenly registers subfunction processes that are in the
waiting for /var/run/.compute_config_complete 'polling'
state during unlock enable.
By doing so inadvertently changes its monitor FSM stage
from 'Poll' to 'Manage' before configuration is complete.
Since config is not complete, the hbsClient has not initialized
its socket interface and is unable to service active monitoring
requests. This leads to quorum failure and watchdog reboot.
commit 537935bb0c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date: Mon Jul 9 08:36:22 2018 -0400
Reorder process restart operations to prevent pmond futex deadlock
The Fix: Don't run the audit for processes that are in the
waiting for 'polling' state.
Test Plan:
Provision AIO , verify no quorum failure and inspect logs for
correct behavior.
Change-Id: I179c78309517a34285783ee99bbb3d699915cb83
Closes-Bug: 1804318
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>