Baseline changes to comply with Release Notes Management
based in Reno [0] a release notes manager.
[0] https://docs.openstack.org/reno/latest/
Story: 2003101
Task: 25744
Change-Id: Ib52641346d5a788df53a2bab97c98f2e1de0b170
Signed-off-by: Abraham Arce <abraham.arce.moreno@intel.com>
Fix below linters issues
E001 Trailing Whitespace
E003 Indent not multiple of 4
E006 Line too long
E011 Then keyword is not on same line as if or elif keyword
E020 Function declaration not in format ^function name {$
E040 Syntax error: syntax error near unexpected token `;'
ignore cases are added in tox setup
E006 Line too long
E010: do not on the same line as for
Story: 2003368
Task: 24427
Change-Id: I6acf64271a4e608be8bc8fa965cac4fa31e0c05b
Signed-off-by: Sun Austin <austin.sun@intel.com>
The maintenance system implements a high availability (HA) feature
designed to detect the simultaneous heartbeat failure of a group
of hosts and avoid failing all those hosts until heartbeat resumes
or after a set period of time.
This feature is called Multi-Node Failure Avoidance, aka MNFA, and
currently has the hosts threshold set to 3 and timeout set to 100 secs.
This update implements enhancements to that existing feature by
making the 'number-of-hosts threshold' and 'timeout period'
customer configurable service parameters.
The new service parameters are listed under platform:maintenance which
display with the following command
> system service-parameter-list
mnfa_threshold: This new label and value is added to the puppet
managed /etc/mtc.ini and represents the number of hosts that are
required to fail heartbeat as a group; within the heartbeat
failure window (heartbeat_failure_threshold) after which maintenance
activates MNFA Mode.
This update changes the default number of failing hosts from
3 to 2 while allowing a configurable range from 2 to 100.
mnfa_timeout: This new label and value is added to the puppet
managed /etc/mtc.ini. While MNFA mode is active, it will remain active
until the number of failing hosts drop below the mnfa_threshold or this
timer expires. The MNFA mode deactivates on the first occurance of
either case. Upon deactivation the remaining failed hosts are no
longer treated as a failure group but instead are all Gracefully
Recovered individually. A value of zero imposes no timeout making the
deactivation criteria solely host based.
This update changes the default 100 second timer to 0; no-timeout
while permitting valid a times range from 100 to 86400 secs or 1 day.
Test Plan:
PASS - Verify duplex and 4 compute DOR
PASS - Verify default MNFA - 1 inactive controller and 4 computes
PASS - Verify default MNFA - 4 computes
PASS - Verify default MNFA - 1 active controller and 3 computes and failed host
PASS - Verify Single host heartbeat failure handling - fail host
PASS - Verify Multi Node failure below mnfa_threshold - fail hosts
PASS - Verify MNFA handling with timeout of zero and threshold of 3
PASS - Verify MNFA timeout handling with timeout set at 100 sec
PASS - Verify MNFA service parameter lising, default value and mtc.ini
PASS - Verify MNFA service parameter change and inservice apply
PASS - Verify MNFA timeout service parameter change from value to 0
PASS - Verify MNFA timeout service parameter change from to inrange value
PASS - Verify MNFA service parametrer out of range change handling
PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active)
DocImpact
Story: 2003576
Task: 24903
Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This is an enhancement request to add the screen package to
controller nodes
This specific modification prevents the screen package from being installed
on other nodes (compute or storage)
The screen package is added in another commit
(see https://review.openstack.org/#/c/595249/)
Story: 2003061
Task: 23100
Depends-on: https://review.openstack.org/#/c/595249/
Change-Id: I355d517ba0d0392d40fe78991798ddf6e5d16fde
Signed-off-by: Paul-Emile Element <Paul-Emile.Element@windriver.com>
The low-capacity Swift solution this Story is implementing is on
controllers only.
Story: 2003518
Task: 24811
Depends-On: https://review.openstack.org/595330
Change-Id: I7bb98195bbda2a97f004329f024701475f139d53
Signed-off-by: Jack Ding <jack.ding@windriver.com>
All compute hosts seen to self reboot by hostw during patching due to
stuck pmond process
Current method to kill the running process leads to a race condition
that results in a user space futex dead lock that hangs pmond and
results in a watchdog self-reset due to quorum master 'pmond' failure.
The dead lock was traced to the ordering of the kill process.
Current steps to kill:
- kill process
- remove pidfile
- unregister pid with kernel
Deadlock is avoided by reversing the kill steps to what
is more logical.
- unregister pid with kernel
- remove pidfile
- kill process
Also introduced audit that registers manually restarted processes
with the kernel.
Failure Rate Before Fix: 1 every 25 process restarts.
Mostly fails before 5.
Failure Rate After Fix: No failures after 15000 process restarts
across 8 hosts including all host types between 2 different labs 2
different loads 18.07 and 18.08.
Test Method: Pmon restart regression test restarts all processes on
a host. Total soak restart of 25 monitored processes for 50 loops
over 12 hosts = 15000 restarts.
Also regressed process kill / recovery handling.
(5000 process recoveries)
Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002993
Task: 23007
Two Node System: VMs did not switch to ERROR state after host reboot
A logically failed (rebooted) active controller is not being
administratively failed by maintenance. As a result the host's
offline availability state is not reported to the VIM and the
VMs on that (rebooted) All-in-one host are not evacuated.
This issue only applies to two node systems because of how the heartbeat
enable of an All-in-one host needs to be held off until its compute
manifests apply in the DOR case so as to avoid maintenance failing the
peer controller over a DOR.
The challange in maintenance is to distinguish between this spontaneous
failure and a DOR. For All-in-one hosts, DOR mode is active for a
whopping 600 seconds ; long enough to account for both sets of manifests
to apply.
It's that long delay that is making this silent fault stand out so
obviously.
This update uses 'active DOR mode' to decide whether or not to enable a
host's heartbeat in the add handler.
To better handle early active controller failure the qualifier for DOR
mode was reduced from 20 to 15 minutes. Meaning that maintenance DOR
mode is activated if its host up time is less than 15 minutes ; rather
than 20 as it was before this update. Note that normally the active
controller starts maintenance with an uptime of 5-7 minutes.
Story: 2002995
Task: 23009
Change-Id: I749aefef45b9db6e86a2c6b81d131ebeccc68926
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
The mtcAgent process has been seen to segfault and coredump on process
exit.
The exit code is iterating over a c++ list that can change due to http
interrupt response handling.
The dump code is commented out with a note indicating why and when it
could be re-enabled.
Change-Id: Ie4ef684a65ded533c347ae07fdfa47f332412f7d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002994
Task: 23008
When the Hardware Monitor starts up it reads existing alarms and sensor
state from the sysinv database. It then uses this pre-existing state to
align its internal structure accordingly moving forward.
The hardware monitor manage_startup_states utility is incorrectly
requesting degrade clear rather than degrade set in response to finding
a pre-existing critical sensor assertion on process startup.
This update fixes this issue by calling the set_degraded_state rather
than clear_degraded_state against this sensor in this case.
Change-Id: Ic1ecc1f11d7a729c16da63c6d43b7d758bb9e467
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002882
Task: 22845
Filter out the fm client and fm rest api packages from
compute and storage nodes
Story: 2002828
Task: 22747
Depends-On: https://review.openstack.org/#/c/591452/
Change-Id: If0663dfb2cc1b557a1b9439c64d3ccb36bd66503
Signed-off-by: Tao Liu <tao.liu@windriver.com>
Currently compiling a new package and adding it
to the iso still requires a multi-git update because
image.inc is a single centralized file in the root git.
It would be better to allow a single git update to add
a package. Too allow this, image.inc must be split across
the git repos and the build tools must be changed to
read/merge those files to arrive at the final package list.
Current scheme is to name the image.inc files using this
schema.
${distro}_${build_target}_image_${build_type}.inc
distro = centos, ...
build_target = iso, guest ...
build_type = std, rt ...
Traditionally build_type=std is omitted from config files,
so we instread use ${distro}_${build_target}_image.inc.
Change-Id: I9ef0304ff286be15d95f7ce944ee4ccf9bacc439
Story: 2003447
Task: 24649
Depends-On: Ib39b8063e7759842ba15330c68503bfe2dea6e20
Signed-off-by: Scott Little <scott.little@windriver.com>
As a part of changes to make sm-api independent, calling sm-api
requires keystone authentication.
This change is to enable mtce to call sm rest api with keystone
authentication.
Story: 2002827
Task: 22744
Change-Id: If3b58d3e36b9bd7fd88829d61e9c1daa00ab5048
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Introduction of PTP service requires NTP service to be disabled.
Process monitoring of NTP daemon must be turned off as well.
There is no way to start/stop process monitoring from MTCE.
Puppet can check NTP status at startup and enable/disable monitoring.
So, it is needed to move NTP-related PMON script from MTCE to Puppet.
This is first step: removing NTP references from MTCE.
Change-Id: I1ca6045af8c5169220b7332d45b843fdb4960f01
Story: 2002935
Task: 24520
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
Updating kickstart to provision 5G for new gnocchi filesystem in
cgcs disk partition.
Story: 2002825
Task: 24240
Change-Id: Ie6182a636e6b9c580af2cce671dcbb267acb305f
Signed-off-by: Angie Wang <angie.wang@windriver.com>
If the first mtcAlive message from a host that was supposed to be
rebooted reports uptime in excess of 40 minutes then that means it did
not reboot as expected.
This was seen to happen during an extended offline case where the host
failed heartbeat, then was reported offline during Graceful Recovery
which forced a full enable. When the host eventually came back online
its reported uptime made it clear that it never rebooted but mtce
allowed it to come into service anyway.
This is a security issue that can lead to a host disappearing, being
security hacked and brought back into the system without reboot.
To fix that, this update requires that a host's uptime, reported in its
first mtcAlive message, indicate that it has been up for less twice the
configured mtcAlive timeout or the enable will fail until it is proven
to reset.
Story: 2002882
Task: 22845
Change-Id: I9b3ff0bc1ba5af2ca5b07a58db9da9f288b59576
Signed-off-by: Jack Ding <jack.ding@windriver.com>
For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent
will declare heartbeat recovery upon the first successful heartbeat
reply after the loss is declared ; basically edge level trigger
recovery.
In cases where a networking issue causes heartbeat loss of a group of
hosts, Maintenance tracks the group of hosts that experienced heartbeta
loss and puts the system into 'Multi Node Failure Avoidance' mode.
maintenance then simply waits up to a configured timeout period for
hosts to regain heartbeat.
As heartbeat is regained for each host that host is attempted to be
'Gracefully Recovered'.
However, if the networking issue persists in a way that the occasional
transient heartbeat pulse gets through then the maintenance system can
prematurely take hosts and then 'the system' out of MNFA mode only to
find that heartbeat is actually not properly recovered/working only to
then fail and force reboot/reset each node that is still experiencing
heartbeat loss.
This update changes the heartbeat service from an 'edge' to 'level'
sensitive recovery by requiring a number of back-2-back heartbeat pulses
following a failure before that host is delared as recovered and pulled
out of the MMNFA pool.
Basically, This update makes the system's MNFA recovery algorithm more
robust in the face of transient heartbeat loss for a group of hosts.
Story: 2002882
Task: 22845
Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1
Signed-off-by: Jack Ding <jack.ding@windriver.com>
To improve kubernetes support, update kernel to CentOS 7.5 version
and enable user namespaces in kernel bootargs.
Depends-On: https://review.openstack.org/580689
Change-Id: I4d8620ea17a19a764c6627cd79eb548c79c56bfd
Signed-off-by: Jason McKenna <jason.mckenna@windriver.com>
Story: 2002761
Task: 22841
Adds support to the mtcAgent for detecting the absence of the 'host
services execution enhancement feature' in the mtcClient and implements
the pre-upgrade implementation in that case. When mtcAgent tries to lock
a storage node running pre-upgrade verison it will implement a 90s
lock wait before proceeding to declare that storage host as
locked-disabled.
Story: 2002886
Task: 22847
Change-Id: I99fb5576e027621019adb5eff553d52773f608db
Signed-off-by: Jack Ding <jack.ding@windriver.com>
Implements customer configuration of kernel options to control
spectre/meltdown related kernel options. Default (with "nopti
nospectre_v2" options) can be changed to "" using
system modify -S spectre_meltdown_all
Change-Id: I183a22fa681e6524415558c0009aa8786418cc07
Signed-off-by: Jack Ding <jack.ding@windriver.com>
This update adds Maintenance support for receiving host degrade assert
and clear messages from collectd.
This update also disables platform memory, cpu and file system resource
monitoring in the maintenance resource monitor process rmon.
These disabled resources are now monitored by collectd and therefore
should not be monitored by rmond any longer.
Change-Id: I13fd033bb1d14f299dcb97fa80296641c958d0a9
Signed-off-by: Jack Ding <jack.ding@windriver.com>