192 Commits

Author SHA1 Message Date
Abraham Arce
61abc3aafc [Doc] Release Notes Management
Baseline changes to comply with Release Notes Management
based in Reno [0] a release notes manager.

[0] https://docs.openstack.org/reno/latest/

Story: 2003101
Task: 25744

Change-Id: Ib52641346d5a788df53a2bab97c98f2e1de0b170
Signed-off-by: Abraham Arce <abraham.arce.moreno@intel.com>
2018-09-05 19:59:26 -05:00
Abraham Arce
451bad46e9 [Doc] Building docs following Docs Contrib Guide
Baseline changes to comply with OpenStack Documentation
Contributor Guide [0] starting with the following sections:

- Project guide setup
  - [1] sphinx-quickstart
  - [2] doc/source/ layout
- Building documentation
  - [3] tox -e docs
- Using documentation tools
  - [4] openstackdocstheme

[0] https://docs.openstack.org/doc-contrib-guide
[1] http://www.sphinx-doc.org/en/master/usage/quickstart.html
[2] https://docs.openstack.org/doc-contrib-guide/project-guides.html
[3] https://docs.openstack.org/doc-contrib-guide/docs-builds.html
[4] https://docs.openstack.org/openstackdocstheme/

Story: 2002708
Task: 24449

Story: 2002813
Task: 24450

Change-Id: I961c7c90c51248926d11b2a2a89c0231f58f7fd0
Signed-off-by: Abraham Arce <abraham.arce.moreno@intel.com>
2018-09-05 19:59:26 -05:00
Sun Austin
fedb95ba79 Fix linters issues and enable tox/zuul linters job as gate
Fix below linters issues
 E001 Trailing Whitespace
 E003 Indent not multiple of 4
 E006 Line too long
 E011 Then keyword is not on same line as if or elif keyword
 E020 Function declaration not in format ^function name {$
 E040 Syntax error: syntax error near unexpected token `;'

ignore cases are added in tox setup
 E006 Line too long
 E010: do not on the same line as for

Story: 2003368
Task: 24427

Change-Id: I6acf64271a4e608be8bc8fa965cac4fa31e0c05b
Signed-off-by: Sun Austin <austin.sun@intel.com>
2018-09-05 09:02:25 +08:00
Eric MacDonald
82e851d651 Mtce: Make Multi-Node Failure Avoidance Configurable
The maintenance system implements a high availability (HA) feature
designed to detect the simultaneous heartbeat failure of a group
of hosts and avoid failing all those hosts until heartbeat resumes
or after a set period of time.

This feature is called Multi-Node Failure Avoidance, aka MNFA, and
currently has the hosts threshold set to 3 and timeout set to 100 secs.

This update implements enhancements to that existing feature by
making the 'number-of-hosts threshold' and 'timeout period'
customer configurable service parameters.

The new service parameters are listed under platform:maintenance which
display with the following command

> system service-parameter-list

mnfa_threshold: This new label and value is added to the puppet
managed /etc/mtc.ini and represents the number of hosts that are
required to fail heartbeat as a group; within the heartbeat
failure window (heartbeat_failure_threshold) after which maintenance
activates MNFA Mode.

This update changes the default number of failing hosts from
3 to 2 while allowing a configurable range from 2 to 100.

mnfa_timeout: This new label and value is added to the puppet
managed /etc/mtc.ini. While MNFA mode is active, it will remain active
until the number of failing hosts drop below the mnfa_threshold or this
timer expires. The MNFA mode deactivates on the first occurance of
either case. Upon deactivation the remaining failed hosts are no
longer treated as a failure group but instead are all Gracefully
Recovered individually. A value of zero imposes no timeout making the
deactivation criteria solely host based.

This update changes the default 100 second timer to 0; no-timeout
while permitting valid a times range from 100 to 86400 secs or 1 day.

Test Plan:

PASS - Verify duplex and 4 compute DOR
PASS - Verify default MNFA - 1 inactive controller and 4 computes
PASS - Verify default MNFA - 4 computes
PASS - Verify default MNFA - 1 active controller and 3 computes and failed host
PASS - Verify Single host heartbeat failure handling - fail host
PASS - Verify Multi Node failure below mnfa_threshold - fail hosts
PASS - Verify MNFA handling with timeout of zero and threshold of 3
PASS - Verify MNFA timeout handling with timeout set at 100 sec
PASS - Verify MNFA service parameter lising, default value and mtc.ini
PASS - Verify MNFA service parameter change and inservice apply
PASS - Verify MNFA timeout service parameter change from value to 0
PASS - Verify MNFA timeout service parameter change from to inrange value
PASS - Verify MNFA service parametrer out of range change handling
PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active)

DocImpact
Story: 2003576
Task: 24903

Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-08-31 15:35:08 -04:00
hazelnutsgz
482d1acea8 Fix the print syntax inconsistency between python2 and python3
Using the automation tool & manual check to fix the print syntax.
Task: 24595
Story: 2003426

Change-Id: I3844c9644aabeeeb27bc2abb106c839b9921fe78
2018-08-29 16:09:27 +08:00
Zuul
c3d9e4e689 Merge "Add linux screen package to controllers" 2018-08-24 17:48:38 +00:00
Zuul
5581422b0e Merge "Exclude openstack-swift pkgs from compute/storage" 2018-08-23 19:31:02 +00:00
Zuul
0c9a69ccd1 Merge "Mtce: mtcAgent sometimes coredumps on process exit" 2018-08-23 13:36:33 +00:00
Zuul
4d43887256 Merge "Maintain sensor degrade over a process restart" 2018-08-23 13:25:16 +00:00
Zuul
8290718f81 Merge "Reorder process restart operations to prevent pmond futex deadlock" 2018-08-23 13:24:40 +00:00
Zuul
6a1999a371 Merge "Enable host heartbeat in add handler when not in DOR mode" 2018-08-23 13:21:20 +00:00
Paul-Emile Element
1d9f594147 Add linux screen package to controllers
This is an enhancement request to add the screen package to
controller nodes

This specific modification prevents the screen package from being installed
on other nodes (compute or storage)
The screen package is added in another commit
(see https://review.openstack.org/#/c/595249/)

Story: 2003061
Task: 23100

Depends-on: https://review.openstack.org/#/c/595249/
Change-Id: I355d517ba0d0392d40fe78991798ddf6e5d16fde
Signed-off-by: Paul-Emile Element <Paul-Emile.Element@windriver.com>
2018-08-22 17:41:27 -04:00
Jack Ding
ae26bbdca3 Exclude openstack-swift pkgs from compute/storage
The low-capacity Swift solution this Story is implementing is on
controllers only.

Story: 2003518
Task: 24811
Depends-On: https://review.openstack.org/595330

Change-Id: I7bb98195bbda2a97f004329f024701475f139d53
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-08-22 16:04:09 -04:00
Zuul
319af8ffa3 Merge "Remove old repo map files" 2018-08-19 15:38:37 +00:00
Dean Troyer
96f96cdf42 Remove old repo map files
Change-Id: Ia0acd0e3dedd6f6f4c050b7328688038d7834269
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
2018-08-17 16:01:27 -05:00
Zuul
02c0b0f7c8 Merge "Split image.inc across git repos" 2018-08-17 17:24:03 +00:00
Zuul
429d024009 Merge "Decouple Fault Management from stx-config" 2018-08-17 16:29:42 +00:00
Eric MacDonald
537935bb0c Reorder process restart operations to prevent pmond futex deadlock
All compute hosts seen to self reboot by hostw during patching due to 
stuck pmond process

Current method to kill the running process leads to a race condition 
that results in a user space futex dead lock that hangs pmond and 
results in a watchdog self-reset due to quorum master 'pmond' failure.

The dead lock was traced to the ordering of the kill process.

Current steps to kill:

 - kill process
 - remove pidfile
 - unregister pid with kernel

Deadlock is avoided by reversing the kill steps to what
is more logical.

 - unregister pid with kernel
 - remove pidfile
 - kill process

Also introduced audit that registers manually restarted processes
with the kernel.

Failure Rate Before Fix: 1 every 25 process restarts.
                         Mostly fails before 5.

Failure Rate  After Fix: No failures after 15000 process restarts
across 8 hosts including all host types between 2 different labs 2
different loads 18.07 and 18.08.

Test Method: Pmon restart regression test restarts all processes on
a host. Total soak restart of 25 monitored processes for 50 loops
over 12 hosts = 15000 restarts.

Also regressed process kill / recovery handling. 
(5000 process recoveries)

Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002993
Task: 23007
2018-08-16 20:22:15 +00:00
Eric MacDonald
7da4eb945f Enable host heartbeat in add handler when not in DOR mode
Two Node System: VMs did not switch to ERROR state after host reboot

A logically failed (rebooted) active controller is not being
administratively failed by maintenance. As a result the host's
offline availability state is not reported to the VIM and the
VMs on that (rebooted) All-in-one host are not evacuated.

This issue only applies to two node systems because of how the heartbeat
enable of an All-in-one host needs to be held off until its compute 
manifests apply in the DOR case so as to avoid maintenance failing the 
peer controller over a DOR.

The challange in maintenance is to distinguish between this spontaneous
failure and a DOR. For All-in-one hosts, DOR mode is active for a 
whopping 600 seconds ; long enough to account for both sets of manifests
to apply.

It's that long delay that is making this silent fault stand out so 
obviously.

This update uses 'active DOR mode' to decide whether or not to enable a
host's heartbeat in the add handler.

To better handle early active controller failure the qualifier for DOR 
mode was reduced from 20 to 15 minutes. Meaning that maintenance DOR 
mode is activated if its host up time is less than 15 minutes ; rather 
than 20 as it was before this update. Note that normally the active 
controller starts maintenance with an uptime of 5-7 minutes.

Story: 2002995
Task: 23009
Change-Id: I749aefef45b9db6e86a2c6b81d131ebeccc68926
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
2018-08-16 20:20:16 +00:00
Eric MacDonald
67dec7c6cf Mtce: mtcAgent sometimes coredumps on process exit
The mtcAgent process has been seen to segfault and coredump on process 
exit.

The exit code is iterating over a c++ list that can change due to http
interrupt response handling.

The dump code is commented out with a note indicating why and when it 
could be re-enabled.

Change-Id: Ie4ef684a65ded533c347ae07fdfa47f332412f7d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002994
Task: 23008
2018-08-16 20:16:07 +00:00
Eric MacDonald
cc53f1e689 Maintain sensor degrade over a process restart
When the Hardware Monitor starts up it reads existing alarms and sensor
state from the sysinv database. It then uses this pre-existing state to
align its internal structure accordingly moving forward.

The hardware monitor manage_startup_states utility is incorrectly 
requesting degrade clear rather than degrade set in response to finding 
a pre-existing critical sensor assertion on process startup.

This update fixes this issue by calling the set_degraded_state rather 
than clear_degraded_state against this sensor in this case.

Change-Id: Ic1ecc1f11d7a729c16da63c6d43b7d758bb9e467
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002882
Task: 22845
2018-08-16 20:14:24 +00:00
Zuul
706de7b423 Merge "Moving PMON script for NTP from MTCE to Puppet" 2018-08-16 16:25:57 +00:00
Tao Liu
f6834399a1 Decouple Fault Management from stx-config
Filter out the fm client and fm rest api packages from
compute and storage nodes

Story: 2002828
Task: 22747

Depends-On: https://review.openstack.org/#/c/591452/

Change-Id: If0663dfb2cc1b557a1b9439c64d3ccb36bd66503
Signed-off-by: Tao Liu <tao.liu@windriver.com>
2018-08-16 11:52:08 -04:00
Scott Little
44aa6ea4da Split image.inc across git repos
Currently compiling a new package and adding it
to the iso still requires a multi-git update because
image.inc is a single centralized file in the root git.

It would be better to allow a single git update to add
a package. Too allow this, image.inc must be split across
the git repos and the build tools must be changed to
read/merge those files to arrive at the final package list.

Current scheme is to name the image.inc files using this
schema.

${distro}_${build_target}_image_${build_type}.inc

distro = centos, ...
build_target = iso, guest ...
build_type = std, rt ...

Traditionally build_type=std is omitted from config files,
so we instread use ${distro}_${build_target}_image.inc.

Change-Id: I9ef0304ff286be15d95f7ce944ee4ccf9bacc439
Story: 2003447
Task:  24649
Depends-On: Ib39b8063e7759842ba15330c68503bfe2dea6e20
Signed-off-by: Scott Little <scott.little@windriver.com>
2018-08-15 16:45:56 -04:00
Zuul
c38acc947c Merge "Mtce calls sm rest api with keystone authentication" 2018-08-13 17:36:32 +00:00
Bin Qian
b4f8ef606c Mtce calls sm rest api with keystone authentication
As a part of changes to make sm-api independent, calling sm-api
requires keystone authentication.
This change is to enable mtce to call sm rest api with keystone
authentication.

Story: 2002827
Task: 22744

Change-Id: If3b58d3e36b9bd7fd88829d61e9c1daa00ab5048
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-08-13 10:14:45 -04:00
Alex Kozyrev
00520ac78c Moving PMON script for NTP from MTCE to Puppet
Introduction of PTP service requires NTP service to be disabled.
Process monitoring of NTP daemon must be turned off as well.
There is no way to start/stop process monitoring from MTCE.
Puppet can check NTP status at startup and enable/disable monitoring.
So, it is needed to move NTP-related PMON script from MTCE to Puppet.
This is first step: removing NTP references from MTCE.

Change-Id: I1ca6045af8c5169220b7332d45b843fdb4960f01
Story: 2002935
Task: 24520
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2018-08-09 16:04:57 -04:00
Angie Wang
b2d963f0ef Extend cgcs disk partition for gnocchi usage
Updating kickstart to provision 5G for new gnocchi filesystem in
cgcs disk partition.

Story: 2002825
Task: 24240

Change-Id: Ie6182a636e6b9c580af2cce671dcbb267acb305f
Signed-off-by: Angie Wang <angie.wang@windriver.com>
2018.08.0
2018-08-08 15:54:44 -04:00
Angie Wang
3879fe15d6 Filter out gnocchi packages from compute and storage hosts
Story: 2002825
Task: 22871
Depends-On: https://review.openstack.org/587417

Change-Id: I48319b9b584bb8437df48ba5e74c2bfdb1b66827
Signed-off-by: Don Penney <don.penney@windriver.com>
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-31 10:17:24 -04:00
Jack Ding
29ed8f1c18 Cleanup internal references
Story: 2002971
Task: 22979

Change-Id: I095b52139ff4c702fe8a030c1d1697375ef6ff5a
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-31 10:09:27 -04:00
Eric MacDonald
cb2d1b3bfc Mtce: Fix logic compare looking for host that did not reboot
Story: 2002882
Task: 22845

Change-Id: I0ffab3476c32b0947f0cd44796e257ee4bb93029
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:13:05 -04:00
Eric MacDonald
e5cbfce297 Mtce: Increase MNFA timeout from 60 to 100 secs
Story: 2002882
Task: 22845

Change-Id: Ieabbb04877dfec1693a93d38abeefb474ac251a2
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:13:00 -04:00
Eric MacDonald
f649c5b9b4 Mtce: Hosts in MNFA pool are reported to be in Graceful Recovery during wait period
Story: 2002882
Task: 22845

Change-Id: Icbdf21d51f4b41192ed49f40bbe76f462e5aaba9
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:51 -04:00
Eric MacDonald
23d9dd711c Mtce: Enable offline handler during Graceful recovery
Story: 2002882
Task: 22845

Change-Id: Ie5e43a0fe150d277514ef75b9e4c9461951efc26
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:46 -04:00
Eric MacDonald
76fbef1d01 Mtce: Fix memory leak in Swact failure handling
Story: 2002882
Task: 22845

Change-Id: I8be5d26a2702cc9c2788335a27c8d0ebcacc2b2c
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:41 -04:00
Eric MacDonald
4d463fe074 Mtce: add host and iface name to msg debug log in hbsAgent
Story: 2002882
Task: 22845

Change-Id: If4a6768f7f210742130679afb56c5f5364273bfc
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:35 -04:00
Eric MacDonald
083d38923a Mtce: Force enable failure of host that did not reboot during enable.
If the first mtcAlive message from a host that was supposed to be
rebooted reports uptime in excess of 40 minutes then that means it did
not reboot as expected.

This was seen to happen during an extended offline case where the host
failed heartbeat, then was reported offline during Graceful Recovery
which forced a full enable. When the host eventually came back online
its reported uptime made it clear that it never rebooted but mtce
allowed it to come into service anyway.

This is a security issue that can lead to a host disappearing, being
security hacked and brought back into the system without reboot.

To fix that, this update requires that a host's uptime, reported in its
first mtcAlive message, indicate that it has been up for less twice the
configured mtcAlive timeout or the enable will fail until it is proven
to reset.

Story: 2002882
Task: 22845

Change-Id: I9b3ff0bc1ba5af2ca5b07a58db9da9f288b59576
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:28 -04:00
Eric MacDonald
acd2d684f6 Mtce: Debouce heartbeat recovery
For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent
will declare heartbeat recovery upon the first successful heartbeat
reply after the loss is declared ; basically edge level trigger
recovery.

In cases where a networking issue causes heartbeat loss of a group of
hosts, Maintenance tracks the group of hosts that experienced heartbeta
loss and puts the system into 'Multi Node Failure Avoidance' mode.
maintenance then simply waits up to a configured timeout period for
hosts to regain heartbeat.
As heartbeat is regained for each host that host is attempted to be
'Gracefully Recovered'.

However, if the networking issue persists in a way that the occasional
transient heartbeat pulse gets through then the maintenance system can
prematurely take hosts and then 'the system' out of MNFA mode only to
find that heartbeat is actually not properly recovered/working only to
then fail and force reboot/reset each node that is still experiencing
heartbeat loss.

This update changes the heartbeat service from an 'edge' to 'level'
sensitive recovery by requiring a number of back-2-back heartbeat pulses
following a failure before that host is delared as recovered and pulled
out of the MMNFA pool.

Basically, This update makes the system's MNFA recovery algorithm more
robust in the face of transient heartbeat loss for a group of hosts.

Story: 2002882
Task: 22845

Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:19 -04:00
Eric MacDonald
ed1410a736 Mtce: Re-add explicit request for mtcAlive in Graceful Recovery handler
Story: 2002882
Task: 22845

Change-Id: Ib814416e46f988b3342a2da7b31e6e7273684c9e
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:11:59 -04:00
Zuul
8b4cb5f73d Merge "Update upgrade version to 18.03" 2018-07-10 17:14:33 +00:00
jmckenna
bb036defd6 Update boot configs to match CentOS 7.5 kernel
To improve kubernetes support, update kernel to CentOS 7.5 version
and enable user namespaces in kernel bootargs.

Depends-On:  https://review.openstack.org/580689

Change-Id: I4d8620ea17a19a764c6627cd79eb548c79c56bfd
Signed-off-by: Jason McKenna <jason.mckenna@windriver.com>
Story: 2002761
Task: 22841
2018-07-06 11:26:06 -04:00
Bart Wensley
3332b39ba2 Update upgrade version to 18.03
Story: 2002886
Task: 22847
Change-Id: Ieb01085e5ffa12ce90076c1bd8d9c0032396043d
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-06 09:19:38 -04:00
Eric MacDonald
7be3b9085a Add 90s delay before locking storage node for upgrade
Adds support to the mtcAgent for detecting the absence of the 'host
services execution enhancement feature' in the mtcClient and implements
the pre-upgrade implementation in that case. When mtcAgent tries to lock
a storage node running pre-upgrade verison it will implement a 90s
lock wait before proceeding to declare that storage host as
locked-disabled.

Story: 2002886
Task: 22847
Change-Id: I99fb5576e027621019adb5eff553d52773f608db
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-06 09:18:21 -04:00
Scott Little
51d572ceed Shorten "addons/wr-cgcs/layers/cgcs" to just "stx"
Part of the project to remove cgcs references.
Replace and shorten the path the needlessly long and
complex "addons/wr-cgcs/layers/cgcs" path with just "stx".

This update just fixes up paths found in scripts, comments
and config files.

Depends-On: https://review.openstack.org/579954
Depends-On: https://review.openstack.org/579957
Depends-On: https://review.openstack.org/580170
Depends-On: https://review.openstack.org/579975
Change-Id: I2110a0de13487492f62cdaf5d5513f4faf20d50d
Signed-off-by: Scott Little <scott.little@windriver.com>
2018-07-04 11:03:59 -04:00
Scott Little
89dd36625e Rename mwa-* subdirectories to match the git repo name
mwa-delphia -> stx-clients
mwa-pitta   -> stx-config
mwa-cleo    -> stx-fault
mwa-gplv2   -> stx-gplv2
mwa-gplv3   -> stx-gplv3
mwa-solon   -> stx-ha
mwa-sparta  -> stx-integ
mwa-beas    -> stx-metal
mwa-thales  -> stx-nfv
mwa-chilon  -> stx-update
mwa-perian  -> stx-upstream

Depends-On: https://review.openstack.org/579954
Depends-On: https://review.openstack.org/579957
Change-Id: I269a4e79425a41709381f8894456d21233463e9f
Signed-off-by: Scott Little <scott.little@windriver.com>
2018-07-03 16:29:24 -04:00
Zuul
db4063233b Merge "Spectre/meltdown kernel options controllable by customer" 2018-07-03 17:19:18 +00:00
Zuul
4a4c540a3c Merge "Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1" 2018-07-03 17:02:34 +00:00
Zuul
3c53bf4a47 Merge "pmond: add support for no script label in conf files" 2018-07-03 17:02:33 +00:00
jmckenna
fba0ef3f7c Spectre/meltdown kernel options controllable by customer
Implements customer configuration of kernel options to control
spectre/meltdown related kernel options.  Default (with "nopti
nospectre_v2" options) can be changed to "" using

system modify -S spectre_meltdown_all

Change-Id: I183a22fa681e6524415558c0009aa8786418cc07
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-03 11:04:58 -04:00
Eric MacDonald
c038b1a9a7 Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1
This update adds Maintenance support for receiving host degrade assert
and clear messages from collectd.
This update also disables platform memory, cpu and file system resource
monitoring in the maintenance resource monitor process rmon.
These disabled resources are now monitored by collectd and therefore
should not be monitored by rmond any longer.

Change-Id: I13fd033bb1d14f299dcb97fa80296641c958d0a9
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-03 11:04:27 -04:00