metal

Author	SHA1	Message	Date
Abraham Arce	61abc3aafc	[Doc] Release Notes Management Baseline changes to comply with Release Notes Management based in Reno [0] a release notes manager. [0] https://docs.openstack.org/reno/latest/ Story: 2003101 Task: 25744 Change-Id: Ib52641346d5a788df53a2bab97c98f2e1de0b170 Signed-off-by: Abraham Arce <abraham.arce.moreno@intel.com>	2018-09-05 19:59:26 -05:00
Abraham Arce	451bad46e9	[Doc] Building docs following Docs Contrib Guide Baseline changes to comply with OpenStack Documentation Contributor Guide [0] starting with the following sections: - Project guide setup - [1] sphinx-quickstart - [2] doc/source/ layout - Building documentation - [3] tox -e docs - Using documentation tools - [4] openstackdocstheme [0] https://docs.openstack.org/doc-contrib-guide [1] http://www.sphinx-doc.org/en/master/usage/quickstart.html [2] https://docs.openstack.org/doc-contrib-guide/project-guides.html [3] https://docs.openstack.org/doc-contrib-guide/docs-builds.html [4] https://docs.openstack.org/openstackdocstheme/ Story: 2002708 Task: 24449 Story: 2002813 Task: 24450 Change-Id: I961c7c90c51248926d11b2a2a89c0231f58f7fd0 Signed-off-by: Abraham Arce <abraham.arce.moreno@intel.com>	2018-09-05 19:59:26 -05:00
Sun Austin	fedb95ba79	Fix linters issues and enable tox/zuul linters job as gate Fix below linters issues E001 Trailing Whitespace E003 Indent not multiple of 4 E006 Line too long E011 Then keyword is not on same line as if or elif keyword E020 Function declaration not in format ^function name {$ E040 Syntax error: syntax error near unexpected token `;' ignore cases are added in tox setup E006 Line too long E010: do not on the same line as for Story: 2003368 Task: 24427 Change-Id: I6acf64271a4e608be8bc8fa965cac4fa31e0c05b Signed-off-by: Sun Austin <austin.sun@intel.com>	2018-09-05 09:02:25 +08:00
Eric MacDonald	82e851d651	Mtce: Make Multi-Node Failure Avoidance Configurable The maintenance system implements a high availability (HA) feature designed to detect the simultaneous heartbeat failure of a group of hosts and avoid failing all those hosts until heartbeat resumes or after a set period of time. This feature is called Multi-Node Failure Avoidance, aka MNFA, and currently has the hosts threshold set to 3 and timeout set to 100 secs. This update implements enhancements to that existing feature by making the 'number-of-hosts threshold' and 'timeout period' customer configurable service parameters. The new service parameters are listed under platform:maintenance which display with the following command > system service-parameter-list mnfa_threshold: This new label and value is added to the puppet managed /etc/mtc.ini and represents the number of hosts that are required to fail heartbeat as a group; within the heartbeat failure window (heartbeat_failure_threshold) after which maintenance activates MNFA Mode. This update changes the default number of failing hosts from 3 to 2 while allowing a configurable range from 2 to 100. mnfa_timeout: This new label and value is added to the puppet managed /etc/mtc.ini. While MNFA mode is active, it will remain active until the number of failing hosts drop below the mnfa_threshold or this timer expires. The MNFA mode deactivates on the first occurance of either case. Upon deactivation the remaining failed hosts are no longer treated as a failure group but instead are all Gracefully Recovered individually. A value of zero imposes no timeout making the deactivation criteria solely host based. This update changes the default 100 second timer to 0; no-timeout while permitting valid a times range from 100 to 86400 secs or 1 day. Test Plan: PASS - Verify duplex and 4 compute DOR PASS - Verify default MNFA - 1 inactive controller and 4 computes PASS - Verify default MNFA - 4 computes PASS - Verify default MNFA - 1 active controller and 3 computes and failed host PASS - Verify Single host heartbeat failure handling - fail host PASS - Verify Multi Node failure below mnfa_threshold - fail hosts PASS - Verify MNFA handling with timeout of zero and threshold of 3 PASS - Verify MNFA timeout handling with timeout set at 100 sec PASS - Verify MNFA service parameter lising, default value and mtc.ini PASS - Verify MNFA service parameter change and inservice apply PASS - Verify MNFA timeout service parameter change from value to 0 PASS - Verify MNFA timeout service parameter change from to inrange value PASS - Verify MNFA service parametrer out of range change handling PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active) DocImpact Story: 2003576 Task: 24903 Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-08-31 15:35:08 -04:00
hazelnutsgz	482d1acea8	Fix the print syntax inconsistency between python2 and python3 Using the automation tool & manual check to fix the print syntax. Task: 24595 Story: 2003426 Change-Id: I3844c9644aabeeeb27bc2abb106c839b9921fe78	2018-08-29 16:09:27 +08:00
Zuul	c3d9e4e689	Merge "Add linux screen package to controllers"	2018-08-24 17:48:38 +00:00
Zuul	5581422b0e	Merge "Exclude openstack-swift pkgs from compute/storage"	2018-08-23 19:31:02 +00:00
Zuul	0c9a69ccd1	Merge "Mtce: mtcAgent sometimes coredumps on process exit"	2018-08-23 13:36:33 +00:00
Zuul	4d43887256	Merge "Maintain sensor degrade over a process restart"	2018-08-23 13:25:16 +00:00
Zuul	8290718f81	Merge "Reorder process restart operations to prevent pmond futex deadlock"	2018-08-23 13:24:40 +00:00
Zuul	6a1999a371	Merge "Enable host heartbeat in add handler when not in DOR mode"	2018-08-23 13:21:20 +00:00
Paul-Emile Element	1d9f594147	Add linux screen package to controllers This is an enhancement request to add the screen package to controller nodes This specific modification prevents the screen package from being installed on other nodes (compute or storage) The screen package is added in another commit (see https://review.openstack.org/#/c/595249/) Story: 2003061 Task: 23100 Depends-on: https://review.openstack.org/#/c/595249/ Change-Id: I355d517ba0d0392d40fe78991798ddf6e5d16fde Signed-off-by: Paul-Emile Element <Paul-Emile.Element@windriver.com>	2018-08-22 17:41:27 -04:00
Jack Ding	ae26bbdca3	Exclude openstack-swift pkgs from compute/storage The low-capacity Swift solution this Story is implementing is on controllers only. Story: 2003518 Task: 24811 Depends-On: https://review.openstack.org/595330 Change-Id: I7bb98195bbda2a97f004329f024701475f139d53 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-08-22 16:04:09 -04:00
Zuul	319af8ffa3	Merge "Remove old repo map files"	2018-08-19 15:38:37 +00:00
Dean Troyer	96f96cdf42	Remove old repo map files Change-Id: Ia0acd0e3dedd6f6f4c050b7328688038d7834269 Signed-off-by: Dean Troyer <dtroyer@gmail.com>	2018-08-17 16:01:27 -05:00
Zuul	02c0b0f7c8	Merge "Split image.inc across git repos"	2018-08-17 17:24:03 +00:00
Zuul	429d024009	Merge "Decouple Fault Management from stx-config"	2018-08-17 16:29:42 +00:00
Eric MacDonald	537935bb0c	Reorder process restart operations to prevent pmond futex deadlock All compute hosts seen to self reboot by hostw during patching due to stuck pmond process Current method to kill the running process leads to a race condition that results in a user space futex dead lock that hangs pmond and results in a watchdog self-reset due to quorum master 'pmond' failure. The dead lock was traced to the ordering of the kill process. Current steps to kill: - kill process - remove pidfile - unregister pid with kernel Deadlock is avoided by reversing the kill steps to what is more logical. - unregister pid with kernel - remove pidfile - kill process Also introduced audit that registers manually restarted processes with the kernel. Failure Rate Before Fix: 1 every 25 process restarts. Mostly fails before 5. Failure Rate After Fix: No failures after 15000 process restarts across 8 hosts including all host types between 2 different labs 2 different loads 18.07 and 18.08. Test Method: Pmon restart regression test restarts all processes on a host. Total soak restart of 25 monitored processes for 50 loops over 12 hosts = 15000 restarts. Also regressed process kill / recovery handling. (5000 process recoveries) Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d Signed-off-by: David Sullivan <david.sullivan@windriver.com> Story: 2002993 Task: 23007	2018-08-16 20:22:15 +00:00
Eric MacDonald	7da4eb945f	Enable host heartbeat in add handler when not in DOR mode Two Node System: VMs did not switch to ERROR state after host reboot A logically failed (rebooted) active controller is not being administratively failed by maintenance. As a result the host's offline availability state is not reported to the VIM and the VMs on that (rebooted) All-in-one host are not evacuated. This issue only applies to two node systems because of how the heartbeat enable of an All-in-one host needs to be held off until its compute manifests apply in the DOR case so as to avoid maintenance failing the peer controller over a DOR. The challange in maintenance is to distinguish between this spontaneous failure and a DOR. For All-in-one hosts, DOR mode is active for a whopping 600 seconds ; long enough to account for both sets of manifests to apply. It's that long delay that is making this silent fault stand out so obviously. This update uses 'active DOR mode' to decide whether or not to enable a host's heartbeat in the add handler. To better handle early active controller failure the qualifier for DOR mode was reduced from 20 to 15 minutes. Meaning that maintenance DOR mode is activated if its host up time is less than 15 minutes ; rather than 20 as it was before this update. Note that normally the active controller starts maintenance with an uptime of 5-7 minutes. Story: 2002995 Task: 23009 Change-Id: I749aefef45b9db6e86a2c6b81d131ebeccc68926 Signed-off-by: David Sullivan <david.sullivan@windriver.com>	2018-08-16 20:20:16 +00:00
Eric MacDonald	67dec7c6cf	Mtce: mtcAgent sometimes coredumps on process exit The mtcAgent process has been seen to segfault and coredump on process exit. The exit code is iterating over a c++ list that can change due to http interrupt response handling. The dump code is commented out with a note indicating why and when it could be re-enabled. Change-Id: Ie4ef684a65ded533c347ae07fdfa47f332412f7d Signed-off-by: David Sullivan <david.sullivan@windriver.com> Story: 2002994 Task: 23008	2018-08-16 20:16:07 +00:00
Eric MacDonald	cc53f1e689	Maintain sensor degrade over a process restart When the Hardware Monitor starts up it reads existing alarms and sensor state from the sysinv database. It then uses this pre-existing state to align its internal structure accordingly moving forward. The hardware monitor manage_startup_states utility is incorrectly requesting degrade clear rather than degrade set in response to finding a pre-existing critical sensor assertion on process startup. This update fixes this issue by calling the set_degraded_state rather than clear_degraded_state against this sensor in this case. Change-Id: Ic1ecc1f11d7a729c16da63c6d43b7d758bb9e467 Signed-off-by: David Sullivan <david.sullivan@windriver.com> Story: 2002882 Task: 22845	2018-08-16 20:14:24 +00:00
Zuul	706de7b423	Merge "Moving PMON script for NTP from MTCE to Puppet"	2018-08-16 16:25:57 +00:00
Tao Liu	f6834399a1	Decouple Fault Management from stx-config Filter out the fm client and fm rest api packages from compute and storage nodes Story: 2002828 Task: 22747 Depends-On: https://review.openstack.org/#/c/591452/ Change-Id: If0663dfb2cc1b557a1b9439c64d3ccb36bd66503 Signed-off-by: Tao Liu <tao.liu@windriver.com>	2018-08-16 11:52:08 -04:00
Scott Little	44aa6ea4da	Split image.inc across git repos Currently compiling a new package and adding it to the iso still requires a multi-git update because image.inc is a single centralized file in the root git. It would be better to allow a single git update to add a package. Too allow this, image.inc must be split across the git repos and the build tools must be changed to read/merge those files to arrive at the final package list. Current scheme is to name the image.inc files using this schema. ${distro}_${build_target}_image_${build_type}.inc distro = centos, ... build_target = iso, guest ... build_type = std, rt ... Traditionally build_type=std is omitted from config files, so we instread use ${distro}_${build_target}_image.inc. Change-Id: I9ef0304ff286be15d95f7ce944ee4ccf9bacc439 Story: 2003447 Task: 24649 Depends-On: Ib39b8063e7759842ba15330c68503bfe2dea6e20 Signed-off-by: Scott Little <scott.little@windriver.com>	2018-08-15 16:45:56 -04:00
Zuul	c38acc947c	Merge "Mtce calls sm rest api with keystone authentication"	2018-08-13 17:36:32 +00:00
Bin Qian	b4f8ef606c	Mtce calls sm rest api with keystone authentication As a part of changes to make sm-api independent, calling sm-api requires keystone authentication. This change is to enable mtce to call sm rest api with keystone authentication. Story: 2002827 Task: 22744 Change-Id: If3b58d3e36b9bd7fd88829d61e9c1daa00ab5048 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-08-13 10:14:45 -04:00
Alex Kozyrev	00520ac78c	Moving PMON script for NTP from MTCE to Puppet Introduction of PTP service requires NTP service to be disabled. Process monitoring of NTP daemon must be turned off as well. There is no way to start/stop process monitoring from MTCE. Puppet can check NTP status at startup and enable/disable monitoring. So, it is needed to move NTP-related PMON script from MTCE to Puppet. This is first step: removing NTP references from MTCE. Change-Id: I1ca6045af8c5169220b7332d45b843fdb4960f01 Story: 2002935 Task: 24520 Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>	2018-08-09 16:04:57 -04:00
Angie Wang	b2d963f0ef	Extend cgcs disk partition for gnocchi usage Updating kickstart to provision 5G for new gnocchi filesystem in cgcs disk partition. Story: 2002825 Task: 24240 Change-Id: Ie6182a636e6b9c580af2cce671dcbb267acb305f Signed-off-by: Angie Wang <angie.wang@windriver.com> 2018.08.0	2018-08-08 15:54:44 -04:00
Angie Wang	3879fe15d6	Filter out gnocchi packages from compute and storage hosts Story: 2002825 Task: 22871 Depends-On: https://review.openstack.org/587417 Change-Id: I48319b9b584bb8437df48ba5e74c2bfdb1b66827 Signed-off-by: Don Penney <don.penney@windriver.com> Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-31 10:17:24 -04:00
Jack Ding	29ed8f1c18	Cleanup internal references Story: 2002971 Task: 22979 Change-Id: I095b52139ff4c702fe8a030c1d1697375ef6ff5a Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-31 10:09:27 -04:00
Eric MacDonald	cb2d1b3bfc	Mtce: Fix logic compare looking for host that did not reboot Story: 2002882 Task: 22845 Change-Id: I0ffab3476c32b0947f0cd44796e257ee4bb93029 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:13:05 -04:00
Eric MacDonald	e5cbfce297	Mtce: Increase MNFA timeout from 60 to 100 secs Story: 2002882 Task: 22845 Change-Id: Ieabbb04877dfec1693a93d38abeefb474ac251a2 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:13:00 -04:00
Eric MacDonald	f649c5b9b4	Mtce: Hosts in MNFA pool are reported to be in Graceful Recovery during wait period Story: 2002882 Task: 22845 Change-Id: Icbdf21d51f4b41192ed49f40bbe76f462e5aaba9 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:51 -04:00
Eric MacDonald	23d9dd711c	Mtce: Enable offline handler during Graceful recovery Story: 2002882 Task: 22845 Change-Id: Ie5e43a0fe150d277514ef75b9e4c9461951efc26 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:46 -04:00
Eric MacDonald	76fbef1d01	Mtce: Fix memory leak in Swact failure handling Story: 2002882 Task: 22845 Change-Id: I8be5d26a2702cc9c2788335a27c8d0ebcacc2b2c Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:41 -04:00
Eric MacDonald	4d463fe074	Mtce: add host and iface name to msg debug log in hbsAgent Story: 2002882 Task: 22845 Change-Id: If4a6768f7f210742130679afb56c5f5364273bfc Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:35 -04:00
Eric MacDonald	083d38923a	Mtce: Force enable failure of host that did not reboot during enable. If the first mtcAlive message from a host that was supposed to be rebooted reports uptime in excess of 40 minutes then that means it did not reboot as expected. This was seen to happen during an extended offline case where the host failed heartbeat, then was reported offline during Graceful Recovery which forced a full enable. When the host eventually came back online its reported uptime made it clear that it never rebooted but mtce allowed it to come into service anyway. This is a security issue that can lead to a host disappearing, being security hacked and brought back into the system without reboot. To fix that, this update requires that a host's uptime, reported in its first mtcAlive message, indicate that it has been up for less twice the configured mtcAlive timeout or the enable will fail until it is proven to reset. Story: 2002882 Task: 22845 Change-Id: I9b3ff0bc1ba5af2ca5b07a58db9da9f288b59576 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:28 -04:00
Eric MacDonald	acd2d684f6	Mtce: Debouce heartbeat recovery For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent will declare heartbeat recovery upon the first successful heartbeat reply after the loss is declared ; basically edge level trigger recovery. In cases where a networking issue causes heartbeat loss of a group of hosts, Maintenance tracks the group of hosts that experienced heartbeta loss and puts the system into 'Multi Node Failure Avoidance' mode. maintenance then simply waits up to a configured timeout period for hosts to regain heartbeat. As heartbeat is regained for each host that host is attempted to be 'Gracefully Recovered'. However, if the networking issue persists in a way that the occasional transient heartbeat pulse gets through then the maintenance system can prematurely take hosts and then 'the system' out of MNFA mode only to find that heartbeat is actually not properly recovered/working only to then fail and force reboot/reset each node that is still experiencing heartbeat loss. This update changes the heartbeat service from an 'edge' to 'level' sensitive recovery by requiring a number of back-2-back heartbeat pulses following a failure before that host is delared as recovered and pulled out of the MMNFA pool. Basically, This update makes the system's MNFA recovery algorithm more robust in the face of transient heartbeat loss for a group of hosts. Story: 2002882 Task: 22845 Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:19 -04:00
Eric MacDonald	ed1410a736	Mtce: Re-add explicit request for mtcAlive in Graceful Recovery handler Story: 2002882 Task: 22845 Change-Id: Ib814416e46f988b3342a2da7b31e6e7273684c9e Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:11:59 -04:00
Zuul	8b4cb5f73d	Merge "Update upgrade version to 18.03"	2018-07-10 17:14:33 +00:00
jmckenna	bb036defd6	Update boot configs to match CentOS 7.5 kernel To improve kubernetes support, update kernel to CentOS 7.5 version and enable user namespaces in kernel bootargs. Depends-On: https://review.openstack.org/580689 Change-Id: I4d8620ea17a19a764c6627cd79eb548c79c56bfd Signed-off-by: Jason McKenna <jason.mckenna@windriver.com> Story: 2002761 Task: 22841	2018-07-06 11:26:06 -04:00
Bart Wensley	3332b39ba2	Update upgrade version to 18.03 Story: 2002886 Task: 22847 Change-Id: Ieb01085e5ffa12ce90076c1bd8d9c0032396043d Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-06 09:19:38 -04:00
Eric MacDonald	7be3b9085a	Add 90s delay before locking storage node for upgrade Adds support to the mtcAgent for detecting the absence of the 'host services execution enhancement feature' in the mtcClient and implements the pre-upgrade implementation in that case. When mtcAgent tries to lock a storage node running pre-upgrade verison it will implement a 90s lock wait before proceeding to declare that storage host as locked-disabled. Story: 2002886 Task: 22847 Change-Id: I99fb5576e027621019adb5eff553d52773f608db Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-06 09:18:21 -04:00
Scott Little	51d572ceed	Shorten "addons/wr-cgcs/layers/cgcs" to just "stx" Part of the project to remove cgcs references. Replace and shorten the path the needlessly long and complex "addons/wr-cgcs/layers/cgcs" path with just "stx". This update just fixes up paths found in scripts, comments and config files. Depends-On: https://review.openstack.org/579954 Depends-On: https://review.openstack.org/579957 Depends-On: https://review.openstack.org/580170 Depends-On: https://review.openstack.org/579975 Change-Id: I2110a0de13487492f62cdaf5d5513f4faf20d50d Signed-off-by: Scott Little <scott.little@windriver.com>	2018-07-04 11:03:59 -04:00
Scott Little	89dd36625e	Rename mwa-* subdirectories to match the git repo name mwa-delphia -> stx-clients mwa-pitta -> stx-config mwa-cleo -> stx-fault mwa-gplv2 -> stx-gplv2 mwa-gplv3 -> stx-gplv3 mwa-solon -> stx-ha mwa-sparta -> stx-integ mwa-beas -> stx-metal mwa-thales -> stx-nfv mwa-chilon -> stx-update mwa-perian -> stx-upstream Depends-On: https://review.openstack.org/579954 Depends-On: https://review.openstack.org/579957 Change-Id: I269a4e79425a41709381f8894456d21233463e9f Signed-off-by: Scott Little <scott.little@windriver.com>	2018-07-03 16:29:24 -04:00
Zuul	db4063233b	Merge "Spectre/meltdown kernel options controllable by customer"	2018-07-03 17:19:18 +00:00
Zuul	4a4c540a3c	Merge "Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1"	2018-07-03 17:02:34 +00:00
Zuul	3c53bf4a47	Merge "pmond: add support for no script label in conf files"	2018-07-03 17:02:33 +00:00
jmckenna	fba0ef3f7c	Spectre/meltdown kernel options controllable by customer Implements customer configuration of kernel options to control spectre/meltdown related kernel options. Default (with "nopti nospectre_v2" options) can be changed to "" using system modify -S spectre_meltdown_all Change-Id: I183a22fa681e6524415558c0009aa8786418cc07 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-03 11:04:58 -04:00
Eric MacDonald	c038b1a9a7	Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1 This update adds Maintenance support for receiving host degrade assert and clear messages from collectd. This update also disables platform memory, cpu and file system resource monitoring in the maintenance resource monitor process rmon. These disabled resources are now monitored by collectd and therefore should not be monitored by rmond any longer. Change-Id: I13fd033bb1d14f299dcb97fa80296641c958d0a9 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-03 11:04:27 -04:00

1 2 3 4

192 Commits