metal

Author	SHA1	Message	Date
Tao Liu	9661e49411	Change compute node to worker node personality This update replaces compute references to worker in mtce, kickstarts, installer and bsp files. Tests Performed: Non-containerized deployment AIO-SX: Sanity and Nightly automated test suite AIO-DX: Sanity and Nightly automated test suite 2+2 System: Sanity and Nightly automated test suite 2+2 System: Horizon Patch Orchestration Kubernetes deployment: AIO-SX: Create, delete, reboot and rebuild instances 2+2+2 System: worker nodes are unlock enable and no alarms Story: 2004022 Task: 27013 Depends-On: https://review.openstack.org/#/c/624452/ Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f Signed-off-by: Tao Liu <tao.liu@windriver.com>	2018-12-13 13:08:48 -05:00
Zuul	8eb55b2b03	Merge "Mtce: Add Thresholded Maintenance Enable Recovery support"	2018-12-13 15:57:44 +00:00
Zuul	373f21e5cd	Merge "Set SHELL in Makefiles that use bash constructs"	2018-12-12 14:27:53 +00:00
Eric MacDonald	3a5c578355	Mtce: Add Thresholded Maintenance Enable Recovery support This update stops trying to recover hosts that have failed the Enable sequence after a thresholded number of back-to-back tries. A host that has reached a particular failure modes' max failure threshold then maintenance puts it into a 'unlocked-disabled-failed' state and left that way with no further recovery action until it is manually locked and unlocked. The thresholded Enable failure causes are Configuration Failure ....... threshold:2 retry interval:30 secs In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec Start Host Services Failure . threshold:2 retry interval:30 sec Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute This update refactors the old auto recovery for AIO SX into this more generic framework. Story: 2003576 Task: 24905 Test Plan: PASS: Verify AIO DX System Install PASS: Verify AIO SX DOR PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state PASS: Verify AIO SX Main Config Failure handling PASS: Verify AIO SX Main Config Timeout handling PASS: Verify AIO SX Main GoEnabled Failure Handling PASS; Verify AIO SX Main Host Services Failure handling PASS; Verify AIO SX Main Host Services Timeout handling PASS; Verify AIO SX Subf Config Failure handling PASS: Verify AIO SX Subf Config Timeout handling PASS: Verify AIO SX Subf GoEnabled Failure Handling PASS: Verify AIO SX Subf Host Services Failure handling PASS: Verify AIO DX System Install PASS: Verify AIO DX DOR PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested PASS: Verify AIO DX Main First Unlock Failure handling PASS: Verify AIO DX Main Config Failure handling (inactive ctrl) PASS: Verify AIO DX Main one time Config Failure handling PASS: Verify AIO DX Main one time GoEnabled Failure handling. PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling. PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry. PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller. PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state. PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process) PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled) PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling. PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade) PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail) PASS: Verify Normal System Install PASS: Verify Compute Enable Configuration Failure handling (wc71-75) PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1) PASS: Verify Compute Enable Start Host Services Failure handling PASS: Verify Compute Enable Heartbeat Soak Failure handling PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling PASS: Verify Inactive Controller Configuration Failure handling PASS; Verify Inactive Controller GoEnabled Failure handling PASS; Verify Inactive Controller Host Services Failure handling PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup PASS: Verify auto recovery threshold number is configurable PASS: Verify auto recovery retry interval is configurable PASS: Verify auto recovery host state and status message Regression: PASS: Verify Swact behavior, over and back PASS: Verify 5 node DOR PASS: Verify 3 host MNFA behavior PASS: verify in-service heartbeat failure handling PASS: verify no segfaults during UT Corner Cases: PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart. Change-Id: I7098f16243caef27c5295971ef3c9de5be975755 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-12-12 08:11:36 -05:00
Zuul	42ad23ae83	Merge "No json_object_put() for the json_obj created by json_object_object_get_ex()."	2018-12-11 21:10:30 +00:00
Eric MacDonald	9d7a4bf92c	Implement Active-Active Heartbeat as HA Improvement Fix A few small issues were found during integration testing with SM. This update delivers those integration tested fixes. 1. Send cluster event to SM only after the first 10 heartbeat pulses are received. 2. Only send inventory to hbsAgent on provisioned controllers. 3. Add new OOB SM_UNHEALTHY flag to detect and act on an SM declared unhealthy controller. 4. Network monitoring enable fix. 5. Fix oldest entry tracking when a network history is not full. 6. Prevent clearing local uptime for a host that is being enabled. 7. Refactor cluster state change notification logging and handling. These fixes were both UT and IT tested in multiple labs Change-Id: I28485f241ac47bb3ed3ec1e2a8f4c09a1ca2070a Story: 2003576 Task: 24907 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-12-10 09:57:34 -05:00
Dean Troyer	58b987239f	Set SHELL in Makefiles that use bash constructs A number of Makefiles use '[[' in their test to set STATIC_ANALYSIS_TOOL_EXISTS. Set SHELL=/bin/bash Change-Id: Ie9536d7cafd518f3e65acf38ac5b30aa7536ea79 Signed-off-by: Dean Troyer <dtroyer@gmail.com>	2018-12-07 14:09:48 -06:00
Yan Chen	1c38aff32a	No json_object_put() for the json_obj created by json_object_object_get_ex(). It is stated in the json_object.h from version 0.11: https://github.com/json-c/json-c/blob/json-c-0.11/json_object.h#L271 As on json-c 0.11, there's no assert to check for the ref_count, we wont get crashed. But on json-c 0.13.1 (latest release), json_object_put will check for the ref_count first, so mtcAgent will crash. Test Done: Run mtcAgent with json-c version 0.13.1 with this patch, no crash found. Closes-Bug: 1807097 Change-Id: I35e5c1cad2e16ee0b6fc639380f1bdd3b64a7018 Signed-off-by: Yan Chen <yan.chen@intel.com>	2018-12-08 00:32:44 +08:00
Eric MacDonald	c0d26f5907	Revert "No json_object_put() for the json_obj created by json_object_object_get_ex()." This reverts commit a92c543fd574e027f0de0bd17d8a67090364ef3d. Change-Id: I972e083ac91bd1ecd13b900b417685eda5a4add0	2018-12-06 20:40:19 +00:00
Yan Chen	a92c543fd5	No json_object_put() for the json_obj created by json_object_object_get_ex(). It is stated in the json_object.h from version 0.11: https://github.com/json-c/json-c/blob/json-c-0.11/json_object.h#L271 As on json-c 0.11, there's no assert to check for the ref_count, we wont get crashed. But on json-c 0.13.1 (latest release), json_object_put will check for the ref_count first, so mtcAgent will crash. Test Done: Run mtcAgent with json-c version 0.13.1 with this patch, no crash found. Closes-Bug: 1807097 Change-Id: I7f954c97804ae01f831c94a36b9dbdbb34dbf083 Signed-off-by: Yan Chen <yan.chen@intel.com>	2018-12-07 00:47:21 +08:00
Eric MacDonald	dc531dc815	Fix mtce guest build failure A recent update to stx-metal/mtce-common removed a daemon_config structure member that the stx-nfv/mtce-guest git depends on. This was not detected during UT of the mtc-common change because of a missing build dependency that should force a rebuild of the mtce guest. Delivering the code fix to unblock the community. Will deliver the build dependency change shortly. Change-Id: Ice08424f156ffc84e38651fbc40ebc184170eb20 Closes-Bug: 1804579 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-11-22 10:26:18 -05:00
Eric MacDonald	0b922227ac	Implement Active-Active Heartbeat as HA Improvement This update introduces mtce changes to support Active-Active Heartbeating. The purpose of Active-Active Heartbeating is help avoid Split-Brain. Active-Active heartbeating has each controller maintain a 5 second heartbeat response history cache of each network for all monitored hosts as well as the on-going health of storage-0 if provisioned and enabled. This is referred to as the 'heartbeat cluster history' Each controller then includes its cluster history in each heartbeat pulse request message. The hbsClient, now modified to handle heartbeat from both controllers, saves each controllers' heartbeat cluster history in a local cache and criss-crosses the data in its pulse responses. So when the hbsClient receives a pulse request from controller-0 it saves its reported history and then replaces that history information in its response to controller-0 with what it saved from controller-1's last pulse request ; i.e. its view of the system. Controller-0, receiving a host's pulse response, saves its peers heartbeat cluster history so that it has summary of heartbeat cluster history for the last 5 seconds for each monitored network of every monitored host in the system from both controllers' perspectives. Same for controller-1 with controller-0's history. The hbsAgent is then further enhanced to support a query request for this information. So now SM, when it needs to make a decision to avoid Split-Brain or otherwise, can query either controller for its heartbeat cluster history and get the last 5 second summary view of heartbeat (network) responsivness from both controllers perspectives to help decide which controller to make active. This involved removing the hbsAgent process from SM control and monitor and adding a new hbsAgent LSB init script for process launch, service file to run the init script and pmon config file for hbsAgent process monitoring. With hbsAgent now running on both controllers, changes to maintenance were required to send inventory to hbsAgent on both controllers, listen for hbsAgent event messages over the management interface and inform both hbsAgents which controller is active. The hbsAgent running on the inactive controller does not - does not send heartbeat events to maintenance - does not send raise or clear alarms or produce customer logs Test Plan: Feature: PASS: Verify hbsAgent runs on both controllers PASS: Verify hbsAgent as pmon monitored process (not SM) PASS: Verify system install and cluster collection in all system types (10+) PASS: Verify active controller hbsAgent detects and handles heartbeat loss PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss PASS: Verify heartbeat cluster history collection functions properly. PASS: Verify storage-0 state tracking in cluster into. PASS: Verify storage-0 not responding handling PASS: Verify heartbeat response is sent back to only the requesting controller. PASS: Verify heartbeat history is correct from each controller PASS: Verify MNFA from active controller after install to controller-0 PASS: Verify MNFA from active controller after swact to controller-1 PASS: Verify MNFA for 80%+ of the hosts in the storage system PASS: Verify SM cluster query operation and content from both controllers PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms Logging: PASS: Verify cluster info logs. PASS: Verify feature design logging. PASS: Verify hbsAgent and hbsClient design logs on all hosts add value PASS: Verify design logging from both controllers in heartbeat loss case PASS: Verify design logging from both controllers in MNFA case PASS: Verify clog logs cluster info vault status and updates for controllers PASS: Verify clog1 logs full cluster state change for all hosts PASS: Verify clog2 logs cluster info save/append logs for controllers PASS: Verify clog3 memory dumps a cluster history PASS: Verify USR2 forces heartbeat and cluster info log dump PASS: Verify hourly heartbeat and cluster info log dump PASS: Verify loss events force heartbeat and cluster info log dump Regression: PASS: Verify Large System DOR PASS: Verify pmond regression test that now includes hbsAgent PASS: Verify Lock/Unlock of inactive controller (x3) PASS: Verify Swact behavior (x10) PASS: Verify compute Lock/Unlock PASS: Verify storage-0 Lock/Unlock PASS: Verify compute Host Failure and Graceful Recovery PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable PASS: Verify Delete Host PASS: Verify Patching hbsAgent and hbsClient PASS: Verify event driven cluster push Story: 2003576 Task: 24907 Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-11-20 19:57:18 +00:00
Zuul	e472f51aed	Merge "fix compilation warnings in c/cpp files"	2018-10-30 02:37:06 +00:00
Yong Hu	40404e86de	fix compilation warnings in c/cpp files The warnings might be treated as errors when the build system using compiler, for example, gcc/g++ 8.2.1, with "-O2 -Wall -Wextra -Werror" options. Story: 2004134 Task: 27591 Change-Id: I576a8c0305a4c32772fbc750ef39c73334b19336 Signed-off-by: Yong Hu <yong.hu@intel.com>	2018-10-23 07:38:33 +00:00
Martin Chen	c89f0ffa20	Fix resource leak issues, file not close case Partial-Bug: 1794903 Change-Id: Id6de282c27374d578a0a41869ec6a934e6675db4 Signed-off-by: Martin Chen <haochuan.z.chen@intel.com>	2018-10-20 02:08:42 +08:00
Eric MacDonald	8a223f395d	Mtce: Add heartbeat cluster information for SM query This part one of a two part HA Improvements feature that introduces the collection of heartbeat health at the system level. The full feature is intended to provide service management (SM) with the last 2 seconds of maintenace's heartbeat health view that is reflective of each controller's connectivity to each host including its peer controller. The heartbeat cluster summary information is additional information for SM to draw on when needing to make a choice of which controller is healthier, if/when to switch over and to ultimately avoid split brain scenarios in a two controller system. Feature Behavior: A common heartbeat cluster data structure is introduced and published to the sysroot for SM. The heartbeat service populates and maintains a local copy of this structure with data that reflects the responsivness for each monitored network of all the monitored hosts for the last 20 heartbeat periods. Mtce sends the current cluster summary to SM upon request. General flow of cluster feature wrt hbsAgent: hbs_cluster_init: general data init hbs_cluster_nums: set controller and network numbers forever: select: hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent hbs_sm_handler -> hbs_cluster_send: - send cluster to SM heartbeating: hbs_cluster_append: add controller cluster to pulse request hbs_cluster_update: get controller cluster data from pulse responses hbs_cluster_save: save other controller cluster view in cluster vault hbs_cluster_log: log cluster state changes (clog) Test Plan: PASS: Verify compute system install PASS: Verify storage system install PASS: Verify cluster data ; all members of structure PASS: Verify storage-0 state management PASS: Verify add of second controller PASS: Verify add of storage-0 node PASS: Verify behavior over Swact PASS: Verify lock/unlock of second controller ; overall behavior PASS: Verify lock/unlock of storage-0 ; overall behavior PASS: Verify lock/unlock of storage-1 ; overall behavior PASS: Verify lock/unlock of compute nodes ; overall behavior PASS: Verify heartbeat failure and recovery of compute node PASS: Verify heartbeat failure and recovery of storage-0 PASS: Verify heartbeat failure and recovery of controller PASS: Verify delete of controller node PASS: Verify delete of storage-0 PASS: Verify delete of compute node PASS: Verify cluster when controller-1 active / controller-0 disabled PASS: Verify MNFA and recovery handling PASS: Verify handling in presence of multiple failure conditions PASS: Verify hbsAgent memory leak soak test with continuous SM query. PASS: Verify active controller-1 infra network failure behavior. PASS: Verify inactive controller-1 infra network failure behavior. Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c Story: 2003576 Task: 24907 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-10-05 22:47:17 +00:00
Jim Gauld	6a5e10492c	Decouple Guest-server/agent from stx-metal This decouples the build and packaging of guest-server, guest-agent from mtce, by splitting guest component into stx-nfv repo. This leaves existing C++ code, scripts, and resource files untouched, so there is no functional change. Code refactoring is beyond the scope of this update. Makefiles were modified to include devel headers directories /usr/include/mtce-common and /usr/include/mtce-daemon. This ensures there is no contamination with other system headers. The cgts-mtce-common package is renamed and split into: - repo stx-metal: mtce-common, mtce-common-dev - repo stx-metal: mtce - repo stx-nfv: mtce-guest - repo stx-ha: updates package dependencies to mtce-pmon for service-mgmt, sm, and sm-api mtce-common: - contains common and daemon shared source utility code mtce-common-dev: - based on mtce-common, contains devel package required to build mtce-guest and mtce - contains common library archives and headers mtce: - contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon, maintenance, mtclog, pmon, public, rmon mtce-guest: - contains guest component guest-server, guest-agent Story: 2002829 Task: 22748 Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f Signed-off-by: Jim Gauld <james.gauld@windriver.com>	2018-09-18 17:15:08 -04:00
Teresa Ho	eb7559f335	Support mgmt and infra network on an interface Currently, management interface can be shared with infrastructure only over an VLAN. This update supports both management and infrastructure network sharing a single interface. Story: 2003087 Task: 23171 Depends-On: https://review.openstack.org/#/c/601156 Change-Id: Ie97dbd1260f5c98d7401b0e48361ebd87f060f65 Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2018-09-17 12:47:04 -04:00
Zuul	0e9b4a9b2f	Merge "Mtce: Improve non-blocking http request dispatch"	2018-09-11 14:25:35 +00:00
Zuul	31c4beff75	Merge "Mtce: Make Heartbeat Failure Action Configurable"	2018-09-11 13:50:31 +00:00
Eric MacDonald	316032b904	Mtce: Improve non-blocking http request dispatch Maintenance is seen to intermittently fail Swact requests early after initial system provisioning, without logging an error reason, only to always succeed later on. The issue is difficult to reproduce so this update adds extra logging to this code path and implements a speculative fix. The event_base_loop calls' non-zero return code is never being logged. The libevent documentation states that this API will return 1 while the target has not yet provided any data. Theory is, because the call is local, that normally it returns with data even on the first dispatch case. However, during early system configuration, when the system is busy, that first dispatch does not complete immediately like it normally does later on. Speculation is, instead it returns a 1 stating retry but the existing code path treats that as a failure. This update modifies the code to return a PASS if the command dispatch returns a 1 while the error case of -1 gets enhanced logging and continues to be treated as a failure. Test Plan: PASS: Swact 5 times PASS: Lock/Unlock Host PASS: Large System DOR Related Bug: https://bugs.launchpad.net/starlingx/+bug/1791381 Change-Id: I19b22e07d3224b2e9dd3f3569ecbe9aed7d9402f Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-09-10 19:02:42 +00:00
Eric MacDonald	74c5f89ab4	Mtce: Make Heartbeat Failure Action Configurable The current maintenance heartbeat failure action handling is to Fail and Gracefully Recover the host. This means that maintenance will ensure that a heartbeat failed host is rebooted/reset before it is recovered but will avoid rebooting it a second time if its recovered uptime indicates that it has already rebooted. This update expands that single action handling behavior to support three new actions. In doing so it adds a new configuration service parameter called heartbeat_failure_action. The customer can configure this new parameter with any one of the following 4 actions in order of decreasing impact. fail - Host is failed and gracefuly recovered. - Current Network specific alarms continue to be raised/cleared. Note: Prior to this update this was standard system behavior. degrade - Host is only degraded while it is failing heartbeat. - Current Network specific alarms continue to be raised/cleared. - heartbeat degrade reason is cleared as are the alarms when heartbeat responses resume. alarm - The only indication of a heartbeat failure is by alarm. - Same set of alarms as in above action cases - Only in this case no degrade, no failure, no reboot/reset none - Heartbeat is disabled ; no multicase heartbeat message is sent. - All existing heartbeat alarms are cleared. - The heartbeat soak as part of the enable sequence is bypassed. The selected action is a system wide setting. The selected setting also applies to Multi-Node Failure Avoidance. The default action is the legacy action Fail. This update also 1. Removes redundant inservice failure alarm for MNFA case in support of degrade only action. Keeping it would make that alarm handling case unnecessarily complicated. 2. No longer used 'hbs calibration' code is removed (cleanup). 3. Small amount of heartbeat logging cleanup. Test Plan: PASS: fail: Verify MNFA and recovery PASS: fail: Verify Single Host heartbeat failure and recovery PASS: fail: Verify Single Host heartbeat failure and recovery (from none) PASS: degrade: Verify MNFA and recovery PASS: degrade: Verify Single Host heartbeat failure and recovery PASS: degrade: Verify Single Host heartbeat failure and recovery (from alarm) PASS: alarm: Verify MNFA and recovery PASS: alarm: Verify Single Host heartbeat failure and recovery PASS: alarm: Verify Single Host heartbeat failure and recovery (from degrade) PASS: none: Verify heartbeat disable, fail ignore and no recovery PASS: none: Verify Single Host heartbeat ignore and no recovery PASS: none: Verify Single Host heartbeat ignode and no recovery (from fail) PASS: Verify action change behavior from none to alarm with active MNFA PASS: Verify action change behavior from alarm to degrade with active MNFA PASS: Verify action change behavior from degrade to none with active MNFA PASS: Verify action change behavior from none to fail with active MNFA PASS: Verify action change behavior from fail to none with active MNFA PASS: Verify action change behavior from degrade to fail then MNFA timeout PASS: Verify all heartbeat action change customer logs PASS: verify heartbeat stats clear over action change PASS: Verify LO DOR (several large labs - compute and storage systems) PASS: Verify recovery from failure of active controller PASS: Verify 3 host failure behavior with MNFA threshold at 3 (action:fail) PASS: Verify 2 host failure behavior with MNFA threshold at 3 (action:fail) Depends-On: https://review.openstack.org/601264 Change-Id: Iede5cdbb1c923898fd71b3a95d5289182f4287b4 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-09-10 13:03:30 -04:00
Sun Austin	90ce692186	pep8 job enable and fix pep8 reported issue use flake8 as pep8 tools enable check and gate for pep8(voting) Fix below flake8 issues: E127 continuation line over-indented for visual indent E211 whitespace before '(' E222 multiple spaces after operator E302 expected 2 blank lines, found 1 E501 line too long (101 > 79 characters) E502 the backslash is redundant between brackets F401 'platform' imported but unused W391 blank line at end of file Change-Id: Idfb953e52c8ee35c2adefdf0e4143a381c7f49e2 Story: 2003426 Task: 24596 Signed-off-by: Sun Austin <austin.sun@intel.com>	2018-09-06 09:45:51 +08:00
Sun Austin	fedb95ba79	Fix linters issues and enable tox/zuul linters job as gate Fix below linters issues E001 Trailing Whitespace E003 Indent not multiple of 4 E006 Line too long E011 Then keyword is not on same line as if or elif keyword E020 Function declaration not in format ^function name {$ E040 Syntax error: syntax error near unexpected token `;' ignore cases are added in tox setup E006 Line too long E010: do not on the same line as for Story: 2003368 Task: 24427 Change-Id: I6acf64271a4e608be8bc8fa965cac4fa31e0c05b Signed-off-by: Sun Austin <austin.sun@intel.com>	2018-09-05 09:02:25 +08:00
Eric MacDonald	82e851d651	Mtce: Make Multi-Node Failure Avoidance Configurable The maintenance system implements a high availability (HA) feature designed to detect the simultaneous heartbeat failure of a group of hosts and avoid failing all those hosts until heartbeat resumes or after a set period of time. This feature is called Multi-Node Failure Avoidance, aka MNFA, and currently has the hosts threshold set to 3 and timeout set to 100 secs. This update implements enhancements to that existing feature by making the 'number-of-hosts threshold' and 'timeout period' customer configurable service parameters. The new service parameters are listed under platform:maintenance which display with the following command > system service-parameter-list mnfa_threshold: This new label and value is added to the puppet managed /etc/mtc.ini and represents the number of hosts that are required to fail heartbeat as a group; within the heartbeat failure window (heartbeat_failure_threshold) after which maintenance activates MNFA Mode. This update changes the default number of failing hosts from 3 to 2 while allowing a configurable range from 2 to 100. mnfa_timeout: This new label and value is added to the puppet managed /etc/mtc.ini. While MNFA mode is active, it will remain active until the number of failing hosts drop below the mnfa_threshold or this timer expires. The MNFA mode deactivates on the first occurance of either case. Upon deactivation the remaining failed hosts are no longer treated as a failure group but instead are all Gracefully Recovered individually. A value of zero imposes no timeout making the deactivation criteria solely host based. This update changes the default 100 second timer to 0; no-timeout while permitting valid a times range from 100 to 86400 secs or 1 day. Test Plan: PASS - Verify duplex and 4 compute DOR PASS - Verify default MNFA - 1 inactive controller and 4 computes PASS - Verify default MNFA - 4 computes PASS - Verify default MNFA - 1 active controller and 3 computes and failed host PASS - Verify Single host heartbeat failure handling - fail host PASS - Verify Multi Node failure below mnfa_threshold - fail hosts PASS - Verify MNFA handling with timeout of zero and threshold of 3 PASS - Verify MNFA timeout handling with timeout set at 100 sec PASS - Verify MNFA service parameter lising, default value and mtc.ini PASS - Verify MNFA service parameter change and inservice apply PASS - Verify MNFA timeout service parameter change from value to 0 PASS - Verify MNFA timeout service parameter change from to inrange value PASS - Verify MNFA service parametrer out of range change handling PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active) DocImpact Story: 2003576 Task: 24903 Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-08-31 15:35:08 -04:00
hazelnutsgz	482d1acea8	Fix the print syntax inconsistency between python2 and python3 Using the automation tool & manual check to fix the print syntax. Task: 24595 Story: 2003426 Change-Id: I3844c9644aabeeeb27bc2abb106c839b9921fe78	2018-08-29 16:09:27 +08:00
Zuul	0c9a69ccd1	Merge "Mtce: mtcAgent sometimes coredumps on process exit"	2018-08-23 13:36:33 +00:00
Zuul	4d43887256	Merge "Maintain sensor degrade over a process restart"	2018-08-23 13:25:16 +00:00
Zuul	8290718f81	Merge "Reorder process restart operations to prevent pmond futex deadlock"	2018-08-23 13:24:40 +00:00
Zuul	6a1999a371	Merge "Enable host heartbeat in add handler when not in DOR mode"	2018-08-23 13:21:20 +00:00
Eric MacDonald	537935bb0c	Reorder process restart operations to prevent pmond futex deadlock All compute hosts seen to self reboot by hostw during patching due to stuck pmond process Current method to kill the running process leads to a race condition that results in a user space futex dead lock that hangs pmond and results in a watchdog self-reset due to quorum master 'pmond' failure. The dead lock was traced to the ordering of the kill process. Current steps to kill: - kill process - remove pidfile - unregister pid with kernel Deadlock is avoided by reversing the kill steps to what is more logical. - unregister pid with kernel - remove pidfile - kill process Also introduced audit that registers manually restarted processes with the kernel. Failure Rate Before Fix: 1 every 25 process restarts. Mostly fails before 5. Failure Rate After Fix: No failures after 15000 process restarts across 8 hosts including all host types between 2 different labs 2 different loads 18.07 and 18.08. Test Method: Pmon restart regression test restarts all processes on a host. Total soak restart of 25 monitored processes for 50 loops over 12 hosts = 15000 restarts. Also regressed process kill / recovery handling. (5000 process recoveries) Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d Signed-off-by: David Sullivan <david.sullivan@windriver.com> Story: 2002993 Task: 23007	2018-08-16 20:22:15 +00:00
Eric MacDonald	7da4eb945f	Enable host heartbeat in add handler when not in DOR mode Two Node System: VMs did not switch to ERROR state after host reboot A logically failed (rebooted) active controller is not being administratively failed by maintenance. As a result the host's offline availability state is not reported to the VIM and the VMs on that (rebooted) All-in-one host are not evacuated. This issue only applies to two node systems because of how the heartbeat enable of an All-in-one host needs to be held off until its compute manifests apply in the DOR case so as to avoid maintenance failing the peer controller over a DOR. The challange in maintenance is to distinguish between this spontaneous failure and a DOR. For All-in-one hosts, DOR mode is active for a whopping 600 seconds ; long enough to account for both sets of manifests to apply. It's that long delay that is making this silent fault stand out so obviously. This update uses 'active DOR mode' to decide whether or not to enable a host's heartbeat in the add handler. To better handle early active controller failure the qualifier for DOR mode was reduced from 20 to 15 minutes. Meaning that maintenance DOR mode is activated if its host up time is less than 15 minutes ; rather than 20 as it was before this update. Note that normally the active controller starts maintenance with an uptime of 5-7 minutes. Story: 2002995 Task: 23009 Change-Id: I749aefef45b9db6e86a2c6b81d131ebeccc68926 Signed-off-by: David Sullivan <david.sullivan@windriver.com>	2018-08-16 20:20:16 +00:00
Eric MacDonald	67dec7c6cf	Mtce: mtcAgent sometimes coredumps on process exit The mtcAgent process has been seen to segfault and coredump on process exit. The exit code is iterating over a c++ list that can change due to http interrupt response handling. The dump code is commented out with a note indicating why and when it could be re-enabled. Change-Id: Ie4ef684a65ded533c347ae07fdfa47f332412f7d Signed-off-by: David Sullivan <david.sullivan@windriver.com> Story: 2002994 Task: 23008	2018-08-16 20:16:07 +00:00
Eric MacDonald	cc53f1e689	Maintain sensor degrade over a process restart When the Hardware Monitor starts up it reads existing alarms and sensor state from the sysinv database. It then uses this pre-existing state to align its internal structure accordingly moving forward. The hardware monitor manage_startup_states utility is incorrectly requesting degrade clear rather than degrade set in response to finding a pre-existing critical sensor assertion on process startup. This update fixes this issue by calling the set_degraded_state rather than clear_degraded_state against this sensor in this case. Change-Id: Ic1ecc1f11d7a729c16da63c6d43b7d758bb9e467 Signed-off-by: David Sullivan <david.sullivan@windriver.com> Story: 2002882 Task: 22845	2018-08-16 20:14:24 +00:00
Zuul	706de7b423	Merge "Moving PMON script for NTP from MTCE to Puppet"	2018-08-16 16:25:57 +00:00
Bin Qian	b4f8ef606c	Mtce calls sm rest api with keystone authentication As a part of changes to make sm-api independent, calling sm-api requires keystone authentication. This change is to enable mtce to call sm rest api with keystone authentication. Story: 2002827 Task: 22744 Change-Id: If3b58d3e36b9bd7fd88829d61e9c1daa00ab5048 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-08-13 10:14:45 -04:00
Alex Kozyrev	00520ac78c	Moving PMON script for NTP from MTCE to Puppet Introduction of PTP service requires NTP service to be disabled. Process monitoring of NTP daemon must be turned off as well. There is no way to start/stop process monitoring from MTCE. Puppet can check NTP status at startup and enable/disable monitoring. So, it is needed to move NTP-related PMON script from MTCE to Puppet. This is first step: removing NTP references from MTCE. Change-Id: I1ca6045af8c5169220b7332d45b843fdb4960f01 Story: 2002935 Task: 24520 Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>	2018-08-09 16:04:57 -04:00
Jack Ding	29ed8f1c18	Cleanup internal references Story: 2002971 Task: 22979 Change-Id: I095b52139ff4c702fe8a030c1d1697375ef6ff5a Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-31 10:09:27 -04:00
Eric MacDonald	cb2d1b3bfc	Mtce: Fix logic compare looking for host that did not reboot Story: 2002882 Task: 22845 Change-Id: I0ffab3476c32b0947f0cd44796e257ee4bb93029 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:13:05 -04:00
Eric MacDonald	e5cbfce297	Mtce: Increase MNFA timeout from 60 to 100 secs Story: 2002882 Task: 22845 Change-Id: Ieabbb04877dfec1693a93d38abeefb474ac251a2 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:13:00 -04:00
Eric MacDonald	f649c5b9b4	Mtce: Hosts in MNFA pool are reported to be in Graceful Recovery during wait period Story: 2002882 Task: 22845 Change-Id: Icbdf21d51f4b41192ed49f40bbe76f462e5aaba9 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:51 -04:00
Eric MacDonald	23d9dd711c	Mtce: Enable offline handler during Graceful recovery Story: 2002882 Task: 22845 Change-Id: Ie5e43a0fe150d277514ef75b9e4c9461951efc26 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:46 -04:00
Eric MacDonald	76fbef1d01	Mtce: Fix memory leak in Swact failure handling Story: 2002882 Task: 22845 Change-Id: I8be5d26a2702cc9c2788335a27c8d0ebcacc2b2c Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:41 -04:00
Eric MacDonald	4d463fe074	Mtce: add host and iface name to msg debug log in hbsAgent Story: 2002882 Task: 22845 Change-Id: If4a6768f7f210742130679afb56c5f5364273bfc Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:35 -04:00
Eric MacDonald	083d38923a	Mtce: Force enable failure of host that did not reboot during enable. If the first mtcAlive message from a host that was supposed to be rebooted reports uptime in excess of 40 minutes then that means it did not reboot as expected. This was seen to happen during an extended offline case where the host failed heartbeat, then was reported offline during Graceful Recovery which forced a full enable. When the host eventually came back online its reported uptime made it clear that it never rebooted but mtce allowed it to come into service anyway. This is a security issue that can lead to a host disappearing, being security hacked and brought back into the system without reboot. To fix that, this update requires that a host's uptime, reported in its first mtcAlive message, indicate that it has been up for less twice the configured mtcAlive timeout or the enable will fail until it is proven to reset. Story: 2002882 Task: 22845 Change-Id: I9b3ff0bc1ba5af2ca5b07a58db9da9f288b59576 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:28 -04:00
Eric MacDonald	acd2d684f6	Mtce: Debouce heartbeat recovery For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent will declare heartbeat recovery upon the first successful heartbeat reply after the loss is declared ; basically edge level trigger recovery. In cases where a networking issue causes heartbeat loss of a group of hosts, Maintenance tracks the group of hosts that experienced heartbeta loss and puts the system into 'Multi Node Failure Avoidance' mode. maintenance then simply waits up to a configured timeout period for hosts to regain heartbeat. As heartbeat is regained for each host that host is attempted to be 'Gracefully Recovered'. However, if the networking issue persists in a way that the occasional transient heartbeat pulse gets through then the maintenance system can prematurely take hosts and then 'the system' out of MNFA mode only to find that heartbeat is actually not properly recovered/working only to then fail and force reboot/reset each node that is still experiencing heartbeat loss. This update changes the heartbeat service from an 'edge' to 'level' sensitive recovery by requiring a number of back-2-back heartbeat pulses following a failure before that host is delared as recovered and pulled out of the MMNFA pool. Basically, This update makes the system's MNFA recovery algorithm more robust in the face of transient heartbeat loss for a group of hosts. Story: 2002882 Task: 22845 Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:12:19 -04:00
Eric MacDonald	ed1410a736	Mtce: Re-add explicit request for mtcAlive in Graceful Recovery handler Story: 2002882 Task: 22845 Change-Id: Ib814416e46f988b3342a2da7b31e6e7273684c9e Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 11:11:59 -04:00
Eric MacDonald	7be3b9085a	Add 90s delay before locking storage node for upgrade Adds support to the mtcAgent for detecting the absence of the 'host services execution enhancement feature' in the mtcClient and implements the pre-upgrade implementation in that case. When mtcAgent tries to lock a storage node running pre-upgrade verison it will implement a 90s lock wait before proceeding to declare that storage host as locked-disabled. Story: 2002886 Task: 22847 Change-Id: I99fb5576e027621019adb5eff553d52773f608db Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-06 09:18:21 -04:00
Scott Little	51d572ceed	Shorten "addons/wr-cgcs/layers/cgcs" to just "stx" Part of the project to remove cgcs references. Replace and shorten the path the needlessly long and complex "addons/wr-cgcs/layers/cgcs" path with just "stx". This update just fixes up paths found in scripts, comments and config files. Depends-On: https://review.openstack.org/579954 Depends-On: https://review.openstack.org/579957 Depends-On: https://review.openstack.org/580170 Depends-On: https://review.openstack.org/579975 Change-Id: I2110a0de13487492f62cdaf5d5513f4faf20d50d Signed-off-by: Scott Little <scott.little@windriver.com>	2018-07-04 11:03:59 -04:00
Zuul	4a4c540a3c	Merge "Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1"	2018-07-03 17:02:34 +00:00

1 2

71 Commits