This update replaces compute references to worker in mtce,
kickstarts, installer and bsp files.
Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration
Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms
Story: 2004022
Task: 27013
Depends-On: https://review.openstack.org/#/c/624452/
Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f
Signed-off-by: Tao Liu <tao.liu@windriver.com>
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.
A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.
The thresholded Enable failure causes are
Configuration Failure ....... threshold:2 retry interval:30 secs
In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
Start Host Services Failure . threshold:2 retry interval:30 sec
Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute
This update refactors the old auto recovery for AIO SX into this
more generic framework.
Story: 2003576
Task: 24905
Test Plan:
PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling
PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)
PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message
Regression:
PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT
Corner Cases:
PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.
Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A few small issues were found during integration testing with SM.
This update delivers those integration tested fixes.
1. Send cluster event to SM only after the first 10 heartbeat
pulses are received.
2. Only send inventory to hbsAgent on provisioned controllers.
3. Add new OOB SM_UNHEALTHY flag to detect and act on an SM
declared unhealthy controller.
4. Network monitoring enable fix.
5. Fix oldest entry tracking when a network history is not full.
6. Prevent clearing local uptime for a host that is being enabled.
7. Refactor cluster state change notification logging and handling.
These fixes were both UT and IT tested in multiple labs
Change-Id: I28485f241ac47bb3ed3ec1e2a8f4c09a1ca2070a
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A number of Makefiles use '[[' in their test to set
STATIC_ANALYSIS_TOOL_EXISTS. Set SHELL=/bin/bash
Change-Id: Ie9536d7cafd518f3e65acf38ac5b30aa7536ea79
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
It is stated in the json_object.h from version 0.11:
https://github.com/json-c/json-c/blob/json-c-0.11/json_object.h#L271
As on json-c 0.11, there's no assert to check for the ref_count, we wont get
crashed. But on json-c 0.13.1 (latest release), json_object_put will check
for the ref_count first, so mtcAgent will crash.
Test Done:
Run mtcAgent with json-c version 0.13.1 with this patch, no crash found.
Closes-Bug: 1807097
Change-Id: I35e5c1cad2e16ee0b6fc639380f1bdd3b64a7018
Signed-off-by: Yan Chen <yan.chen@intel.com>
It is stated in the json_object.h from version 0.11:
https://github.com/json-c/json-c/blob/json-c-0.11/json_object.h#L271
As on json-c 0.11, there's no assert to check for the ref_count, we wont get
crashed. But on json-c 0.13.1 (latest release), json_object_put will check
for the ref_count first, so mtcAgent will crash.
Test Done:
Run mtcAgent with json-c version 0.13.1 with this patch, no crash found.
Closes-Bug: 1807097
Change-Id: I7f954c97804ae01f831c94a36b9dbdbb34dbf083
Signed-off-by: Yan Chen <yan.chen@intel.com>
A recent update to stx-metal/mtce-common removed a daemon_config
structure member that the stx-nfv/mtce-guest git depends on.
This was not detected during UT of the mtc-common change because
of a missing build dependency that should force a rebuild of the
mtce guest.
Delivering the code fix to unblock the community.
Will deliver the build dependency change shortly.
Change-Id: Ice08424f156ffc84e38651fbc40ebc184170eb20
Closes-Bug: 1804579
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update introduces mtce changes to support Active-Active Heartbeating.
The purpose of Active-Active Heartbeating is help avoid Split-Brain.
Active-Active heartbeating has each controller maintain a 5 second
heartbeat response history cache of each network for all monitored
hosts as well as the on-going health of storage-0 if provisioned and
enabled.
This is referred to as the 'heartbeat cluster history'
Each controller then includes its cluster history in each heartbeat
pulse request message.
The hbsClient, now modified to handle heartbeat from both controllers,
saves each controllers' heartbeat cluster history in a local cache and
criss-crosses the data in its pulse responses.
So when the hbsClient receives a pulse request from controller-0 it
saves its reported history and then replaces that history information
in its response to controller-0 with what it saved from controller-1's
last pulse request ; i.e. its view of the system.
Controller-0, receiving a host's pulse response, saves its peers
heartbeat cluster history so that it has summary of heartbeat
cluster history for the last 5 seconds for each monitored network
of every monitored host in the system from both controllers'
perspectives. Same for controller-1 with controller-0's history.
The hbsAgent is then further enhanced to support a query request
for this information.
So now SM, when it needs to make a decision to avoid Split-Brain
or otherwise, can query either controller for its heartbeat cluster
history and get the last 5 second summary view of heartbeat (network)
responsivness from both controllers perspectives to help decide which
controller to make active.
This involved removing the hbsAgent process from SM control and monitor
and adding a new hbsAgent LSB init script for process launch, service
file to run the init script and pmon config file for hbsAgent process
monitoring.
With hbsAgent now running on both controllers, changes to maintenance
were required to send inventory to hbsAgent on both controllers,
listen for hbsAgent event messages over the management interface
and inform both hbsAgents which controller is active.
The hbsAgent running on the inactive controller does not
- does not send heartbeat events to maintenance
- does not send raise or clear alarms or produce customer logs
Test Plan:
Feature:
PASS: Verify hbsAgent runs on both controllers
PASS: Verify hbsAgent as pmon monitored process (not SM)
PASS: Verify system install and cluster collection in all system types (10+)
PASS: Verify active controller hbsAgent detects and handles heartbeat loss
PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss
PASS: Verify heartbeat cluster history collection functions properly.
PASS: Verify storage-0 state tracking in cluster into.
PASS: Verify storage-0 not responding handling
PASS: Verify heartbeat response is sent back to only the requesting controller.
PASS: Verify heartbeat history is correct from each controller
PASS: Verify MNFA from active controller after install to controller-0
PASS: Verify MNFA from active controller after swact to controller-1
PASS: Verify MNFA for 80%+ of the hosts in the storage system
PASS: Verify SM cluster query operation and content from both controllers
PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms
Logging:
PASS: Verify cluster info logs.
PASS: Verify feature design logging.
PASS: Verify hbsAgent and hbsClient design logs on all hosts add value
PASS: Verify design logging from both controllers in heartbeat loss case
PASS: Verify design logging from both controllers in MNFA case
PASS: Verify clog logs cluster info vault status and updates for controllers
PASS: Verify clog1 logs full cluster state change for all hosts
PASS: Verify clog2 logs cluster info save/append logs for controllers
PASS: Verify clog3 memory dumps a cluster history
PASS: Verify USR2 forces heartbeat and cluster info log dump
PASS: Verify hourly heartbeat and cluster info log dump
PASS: Verify loss events force heartbeat and cluster info log dump
Regression:
PASS: Verify Large System DOR
PASS: Verify pmond regression test that now includes hbsAgent
PASS: Verify Lock/Unlock of inactive controller (x3)
PASS: Verify Swact behavior (x10)
PASS: Verify compute Lock/Unlock
PASS: Verify storage-0 Lock/Unlock
PASS: Verify compute Host Failure and Graceful Recovery
PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable
PASS: Verify Delete Host
PASS: Verify Patching hbsAgent and hbsClient
PASS: Verify event driven cluster push
Story: 2003576
Task: 24907
Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The warnings might be treated as errors when the build system using compiler,
for example, gcc/g++ 8.2.1, with "-O2 -Wall -Wextra -Werror" options.
Story: 2004134
Task: 27591
Change-Id: I576a8c0305a4c32772fbc750ef39c73334b19336
Signed-off-by: Yong Hu <yong.hu@intel.com>
This part one of a two part HA Improvements feature that introduces
the collection of heartbeat health at the system level.
The full feature is intended to provide service management (SM)
with the last 2 seconds of maintenace's heartbeat health view that
is reflective of each controller's connectivity to each host
including its peer controller.
The heartbeat cluster summary information is additional information
for SM to draw on when needing to make a choice of which controller
is healthier, if/when to switch over and to ultimately avoid split
brain scenarios in a two controller system.
Feature Behavior: A common heartbeat cluster data structure is
introduced and published to the sysroot for SM. The heartbeat
service populates and maintains a local copy of this structure
with data that reflects the responsivness for each monitored
network of all the monitored hosts for the last 20 heartbeat
periods. Mtce sends the current cluster summary to SM upon request.
General flow of cluster feature wrt hbsAgent:
hbs_cluster_init: general data init
hbs_cluster_nums: set controller and network numbers
forever:
select:
hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent
hbs_sm_handler -> hbs_cluster_send: - send cluster to SM
heartbeating:
hbs_cluster_append: add controller cluster to pulse request
hbs_cluster_update: get controller cluster data from pulse responses
hbs_cluster_save: save other controller cluster view in cluster vault
hbs_cluster_log: log cluster state changes (clog)
Test Plan:
PASS: Verify compute system install
PASS: Verify storage system install
PASS: Verify cluster data ; all members of structure
PASS: Verify storage-0 state management
PASS: Verify add of second controller
PASS: Verify add of storage-0 node
PASS: Verify behavior over Swact
PASS: Verify lock/unlock of second controller ; overall behavior
PASS: Verify lock/unlock of storage-0 ; overall behavior
PASS: Verify lock/unlock of storage-1 ; overall behavior
PASS: Verify lock/unlock of compute nodes ; overall behavior
PASS: Verify heartbeat failure and recovery of compute node
PASS: Verify heartbeat failure and recovery of storage-0
PASS: Verify heartbeat failure and recovery of controller
PASS: Verify delete of controller node
PASS: Verify delete of storage-0
PASS: Verify delete of compute node
PASS: Verify cluster when controller-1 active / controller-0 disabled
PASS: Verify MNFA and recovery handling
PASS: Verify handling in presence of multiple failure conditions
PASS: Verify hbsAgent memory leak soak test with continuous SM query.
PASS: Verify active controller-1 infra network failure behavior.
PASS: Verify inactive controller-1 infra network failure behavior.
Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
Currently, management interface can be shared with infrastructure only over
an VLAN. This update supports both management and infrastructure network
sharing a single interface.
Story: 2003087
Task: 23171
Depends-On: https://review.openstack.org/#/c/601156
Change-Id: Ie97dbd1260f5c98d7401b0e48361ebd87f060f65
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
Maintenance is seen to intermittently fail Swact requests early
after initial system provisioning, without logging an error
reason, only to always succeed later on.
The issue is difficult to reproduce so this update adds extra
logging to this code path and implements a speculative fix.
The event_base_loop calls' non-zero return code is never being
logged. The libevent documentation states that this API will
return 1 while the target has not yet provided any data.
Theory is, because the call is local, that normally it returns
with data even on the first dispatch case. However, during early
system configuration, when the system is busy, that first dispatch
does not complete immediately like it normally does later on.
Speculation is, instead it returns a 1 stating retry but the
existing code path treats that as a failure.
This update modifies the code to return a PASS if the command
dispatch returns a 1 while the error case of -1 gets enhanced
logging and continues to be treated as a failure.
Test Plan:
PASS: Swact 5 times
PASS: Lock/Unlock Host
PASS: Large System DOR
Related Bug: https://bugs.launchpad.net/starlingx/+bug/1791381
Change-Id: I19b22e07d3224b2e9dd3f3569ecbe9aed7d9402f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The current maintenance heartbeat failure action handling is to Fail
and Gracefully Recover the host. This means that maintenance will
ensure that a heartbeat failed host is rebooted/reset before it is
recovered but will avoid rebooting it a second time if its recovered
uptime indicates that it has already rebooted.
This update expands that single action handling behavior to support
three new actions. In doing so it adds a new configuration service
parameter called heartbeat_failure_action. The customer can configure
this new parameter with any one of the following 4 actions in order of
decreasing impact.
fail - Host is failed and gracefuly recovered.
- Current Network specific alarms continue to be raised/cleared.
Note: Prior to this update this was standard system behavior.
degrade - Host is only degraded while it is failing heartbeat.
- Current Network specific alarms continue to be raised/cleared.
- heartbeat degrade reason is cleared as are the alarms when
heartbeat responses resume.
alarm - The only indication of a heartbeat failure is by alarm.
- Same set of alarms as in above action cases
- Only in this case no degrade, no failure, no reboot/reset
none - Heartbeat is disabled ; no multicase heartbeat message is sent.
- All existing heartbeat alarms are cleared.
- The heartbeat soak as part of the enable sequence is bypassed.
The selected action is a system wide setting.
The selected setting also applies to Multi-Node Failure Avoidance.
The default action is the legacy action Fail.
This update also
1. Removes redundant inservice failure alarm for MNFA case in support
of degrade only action. Keeping it would make that alarm handling
case unnecessarily complicated.
2. No longer used 'hbs calibration' code is removed (cleanup).
3. Small amount of heartbeat logging cleanup.
Test Plan:
PASS: fail: Verify MNFA and recovery
PASS: fail: Verify Single Host heartbeat failure and recovery
PASS: fail: Verify Single Host heartbeat failure and recovery (from none)
PASS: degrade: Verify MNFA and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery (from alarm)
PASS: alarm: Verify MNFA and recovery
PASS: alarm: Verify Single Host heartbeat failure and recovery
PASS: alarm: Verify Single Host heartbeat failure and recovery (from degrade)
PASS: none: Verify heartbeat disable, fail ignore and no recovery
PASS: none: Verify Single Host heartbeat ignore and no recovery
PASS: none: Verify Single Host heartbeat ignode and no recovery (from fail)
PASS: Verify action change behavior from none to alarm with active MNFA
PASS: Verify action change behavior from alarm to degrade with active MNFA
PASS: Verify action change behavior from degrade to none with active MNFA
PASS: Verify action change behavior from none to fail with active MNFA
PASS: Verify action change behavior from fail to none with active MNFA
PASS: Verify action change behavior from degrade to fail then MNFA timeout
PASS: Verify all heartbeat action change customer logs
PASS: verify heartbeat stats clear over action change
PASS: Verify LO DOR (several large labs - compute and storage systems)
PASS: Verify recovery from failure of active controller
PASS: Verify 3 host failure behavior with MNFA threshold at 3 (action:fail)
PASS: Verify 2 host failure behavior with MNFA threshold at 3 (action:fail)
Depends-On: https://review.openstack.org/601264
Change-Id: Iede5cdbb1c923898fd71b3a95d5289182f4287b4
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
use flake8 as pep8 tools
enable check and gate for pep8(voting)
Fix below flake8 issues:
E127 continuation line over-indented for visual indent
E211 whitespace before '('
E222 multiple spaces after operator
E302 expected 2 blank lines, found 1
E501 line too long (101 > 79 characters)
E502 the backslash is redundant between brackets
F401 'platform' imported but unused
W391 blank line at end of file
Change-Id: Idfb953e52c8ee35c2adefdf0e4143a381c7f49e2
Story: 2003426
Task: 24596
Signed-off-by: Sun Austin <austin.sun@intel.com>
Fix below linters issues
E001 Trailing Whitespace
E003 Indent not multiple of 4
E006 Line too long
E011 Then keyword is not on same line as if or elif keyword
E020 Function declaration not in format ^function name {$
E040 Syntax error: syntax error near unexpected token `;'
ignore cases are added in tox setup
E006 Line too long
E010: do not on the same line as for
Story: 2003368
Task: 24427
Change-Id: I6acf64271a4e608be8bc8fa965cac4fa31e0c05b
Signed-off-by: Sun Austin <austin.sun@intel.com>
The maintenance system implements a high availability (HA) feature
designed to detect the simultaneous heartbeat failure of a group
of hosts and avoid failing all those hosts until heartbeat resumes
or after a set period of time.
This feature is called Multi-Node Failure Avoidance, aka MNFA, and
currently has the hosts threshold set to 3 and timeout set to 100 secs.
This update implements enhancements to that existing feature by
making the 'number-of-hosts threshold' and 'timeout period'
customer configurable service parameters.
The new service parameters are listed under platform:maintenance which
display with the following command
> system service-parameter-list
mnfa_threshold: This new label and value is added to the puppet
managed /etc/mtc.ini and represents the number of hosts that are
required to fail heartbeat as a group; within the heartbeat
failure window (heartbeat_failure_threshold) after which maintenance
activates MNFA Mode.
This update changes the default number of failing hosts from
3 to 2 while allowing a configurable range from 2 to 100.
mnfa_timeout: This new label and value is added to the puppet
managed /etc/mtc.ini. While MNFA mode is active, it will remain active
until the number of failing hosts drop below the mnfa_threshold or this
timer expires. The MNFA mode deactivates on the first occurance of
either case. Upon deactivation the remaining failed hosts are no
longer treated as a failure group but instead are all Gracefully
Recovered individually. A value of zero imposes no timeout making the
deactivation criteria solely host based.
This update changes the default 100 second timer to 0; no-timeout
while permitting valid a times range from 100 to 86400 secs or 1 day.
Test Plan:
PASS - Verify duplex and 4 compute DOR
PASS - Verify default MNFA - 1 inactive controller and 4 computes
PASS - Verify default MNFA - 4 computes
PASS - Verify default MNFA - 1 active controller and 3 computes and failed host
PASS - Verify Single host heartbeat failure handling - fail host
PASS - Verify Multi Node failure below mnfa_threshold - fail hosts
PASS - Verify MNFA handling with timeout of zero and threshold of 3
PASS - Verify MNFA timeout handling with timeout set at 100 sec
PASS - Verify MNFA service parameter lising, default value and mtc.ini
PASS - Verify MNFA service parameter change and inservice apply
PASS - Verify MNFA timeout service parameter change from value to 0
PASS - Verify MNFA timeout service parameter change from to inrange value
PASS - Verify MNFA service parametrer out of range change handling
PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active)
DocImpact
Story: 2003576
Task: 24903
Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
All compute hosts seen to self reboot by hostw during patching due to
stuck pmond process
Current method to kill the running process leads to a race condition
that results in a user space futex dead lock that hangs pmond and
results in a watchdog self-reset due to quorum master 'pmond' failure.
The dead lock was traced to the ordering of the kill process.
Current steps to kill:
- kill process
- remove pidfile
- unregister pid with kernel
Deadlock is avoided by reversing the kill steps to what
is more logical.
- unregister pid with kernel
- remove pidfile
- kill process
Also introduced audit that registers manually restarted processes
with the kernel.
Failure Rate Before Fix: 1 every 25 process restarts.
Mostly fails before 5.
Failure Rate After Fix: No failures after 15000 process restarts
across 8 hosts including all host types between 2 different labs 2
different loads 18.07 and 18.08.
Test Method: Pmon restart regression test restarts all processes on
a host. Total soak restart of 25 monitored processes for 50 loops
over 12 hosts = 15000 restarts.
Also regressed process kill / recovery handling.
(5000 process recoveries)
Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002993
Task: 23007
Two Node System: VMs did not switch to ERROR state after host reboot
A logically failed (rebooted) active controller is not being
administratively failed by maintenance. As a result the host's
offline availability state is not reported to the VIM and the
VMs on that (rebooted) All-in-one host are not evacuated.
This issue only applies to two node systems because of how the heartbeat
enable of an All-in-one host needs to be held off until its compute
manifests apply in the DOR case so as to avoid maintenance failing the
peer controller over a DOR.
The challange in maintenance is to distinguish between this spontaneous
failure and a DOR. For All-in-one hosts, DOR mode is active for a
whopping 600 seconds ; long enough to account for both sets of manifests
to apply.
It's that long delay that is making this silent fault stand out so
obviously.
This update uses 'active DOR mode' to decide whether or not to enable a
host's heartbeat in the add handler.
To better handle early active controller failure the qualifier for DOR
mode was reduced from 20 to 15 minutes. Meaning that maintenance DOR
mode is activated if its host up time is less than 15 minutes ; rather
than 20 as it was before this update. Note that normally the active
controller starts maintenance with an uptime of 5-7 minutes.
Story: 2002995
Task: 23009
Change-Id: I749aefef45b9db6e86a2c6b81d131ebeccc68926
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
The mtcAgent process has been seen to segfault and coredump on process
exit.
The exit code is iterating over a c++ list that can change due to http
interrupt response handling.
The dump code is commented out with a note indicating why and when it
could be re-enabled.
Change-Id: Ie4ef684a65ded533c347ae07fdfa47f332412f7d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002994
Task: 23008
When the Hardware Monitor starts up it reads existing alarms and sensor
state from the sysinv database. It then uses this pre-existing state to
align its internal structure accordingly moving forward.
The hardware monitor manage_startup_states utility is incorrectly
requesting degrade clear rather than degrade set in response to finding
a pre-existing critical sensor assertion on process startup.
This update fixes this issue by calling the set_degraded_state rather
than clear_degraded_state against this sensor in this case.
Change-Id: Ic1ecc1f11d7a729c16da63c6d43b7d758bb9e467
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002882
Task: 22845
As a part of changes to make sm-api independent, calling sm-api
requires keystone authentication.
This change is to enable mtce to call sm rest api with keystone
authentication.
Story: 2002827
Task: 22744
Change-Id: If3b58d3e36b9bd7fd88829d61e9c1daa00ab5048
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Introduction of PTP service requires NTP service to be disabled.
Process monitoring of NTP daemon must be turned off as well.
There is no way to start/stop process monitoring from MTCE.
Puppet can check NTP status at startup and enable/disable monitoring.
So, it is needed to move NTP-related PMON script from MTCE to Puppet.
This is first step: removing NTP references from MTCE.
Change-Id: I1ca6045af8c5169220b7332d45b843fdb4960f01
Story: 2002935
Task: 24520
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
If the first mtcAlive message from a host that was supposed to be
rebooted reports uptime in excess of 40 minutes then that means it did
not reboot as expected.
This was seen to happen during an extended offline case where the host
failed heartbeat, then was reported offline during Graceful Recovery
which forced a full enable. When the host eventually came back online
its reported uptime made it clear that it never rebooted but mtce
allowed it to come into service anyway.
This is a security issue that can lead to a host disappearing, being
security hacked and brought back into the system without reboot.
To fix that, this update requires that a host's uptime, reported in its
first mtcAlive message, indicate that it has been up for less twice the
configured mtcAlive timeout or the enable will fail until it is proven
to reset.
Story: 2002882
Task: 22845
Change-Id: I9b3ff0bc1ba5af2ca5b07a58db9da9f288b59576
Signed-off-by: Jack Ding <jack.ding@windriver.com>
For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent
will declare heartbeat recovery upon the first successful heartbeat
reply after the loss is declared ; basically edge level trigger
recovery.
In cases where a networking issue causes heartbeat loss of a group of
hosts, Maintenance tracks the group of hosts that experienced heartbeta
loss and puts the system into 'Multi Node Failure Avoidance' mode.
maintenance then simply waits up to a configured timeout period for
hosts to regain heartbeat.
As heartbeat is regained for each host that host is attempted to be
'Gracefully Recovered'.
However, if the networking issue persists in a way that the occasional
transient heartbeat pulse gets through then the maintenance system can
prematurely take hosts and then 'the system' out of MNFA mode only to
find that heartbeat is actually not properly recovered/working only to
then fail and force reboot/reset each node that is still experiencing
heartbeat loss.
This update changes the heartbeat service from an 'edge' to 'level'
sensitive recovery by requiring a number of back-2-back heartbeat pulses
following a failure before that host is delared as recovered and pulled
out of the MMNFA pool.
Basically, This update makes the system's MNFA recovery algorithm more
robust in the face of transient heartbeat loss for a group of hosts.
Story: 2002882
Task: 22845
Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1
Signed-off-by: Jack Ding <jack.ding@windriver.com>
Adds support to the mtcAgent for detecting the absence of the 'host
services execution enhancement feature' in the mtcClient and implements
the pre-upgrade implementation in that case. When mtcAgent tries to lock
a storage node running pre-upgrade verison it will implement a 90s
lock wait before proceeding to declare that storage host as
locked-disabled.
Story: 2002886
Task: 22847
Change-Id: I99fb5576e027621019adb5eff553d52773f608db
Signed-off-by: Jack Ding <jack.ding@windriver.com>