1. Build-iso - PASS
2. Install iso and unlock all hosts -PASS
3. Force reboot on unlocked host to verify heartbeat failure detection
and graceful recovery. PASS
4. Verify hbsAgent logs for unexpected logs. PASS
Change-Id: Ia4f52d3ffa52152914f3c221fa6eb860d127724b
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
Closes-Bug: 1806963
In the case where the active controller experiences a
spontaneous reboot failure there is the potential for
a race condition in the new Active-Active Heartbeat
model between the inactive hbsAgent and mtcAgent
starting up on the newly active controller.
The inactive hbsAgent can report a heartbeat Loss before
SM starts up the mtcAgent. This results in a no detect
of the of a heartbeat failed host.
This update modifies the hbsAgent to continue to report
heartbeat Loss at a throttled rate while the hbsAgent
continues to experience heartbeat loss of enabled monitored
hosts. This change is implemented in nodeClass.cpp.
Debug of this issue also revealed another undesirable race
condition and logging issue when a controller is locked. This
issue is remedied with the introduction of a control structure
'locked' state that is set on controller lock and looked at in
the hbs_cluster_update utility. hbsCluster.cpp
Two additional hbsAgent logging changes were implemented with
this update.
1. Only print "missing peer controller cluster view" on a
state change event. Otherwise, this becomes excessive
whenever the inactive controller fails.
hbsAgent.cpp
2. Don't print the full heartbeat inventory and state banner
with hbsInv.print_node_info on every heartbeat Loss event.
Otherwise, this becomes excessive in larget systems.
hbsCluster.cpp
Test Plan:
PASS: Verify hbsAgent log stream for implemented improvements.
PASS: Verify Lock inactive controller several times.
PASS: Fail inactive controller several times. verify detect.
PASS: Reboot active controller several times. verify detect.
PASS: DOR System several times. Verify proper recovery.
PASS: DOR system but prevent power-up of several hosts. Verify detect.
Change-Id: I36e6309e141e9c7844b736cce0cf0cddff3eb588
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update replaces compute references to worker in mtce,
kickstarts, installer and bsp files.
Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration
Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms
Story: 2004022
Task: 27013
Depends-On: https://review.openstack.org/#/c/624452/
Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f
Signed-off-by: Tao Liu <tao.liu@windriver.com>
The maintenance process monitor is failing the hbsClient
process over config or process reload operations.
The issue relates to the hbsClient's subfunction being
'last-config' without pmon properly gating the active
monitoring FSM from starting until the passive monitoring
phase is complete and in the MANAGE state.
Test Plan
PASS: Verify active monitoring failure detection and handling
PASS: Verify proper process monitoring over pmond config reload
PASS: Verify proper process monitoring over SIGHUP -> pmond
PASS: Verify proper process monitoring over SIGUSR2 -> pmond
PASS: Verify proper process monitoring over process failure recovery
PASS: Verify pmond regression test soak ; on active and inactive controllers
PASS: Verify pmond regression test soak ; on compute node
PASS: Verify pmond regression test soak ; kill/recovery function
PASS: Verify pmond regression test soak ; restart function
PASS: Verify pmond regression test soak ; alarming function
PASS: Verify pmond handles critical process failure with no restart config
PASS: Verify pmond handles ntpd process failure
PASS: Verify AIO DX Install
PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.
Change-Id: Ie2fe7b6ce479f660725e5600498cc98f36f78337
Closes-Bug: 1807724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.
A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.
The thresholded Enable failure causes are
Configuration Failure ....... threshold:2 retry interval:30 secs
In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
Start Host Services Failure . threshold:2 retry interval:30 sec
Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute
This update refactors the old auto recovery for AIO SX into this
more generic framework.
Story: 2003576
Task: 24905
Test Plan:
PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling
PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)
PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message
Regression:
PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT
Corner Cases:
PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.
Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A few small issues were found during integration testing with SM.
This update delivers those integration tested fixes.
1. Send cluster event to SM only after the first 10 heartbeat
pulses are received.
2. Only send inventory to hbsAgent on provisioned controllers.
3. Add new OOB SM_UNHEALTHY flag to detect and act on an SM
declared unhealthy controller.
4. Network monitoring enable fix.
5. Fix oldest entry tracking when a network history is not full.
6. Prevent clearing local uptime for a host that is being enabled.
7. Refactor cluster state change notification logging and handling.
These fixes were both UT and IT tested in multiple labs
Change-Id: I28485f241ac47bb3ed3ec1e2a8f4c09a1ca2070a
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A number of Makefiles use '[[' in their test to set
STATIC_ANALYSIS_TOOL_EXISTS. Set SHELL=/bin/bash
Change-Id: Ie9536d7cafd518f3e65acf38ac5b30aa7536ea79
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
After discussion with Eslimi, this patch disables DDNS on dhclient,
as the network port 2105 used by dhclient conflict with same port
used on mtcClient. Now we change the port used by mtcClient from 2105
to 2118 to fix conflict, then we can remove this patch.
Deployment test pass.
Story: 2003757
Task: 26445
Change-Id: I70559d73f51f85c840042cc4fc206fcd5bc3de27
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
Description:
In mtce/src/hwmon/hwmonThreads.cpp, line 266:
++dst_ptr = '\0' ;
should be modified as
*(++dst_ptr) = '\0' ;
Otherwise the code is useless and will generate a compile error in
higher version gcc.
Reproduce:
Compile the mcte code with gcc 8.2.1 will cause a compile error.
And after the fix, the error is gone.
Closes-Bug: 1804599
Change-Id: I25df255fb14aa3d96c62927eeb7d3e23ae29af2b
Signed-off-by: Yan Chen <yan.chen@intel.com>
The commit shown below introduced a main loop audit that
mistakenly registers subfunction processes that are in the
waiting for /var/run/.compute_config_complete 'polling'
state during unlock enable.
By doing so inadvertently changes its monitor FSM stage
from 'Poll' to 'Manage' before configuration is complete.
Since config is not complete, the hbsClient has not initialized
its socket interface and is unable to service active monitoring
requests. This leads to quorum failure and watchdog reboot.
commit 537935bb0caa257df624a0b470a971c82d215152
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date: Mon Jul 9 08:36:22 2018 -0400
Reorder process restart operations to prevent pmond futex deadlock
The Fix: Don't run the audit for processes that are in the
waiting for 'polling' state.
Test Plan:
Provision AIO , verify no quorum failure and inspect logs for
correct behavior.
Change-Id: I179c78309517a34285783ee99bbb3d699915cb83
Closes-Bug: 1804318
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update introduces mtce changes to support Active-Active Heartbeating.
The purpose of Active-Active Heartbeating is help avoid Split-Brain.
Active-Active heartbeating has each controller maintain a 5 second
heartbeat response history cache of each network for all monitored
hosts as well as the on-going health of storage-0 if provisioned and
enabled.
This is referred to as the 'heartbeat cluster history'
Each controller then includes its cluster history in each heartbeat
pulse request message.
The hbsClient, now modified to handle heartbeat from both controllers,
saves each controllers' heartbeat cluster history in a local cache and
criss-crosses the data in its pulse responses.
So when the hbsClient receives a pulse request from controller-0 it
saves its reported history and then replaces that history information
in its response to controller-0 with what it saved from controller-1's
last pulse request ; i.e. its view of the system.
Controller-0, receiving a host's pulse response, saves its peers
heartbeat cluster history so that it has summary of heartbeat
cluster history for the last 5 seconds for each monitored network
of every monitored host in the system from both controllers'
perspectives. Same for controller-1 with controller-0's history.
The hbsAgent is then further enhanced to support a query request
for this information.
So now SM, when it needs to make a decision to avoid Split-Brain
or otherwise, can query either controller for its heartbeat cluster
history and get the last 5 second summary view of heartbeat (network)
responsivness from both controllers perspectives to help decide which
controller to make active.
This involved removing the hbsAgent process from SM control and monitor
and adding a new hbsAgent LSB init script for process launch, service
file to run the init script and pmon config file for hbsAgent process
monitoring.
With hbsAgent now running on both controllers, changes to maintenance
were required to send inventory to hbsAgent on both controllers,
listen for hbsAgent event messages over the management interface
and inform both hbsAgents which controller is active.
The hbsAgent running on the inactive controller does not
- does not send heartbeat events to maintenance
- does not send raise or clear alarms or produce customer logs
Test Plan:
Feature:
PASS: Verify hbsAgent runs on both controllers
PASS: Verify hbsAgent as pmon monitored process (not SM)
PASS: Verify system install and cluster collection in all system types (10+)
PASS: Verify active controller hbsAgent detects and handles heartbeat loss
PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss
PASS: Verify heartbeat cluster history collection functions properly.
PASS: Verify storage-0 state tracking in cluster into.
PASS: Verify storage-0 not responding handling
PASS: Verify heartbeat response is sent back to only the requesting controller.
PASS: Verify heartbeat history is correct from each controller
PASS: Verify MNFA from active controller after install to controller-0
PASS: Verify MNFA from active controller after swact to controller-1
PASS: Verify MNFA for 80%+ of the hosts in the storage system
PASS: Verify SM cluster query operation and content from both controllers
PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms
Logging:
PASS: Verify cluster info logs.
PASS: Verify feature design logging.
PASS: Verify hbsAgent and hbsClient design logs on all hosts add value
PASS: Verify design logging from both controllers in heartbeat loss case
PASS: Verify design logging from both controllers in MNFA case
PASS: Verify clog logs cluster info vault status and updates for controllers
PASS: Verify clog1 logs full cluster state change for all hosts
PASS: Verify clog2 logs cluster info save/append logs for controllers
PASS: Verify clog3 memory dumps a cluster history
PASS: Verify USR2 forces heartbeat and cluster info log dump
PASS: Verify hourly heartbeat and cluster info log dump
PASS: Verify loss events force heartbeat and cluster info log dump
Regression:
PASS: Verify Large System DOR
PASS: Verify pmond regression test that now includes hbsAgent
PASS: Verify Lock/Unlock of inactive controller (x3)
PASS: Verify Swact behavior (x10)
PASS: Verify compute Lock/Unlock
PASS: Verify storage-0 Lock/Unlock
PASS: Verify compute Host Failure and Graceful Recovery
PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable
PASS: Verify Delete Host
PASS: Verify Patching hbsAgent and hbsClient
PASS: Verify event driven cluster push
Story: 2003576
Task: 24907
Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
illegal accessed. Fix with add boundry checking.
Verified with deployed system, check with system host-list command to
ensure get_adminAction_str/get_adminState_str/get_operState_str in
nodeClass correct,daemon_init in rmon with no error message, active
monitor is corect for hbsClient and pmon is correct in hbsClient.log
and pmon.log
Closes-Bug: 1795110
Change-Id: Iff81d93702cfa5636c3d8c9567363ff9601be5f6
Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>
This part one of a two part HA Improvements feature that introduces
the collection of heartbeat health at the system level.
The full feature is intended to provide service management (SM)
with the last 2 seconds of maintenace's heartbeat health view that
is reflective of each controller's connectivity to each host
including its peer controller.
The heartbeat cluster summary information is additional information
for SM to draw on when needing to make a choice of which controller
is healthier, if/when to switch over and to ultimately avoid split
brain scenarios in a two controller system.
Feature Behavior: A common heartbeat cluster data structure is
introduced and published to the sysroot for SM. The heartbeat
service populates and maintains a local copy of this structure
with data that reflects the responsivness for each monitored
network of all the monitored hosts for the last 20 heartbeat
periods. Mtce sends the current cluster summary to SM upon request.
General flow of cluster feature wrt hbsAgent:
hbs_cluster_init: general data init
hbs_cluster_nums: set controller and network numbers
forever:
select:
hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent
hbs_sm_handler -> hbs_cluster_send: - send cluster to SM
heartbeating:
hbs_cluster_append: add controller cluster to pulse request
hbs_cluster_update: get controller cluster data from pulse responses
hbs_cluster_save: save other controller cluster view in cluster vault
hbs_cluster_log: log cluster state changes (clog)
Test Plan:
PASS: Verify compute system install
PASS: Verify storage system install
PASS: Verify cluster data ; all members of structure
PASS: Verify storage-0 state management
PASS: Verify add of second controller
PASS: Verify add of storage-0 node
PASS: Verify behavior over Swact
PASS: Verify lock/unlock of second controller ; overall behavior
PASS: Verify lock/unlock of storage-0 ; overall behavior
PASS: Verify lock/unlock of storage-1 ; overall behavior
PASS: Verify lock/unlock of compute nodes ; overall behavior
PASS: Verify heartbeat failure and recovery of compute node
PASS: Verify heartbeat failure and recovery of storage-0
PASS: Verify heartbeat failure and recovery of controller
PASS: Verify delete of controller node
PASS: Verify delete of storage-0
PASS: Verify delete of compute node
PASS: Verify cluster when controller-1 active / controller-0 disabled
PASS: Verify MNFA and recovery handling
PASS: Verify handling in presence of multiple failure conditions
PASS: Verify hbsAgent memory leak soak test with continuous SM query.
PASS: Verify active controller-1 infra network failure behavior.
PASS: Verify inactive controller-1 infra network failure behavior.
Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Maintenance is seen to intermittently fail Swact requests when
it fails to get a response from SM 500 msecs after having issued
the request successfully.
A recent instrumentation update went in which verified that the
http request was being launched properly even in the failure cases.
Seems the 500 msec timeout might not be long enough to account
for SM's scheduling/handling.
This update increases the receive retry delay from 50 msec to 1 second.
Change-Id: I29d6ba03094843a2af9d8720dd074572d76a31a4
Related-Bug: https://bugs.launchpad.net/starlingx/+bug/1791381
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>