Commit Graph

11 Commits (master)

Author SHA1 Message Date
Girish Subramanya 86681b7598 Alarm Hostname controller function has in-service failure reported
When compute services remain healthy:
 - listing alarms shall not refer to the below Obsoleted alarm
 - 200.012 alarm hostname controller function has an in-service failure

This update deletes definition of the obsoleted alarm and any references
200.012 is removed in events.yaml file
Also updated any reference to this alarm definition.
Need to also raise a Bug to track the Doc change.

Test Plan:
Verify on a Standard configuration no alarms are listed for
hostname controller in-service failure
Code (removal) changes exercised with fix prior to ansible bootstrap
and host-unlock and verify no unexpected alarms
Regression:
There is no need to test the alarm referred here as they are obsolete

Closes-Bug: 1991531

Signed-off-by: Girish Subramanya <girish.subramanya@windriver.com>

Change-Id: I255af68155c5392ea42244b931516f742fa838c3
8 months ago
Matheus Machado Guilhermino 4c8abe18d3 Fix failing mtce services on Debian
Modified mtce and mtce-control to address the following
failing services on Debian:
hbsAgent.service
hbsClient.service
hwmon.service
lmon.service
mtcalarm.service
mtclog.service
runservices.service

Applied fix:
- Included modified .service files for debian
directly into into the deb_folder.
- Changed the init files to account for the different
locations of the init-functions and service daemons
on Debian and CentOS
- Included "override_dh_installsystemd" section
to rules in order to start services at boot.

Test Plan:

PASS: Package installed and ISO built successfully
PASS: Ran "systemctl list-units --failed" and verified that the
services are not failing
PASS: Ran "systemctl status <service_name>" for
each service and verified that they are active

Story: 2009101
Task: 44192

Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com>
Change-Id: I50915c17d6f50f5e20e6448d3e75bfe54a75acc0
1 year ago
Eric MacDonald 3c1e9d9601 Modify mtce daemon log rotation config files
This update make the following setting changes to the
maintenance log rotation configuration files

 - add 'create' with permissions to each tuple
 - add 'delaycompress'
 - group together log files with similar settings
 - move global settings ro local settings
 - remove 'copytruncate' global setting
 - remove the 'nodateext' global and local setting

Test Plan:

PASS: Verify log rotation for all mtc log files
PASS: Verify no log loss over rotation
PASS: Verify log rotation file naming convention
PASS: Verify delaycompress on all mtce log files
PASS: Verify log permissions after rotate are 0640

Regression:

PASS: Verify AIO system install
PASS: Verify Standard system install
PASS: Verify full and dated collect

Change-Id: I623030fa2c1ce4e8085e654ae3fb782c7e520924
Partial-Bug: 1918979
Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2 years ago
Sharath Kumar K b725a0974b De-branding in starlingx/metal: Titanium Cloud -> StarlingX
1. Rename Titanium Cloud to StarlingX for .spec files
2. Rename Titanium Cloud to StarlingX for .service file

Test:
After the de-brand change, bootimage.iso has built in the flock layer
and installed on the dev machine to validate the changes.

Please note, doing de-brand changes in batches, this is batch1 changes.

Story: 2006387
Task: 36207

Change-Id: Ifa4dc5c7aa3189815e00b796fc833852e88c8fe3
Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>
3 years ago
Eric MacDonald f2fedc0446 Add alarm retry support to maintenance alarm handling daemon
The maintenance alarm handling daemon (mtcalarmd) should not
drop alarm requests simply because FM process is not running.
Insteads it should retry for it and other FM error cases that
will likely succeed in time if they are retried.

Some error cases however do need to be dropped such as those
that are unlikely to succeed with retries.

Reviewed FM return codes with FM designer which lead to a list
of errors that should drop and others that should retry.

This update implements that handling with a posting and
servicing of a first-in / first-out alarm queue.

Typical retry case is the NOCONNECT error code which occurs
when FM is not running.

Alarm ordering and first try timestamp is maintained.
Retries and logs are throttled to avoid flooding.

Test Plan:

PASS: Verify success path alarm handling End-to-End.
PASS: Verify retry handling while FM is not running.
PASS: Verify handling of all FM error codes (fit tool).
PASS: Verify alarm handling under stress (inject-alarm script) soak.
PASS: verify no memory leak over stress soak.
PASS: Verify logging (success, retry, failure)
PASS: Verify alarm posted date is maintained over retry success.

Change-Id: Icd1e75583ef660b767e0788dd4af7f184bdb9e86
Closes-Bug: 1841653
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
4 years ago
Erich Cordoba 1390ce0e95 Add LSB headers to mtce service scripts
The LSB headers are required by openSUSE build system as one of the
quality checks. This patch adds the header and missing fields. A $null
value was set for all unchanged fields

Change-Id: I22ee3571b70b22e1fe8238c2a94b3f4d099d41cd
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
4 years ago
Teresa Ho 8e51a1660a Refactor infrastructure network in mtce code
Updated to read the host cluster-host parameter in /etc/hosts
file.
Replaced references of infra network with cluster-host network

Story: 2004273
Task: 29473

Change-Id: I199fb82e5f6b459b181196d0802f1a74220b796e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
4 years ago
Dean Troyer 83101e95ba Add EXTRALDFLAGS to linker in a number of Makefiles
This allows DevStack plugins to add its configured STX_INST_DIR
to the linker search path.

Change-Id: I277204cd89767b93eec6c96969fc33d23e04516b
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
4 years ago
Dean Troyer 58b987239f Set SHELL in Makefiles that use bash constructs
A number of Makefiles use '[[' in their test to set
STATIC_ANALYSIS_TOOL_EXISTS.  Set SHELL=/bin/bash

Change-Id: Ie9536d7cafd518f3e65acf38ac5b30aa7536ea79
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
5 years ago
Eric MacDonald 0b922227ac Implement Active-Active Heartbeat as HA Improvement
This update introduces mtce changes to support Active-Active Heartbeating.

The purpose of Active-Active Heartbeating is help avoid Split-Brain.

Active-Active heartbeating has each controller maintain a 5 second
heartbeat response history cache of each network for all monitored
hosts as well as the on-going health of storage-0 if provisioned and
enabled.

This is referred to as the 'heartbeat cluster history'

Each controller then includes its cluster history in each heartbeat
pulse request message.

The hbsClient, now modified to handle heartbeat from both controllers,
saves each controllers' heartbeat cluster history in a local cache and
criss-crosses the data in its pulse responses.

So when the hbsClient receives a pulse request from controller-0 it
saves its reported history and then replaces that history information
in its response to controller-0 with what it saved from controller-1's
last pulse request ; i.e. its view of the system.

Controller-0, receiving a host's pulse response, saves its peers
heartbeat cluster history so that it has summary of heartbeat
cluster history for the last 5 seconds for each monitored network
of every monitored host in the system from both controllers'
perspectives. Same for controller-1 with controller-0's history.

The hbsAgent is then further enhanced to support a query request
for this information.

So now SM, when it needs to make a decision to avoid Split-Brain
or otherwise, can query either controller for its heartbeat cluster
history and get the last 5 second summary view of heartbeat (network)
responsivness from both controllers perspectives to help decide which
controller to make active.

This involved removing the hbsAgent process from SM control and monitor
and adding a new hbsAgent LSB init script for process launch, service
file to run the init script and pmon config file for hbsAgent process
monitoring.

With hbsAgent now running on both controllers, changes to maintenance
were required to send inventory to hbsAgent on both controllers,
listen for hbsAgent event messages over the management interface
and inform both hbsAgents which controller is active.

The hbsAgent running on the inactive controller does not
 - does not send heartbeat events to maintenance
 - does not send raise or clear alarms or produce customer logs

Test Plan:

Feature:
PASS: Verify hbsAgent runs on both controllers
PASS: Verify hbsAgent as pmon monitored process (not SM)
PASS: Verify system install and cluster collection in all system types (10+)
PASS: Verify active controller hbsAgent detects and handles heartbeat loss
PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss
PASS: Verify heartbeat cluster history collection functions properly.
PASS: Verify storage-0 state tracking in cluster into.
PASS: Verify storage-0 not responding handling
PASS: Verify heartbeat response is sent back to only the requesting controller.
PASS: Verify heartbeat history is correct from each controller
PASS: Verify MNFA from active controller after install to controller-0
PASS: Verify MNFA from active controller after swact to controller-1
PASS: Verify MNFA for 80%+ of the hosts in the storage system
PASS: Verify SM cluster query operation and content from both controllers
PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms

Logging:
PASS: Verify cluster info logs.
PASS: Verify feature design logging.
PASS: Verify hbsAgent and hbsClient design logs on all hosts add value
PASS: Verify design logging from both controllers in heartbeat loss case
PASS: Verify design logging from both controllers in MNFA case
PASS: Verify clog  logs cluster info vault status and updates for controllers
PASS: Verify clog1 logs full cluster state change for all hosts
PASS: Verify clog2 logs cluster info save/append logs for controllers
PASS: Verify clog3 memory dumps a cluster history
PASS: Verify USR2 forces heartbeat and cluster info log dump
PASS: Verify hourly heartbeat and cluster info log dump
PASS: Verify loss events force heartbeat and cluster info log dump

Regression:
PASS: Verify Large System DOR
PASS: Verify pmond regression test that now includes hbsAgent
PASS: Verify Lock/Unlock of inactive controller (x3)
PASS: Verify Swact behavior (x10)
PASS: Verify compute Lock/Unlock
PASS: Verify storage-0 Lock/Unlock
PASS: Verify compute Host Failure and Graceful Recovery
PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable
PASS: Verify Delete Host
PASS: Verify Patching hbsAgent and hbsClient
PASS: Verify event driven cluster push

Story: 2003576
Task: 24907

Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
5 years ago
Jim Gauld 6a5e10492c Decouple Guest-server/agent from stx-metal
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.

This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.

Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.

The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
  service-mgmt, sm, and sm-api

mtce-common:
- contains common and daemon shared source utility code

mtce-common-dev:
- based on mtce-common, contains devel package required to build
  mtce-guest and mtce
- contains common library archives and headers

mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
  maintenance, mtclog, pmon, public, rmon

mtce-guest:
- contains guest component guest-server, guest-agent

Story: 2002829
Task: 22748

Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
5 years ago