metal/mtce-common/src/common/nodeTimers.h
Eric MacDonald 0b922227ac Implement Active-Active Heartbeat as HA Improvement
This update introduces mtce changes to support Active-Active Heartbeating.

The purpose of Active-Active Heartbeating is help avoid Split-Brain.

Active-Active heartbeating has each controller maintain a 5 second
heartbeat response history cache of each network for all monitored
hosts as well as the on-going health of storage-0 if provisioned and
enabled.

This is referred to as the 'heartbeat cluster history'

Each controller then includes its cluster history in each heartbeat
pulse request message.

The hbsClient, now modified to handle heartbeat from both controllers,
saves each controllers' heartbeat cluster history in a local cache and
criss-crosses the data in its pulse responses.

So when the hbsClient receives a pulse request from controller-0 it
saves its reported history and then replaces that history information
in its response to controller-0 with what it saved from controller-1's
last pulse request ; i.e. its view of the system.

Controller-0, receiving a host's pulse response, saves its peers
heartbeat cluster history so that it has summary of heartbeat
cluster history for the last 5 seconds for each monitored network
of every monitored host in the system from both controllers'
perspectives. Same for controller-1 with controller-0's history.

The hbsAgent is then further enhanced to support a query request
for this information.

So now SM, when it needs to make a decision to avoid Split-Brain
or otherwise, can query either controller for its heartbeat cluster
history and get the last 5 second summary view of heartbeat (network)
responsivness from both controllers perspectives to help decide which
controller to make active.

This involved removing the hbsAgent process from SM control and monitor
and adding a new hbsAgent LSB init script for process launch, service
file to run the init script and pmon config file for hbsAgent process
monitoring.

With hbsAgent now running on both controllers, changes to maintenance
were required to send inventory to hbsAgent on both controllers,
listen for hbsAgent event messages over the management interface
and inform both hbsAgents which controller is active.

The hbsAgent running on the inactive controller does not
 - does not send heartbeat events to maintenance
 - does not send raise or clear alarms or produce customer logs

Test Plan:

Feature:
PASS: Verify hbsAgent runs on both controllers
PASS: Verify hbsAgent as pmon monitored process (not SM)
PASS: Verify system install and cluster collection in all system types (10+)
PASS: Verify active controller hbsAgent detects and handles heartbeat loss
PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss
PASS: Verify heartbeat cluster history collection functions properly.
PASS: Verify storage-0 state tracking in cluster into.
PASS: Verify storage-0 not responding handling
PASS: Verify heartbeat response is sent back to only the requesting controller.
PASS: Verify heartbeat history is correct from each controller
PASS: Verify MNFA from active controller after install to controller-0
PASS: Verify MNFA from active controller after swact to controller-1
PASS: Verify MNFA for 80%+ of the hosts in the storage system
PASS: Verify SM cluster query operation and content from both controllers
PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms

Logging:
PASS: Verify cluster info logs.
PASS: Verify feature design logging.
PASS: Verify hbsAgent and hbsClient design logs on all hosts add value
PASS: Verify design logging from both controllers in heartbeat loss case
PASS: Verify design logging from both controllers in MNFA case
PASS: Verify clog  logs cluster info vault status and updates for controllers
PASS: Verify clog1 logs full cluster state change for all hosts
PASS: Verify clog2 logs cluster info save/append logs for controllers
PASS: Verify clog3 memory dumps a cluster history
PASS: Verify USR2 forces heartbeat and cluster info log dump
PASS: Verify hourly heartbeat and cluster info log dump
PASS: Verify loss events force heartbeat and cluster info log dump

Regression:
PASS: Verify Large System DOR
PASS: Verify pmond regression test that now includes hbsAgent
PASS: Verify Lock/Unlock of inactive controller (x3)
PASS: Verify Swact behavior (x10)
PASS: Verify compute Lock/Unlock
PASS: Verify storage-0 Lock/Unlock
PASS: Verify compute Host Failure and Graceful Recovery
PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable
PASS: Verify Delete Host
PASS: Verify Patching hbsAgent and hbsClient
PASS: Verify event driven cluster push

Story: 2003576
Task: 24907

Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-11-20 19:57:18 +00:00

182 lines
6.4 KiB
C
Executable File

#ifndef __INCLUDE_NODETIMERS_HH__
#define __INCLUDE_NODETIMERS_HH__
/*
* Copyright (c) 2013-2016 Wind River Systems, Inc.
*
* SPDX-License-Identifier: Apache-2.0
*
*/
/**
* @file
* Wind River CGTS Platform Node Maintenance "Timer Facility"
* Header and Maintenance API
*/
/**
* @detail
* Detailed description ...
*
* Common timer struct
*
*/
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <signal.h>
#define MAX_TIMER_DURATION (30000)
#define MTC_SECS_1 (1)
#define MTC_SECS_2 (2)
#define MTC_SECS_5 (5)
#define MTC_SECS_10 ( 10)
#define MTC_SECS_15 ( 15)
#define MTC_SECS_20 ( 20)
#define MTC_SECS_30 ( 30)
#define MTC_MINS_1 ( 60)
#define MTC_MINS_2 (120)
#define MTC_MINS_3 (180)
#define MTC_MINS_4 (240)
#define MTC_MINS_5 (300)
#define MTC_MINS_10 (600)
#define MTC_MINS_15 (900)
#define MTC_MINS_20 (1200)
#define MTC_MINS_30 (1800)
#define MTC_MINS_40 (2400)
#define MTC_HRS_1 (3600)
#define MTC_HRS_4 (14400)
#define MTC_HRS_8 (28800) /* old token refresh rate */
#define HOST_MTCALIVE_TIMEOUT (MTC_MINS_20)
#define HOST_GOENABLED_TIMEOUT (MTC_MINS_2)
#define MTC_CMD_RSP_TIMEOUT (10)
#define MTC_FORCE_LOCK_RESET_WAIT (30)
#define MTC_RECOVERY_TIMEOUT (16)
#define MTC_PMOND_READY_TIMEOUT (10)
#define MTC_UPTIME_REFRESH_TIMER (MTC_MINS_1) /* If this interval changes review impact
to garbage collecton in mtctimer_handler */
#define MTC_MNFA_RECOVERY_TIMER (3)
#define MTC_ALIVE_TIMER (5)
#define MTC_POWEROFF_DELAY (5)
#define MTC_SWACT_POLL_TIMER (10)
#define MTC_TASK_UPDATE_DELAY (10)
#define MTC_BM_PING_TIMEOUT (30)
#define MTC_BM_POWEROFF_TIMEOUT (30)
#define MTC_BM_POWERON_TIMEOUT (30)
#define MTC_RESET_PROG_TIMEOUT (20)
#define MTC_WORKQUEUE_TIMEOUT (60)
#define MTC_COMPUTE_CONFIG_TIMEOUT (900)
#define MTC_EXIT_DOR_MODE_TIMEOUT (60*15)
#define MTC_RESET_PROG_OFFLINE_TIMEOUT (20)
#define MTC_RESET_TO_OFFLINE_TIMEOUT (150)
#define MTC_POWEROFF_TO_OFFLINE_TIMEOUT (200)
#define MTC_POWERON_TO_ONLINE_TIMEOUT (900)
#define MTC_POWERCYCLE_COOLDOWN_DELAY (MTC_MINS_5)
#define MTC_POWERCYCLE_BACK2BACK_DELAY (MTC_MINS_5)
#define MTC_HEARTBEAT_SOAK_BEFORE_ENABLE (11)
#define MTC_REINSTALL_TIMEOUT_DEFAULT (MTC_MINS_40)
#define MTC_REINSTALL_TIMEOUT_MIN (MTC_MINS_1)
#define MTC_REINSTALL_TIMEOUT_MAX (MTC_HRS_4)
#define MTC_REINSTALL_WAIT_TIMER (10)
#define MTC_IPMITOOL_REQUEST_DELAY (10) /* consider making this shorter */
#define LAZY_REBOOT_RETRY_DELAY_SECS (60)
#define SM_NOTIFY_UNHEALTHY_DELAY_SECS (5)
#define MTC_MIN_ONLINE_PERIOD_SECS (7)
#define MTC_RETRY_WAIT (5)
#define MTC_AGENT_TIMEOUT_EXTENSION (5)
#define MTC_LOCK_CEPH_DELAY (90)
/** Host must stay enabled for this long for the
* failed_recovery_counter to get cleared */
#define MTC_ENABLED_TIMER (5)
/** Should be same or lower but not less than half of ALIVE_TIMER */
#define MTC_OFFLINE_TIMER (7)
#define TIMER_INIT_SIGNATURE (0x86752413)
struct mtc_timer
{
/** linux timer structs */
struct sigevent sev ; /**< set by util - time event specifier */
struct itimerspec value ; /**< set by util - time values */
struct sigaction sa ; /**< set by util and create parm handler */
/** local service members */
unsigned int init ; /** timer initialized signatur */
timer_t tid ; /**< the timer address pointer */
bool active ; /**< indicates that the timer is active */
bool mutex ;
bool error ;
int _guard ;
bool ring ; /**< set to true if the timer fires */
int guard_ ;
int secs ; /**< set by create parm - sub second not supported */
int msec ; /**< set by create parm - sub second not supported */
string hostname ; /**< name of the host using the timer */
string service ; /**< name of the service using the timer */
} ;
void mtcTimer_mem_log ( void );
void mtcTimer_init ( struct mtc_timer & mtcTimer );
void mtcTimer_init ( struct mtc_timer & mtcTimer, string hostname );
void mtcTimer_init ( struct mtc_timer & mtcTimer, string hostname, string service );
void mtcTimer_init ( struct mtc_timer * mtcTimer_ptr );
void mtcTimer_init ( struct mtc_timer * mtcTimer_ptr, string hostname, string service );
int mtcTimer_start ( struct mtc_timer & mtcTimer,
void (*handler)(int, siginfo_t*, void*),
int seconds );
int mtcTimer_start ( struct mtc_timer * mtcTimer_ptr,
void (*handler)(int, siginfo_t*, void*),
int seconds );
int mtcTimer_start_msec ( struct mtc_timer & mtcTimer,
void (*handler)(int, siginfo_t*, void*),
int msec );
int mtcTimer_start_msec ( struct mtc_timer * mtcTimer_ptr,
void (*handler)(int, siginfo_t*, void*),
int msec );
int mtcTimer_start_sec_msec ( struct mtc_timer * mtcTimer_ptr,
void (*handler)(int, siginfo_t*, void*),
int secs , int msec );
int mtcTimer_stop ( struct mtc_timer & mtc_timer );
int mtcTimer_stop ( struct mtc_timer * mtcTimer_ptr );
int mtcTimer_stop_int_safe ( struct mtc_timer & mtcTimer );
int mtcTimer_stop_int_safe ( struct mtc_timer * mtcTimer_ptr );
void mtcTimer_dump_data ( void );
/** Cleanup interface - stop and delete an unknown timer */
int mtcTimer_stop_tid ( timer_t * tid_ptr );
int mtcTimer_stop_tid_int_safe ( timer_t * tid_ptr );
/* returns true if the timer is not active or ring is true */
bool mtcTimer_expired ( struct mtc_timer & mtcTimer );
bool mtcTimer_expired ( struct mtc_timer * mtcTimer_ptr );
/* stops timer if tid is active running */
void mtcTimer_reset ( struct mtc_timer & mtcTimer );
void mtcTimer_reset ( struct mtc_timer * mtcTimer_ptr );
/* de-init a user timer */
void mtcTimer_fini ( struct mtc_timer & mtcTimer );
void mtcTimer_fini ( struct mtc_timer * mtcTimer_ptr );
void mtcWait_msecs ( int millisecs );
void mtcWait_secs ( int secs );
int mtcTimer_testhead ( void );
#endif