metal/mtce/src/hwmon/hwmonModel.cpp
Eric MacDonald c4b8171ddd Refactor BMC provisioning in Maintenance
The current mechanism used to preserve the learned bmc protocol in
the filesystem on the active controller is problematic over swact.

This update removes the file storage method in favor of preserving
the learned protocol in the system inventory database as a key/value
pair at the host level in already existing mtce_info database field.

The specified or learned bmc access protocol is then shared with the
hardware monitor through inter-daemon maintenance messaging.

This update refactors bmc provisioning to accommodate bmc protocol
selection at the host rather than system level. Towards that this
update removes system level bmc_access_method selection in favor of
host level selection through bm_type. A bm_type of 'bmc' specifies
that the bmc access protocol for that host be learned. This has the
effect of making it the same as what is delivered today but without
support for changing it as the system level.

A system inventory update will be delivered shortly that enables bmc
access protocol selection at the host level. That update allows the
customer to specify the bmc access protocol at the host level to be
either dynamic (aka learned) or to only use 'redfish' or 'ipmi'.
That system inventory update delivers that information to maintenance
through bm_type via bmc provisioning. Until that update is delivered
bm_type always comes in as 'bmc' which get interpreted as 'dynamic'
to maintain existing configuration.

The following additional issues were also fixed in this update.

1. The nodeTimers module defaults the 'ring' member of timers that are
   not running to false but should be true.

2. Added a pingUtil_restart function to facilitate quicker sensor
   monitoring following provisioning changes and bmc access failures.

3. Enhanced the hardware monitor sensor grouping filter to accommodate
   non-standard Redfish readout labelling so that more sensors fall
   into the existing canned groups ; leads to more monitored sensors.

4. Added a 'http security mode' to hardware monitor messaging. This
   defaults to https as that is all that is supported by the Redfish
   implementation today. This field can be used to specify non-secure
   'http' mode in the future when that gets implemented.

5. Ensure the hardware monitor performs a bmc password re-fetch on every
   provisioning change.

Test Plan:

PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
PASS: Verify bmc provisioning of ipmi and redfish servers
PASS: Verify learned bmc protocol persists over process restart and swact
PASS: Verify process startup with protocol already learned

Hardware Monitor:

PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov
PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov
PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted
PASS: Verify sensor model delete and relearn over ip address change
PASS: Verify sensor model delete and relearn over bm_type change change
PASS: Verify sensor model not relearned username change
PASS: Verify bm pw is re-fetched over any (re)provisioning change
PASS: Verify bmc re-provisioning soak (test-bmc-reprovisioning.sh 50 loops)
PASS: Verify protocol change handling, file cleanup, model recreation
PASS: Verify End-2-End behavior for bm_type change from redfish to ipmi
PASS: Verify End-2-End behavior for bm_type change from ipmi to redfish
PASS: Verify End-2-End behavior for bm_type change from redfish to dynamic
PASS: Verify End-2-End behavior for bm_type change from ipmi to dynamic
PASS: Verify End-2-End behavior for bm_type change from dynamic to ipmi
PASS: Verify End-2-End behavior for bm_type change from dynamic to redfish
PASS: Verify sensor model creation waits for server power to be on
PASS: Verify sensor relearn by provisioning change during model creation. (soak)

Regression:

PASS: Verify host power off and on.
PASS: Verify BMC access alarm handling (assert and clear)
PASS: Verify mtcAgent and hwmond logs add value
PASS: Verify no core dumps / seg faults.
PASS: Verify no mtcAgent and hwmond memory leak.
PASS: Verify delete of BMC provisioned host
PASS: Verify sensor monitoring, alarming, degrade and then clear cycle
PASS: Verify static analysis report of changed modules.
PASS: Verify host level bm_type=bmc functions as would dynamic selection
PASS: Verify batch provisioning and deprovisioning (7 nodes)
PASS: Verify batch provisioning to different protocol (5 nodes)
PASS: Verify handling of flaky Redfish responses

PEND: Verify System Install

Change-Id: Ic224a9c33e0283a611725b33c90009132cab3382
Closes-Bug: #1853471
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-12-09 09:39:49 -05:00

549 lines
21 KiB
C++

/*
* Copyright (c) 2015-2017 Wind River Systems, Inc.
*
* SPDX-License-Identifier: Apache-2.0
*
*
*
* @file
* Wind River Titanium Cloud Hardware Monitor" Sensor Model" Utilities
*
*
* These are the utilities that load, create, group and delete sensor models
*
*
* bmc_load_sensor_model ....... called by add_host_handler FSM
*
* bmc_create_sensor_model
*
* bmc_create_sample_model ... create model based on sample data
* bmc_create_groups
* bmc_create_sensors
* bmc_group_sensors
*
* bmc_create_quanta_model ... create model for Quanta server
* bmc_add_group
* load_profile_groups
* load_profile_sensors
* hwmon_group_sensors
*
* bmc_delete_sensor_model ..... called on model re-create
*
*****************************************************************************/
#include "daemon_ini.h" /* for ... parse_ini and MATCH */
#include "nodeBase.h" /* for ... mtce common definitions */
#include "jsonUtil.h" /* for ... json utilitiies */
#include "nodeUtil.h" /* for ... mtce common utilities */
#include "hwmonUtil.h" /* for ... get_severity */
#include "hwmonClass.h" /* for ... service class definition */
#include "hwmonHttp.h" /* for ... http podule header */
#include "hwmonSensor.h" /* for ... this module header */
#include "hwmonBmc.h" /* for ... QUANTA_SENSOR_PROFILE_CHECKSUM */
/*****************************************************************************
*
* Name : bmc_create_sensor_model
*
* Description: Top level utility that creates a sensor model based on
* sample data.
*
* The caller has already determined if the sample set matches
* the special case Quanta server model. If it does then we
* use the Quanta sensor profile to create the model. Otherwise,
* the model is created based on sensor samples.
*
******************************************************************************/
int hwmonHostClass::bmc_create_sensor_model ( struct hwmonHostClass::hwmon_host * host_ptr )
{
int rc = PASS ;
ilog ("%s creating sensor model using %s:%s\n",
host_ptr->hostname.c_str(),
bmcUtil_getProtocol_str(host_ptr->protocol).c_str(),
host_ptr->bm_ip.c_str());
host_ptr->groups = 0 ;
/* If this is NOT a Quanta Server then ... */
if ( ! host_ptr->quanta_server )
{
/*
* Dynamically create a model based
* on the sensor sample reading data.
*/
rc = bmc_create_sample_model ( host_ptr );
}
/* Otherwise create the model based on the known Quanta sensor profile */
else
{
if ( ( rc = bmc_create_quanta_model ( host_ptr )) == PASS )
{
if ( host_ptr->groups >= MIN_SENSOR_GROUPS )
{
/*
* If this is a Quanta server then the best way to ensure the
* sensor profile is identical and backward compatible is to
* load the sensor profile from the legacy Quanta profile file.
*
* QUANTA_SENSOR_PROFILE_FILE
*/
struct sensor_group_type group_array [MAX_HOST_GROUPS] ;
sensor_type sensor_array [MAX_HOST_SENSORS];
int profile_groups ;
bool error = false ;
ilog ("%s provisioning Quanta server using %s\n",
host_ptr->hostname.c_str(), QUANTA_SENSOR_PROFILE_FILE );
profile_groups = load_profile_groups ( host_ptr, &group_array[0], MAX_HOST_GROUPS, error );
if (( error == false ) && ( profile_groups == host_ptr->groups ))
{
int profile_sensors;
for ( int g = 0 ; g < host_ptr->groups ; ++g )
{
/*
* Add the sensor label list to each host_ptr group[x].
*
* This list was fetched and attached to the group array
* in load_profile_groups.
*
* Having it prevents the need to parse the profile file
* again to associate the sensors to a group all over
* again inside load_profile_sensors
*/
host_ptr->group[g].sensor_labels = group_array[g].sensor_labels ;
blog ("%s '%s' group sensor list: %s\n",
host_ptr->hostname.c_str(),
host_ptr->group[g].group_name.c_str(),
host_ptr->group[g].sensor_labels.c_str());
}
ilog ( "%s %d profile groups loaded\n", host_ptr->hostname.c_str(), profile_groups );
profile_sensors = load_profile_sensors ( host_ptr, &sensor_array[0], MAX_HOST_SENSORS, error );
if (( error == false ) && ( profile_sensors ))
{
ilog ( "%s %d profile sensors loaded\n", host_ptr->hostname.c_str(), profile_sensors );
for ( int s = 0 ; s < profile_sensors ; ++s )
{
if (( rc = hwmonHttp_add_sensor ( host_ptr->hostname, host_ptr->event, sensor_array[s])) == PASS )
{
sensor_array[s].uuid = host_ptr->event.new_uuid ;
if (( rc = add_sensor ( host_ptr->hostname, sensor_array[s] )) == PASS )
{
blog ( "%s '%s' sensor added\n",
host_ptr->hostname.c_str(),
host_ptr->sensor[s].sensorname.c_str());
}
else
{
wlog ("%s '%s' sensor add failure (to hwmon)\n",
host_ptr->hostname.c_str(),
sensor_array[s].sensorname.c_str());
}
}
else
{
wlog ("%s '%s' sensor add failure (to sysinv)\n",
host_ptr->hostname.c_str(),
sensor_array[s].sensorname.c_str());
}
} /* end for loop */
}
else
{
elog ( "%s load_profile_sensors failed (rc:%d) (%d)\n",
host_ptr->hostname.c_str(),
error,
profile_sensors );
}
}
else
{
elog ( "%s load_profile_groups failed (rc:%d) (%d:%d)\n",
host_ptr->hostname.c_str(),
error,
profile_groups,
host_ptr->groups );
}
}
else
{
elog ("%s too few groups\n", host_ptr->hostname.c_str());
rc = FAIL_INVALID_DATA ;
}
}
else
{
elog ("%s failed to create group model (rc:%d)\n", host_ptr->hostname.c_str(), rc);
}
}
if (( rc == PASS ) && ( host_ptr->quanta_server))
{
/* Group all the sensors into the groups specified by the profile file */
rc = hwmonHostClass::hwmon_group_sensors ( host_ptr );
if ( rc == PASS )
{
ilog ("%s sensors grouped\n", host_ptr->hostname.c_str());
}
else
{
elog ("%s sensor grouping failed (rc:%d)\n", host_ptr->hostname.c_str(), rc );
}
plog ("%s sensor model created\n", host_ptr->hostname.c_str() );
}
if (( host_ptr->relearn == true ) ||
( host_ptr->interval < HWMON_MIN_AUDIT_INTERVAL ))
{
dlog ("%s requesting interval change (%d)\n",
host_ptr->hostname.c_str(),
host_ptr->interval );
host_ptr->interval_changed = true ;
}
/* make sure all sensors are updated with the group actions */
return (rc);
}
/******************************************************************************
*
* Name : bmc_create_sample_model
*
* Description: Create a sensor model based on sample data.
*
******************************************************************************/
int hwmonHostClass::bmc_create_sample_model ( struct hwmonHostClass::hwmon_host * host_ptr )
{
int rc = FAIL ;
if ( host_ptr->samples )
{
/* Start by creating a set of sensor groups based on sample data
* and specifically sensor type and save those groups in the database */
if ( ( rc = bmc_create_groups ( host_ptr ) ) == PASS )
{
/* add all the sensors to hwmon and save that in the database */
if ( ( rc = bmc_create_sensors ( host_ptr ) ) == PASS )
{
/* add the sensors to the groups and save that in the database */
rc = bmc_group_sensors ( host_ptr );
}
}
}
else
{
rc = FAIL_NO_DATA ;
elog ("%s failed sensor sample model create ; no sensor samples\n", host_ptr->hostname.c_str() );
}
return(rc);
}
/******************************************************************************
*
* Name : bmc_create_quanta_model
*
* Description: Create a static Quanta sever sensor group model.
*
******************************************************************************/
int hwmonHostClass::bmc_create_quanta_model ( struct hwmonHostClass::hwmon_host * host_ptr )
{
int status = PASS ;
int rc = PASS ;
if ( host_ptr )
{
if ( host_ptr->quanta_server == true )
{
rc = bmc_add_group ( host_ptr , DISCRETE, "fan" , HWMON_CANNED_GROUP__FANS, "server fans", "show /SYS/fan");
if (( rc ) && ( !status )) status = rc ;
rc = bmc_add_group ( host_ptr , DISCRETE, "fan" , HWMON_CANNED_GROUP__FANS, "power supply fans", "show /SYS/fan");
if (( rc ) && ( !status )) status = rc ;
rc = bmc_add_group ( host_ptr , DISCRETE, "power" , HWMON_CANNED_GROUP__POWER, "server power", "show /SYS/powerSupply");
if (( rc ) && ( !status )) status = rc ;
rc = bmc_add_group ( host_ptr , DISCRETE, "temperature" , HWMON_CANNED_GROUP__TEMP, "server temperature", "show /SYS/temperature");
if (( rc ) && ( !status )) status = rc ;
rc = bmc_add_group ( host_ptr , DISCRETE, "voltage" , HWMON_CANNED_GROUP__VOLT, "server voltage", "show /SYS/voltage");
if (( rc ) && ( !status )) status = rc ;
}
}
return (status);
}
int hwmonHostClass::bmc_delete_sensor_model ( struct hwmonHostClass::hwmon_host * host_ptr )
{
int rc = PASS ;
if ( host_ptr->relearn_retry_counter == 0 )
{
ilog ("%s ... saving group customizations\n",
host_ptr->hostname.c_str());
this->save_model_attributes ( host_ptr );
ilog ("%s ... clearing existing assertions\n",
host_ptr->hostname.c_str());
this->clear_bm_assertions ( host_ptr );
blog ("%s ... deleting sensor model\n",
host_ptr->hostname.c_str());
}
/* Delete the groups from the end to the start.
* If there is a failure then exit and the caller will retry.
*/
if ( host_ptr->groups )
{
for ( int g = host_ptr->groups-1 ;
host_ptr->groups != 0 ;
host_ptr->groups-- , g-- )
{
daemon_signal_hdlr ();
int rc_temp = hwmonHttp_del_group ( host_ptr->hostname,
host_ptr->event,
host_ptr->group[g] );
if ( rc_temp )
{
elog ("%s %s group delete failed (rc:%d) (%d)\n",
host_ptr->hostname.c_str(),
host_ptr->group[g].group_name.c_str(),
rc_temp, g );
host_ptr->relearn_retry_counter++ ;
return (rc_temp);
}
else
{
blog ("%s %s (index:%d)\n",
host_ptr->hostname.c_str(),
host_ptr->group[g].group_name.c_str(), g );
if ( host_ptr->group[g].timer.init == TIMER_INIT_SIGNATURE )
{
mtcTimer_reset ( host_ptr->group[g].timer );
}
hwmonGroup_init ( host_ptr->hostname, &host_ptr->group[g]);
}
}
}
/* Delete the sensors from the end to the start.
* If there is a failure then exit and the caller will retry.
*/
if ( host_ptr->sensors )
{
for ( int s = host_ptr->sensors-1 ;
host_ptr->sensors != 0 ;
host_ptr->sensors-- , s-- )
{
daemon_signal_hdlr ();
int rc_temp = hwmonHttp_del_sensor ( host_ptr->hostname,
host_ptr->event,
host_ptr->sensor[s] );
if ( rc_temp )
{
elog ("%s %s sensor delete failed (rc:%d) (%d)\n",
host_ptr->hostname.c_str(),
host_ptr->sensor[s].sensorname.c_str(),
rc_temp, s );
host_ptr->relearn_retry_counter++ ;
return (rc_temp);
}
else
{
blog ("%s %s (index:%d)\n",
host_ptr->hostname.c_str(),
host_ptr->sensor[s].sensorname.c_str(), s );
hwmonSensor_init ( host_ptr->hostname, &host_ptr->sensor[s]);
sensor_data_init ( host_ptr->sample[s] );
if ( host_ptr->sensors == 1 )
{
host_ptr->quanta_server = false ;
host_ptr->sensors =
host_ptr->samples =
host_ptr->profile_sensor_checksum =
host_ptr->sample_sensor_checksum =
host_ptr->last_sample_sensor_checksum = 0 ;
break ;
}
}
}
}
if (( host_ptr->sensors == 0 ) && ( host_ptr->groups == 0 ))
{
plog ("%s sensor model deleted\n", host_ptr->hostname.c_str() );
}
else
{
elog ("%s sensor model delete failed (%d:%d)\n",
host_ptr->hostname.c_str(),
host_ptr->groups,
host_ptr->sensors );
rc = FAIL ;
}
return (rc);
}
/* *************************************************************************
*
* Name : bmc_load_sensor_model
*
* Description: Called from the add_handler to load sensors and groups
* for the specified host from the sysinv database.
*
* Warnings : Will return a failure and swerr if called when with an
* already loaded sensor profile.
*
* Assumptions: Inservice sensor model reprovisioning is done with
* bmc_delete_sensor_model and bmc_create_sensor_model API.
*
*
* Scope : private hwmonHostClass
*
* Parameters : host_ptr
*
* Returns : TODO: handle modify errors better.
*
* *************************************************************************/
int hwmonHostClass::bmc_load_sensor_model ( struct hwmonHostClass::hwmon_host * host_ptr )
{
int rc ;
if (( host_ptr->sensors ) || ( host_ptr->groups ))
{
elog ("%s already has %d sensors across %d groups loaded - reloading\n",
host_ptr->hostname.c_str(),
host_ptr->sensors,
host_ptr->groups );
this->hwmon_del_sensors ( host_ptr );
this->hwmon_del_groups ( host_ptr );
rc = FAIL_INVALID_OPERATION ;
}
else
{
/* Load aleady provisioned sensors from the database
* into host_ptr->sensor list.
*
* Warning: This is a blocking call and always has been.
*/
rc = hwmonHttp_load_sensors ( host_ptr->hostname, host_ptr->event );
if ( rc == PASS )
{
daemon_signal_hdlr (); /* service the signals */
if ( host_ptr->sensors != 0 )
{
/* Load aleady provisioned groups from the database
* into host_ptr->group list */
rc = hwmonHttp_load_groups ( host_ptr->hostname, host_ptr->event );
if ( rc == PASS )
{
/* update sample severity to avoid state change
* from fail to ok to fail over a process restart */
for ( int s = 0 ; s < host_ptr->sensors ; s++ )
{
host_ptr->sensor[s].sample_severity = get_severity(host_ptr->sensor[s].status) ;
host_ptr->sensor[s].sample_status =
host_ptr->sensor[s].sample_status_last = host_ptr->sensor[s].status ;
}
rc = hwmonHostClass::hwmon_group_sensors ( host_ptr );
if ( rc == PASS )
{
blog ("%s sensors grouped\n", host_ptr->hostname.c_str());
}
else
{
wlog ("%s sensor grouping failed (in hwmon) (rc:%d)\n", host_ptr->hostname.c_str(), rc );
}
}
else
{
wlog ("%s sensor group load failed (from sysinv) (rc:%d)\n", host_ptr->hostname.c_str(), rc );
}
}
}
else
{
wlog ("%s sensors load failed (from sysinv) (rc:%d)\n", host_ptr->hostname.c_str(), rc );
}
}
if ( rc == PASS )
{
if (( host_ptr->sensors ) && ( host_ptr->groups ))
{
ilog ("%s has %d sensors across %d groups (in sysinv)\n",
host_ptr->hostname.c_str(),
host_ptr->sensors,
host_ptr->groups );
/* initialize sensor data */
for ( int i = 0 ; i < host_ptr->sensors ; ++i )
{
host_ptr->sensor[i].severity = get_severity ( host_ptr->sensor[i].status );
}
host_ptr->profile_sensor_checksum =
checksum_sensor_profile ( host_ptr->hostname,
host_ptr->sensors,
&host_ptr->sensor[0]);
ilog ("%s database profile checksum : %04x (%d sensors)\n",
host_ptr->hostname.c_str(),
host_ptr->profile_sensor_checksum,
host_ptr->sensors);
if ((( host_ptr->profile_sensor_checksum == QUANTA_SENSOR_PROFILE_CHECKSUM ) ||
( host_ptr->profile_sensor_checksum == QUANTA_SENSOR_PROFILE_CHECKSUM_13_53 )) &&
(( host_ptr->sensors == QUANTA_PROFILE_SENSORS ) || (QUANTA_PROFILE_SENSORS_REVISED_1)) &&
( host_ptr->groups == QUANTA_SENSOR_GROUPS ))
{
ilog ("%s ---------------------------------------------\n", host_ptr->hostname.c_str());
ilog ("%s is a Quanta server with legacy sensor profile\n", host_ptr->hostname.c_str());
ilog ("%s ---------------------------------------------\n", host_ptr->hostname.c_str());
host_ptr->quanta_server = true ;
}
else
{
ilog ("%s has unique sensor model\n", host_ptr->hostname.c_str());
}
}
else
{
/* Incomplete or no sensor/group model found in database */
ilog ("%s no valid sensor model found (in sysinv) (sensors:%d groups:%d)\n",
host_ptr->hostname.c_str(),
host_ptr->sensors,
host_ptr->groups );
if (( host_ptr->sensors ) || (host_ptr->groups ))
{
wlog ("%s has a corrupt sensor profile ; deleting ...\n", host_ptr->hostname.c_str());
bmc_delete_sensor_model ( host_ptr );
}
}
}
return (rc);
}