148 Commits

Author SHA1 Message Date
Eric MacDonald
da398e0c5f Debian: Make Mtce offline handler more resilient to slow shutdowns
The current offline handler assumes the node is offline after
'offline_search_count' reaches 'offline_threshold' count
regardless of whether mtcAlive messages were received during
the search window.

The offline algorithm requires that no mtcAlive messages
be seen for the full offline_threshold count.

During a slow shutdown the mtcClient runs for longer than
it should and as a result can lead to maintenance seeing
the node as recovered before it should.

This update manages the offline search counter to ensure that
it only reached the count threshold after seeing no mtcAlive
messages for the full search count. Any mtcAlive message seen
during the count triggers a count reset.

This update also
1. Adjusts the reset retry cadence from 7 to 12 secs
   to prevent unnecessary reboot thrash during
   the current shutdown.
2. Clears the hbsClient ready event at the start of the
   subfunction handler so the heartbeat soak is only
   started after seeing heartbeat client ready events
   that follow the main config.

Test Plan:

PASS: Debian and CentOS Build and DX install
PASS: Verify search count management
PASS: Verify issue does not occur over lock/unlock soak (100+)
      - where the same test without update did show issue.
PASS: Monitor alive logs for behavioral correctness
PASS: Verify recovery reset occurs after expected extended time.

Closes-Bug: 1993656
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e
2022-10-24 15:57:43 +00:00
Eric MacDonald
3f4c2cbb45 Mtce: Add ActionInfo extension support for reset operations.
StarlingX Maintenance supports host power and reset control through
both IPMI and Redfish Platform Management protocols when the host's
BMC (Board Management Controller) is provisioned.

The power and reset action commands for Redfish are learned through
HTTP payload annotations at the Systems level; "/redfish/v1/Systems.

The existing maintenance implementation only supports the
"ResetType@Redfish.AllowableValues" payload property annotation at
the #ComputerSystem.Reset Actions property level.

However, the Redfish schema also supports an 'ActionInfo' extension
at /redfish/v1/Systems/1/ResetActionInfo.

This update adds support for the 'ActionInfo' extension for Reset
and power control command learning.

For more information refer to the section 6.3 ActionInfo 1.3.0 of
the Redfish Data Model Specification link in the launchpad report.

Test Plan:

PASS: Verify CentOS build and patch install.
PASS: Verify Debian build and ISO install.
PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0
PASS: Verify reset/power control cmd load from newly added second
      level query from ActionInfo service.

Failure Handling: Significant failure path testing with this update

PASS: Verify Redfish protocol is periodically retried from start
      when bm_type=redfish fails to connect.
PASS: Verify BMC access protocol defaults to IPMI when
      bm_type=dynamic but failed connect using redfish.
      Connection failures in the above cases include
      - redfish bmc root query fails
      - redfish bmc info query fails
      - redfish bmc load power/reset control actions fails
      - missing second level Parameters label list
      - missing second level AllowableValues label list
PASS: Verify sensor monitoring is relearned to ipmi from failed and
      retried with bm_type=redfish after switch to bm_type=dynamic
      or bm_type=ipmi by sysinv update command.

Regression:

PASS: Verify with CentOS redfishtool 1.1.0
PASS: Verify switch back and forth between ipmi and redfish using
      update bm_type=ipmi and bm_type=redfish commands
PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for
      hosts that support redfish
PASS: Verify redfish protocol is preferred in bm_type=dynamic mode
PASS: Verify IPMI sensor monitoring when bm_type=ipmi
PASS: Verify IPMI sensor monitoring when bm_type=dynamic
      and redfish connect fails.
PASS: Verify redfish sensor event assert/clear handling with
      alarm and degrade condition for both IPMI and redfish.
PASS: Verify reset/power command learn by single level query.
PASS: Verify mtcAgent.log logging

Closes-Bug: 1992286
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35
2022-10-13 17:40:05 +00:00
Zuul
ad1c87669f Merge "Debian: Redfishtool requests fail when IPV4 address has square brackets" 2022-10-11 15:41:59 +00:00
Eric MacDonald
db0b4ccadd Debian: Redfishtool requests fail when IPV4 address has square brackets
Redfishtool was introduced in CentOS for maintenance power control
and sensor monitoring. Both IPV4 and IPV6 addressing is supported.

The initial integration exposed an issue where square brackets were
required around the BMC IP address for IPV6 addressing. At the time
it was simpler to add the brackets for IPV4 as well.

    redfishtool -S Always -T 30 -r [${BM_IP}] root

However, the python3 version of redfishtool, introduced in Debian,
rejects requests with square braces around IPV4 addresses.

    redfishtool -v -S Always -T 20 -r [${BM_IP}] root
    # Main: Error: rc=5

This update introduces a utility to the mtce msgClass module used
to distinguish between IPV4 and IPV6 addresses. The redfish request
create utility is updated to use this new utility when creating the
redfishtool request without adding the square brackets in Debian
for BMC's provisioned with IPV4 addressing.

Update testing revealed that the Debian based python3 version of
redfishtool takes a few seconds longer compared to the python2 in
CentOS. This exposes a timer race condition during sensor monitoring.
The BMC pthread is currently given 60 seconds to complete its
requests. However, unlike sensor monitoring using ipmi which uses
one request, redfish requires two requests.

Unfortunately, requests using the Debian python3 version of
redfishtool sometimes take longer than 30 seconds. If the cumulation
of both requests take longer than the current max timeout of 60
seconds then that is treated as a pthread timeout error condition
causing the hardware monitor to enter an error state for that host
which requires it to go through a full reconnection algorithm.

Given this additional issue this update also increases the BMC
thread timeout from 60 to 100 seconds to avoid needless reconnections
when using the mildly slower Debian python3 redfishtool.

Test Plan:

PASS: Verify Build Debian and CentOS iso images
PASS: Verify Patch CentOS change
PASS: Verify Install Debian image

For both Debian and CentOS

PASS: Verify redfish sensor monitoring over IPV4
PASS: Verify ipmi sensor monitoring over IPV4
PASS: Verify redfish sensor monitoring over IPV6
PASS: Verify ipmi sensor monitoring over IPV6
PASS: Verify no redfish connection failure/recovery errors; 3 hr soak

Regression:

PASS: Verify sensor model relearn by command and reprovision
PASS: Verify system host-modify <id> bm_type change handling
PASS: Verify redfish critical sensor assert/clear handling
PASS: Verify ipmi critical sensor assert/clear handling
PASS: Verify mtcAgent and hwmond logging

Closes-Bug: 1991819
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I3b69cc4f19c580687cd91b4f2033f6019be87e5e
2022-10-06 22:21:38 +00:00
Girish Subramanya
86681b7598 Alarm Hostname controller function has in-service failure reported
When compute services remain healthy:
 - listing alarms shall not refer to the below Obsoleted alarm
 - 200.012 alarm hostname controller function has an in-service failure

This update deletes definition of the obsoleted alarm and any references
200.012 is removed in events.yaml file
Also updated any reference to this alarm definition.
Need to also raise a Bug to track the Doc change.

Test Plan:
Verify on a Standard configuration no alarms are listed for
hostname controller in-service failure
Code (removal) changes exercised with fix prior to ansible bootstrap
and host-unlock and verify no unexpected alarms
Regression:
There is no need to test the alarm referred here as they are obsolete

Closes-Bug: 1991531

Signed-off-by: Girish Subramanya <girish.subramanya@windriver.com>

Change-Id: I255af68155c5392ea42244b931516f742fa838c3
2022-10-05 10:30:01 -04:00
Eric MacDonald
aaf9d08028 Mtce: Fix bmc password fetch error handling
The mtcAgent process sometimes segfaults while trying to fetch
the bmc password from a failing barbican process.

With that issue fixed the mtcAgent sends the bmc access
credentials to the hardware monitor (hwmond) process which
then segfaults for a reason similar

In cases where the process does not segfault but also does not
get a bmc password, the mtcAgent will flood its log file.

This update

 1. Prevents the segfault case by properly managing acquired
    json-c object releases. There was one in the mtcAgent and
    another in the hardware monitor (hwmond).

    The json_object_put object release api should only be called
    against objects that were created with very specific apis.
    See new comments in the code.

 2. Avoids log flooding error case by performing a password size
    check rather than assume the password is valid following the
    secret payload receive stage.

 3. Simplifies the secret fsm and error and retry handling.

 4. Deletes useless creation and release of a few unused json
    objects in the common jsonUtil and hwmonJson modules.

Note: This update temporarily disables sensor and sensorgroup
      suppression support for the debian hardware monitor while
      a suppression type fix in sysinv is being investigated.

Test Plan:

PASS: Verify success path bmc password secret fetch
PASS: Verify secret reference get error handling
PASS: Verify secret password read error handling
PASS: Verify 24 hr provision/deprov success path soak
PASS: Verify 24 hr provision/deprov error path path soak
PASS: Verify no memory leak over success and failure path soaking
PASS: Verify failure handling stress soak ; reduced retry delay
PASS: Verify blocking secret fetch success and error handling
PASS: Verify non-blocking secret fetch success and error handling
PASS: Verify secret fetch is set non-blocking
PASS: Verify success and failure path logging
PASS: Verify all of jsonUtil module manages object release properly
PASS: Verify hardware monitor sensor model creation, monitoring,
             alarming and relearning. This test requires suppress
             disable in order to create sensor groups in debian.
PASS: Verify both ipmi and redfish and switch between them with
             just bm_type change.
PASS: Verify all above tests in CentOS
PASS: Verify over 4000 provision/deprovision cycles across both
             failure and success path handling with no process
             failures

Closes-Bug: 1975520
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff
2022-06-01 15:21:05 +00:00
Tracey Bogue
0551c665cb Add Debian packaging for mtce packages
Some of the code used TRUE instead of true which did not compile
for Debian. These instances were changed to true.
Some #define constants generated narrowing errors because their
values are negative in a 32 bit integer. These values were
explicitly casted to int in the case statements causing the errors.

Story: 2009101
Task: 43426

Signed-off-by: Tracey Bogue <tracey.bogue@windriver.com>
Change-Id: Iffc4305660779010969e0c506d4ef46e1ebc2c71
2021-10-29 09:17:00 -05:00
Eric MacDonald
48978d804d Improved maintenance handling of spontaneous active controller reboot
Performing a forced reboot of the active controller sometimes
results in a second reboot of that controller. The cause of the
second reboot was due to its reported uptime in the first mtcAlive
message, following the reboot, as greater than 10 minutes.

Maintenance has a long standing graceful recovery threshold of
10 minutes. Meaning that if a host looses heartbeat and enters
Graceful Recovery, if the uptime value extracted from the first
mtcAlive message following the recovery of that host exceeds 10
minutes, then maintenance interprets that the host did not reboot.
If a host goes absent for longer than this threshold then for
reasons not limited to security, maintenance declares the host
as 'failed' and force re-enables it through a reboot.

With the introduction of containers and addition of new features
over the last few releases, boot times on some servers are
approaching the 10 minute threshold and in this case exceeded
the threshold.

The primary fix in this update is to increase this long standing
threshold to 15 minutes to account for evolution of the product.

During the debug of this issue a few other related undesirable
behaviors related to Graceful Recovery were observed with the
following additional changes implemented.

 - Remove hbsAgent process restart in ha service management
   failover failure recovery handling. This change is in the
   ha git with a loose dependency placed on this update.
   Reason: https://review.opendev.org/c/starlingx/ha/+/788299

 - Prevent the hbsAgent from sending heartbeat clear events
   to maintenance in response to a heartbeat stop command.
   Reason: Maintenance receiving these clear events while in
           Graceful Recovery causes it to pop out of graceful
           recovery only to re-enter as a retry and therefore
           needlessly consumes one (of a max of 5) retry count.

 - Prevent successful Graceful Recovery until all heartbeat
   monitored networks recover.
   Reason: If heartbeat of one network, say cluster recovers but
           another (management) does not then its possible the
           max Graceful Recovery Retries could be reached quite
           quickly, while one network recovered but the other
           may not have, causing maintenance to fail the host and
           force a full enable with reboot.

 - Extend the wait for the hbsClient ready event in the graceful
   recovery handler timout from 1 minute to worker config timeout.
   Reason: To give the worker config time to complete before force
           starting the recovery handler's heartbeat soak.

 - Add Graceful Recovery Wait state recovery over process restart.
   Reason: Avoid double reboot of Gracefully Recovering host over
           SM service bounce.

 - Add requirement for a valid out-of-band mtce flags value before
   declaring configuration error in the subfunction enable handler.
   Reason: rebooting the active controller can sometimes result in
           a falsely reported configation error due to the
           subfunction enable handler interpreting a zero value as
           a configuration error.

 - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
   Reason: To assist log analysis and issue debug

Test Plan:

PASS: Verify handling active controller reboot
             cases: AIO DC, AIO DX, Standard, and Storage
PASS: Verify Graceful Recovery Wait behavior
             cases: with and without timeout, with and without bmc
             cases: uptime > 15 mins and 10 < uptime < 15 mins
PASS: Verify Graceful Recovery continuation over mtcAgent restart
             cases: peer controller, compute, MNFA 4 computes
PASS: Verify AIO DX and DC active controller reboot to standby
             takeover that up for less than 15 minutes.

Regression:

PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
PASS: Verify cluster network only heartbeat loss handling
             cases: worker and standby controller in all systems.
PASS: Verify Dead Office Recovery (DOR)
             cases: AIO DC, AIO DX, Standard, Storage
PASS: Verify system installations
             cases: AIO SX/DC/DX and 8 node Storage system
PASS: Verify heartbeat and graceful recovery of both 'standby
             controller' and worker nodes in AIO Plus.

PASS: Verify logging and no coredumps over all of testing
PASS: Verify no missing or stuck alarms over all of testing

Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
Closes-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-30 15:35:53 +00:00
Eric MacDonald
7539d36c3f Prevent mtcClient from sending to uninitialized socket in AIO SX
The mtcClient will perform a socket reinit if it detects a socket
failure. The mtcClient also avoids setting up its controller-1
cluster network socket for the AIO SX system type ; because there
is no controller-1 provisioned.

Most AIO SX systems have the management/cluster networks set to
the 'loopback' interface. However, when an AIO SX system is setup
with its management and cluster networks on physical interfaces,
with or without vlan, the mtcAlive send message utility will try
to send to the uninitialized controller-1 cluster socket. This
leads to a socket error that triggers a socket reinitialization
loop which causes log flooding.

This update adds a check to the mtcAlive send utility to avoid
sending mtcAlive to controller-1 for AIO SX system type where
there is no controller-1 provisioned; no send,no error,no flood.

Since this update needed to add a system type check, this update
also implemented a system type definition rename from CPE to AIO.
Other related definitions and comments were also changed to make
the code base more understandable and maintainable

Test Plan:

PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
PASS: Verify AIO SX locked-disabled-online state
PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)

Regression:

PASS: Verify AIO SX Lock and Unlock (lazy reboot)
PASS: Verify AIO DX and DC install with pv regression and sanity
PASS: Verify Standard system install with pv regression and sanity

Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
Closes-Bug: 1897334
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-21 10:20:10 -04:00
Eric MacDonald
5c83453fdf Fix Graceful Recovery handling while in Graceful Recovery handling
The current Graceful Recovery handler is not properly handling
back-to-back Multi Node Failure Avoidance (MNFA) events.

There are two phases to MNFA

 phase 1: waiting for number of failed nodes to fall below
          mnfa_threahold as each affected node's heartbeat
          is recovered.
 phase 2: then a Graceful Recovery Wait period which is an
          11 second heartbeat soak to verify that a stable
          heartbeat is regained before declaring the NMFA
          event complete.

The Graceful Recovery Wait status of one or more affected nodes
has been seen to be left uncleared (stuck) on one or more of the
affected nodes if phase 2 of MNFA is interrupted by another MNFA
event ; aka MNFA Nesting.

Although this stuck status is not service affecting it does leave
one or more nodes' host.task field, as observed under host-show,
with "Graceful Recovery Wait" rather than empty.

This update makes Multi Node Failure Avoidance (MNFA) handling
changes to ensure that, upon MNFA exit, the recovery handler
is properly restarted if MNFA Nesting occurs.

Two additional Graceful Recovery phase issues were identified
and fixed by this update.

 1. Cut Graceful recovery handling in half

    - Found and removed a redundant 11 second heartbeat soak
      at the very end of the recovery handler.
    - This cuts the graceful recovery handling time down from
      22 to 11 seconds thereby cutting potential for nesting
      in half.

 2. Increased supported Graceful Recovery nesting from 3 to 5

    - Found that some links bounce more than others so a nesting
      count of 3 can lead to an occasional single node failure.
    - This adds a bit more resiliency to MNFA handling of cases
      that exhibit more link messaging bounce.

Test Plan: Verified 60+ MNFA occurrences across 4 different
           system types including AIO plus, Standard and Storage

PASS: Verify Single Node Graceful Recovery Handling
PASS: Verify Multi Node Graceful Recovery Handling
PASS: Verify Single Node Graceful Recovery Nesting Handling
PASS: Verify Multi Node Graceful Recovery Nesting Handling
PASS: Verify MNFA of up to 5 nests can be gracefully recovered
PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
PASS: Verify update as a patch
PASS: Verify mtcAgent logging

Regression:

PASS: Verify standard system install
PASS: Verify product verification maintenance regression (4 runs)
PASS: Verify MNFA threshold increase and below threshold behavior
PASS: Verify MNFA with reduced timeout behavior for
      ... nested case that does not timeout
      ... case that does not timeout
      ... case that does timeout

Closes Bug: 1892877
Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-17 14:25:19 -04:00
Eric MacDonald
4f5bf78f55 Improve mtcAgent interrupted thread cleanup
A BMC command send will be rejected if its thread
is not in the IDLE state going into the call.

This issue is seen to occur over a reprovisioning action
while the bmc access alarmable condition exists.

Maintenance will do retries. So the only visible side affect
of this issue is a failure to provision to 'redfish' over a
provisioning switch to 'dynamic' (learn mode). Instead
ipmi is selected.

The non-return to idle can occur when the bmc handler FSM
is interrupted by a reprovisioning request while a bmc
command is in flight.

This update enhances the thread management module by
introducing a thread consumption utility that is called
by the bmc command send utility. If the send finds that
its thread is not in the IDLE state it will either kill
the thread if it is running or free a completed but-not-
consumed thread result.

Note: Maintenance only supports the execution of
a single thread per host per process at one time.

Test Plan:

PASS: Verify BMC provisioning change from ipmi to dynamic
      while the ipmi provisioning was failing prior to
      re-provisioning. Verify the previous error is cleaned
      up and the reprovisioning request succeeds as expected.

PASS: Verify thread 'execution timeout kill' cleanup handling.
PASS: Verify thread 'complete but not consumed' cleanup handling.
PASS: Verify logging during regression soaks

Regression:

PASS: Verify bmc protocol reprovisioning script soak
PASS: Verify sensor monitoring following BMC reprovisioning
PASS: Verify product verification mtce regression test suite

Change-Id: Ie5e9e89ed2f8db6888c0fc7de03d494c75517178
Closes-Bug: 1864906
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-15 10:51:16 -04:00
Eric MacDonald
9ab726b0eb Add support for peer controller reset via mtcClient
This update adds the ability for SM to passively
request the mtcClient to BMC reset its peer controller
as a means to recover a severely loaded active controller.

To do this the mtcAgent is modified keep the controllers'
mtcClients updated with the BMC info of its peer.

The mtcClient is modified to audit for the SM signal
and then when asserted issue a BMC reset of its peer
controller using ipmitool system call.

The ability to command the peer mtcCient to 'sync'
prior to the BMC reset is implemented but configured
disabled for now.

Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a
Partial-Bug: #1895350
Co-Authored-By: Bin.Qian@windriver.com
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-01-14 16:44:14 -05:00
Eric MacDonald
8c81914ea5 Add SM process heartbeat and status to the hbs cluster
This update is the mtc hbsAgent side of a new
SM -> hbsAgent heartbeat algorithm for the
purpose of detecting peer SM process stalls.

This update adds an 'SM Heartbeat status' bit to
the cluster view it injects into its multicast
heartbeat requests.

Its peer is able to read this on-going hbsAgent/SM
heartbeat status through the cluster.

The status bit reads 'ok' while the hbsAgent sees
the SM heartbeat as steady.

The status bit reads 'not ok' while the SM heartbeat
is lost for longer than 800 msecs.

Change-Id: I0f2079b0fafd7bce0b97ee26d29899659d66f81d
Partial-Fix: #1895350
Co-Authored-By: Bin.Qian@windriver.com
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-12-10 11:13:13 -05:00
Eric MacDonald
1350502720 Make Mtce Power-Off FSM verify power-off
If a host's BMC server accepts a power-off command without
error but does not actually power-off the host, the power-off
FSM reports success yet the host power is still on.

This update adds a verification component to the power-off
FSM. Once the power-off command is issued and succeeds at the
command level, the power-off FSM will now query power status
and retry the power-off command until the server is verified
to be powered-off or the retry max (10) is reached and the
power-off command is failed.

Test Plan:

PASS: Verify 200+ Mtce Power Off/On cycles (ipmi & redfish)
PASS: Verify 100+ Mtce Reinstalls with FIT (ipmi & redfish)

Change-Id: Iddd120d89d1152fc0b26915df123f586c38b909b
Closes-Bug: 1865087
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-22 13:38:33 +00:00
Eric MacDonald
1196056612 Disable Redfish BMC audit and improve reinstall failure handling
The Mtce Reinstall Handler can collide with the BMC Redfish
audit resulting in reinstall failure. BMC handler's 2 minute
connection audit can colliding with other BMC commands.

The reinstall handler, with 4 bmc command operations is
particularly suseptable.

Two additional bmc communication improvements are implemented:

1. Add 'retry' handling to all BMC requests in the Maintenance
   Reinstall Handler FSM to handle transient command failures.

   Note: There are already retries to all but the power status
   query and the netboot requests in that handler and retries
   in other administrative commands that involve bmc requests.

2. Switch BMC power control command management from 'static' to
   'learned' lists. Some BMCs don't support both graceful and
   immediate power commands; Graceful Restart and Force Restart.
   To remove the possibility of using an unsupported BMC command,
   this update switches from static to learned power command lists
   with log produced if a server is missing command support.

   Power commands escalate from graceful to immediate in the
   presence of retries.

Test Cases:

PASS: Verify bmc handler redfish audit is disabled
PASS: Verify reinstall soak using redfish
PASS: Verify reinstall netboot and power status retry handling
PASS: Verify all power control commands using redfish
PASS: Verify graceful operations are used if available
PASS: Verify immediate operations are used for retries

Regression:

PASS: Verify bmc ping audit success and failure handling

PASS: Verify Reset        Handling soak (redfish and ipmi)
PASS: Verify Power-Off/On Handling soak (redfish and ipmi)
PASS: Verify Reinstall    Handling soak (redfish and ipmi)
PASS: Verify Standard System Install    (redfish and ipmi)
PASS: Verify AIO DX   System Install    (redfish and ipmi)

PASS: Verify this update as a patch

Change-Id: Idb484512ccb1b16e2d0ea9aff4ab7965347b1322
Closes-Bug: 1880578
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-16 15:15:22 +00:00
Eric MacDonald
2fc05673d1 Add SysRq crash dump support for pmon quorum health messaging loss
The hostwd process supports failure handling for two pmon
quorum failure modes.
 1. persistent pmon quorum process failure
 2. persistent absence of pmon's quorum health report

This update adds a new configuration option and associated
implementation required to force a crash dump action for
failure mode 2 above.

This means that if the Process Monitor itself gets stalled or stops
running for 3 (default config) minutes then the hostwd will trigger
a SysRq to force a crash dump.

Test Plan:

PASS: Verify kdump for pmon quorum health report message loss
PASS: Verify no kdump when kdump_on_stall is disabled
PASS: Verify handling when kdump service is not active
PASS: Verify sighup config change detection and handling

Regression:

PASS: Verify softdog timeout handling and logs
PASS: Verify quorum threshold config change and handling
PASS: Verify handling with reboot/reset recovery methods disabled
PASS: Verify enable reboot_on_err config change handling
PASS: Verify reboot/reset actions are ignored while host is locked
PASS: Verify pmon failure recovery handling before threshold reached

Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1
Partial-Bug: 1894889
Depends-On: https://review.opendev.org/#/c/750806
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-13 12:38:16 -05:00
Eric MacDonald
3a6fec50c1 Reduce Maintenance Host Watchdog timeout for controllers
This update makes changes to the maintenance host watchdog
and reduces the timeout from 5 to 3 minutes for controllers.

This update also decouples the pmon quorum monitoring
feature handling from the host watchdog timeout. Both were
driven off the same select timer which prevented watchdog
timeout value to be independently changed without affecting
quorum monitoring.

A new config label 'kernwd_update_period_stall_detect' is
added and value loaded for hosts that need more rigid
process stall detection.

This new lower timeout value label is loaded and applied to
hosts that run the system controller function.

A few logging improvements were made.

Test Plan:

PASS: Verify pmon quorum failure handling while unlocked.
             Was and remains at 3 misses, 60 seconds each.
PASS: Verify watchdog TO at 12 seconds on controllers.
             Was 300 secs.
PASS: Verify kernel watchdog is not enabled when loaded
             kernwd_update_period is less than 5 seconds.
             Was 60 secs.
PASS: Verify process logging ; startup, failure, transient
PASS: Verify all config values loaded by hostwd process

Regression:

PASS: Verify watchdog TO at 300 seconds on non-controllers
PASS: Verify handling of failed quorum process while locked
PASS: Verify handling of failed quorum process while unlocked
PASS: Verify handling of transient quorum messaging loss while
             unlocked
PASS: Verify hostwd process patching ; locked and unlocked
             cases

PASS: Verify AIO DX System Install
PASS: Verify Standard System Install

Note: There is no kernel WD TO log.
      The log is output to the console.

Change-Id: Iad726436e28dfa48a06743aa166318969eb6915d
Closes-Bug: #1894889
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-13 07:52:59 -05:00
Eric MacDonald
126cdfa369 Make daemon_get_file_str return first line in specified file
The current implementation will return only the first
group of characters up to the first space from the first
line of the specified file.

This function was intended to return the entire first line.

Change-Id: Ic34361c32aeff564f4645070279cdb53d5b87626
Closes-Bug: 1896669
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-09-22 18:19:24 -04:00
Eric MacDonald
55d5f43edb Fix heartbeat messaging when interface is set to 'lo'
Maintenance heartbeat service should not be multicast
messaging over an 'lo' interface which in IPv6 leads
to socket failures, log flooding and the inability to
detect and report pmond process failure.

To fix that this update
 - configures pulse messaging to unicast for monitored
   networks configured as 'lo'.
 - prevents heartbeating over the cluster network if both
   it and the management network are both configured on
   the 'lo' interface.
 - improves logging to avoid flooding in the presence of
   socket setup or access errors.
 - stops logging netlink events (interface state changes)
   on unmonitored network interfaces.
 - maintains heartbeat disabled state until the management
   network is up.
 - modifies hbsAgent socket failure handling and its pmon
   conf file so that a persistent socket failure during
   startup is alarmed as an hbsAgent process failure.

Test Plan:

PASS: Verify logging over system install and socket errors
PASS: Verify unicast messaging when cluster is set to 'lo'
PASS: Verify no cluster network heartbeat when it and mgmnt
      are set to 'lo'.

Regression:

PASS: Verify heartbeat messaging and cluster info
PASS: Verify pmond process failure alarm management
PASS: Verify heartbeat failure detection and graceful recovery
PASS: Verify AIO SX IPv6 system install and run
PASS: Verify AIO DX IPv6 system install and run
PASS: Verify Standard IPv6 system install and run
PASS: Verify Storage system IPv6 install and run
PASS: Verify Storage system IPv4 install and run
PASS: Verify MNFA handling in IPv6 storage system

Change-Id: I5a2a0b2dee0c690617c4e0b0e2ab8b1172b2dc49
Closes-Bug: 1884585
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-06-26 14:16:41 +00:00
Eric MacDonald
e379fdfe18 Prevent pmond process recovery when system is not running
The maintenance process monitor (pmon) should only
recover failed processes when the system state is
'running' or 'degraded'.

The current implementation allowed process recovery
for other non-inservice states, including an unknown
state if systemd returns no data on the state query.

This update tighten's up the system state check by
adding retries to the state query utility and
restricting accepted states to 'running' and 'degraded'.

This change then prevents pmon from inadvertently killing
and recovering the mtcClient which indirectly kills off
the mtcClient's fail-safe sysreq reboot child thread
if pmon state query returns anything other than running
or degraded during a shut down.

Change-Id: I605ae8be06f8f8351a51afce98a4f8bae54a40fd
Closes-Bug: 1883519
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-06-15 11:09:47 -04:00
Eric MacDonald
7d8be4bc1f Add auto-versioning to starlingx/metal mtce packages
This update makes use of the PKG_GITREVCOUNT variable
to auto-version the mtce packages in this repo.

Change-Id: Ifb4da4570e0261bbdcf0d7af79b8add7cfc133ac
Story: 2006166
Task: 39822
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-05-21 15:18:43 -04:00
Zuul
897eb75270 Merge "Fix mtce-common build error with gcc-8.2.1" 2020-04-28 12:36:08 +00:00
Dongqi Chen
7423edce9b Fix mtce-common build error with gcc-8.2.1
gcc-8.2.1 reports "Werror=format-truncation" error due to there is
possibility the string be truncated, add return value check could
avoid the error.

Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Dongqi Chen <chen.dq@neusoft.com>

Change-Id: I8fa08077e47ee3777a50f018af77b3e8fc6191f9
Story: 2007506
Task: 39278
2020-04-03 14:49:09 +08:00
Eric MacDonald
0826882308 Add mtcAgent socket initialization failure retry handling.
The main maintenance process (mtcAgent) exits on a process start-up
socket initialization failure. SM restarts the failed process within
seconds and will swact if the second restart also fails. From startup
to swact can be as quick as 4 seconds. This is too short to handle a
collision with a manifest.

This update adds a number of socket initialization retries to extend
the time the process has to resolve socket initialization failures by
giving the collided manifest time to complete between retries.

The number of retries and inter retry wait time is calibrated to ensure
that a persistently failing mtcAgent process exits in under 40 seconds.

This is to ensure that SM is able to detect and swact away from a
persistently failing maintenance process while also giving the process
a few tries to resolve on its own.

Test Plan:

PASS: Verify socket init failure thresholded retry handling
      with no, persistent and recovered failure conditions.
PASS: Verify swact if socket init failure is persistent
PASS: Verify no swact if socket failure recovers after first exit
PASS: Verify no swact if socket failure recovers over init retry
PASS: Verify an hour long soak of continuous socket open/close retry

Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444
Closes-Bug: 1869192
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-04-01 19:24:22 +00:00
Eric MacDonald
da7b2e94f1 Modify Mtce Reinstall FSM to first power-off BMC provisioned hosts
This update only applies to servers that support and are provisioned
for Board Management Control (BMC).

The BMC of some servers silently reject the 'set next boot device',
a command while it is executing BIOS.

The current reinstall algorithm when the BMC is provisioned starts by
detecting the power state of the target server. If the power is off
it will 'first power it on' and then proceed to 'set next boot device'
to pxe followed by a reset. For the initial power off state case, the
timing of these operations is such that the server is in BIOS when the
'set next boot device' command is issued.

This update modifies the host reinstall algorithm to first power-off
a server followed by setting the next boot device while the server is
confirmed to be powered off, then powered on. This ensures the server
gets and handles the set next boot device command operation properly.

This update also fixes a race condition between the bmc_handler and
power_handler by moving the final power state update in the power
handler to the power done phase.

Test Plan:

Verify all new reinstall failure path handling via fault insertion testing
Verify reinstall of powered off host
Verify reinstall of powered on host
Verify reinstall of Wildcat server with ipmi
Verify reinstall of Supermicro server with ipmi and redfish
Verify reinstall of Ironpass server with ipmi
Verify reinstall of WolfPass server with redfish and ipmi
Verify reinstall of Dell server with ipmi

Over 30 reinstalls were performed across all server types, with initial
power on and off using both ipmi and redfish (where supported).

Change-Id: Iefb17e9aa76c45f2ceadf83f23b1231ae82f000f
Closes-Bug: 1862065
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-02-12 15:44:26 +00:00
Eric MacDonald
9bf231a286 Fix BMC access loss handling
Recent refactoring of the BMC handler FSM introduced a code change that
prevents the BMC Access alarm from being raised after initial BMC
accessibility was established and is then lost.

This update ensures BMC access alarm management is working properly.

This update also implements ping failure debounce so that a single ping
failure does not trigger full reconnection handling. Instead that now
requires 3 ping failures in a row. This has the effect of adding a minute
to ping failure action handling before the usual 2 minute BMC access failure
alarm is raised. ping failure logging is reduced/improved.

Test Plan: for both hwmond and mtcAgent

PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type)
PASS: Verify BMC ping failure debounce handling, recovery and logging
PASS: Verify BMC ping persistent failure handling
PASS: Verify BMC ping periodic miss handling
PASS: Verify BMC ping and access failure recovery timing
PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug
PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery

Regression:

PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12)
PASS: Verify BMC power-off request handling with BMC ping failing & recovering
PASS: Verify BMC power-on request handling with BMC ping failing & recovering
PASS: Verify BMC reset request handling with BMC ping failing & recovering
PASS: Verify BMC sensor group read failure handling & recovery
PASS: Verify sensor monitoring after ping failure handling & recovery

Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe
Closes-Bug: 1858110
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-01-03 09:34:37 -05:00
Eric MacDonald
c4b8171ddd Refactor BMC provisioning in Maintenance
The current mechanism used to preserve the learned bmc protocol in
the filesystem on the active controller is problematic over swact.

This update removes the file storage method in favor of preserving
the learned protocol in the system inventory database as a key/value
pair at the host level in already existing mtce_info database field.

The specified or learned bmc access protocol is then shared with the
hardware monitor through inter-daemon maintenance messaging.

This update refactors bmc provisioning to accommodate bmc protocol
selection at the host rather than system level. Towards that this
update removes system level bmc_access_method selection in favor of
host level selection through bm_type. A bm_type of 'bmc' specifies
that the bmc access protocol for that host be learned. This has the
effect of making it the same as what is delivered today but without
support for changing it as the system level.

A system inventory update will be delivered shortly that enables bmc
access protocol selection at the host level. That update allows the
customer to specify the bmc access protocol at the host level to be
either dynamic (aka learned) or to only use 'redfish' or 'ipmi'.
That system inventory update delivers that information to maintenance
through bm_type via bmc provisioning. Until that update is delivered
bm_type always comes in as 'bmc' which get interpreted as 'dynamic'
to maintain existing configuration.

The following additional issues were also fixed in this update.

1. The nodeTimers module defaults the 'ring' member of timers that are
   not running to false but should be true.

2. Added a pingUtil_restart function to facilitate quicker sensor
   monitoring following provisioning changes and bmc access failures.

3. Enhanced the hardware monitor sensor grouping filter to accommodate
   non-standard Redfish readout labelling so that more sensors fall
   into the existing canned groups ; leads to more monitored sensors.

4. Added a 'http security mode' to hardware monitor messaging. This
   defaults to https as that is all that is supported by the Redfish
   implementation today. This field can be used to specify non-secure
   'http' mode in the future when that gets implemented.

5. Ensure the hardware monitor performs a bmc password re-fetch on every
   provisioning change.

Test Plan:

PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
PASS: Verify bmc provisioning of ipmi and redfish servers
PASS: Verify learned bmc protocol persists over process restart and swact
PASS: Verify process startup with protocol already learned

Hardware Monitor:

PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov
PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov
PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted
PASS: Verify sensor model delete and relearn over ip address change
PASS: Verify sensor model delete and relearn over bm_type change change
PASS: Verify sensor model not relearned username change
PASS: Verify bm pw is re-fetched over any (re)provisioning change
PASS: Verify bmc re-provisioning soak (test-bmc-reprovisioning.sh 50 loops)
PASS: Verify protocol change handling, file cleanup, model recreation
PASS: Verify End-2-End behavior for bm_type change from redfish to ipmi
PASS: Verify End-2-End behavior for bm_type change from ipmi to redfish
PASS: Verify End-2-End behavior for bm_type change from redfish to dynamic
PASS: Verify End-2-End behavior for bm_type change from ipmi to dynamic
PASS: Verify End-2-End behavior for bm_type change from dynamic to ipmi
PASS: Verify End-2-End behavior for bm_type change from dynamic to redfish
PASS: Verify sensor model creation waits for server power to be on
PASS: Verify sensor relearn by provisioning change during model creation. (soak)

Regression:

PASS: Verify host power off and on.
PASS: Verify BMC access alarm handling (assert and clear)
PASS: Verify mtcAgent and hwmond logs add value
PASS: Verify no core dumps / seg faults.
PASS: Verify no mtcAgent and hwmond memory leak.
PASS: Verify delete of BMC provisioned host
PASS: Verify sensor monitoring, alarming, degrade and then clear cycle
PASS: Verify static analysis report of changed modules.
PASS: Verify host level bm_type=bmc functions as would dynamic selection
PASS: Verify batch provisioning and deprovisioning (7 nodes)
PASS: Verify batch provisioning to different protocol (5 nodes)
PASS: Verify handling of flaky Redfish responses

PEND: Verify System Install

Change-Id: Ic224a9c33e0283a611725b33c90009132cab3382
Closes-Bug: #1853471
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-12-09 09:39:49 -05:00
Eric MacDonald
66e8fbd747 Add urlencoding to ip address for redfish requests
This change applies to both IPv4 and IPv6 because
the specification permits it.

Test Plan:

PASS: Verify for both IPv4 and IPv6 addressing
PASS: Verify patched change for IPv4 and IPv6 cases.

Change-Id: I99dcb31c51dd287eed8eb3a038a1814763a4c600
Closes-Bug: #1852481
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-11-15 12:15:49 -05:00
Hang Li
f48eae8f35 fix spelling error
Fixing spelling mistakes in notes helps us
understand.

Change-Id: Ic9050bd5f0141153f74d357f7405032d6aa1e1f1
Closes-Bug: #1852689
2019-11-15 14:11:52 +08:00
Zuul
7661fe5680 Merge "Removing unused flag disable_worker_services" 2019-11-04 13:52:12 +00:00
Eric MacDonald
15c036f321 Separate hardware monitor power and thermal senser data
The redfish thermal sensor data output clobbers
the power sensor data.

This update directs the thermal and power sensor readouts
into two separate files so they are preserved for off box
analysis and continued support for sensor_data product
verification testing.

Removed unused procedure that did not support
two sensor data output files.

Test Plan:

PASS: Verify system install
PASS: Verify power and sensor monitoring.
PASS: Verify power fault insertion testing
PASS: Verify thermal fault insertion testing

Change-Id: Ie7717728944e93dd6fcc38a2c971189764276929
Story: 2005861
Task: 37203
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-17 20:53:14 -04:00
Zuul
069daf1e22 Merge "Add mtcAgent support for sm_node_unhealthy condition" 2019-10-16 19:15:03 +00:00
Zuul
6e024a648f Merge "Modify the strlen judgement to avoid memory leak." 2019-10-15 21:21:22 +00:00
Eric MacDonald
675f49d556 Add mtcAgent support for sm_node_unhealthy condition
When heartbeat over both networks fail, mtcAgent
provides a 5 second grace period for heartbeat to
recover before failing the node.

However, when heartbeat fails over only one of the
networks (management or cluster) the mtcAgent does
not honour that 5 second grace period ; a bug.

When it comes to peer controller heartbeat failure
handling, SM needs that 5 second grace period to handle
swact before mtcAgent declares the peer controller as
failed, resets the node and updates the database.

This update implements a change that forces a 2 second
wait time between each fast enable and fixes the fast
enable threshold count to be the intended 3 retries.
This ensures that at least 5 seconds, actually 6 in
the case of single network heartbeat loss, passes
before declaring the node as failed.

In addition to that, a special condition is added to
detect and stop work if the active controller is
sm_node_unhealthy. We don't want mtcAgent to make
any database updates while in this failure mode.
This gives SM the time to handle the failure
according to the system's controllers' high
availability handling feature.

Test Plan:

PASS: Verify mtcAgent behavior on set and clear of
      SM node unhealthy state.
PASS: Verify SM has at least 5 seconds to shut down
      mtcAgent when heartbeat to peer controller fails
      for one or both networks.
PASS: Test real case scenario with link pull.
PASS: Verify logging in presence of real failure condition.

Change-Id: I8f8d6688040fe899aff6fc40aadda37894c2d5e9
Closes-Bug: 1847657
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-15 15:24:34 -04:00
Zuul
f2cba8f89b Merge "Maintenance Redfish support useability enhancements." 2019-10-10 18:38:21 +00:00
Zuul
ed22f11172 Merge "Add alarm retry support to maintenance alarm handling daemon" 2019-10-07 15:24:14 +00:00
Eric MacDonald
f2fedc0446 Add alarm retry support to maintenance alarm handling daemon
The maintenance alarm handling daemon (mtcalarmd) should not
drop alarm requests simply because FM process is not running.
Insteads it should retry for it and other FM error cases that
will likely succeed in time if they are retried.

Some error cases however do need to be dropped such as those
that are unlikely to succeed with retries.

Reviewed FM return codes with FM designer which lead to a list
of errors that should drop and others that should retry.

This update implements that handling with a posting and
servicing of a first-in / first-out alarm queue.

Typical retry case is the NOCONNECT error code which occurs
when FM is not running.

Alarm ordering and first try timestamp is maintained.
Retries and logs are throttled to avoid flooding.

Test Plan:

PASS: Verify success path alarm handling End-to-End.
PASS: Verify retry handling while FM is not running.
PASS: Verify handling of all FM error codes (fit tool).
PASS: Verify alarm handling under stress (inject-alarm script) soak.
PASS: verify no memory leak over stress soak.
PASS: Verify logging (success, retry, failure)
PASS: Verify alarm posted date is maintained over retry success.

Change-Id: Icd1e75583ef660b767e0788dd4af7f184bdb9e86
Closes-Bug: 1841653
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-07 09:07:49 -04:00
Eric MacDonald
4c541f50d4 Maintenance Redfish support useability enhancements.
This update is a result of changes made during a suite of
end-to-end provisioning, reprovisioning and deprovisioning
customer exterience testing of the maintenance RedFish support
feature.

1. Force reconnection and password fetch on provisioning changes
2. Force reconnection and password fetch on persistent connection failures
3. Fix redfish protocol learning (string compare) in hardware monitor
4. Improve logging for some typical error paths.

Test Plan:

PASS: Verify handling of reprovisioning BMC between hosts that support
             different protocols.
PASS: Verify handling of reprovisioning ip address to host that leads to a
             different protocol select.
PASS: Verify manual relearn handling to recover from errors that result from
             the above case.
PASS: Verify host BMC deprovisioning handling and cleanup.
PASS: Verify sensor monitoring.
PASS: Verify hwmond sticks with a selected protocol once a sensor model
             has been created using that protocol.
PASS: Verify handling of BMC reprovision - ip address change only
PASS: Verify handling of BMC reprovision - username change only
FAIL: Verify handling of BMC reprovision - password change only
             https://bugs.launchpad.net/starlingx/+bug/1846418

Change-Id: I4bf52a5dc3c97d7794ff623c881dff7886234e79
Closes-Bug: #1846212
Story: 2005861
Task: 36606
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-03 11:57:58 -04:00
Zuul
bc83d9e362 Merge "Update openSUSE OBS artifacts to build MTCE packages" 2019-10-02 15:09:50 +00:00
Marcela Rosales
b5f12793a1 Update openSUSE OBS artifacts to build MTCE packages
The openSUSE spec files needs to have the path of the source code in
the setup to have the package generation automated through _service
file in OBS.

Change-Id: I2b7c08d5772025c02821dfb9fc944fff0f5b6f90
Story: 2006508
Task: 36812
Signed-off-by: Marcela Rosales <marcela.a.rosales.jimenez@intel.com>
2019-10-01 11:07:10 -05:00
Eric MacDonald
df9343b0cc Add redfish power/reset/reinstall bmc support to maintenance
This update delivers redfish support for Power-On/Off, Reset
and Netboot Reinstall handling to maintenance.

Test Plan: (Testing Continues)

PASS: Verify Redfish Power-Off action handling
PASS: Verify Redfish Power-On action handling
PASS: Verify Redfish Reset action handling
PASS: Verify compute Redfish Reinstall action handling from controller-0
PASS: Verify compute Redfish Reinstall action handling from controller-1
PASS: Verify Redfish Power-Off Action failure handling
PASS: Verify Redfish Power-On action failure handling
PASS: Verify Redfish Reset action failure handling
PASS: Verify Redfish Re-Install action failure handling
PASS: verify Reset progression cycle does not leak memory.
PASS: Verify bmc_handler failure handling does not leak memory.
PASS: Verify Inservice BMC access (ping) failure and recovery handling.
PASS: Verify BMC access failure alarm handling
PASS: Verify BMC provisioning and deprovisioning soak (redfish - wolfpass)
PASS: Verify BMC provisioning and deprovisioning does not leak memory.
PASS: Verify BMC provisioning handling with bad ip and/or bad username
PASS: Verify BMC reprovisioning to same protocol
PASS: Verify BMC reprovisioning from ipmi host to redfish host
PASS: Verify BMC reprovisioning from redfish host to ipmi host
PASS: Verify mixed protocol support in same lab
PASS: Verify mixed server support in same lab
PASS: Verify Large System Install with BMCs provisioned (wp8-12)
PASS: Verify bmc access method (learn,ipmi,redfish) learned from mtc.init
PASS: Verify Swact with BMCs provisioned.
PASS: Verify no segfaults.
PASS: Verify AIO System Install in lab that supports redfish (WC3-6, WP8-12, Dell 720 3-7)
PASS: Verify AIO Simplex Install with Redfish Support (SM1, SM3)
PASS: Verify AIO Duplex Install with Redfish Support (SM 5-6, Dell 720 1-2

Useability:

PASS: Verify handling of reprovisioning BMC between hosts that support
             different protocols.
PASS: Verify handling of reprovisioning ip address to host that leads to a
             different protocol select.
PASS: Verify manual relearn handling to recover from errors that result from
             the above case.
PASS: Verify host BMC deprovisioning handling and cleanup.
PASS: Verify sensor monitoring.
PASS: Verify fault insertion for both protocols and action handling.
PASS: Verify protocol select handover.
PASS: Verify hwmond sticks with a selected protocol once a sensor model
             has been created using that protocol.
PASS: Verify handling of missing bmc_access_method configuration select.
PASS: Verify inservice bmc_access_method service parameter modification handling.

Regression:

PASS: Verify redfish BMC info query logging.
PASS: Verify sensor monitoring and alarming still works.
PASS: Verify all power/reset/netboot commands for IPMI
PASS: Verify reprovisioning soak of Wolfpass servers
PASS: Verify reprovisioning soak of SM servers

Depends-on: https://review.opendev.org/#/c/679178/
Change-Id: I984057e04d7426e37d675cf4d334a4e35419f2e8
Story: 2005861
Task: 35826
Task: 36606
Task: 36467
Task: 36456
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-26 15:59:35 -04:00
marvin
5f743f1402 Removing unused flag disable_worker_services
The disable_worker_services file was originally created
to prevent the (bare metal) nova-compute services from
running on a newly upgraded controller in an AIO-DX
configuration. This situation no longer exists because
the bare metal nova-compute services do not exist after
transiting to containers. this flag is no longer needed.
Removing all references to the disable_worker_services file.

Change-Id: I20e08db737bb0df6ba34c071e2435f1a18f7c3ed
Partial-Bug: #1838432
Signed-off-by: marvin <weifei.yu@intel.com>
2019-09-25 02:06:31 +00:00
zhipengl
2fd91856d0 Enable protocol switch between ipmi and redfish for hwmon
1) Switch bmc protocol between ipmi and redfish for hwmon.
2) Get power status provided by mtc through below file
   /var/run/bmc/<hostname>

Story: 2005861
Task: 35815

Change-Id: Ie577f7f9265b7cdb5c985dcc0861a90e74508026
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2019-09-22 22:28:30 -04:00
Marcela Rosales
a0a3693bc4 Add openSUSE OBS Artifacts for Maintenance services
StarlingX Open Build Service [0] builds MTCE packages using base
artifacts:
- Spec file
- Changelog

[0] https://build.opensuse.org/project/show/Cloud:StarlingX:2.0

Story: 2006508
Task: 36556
Task: 36557
Task: 36558
Task: 36559
Task: 36560
Task: 36561

Change-Id: I9bf59ab4b890ebe33a9304d3f886951c860412a6
Signed-off-by: Marcela Rosales <marcela.a.rosales.jimenez@intel.com>
2019-09-20 09:18:54 -05:00
Eric MacDonald
0d63a16d8d Improve BMC password first fetch handling in hwmon
Trying to get the BMC password through barbican before
the ping succeeds leads to an early bmc access lost
failure that
 1. produces a misleading bmc access lost failure log ;
    bmc access had not even been established yet.
 2. imposes as retry wait that delays re-establishing
    bmc access and therefore overall sensor monitoring.

This update also

  1. adds hostname to some of the secretUtil  API
     interfaces so that logs ar reported against the
     correct host rather than always the current
     controller hostname.

   2. Changes some success path logging to dlogs to
      reduce log noise.

   3. simplifies a ping ok log

Change-Id: Ib3b7de212294d6dc350ee17d363f4009b3b0dcb0
Story: 2005861
Task: 36595
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-17 18:57:08 +00:00
zhipengl
67d4ba105f Redfish support for Sensor Monitoring in hwmond
Add redfish hwmon thread function and related parse function
for Power and Thermal sensor data.
Removed some unused old functions.
Rename common function or variable with bmc prefix

Test done for this patch on simplex bare metal setup.
system host-sensor-list
system host-sensor-show
system host-sensorgroup-list
system host-sensorgroup-show
system host-sensorgroup-relearn

Story: 2005861
Task: 35815

Depends-on: https://review.opendev.org/#/c/671340
Change-Id: If8a35581d44df15749a049eda945f23d2323fd35
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2019-09-12 01:56:42 +08:00
Eric MacDonald
4d2383818f Add bmc protocol select to maintenance
This update adds BMC Info Query command handling and
info logging to maintenance.

Example of the logs produced by the BMC Query are

  compute-2 manufacturer is Intel Corporation
  compute-2 model number:<str>  part number:<str>  serial number:<str>
  compute-2 BIOS firmware version is SE5C620.86B.00.01.0013.030920180427
  compute-2 BMC  firmware version is unavailable
  compute-2 power is on
  compute-2 has 2 processors
  compute-2 has 192 GiB of memory

Please note that the default protocol remains IPMI even
if Redfish support is detected. This is because the
power/reset/netboot control implementation for Redfish
has not yet been implemented.

Test Plan:

PASS: Verify redfish BMC info query logging.
PASS: Verify IPMI remains the default selected protocol.

Regression:

PASS: Verify sensor monitoring and alarming still works.
PASS: Verify power-off command handling.
PASS: Verify power-on command handling.
PASS: Verify reset command handling.
PASS: Verify reboot command handling.
PASS: Verify reinstall (netboot) command handling.

Change-Id: I654056119018a1751a70495e3df8b541d9e00b93
Story: 2005861
Task: 35826
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-08 14:14:15 -04:00
Eric MacDonald
804ec52227 Add redfish support detection to maintenance
This update

1. Refactors some of the common maintenance ipmi
   definitions and utilities into a more generic
   'bmcUtil' module to reduce code duplication and improve
   improve code reuse with the introduction of a second
   bmc communication protocol ; redfish.

2. Creates a new 'redFishUtil' module similar to the existing
   'ipmiUtil' module but in support of common redfish
   utilities and definitions that can be used by both
   maintenance and the hardware monitor.

3. Moves the existing 'mtcIpmiUtil' module to a more common
   'mtcBmcUtil' and renames the 'ipmi_command_send/recv' to
   the more generic 'bmc_command_send/recv' which are enhanced
   to support both ipmi and redfish bmc communication methods.

4. Renames the bmc info collection and connection monitor ;
   'bm_handler' to 'bmc_handler' and adds support necessary
   to learn if a host's bmc supports redfish.

5. Renames the existing 'mtcThread_ipmitool' to a more common
   'mtcThread_bmc' and redfishtool support for the now common
   set of bmc thread commands and the addition of the new
   redfishtool bmc query, aka 'redfish root query', used to
   detect if a host's bmc supports redfish.

   Note: This aspect is the primary feature of this update.

         Namely the ability to detect and print a log indicating
         if a host's bmc supports redfish.

Test Plan:

PASS: Verify sensor monitoring and alarming still works.
PASS: Verify power-off command handling.
PASS: Verify power-on command handling.
PASS: Verify reset command handling.
PASS: Verify reinstall (netboot) command handling.
PASS: Verify logging when redfish is not supported.
PASS: Verify logging when redfish is supported.
PASS: Verify ipmitool is used regardless of redfish support.
PASS: Verify mtce thread error handling for both protocols.

Change-Id: I72e63958f61d10f5c0d4a93a49a7f39bdd53a76f
Story: 2005861
Task: 35825
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-08-19 14:03:37 +00:00
Alex Kozyrev
0083538501 Properly handle Barbican IPv6 address in MTCE
barbican.conf stores Barbican IPv6 address enclosed by square brackets:
bind_host=[abde::2]
MTCE fails to connect to Barbican with such an IP address.
Need to strip square brackets during barbican.conf file read in MTCE.

Change-Id: I28ae627cd4998a5975d39b3edc466180e11aedf6
Closes-Bug: 1839870
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2019-08-12 15:14:00 -04:00
YeHuiSheng
f945697f26 Modify the strlen judgement to avoid memory leak.
Change-Id: I4752ba24943b4c5e47e63b3c3e95811dec73ecce
Closes-Bug: #1837344
Signed-off-by: YeHuiSheng <hsye@fiberhome.com>
2019-07-22 17:17:12 +08:00