Eric Macdonald 87b4a6990f Make Service Management API calls non-blocking through the workQueue
SM API calls in the Swact FSM are inline and blocking which prevents
token renewal between retries. Which means that 401 authentication
that occur during a Swact exhaust all retry attempts before token
renewal can occur.

Resolved by migrating SM HTTP request handling from blocking to non-
blocking calls through the workQueue infrastructure, consistent with
VIM and SysInv API. This provides the opportunity for Maintenance
to renew its authentication token before HTTP request retries are
exhausted. This change alone constitutes the majority of this update.

The token renewal process is sped up by removing the existing token
renewal delay making on demand renewal virtually immediate following
an HTTP request with a 401 Authentication Error.

Further to the token renewal process, this update prevents token
renewal nesting where multiple concurrent API requests that experience
HTTP 401 errors try to simultaneously attempt token renewal, causing
libEvent base/struct key/value management conflicts.
A token token_renewal_in_progress flag is introduced to wave off
redundant renewal requests during the renewal.

Work queue processing, which relies on a valid token, is held off
while token renewal is in progress ; in success case completes in
under a second. Monitoring and self correction for a stuck in-progress
flag is added to avoid any chance of the new in-progress flag getting
stuck. FIT testing used to verify.

This update also adds retries to critical VIM API calls as a fix for
bug report 2135130.

Other related fixes and collateral code cleanup:

- Fixed retry handling of invalid token during initial inventory
  load during process startup by forcing renewal in the inline
  inventory load loop. FIT testing used to verify.

- Remove duplicate token addition in mtcHttpUtil_event_init since
  its already added in mtcHttpUtil_api_request called by the
  workQueue at send time.

- Increase the Work Queue overload threshold from 40 to 100 and move
  the detection/discard to before enqueue rather than after. Don't
  take on more work when already overloaded.

- Remove unnecessary inservice test force states updates that were
  added during the early phases of product development but never
  removed after the product matured.

- Remove obsoleted Keystone token renewal from mtcHttpUtil module.
  Keystone token renewal is already handled in the common httpUtil
  module so it can be shared with the hardware monitor. It provides
  a separate path for token renewal compared to the work queue which
  processes API requests that need a valid token.

- Remove mtcHttpUtil module 'unsupported non-blocking request call'
  that is never used and not properly freeing http event resources.

Test Plan:

PASS: Verify full clean system build
PASS: Verify AIO SX Install
PASS: Verify AIO DX Install
PASS: Verify DC System Install
PASS: Verify Standard System Install 2+1+1

Authentication Error Handling Tests: partial and full retry handling

PASS: Verify 10 seconds between API call retries
PASS: Verify SM  API authentication error handling and token renewal
PASS: Verify VIM API authentication error handling and token renewal
PASS: Verify INV API authentication error handling and token renewal
PASS: Verify SM/VIM/INV state change stress soak for 4 hrs with
      random token corruption every 20-30 seconds
PASS: Verify HW Monitor authentication error handling & token renewal
      4 hr soak with random token corruption every 20-30 seconds

Token Renewal Cases:

PASS: Verify sub-second token renewal ; detection to valid token
PASS: Verify invalid token handling during initial inventory renewal
PASS: Verify workQueue processing is held off during token renewal
PASS: Verify stuck in-progress and corrective action handling
PASS: Verify random token corruption, auto detection and renewal by
      and while running SM/VIM/INV API stress test soak for 8 hrs.
PASS: Verify no memory leak over 5 sec token renewal soak for 3 hrs.
      - both mtcAgent and hwmond
PASS: Verify no memory leak while corrupting token on every SM and INV
      http request during http request stress soak for 8 hrs

Swact Tests: partial = fails but succeeds before max retries
                full = fails all retries

PASS: Verify 10 seconds between tries
PASS: Verify 1 second between polling tries
PASS: Verify time between user Swact Action and Swact In Progress
      - is 2 seconds and includes completed swact query & swact action
PASS: Verify successful swact timing and handling
      - is 8 seconds to mtcAgent shutdown, 6 seconds is SM
PASS: Verify Swact soak - 50 iterations, 25 per side

Swact Failure Cases:

PASS: Verify Swact process failure retry handling - 10 tries max
PASS: - Swact Query request enqueue fail  - partial and full(~110 secs)
PASS: - Swact Query request failure       - partial and full(~110 secs)
PASS: - Swact Query request dropped       - partial and full(~110 secs)
PASS: - Swact Query request failure 401   - partial and full(~130 secs)
PASS: - Swact Action request enqueue fail - partial and full(~130 secs)
PASS: - Swact Action request failure      - partial and full(~150 secs)
PASS: - Swact Action request dropped      - partial and full(~300 secs)
PASS: - Swact Action request failure 500  - partial and full(~180 secs)
PASS: - Swact Action request failure 401  - partial and full(~110 secs)
PASS: Verify Swact timeout - mtcAgent service shutdown took too long
      - all commands pass but swact does not occur
      - timeout specified in /etc/mtc.conf:swact_timeout (def:120 secs)
PASS: Verify retry handling with min swact_timeout of 20 secs
      - the 20 second swact timeout was honoured.
      - was auto corrected to 20 secs from an invalid 10 second setting
PASS: Verify retry handling with larger swact_timeout of 300 secs
      - the 300 second timeout is honoured. 500 secs was also verified
PASS: Verify Swact failure handling when SM is active but not scheduling
PASS: - FIT with kill -STOP <sm pid>.
PASS: - Swact is failed after the swact request max retries are reached.
PASS: Verify Swact handling logging ; all success and failure paths
PASS: Verify Horizon task updates ; all success and failure paths

Regression:

PASS: Verify Lock/Unlock  of standby controller soak ; 20 iterations
PASS: Verify Automated Sanity
PASS: Verify memory leak 8 hr soak for hwmond and mtcAgent
PASS: Verify handling of spontaneous Active controller reboot
PASS: Verify handling of spontaneous Standby controller reboot
PASS: Verify password change handling - sudo passwd sysadmin
      - change is auto updated on all hosts

Closes-Bug: 2135128
Closes-Bug: 2135130
Change-Id: I0189cd9e682dcf387d89c3dbf955d7c7459a0bd0
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
2026-01-07 14:01:46 -05:00
2025-10-20 15:34:33 -03:00
2023-08-29 16:50:22 -04:00
2019-04-19 19:52:33 +00:00
2025-11-24 16:07:28 -03:00
2018-05-31 07:36:43 -07:00
2023-07-19 12:32:13 -03:00
2022-12-26 23:26:54 +00:00

metal

The starlingx/metal repository handles StarlingX Bare Metal Management1.

This repository is not intended to be developed standalone, but rather as part of the StarlingX Source System, which is defined by the StarlingX manifest2.

References


  1. https://docs.starlingx.io/api-ref/metal↩︎

  2. https://opendev.org/starlingx/manifest.git↩︎

Description
StarlingX Bare Metal and Node Management, Hardware Maintenance
Readme 16 MiB
Languages
C++ 83.2%
Shell 10.1%
Python 3.2%
C 2.5%
Makefile 1%