SM API calls in the Swact FSM are inline and blocking which prevents
token renewal between retries. Which means that 401 authentication
that occur during a Swact exhaust all retry attempts before token
renewal can occur.
Resolved by migrating SM HTTP request handling from blocking to non-
blocking calls through the workQueue infrastructure, consistent with
VIM and SysInv API. This provides the opportunity for Maintenance
to renew its authentication token before HTTP request retries are
exhausted. This change alone constitutes the majority of this update.
The token renewal process is sped up by removing the existing token
renewal delay making on demand renewal virtually immediate following
an HTTP request with a 401 Authentication Error.
Further to the token renewal process, this update prevents token
renewal nesting where multiple concurrent API requests that experience
HTTP 401 errors try to simultaneously attempt token renewal, causing
libEvent base/struct key/value management conflicts.
A token token_renewal_in_progress flag is introduced to wave off
redundant renewal requests during the renewal.
Work queue processing, which relies on a valid token, is held off
while token renewal is in progress ; in success case completes in
under a second. Monitoring and self correction for a stuck in-progress
flag is added to avoid any chance of the new in-progress flag getting
stuck. FIT testing used to verify.
This update also adds retries to critical VIM API calls as a fix for
bug report 2135130.
Other related fixes and collateral code cleanup:
- Fixed retry handling of invalid token during initial inventory
load during process startup by forcing renewal in the inline
inventory load loop. FIT testing used to verify.
- Remove duplicate token addition in mtcHttpUtil_event_init since
its already added in mtcHttpUtil_api_request called by the
workQueue at send time.
- Increase the Work Queue overload threshold from 40 to 100 and move
the detection/discard to before enqueue rather than after. Don't
take on more work when already overloaded.
- Remove unnecessary inservice test force states updates that were
added during the early phases of product development but never
removed after the product matured.
- Remove obsoleted Keystone token renewal from mtcHttpUtil module.
Keystone token renewal is already handled in the common httpUtil
module so it can be shared with the hardware monitor. It provides
a separate path for token renewal compared to the work queue which
processes API requests that need a valid token.
- Remove mtcHttpUtil module 'unsupported non-blocking request call'
that is never used and not properly freeing http event resources.
Test Plan:
PASS: Verify full clean system build
PASS: Verify AIO SX Install
PASS: Verify AIO DX Install
PASS: Verify DC System Install
PASS: Verify Standard System Install 2+1+1
Authentication Error Handling Tests: partial and full retry handling
PASS: Verify 10 seconds between API call retries
PASS: Verify SM API authentication error handling and token renewal
PASS: Verify VIM API authentication error handling and token renewal
PASS: Verify INV API authentication error handling and token renewal
PASS: Verify SM/VIM/INV state change stress soak for 4 hrs with
random token corruption every 20-30 seconds
PASS: Verify HW Monitor authentication error handling & token renewal
4 hr soak with random token corruption every 20-30 seconds
Token Renewal Cases:
PASS: Verify sub-second token renewal ; detection to valid token
PASS: Verify invalid token handling during initial inventory renewal
PASS: Verify workQueue processing is held off during token renewal
PASS: Verify stuck in-progress and corrective action handling
PASS: Verify random token corruption, auto detection and renewal by
and while running SM/VIM/INV API stress test soak for 8 hrs.
PASS: Verify no memory leak over 5 sec token renewal soak for 3 hrs.
- both mtcAgent and hwmond
PASS: Verify no memory leak while corrupting token on every SM and INV
http request during http request stress soak for 8 hrs
Swact Tests: partial = fails but succeeds before max retries
full = fails all retries
PASS: Verify 10 seconds between tries
PASS: Verify 1 second between polling tries
PASS: Verify time between user Swact Action and Swact In Progress
- is 2 seconds and includes completed swact query & swact action
PASS: Verify successful swact timing and handling
- is 8 seconds to mtcAgent shutdown, 6 seconds is SM
PASS: Verify Swact soak - 50 iterations, 25 per side
Swact Failure Cases:
PASS: Verify Swact process failure retry handling - 10 tries max
PASS: - Swact Query request enqueue fail - partial and full(~110 secs)
PASS: - Swact Query request failure - partial and full(~110 secs)
PASS: - Swact Query request dropped - partial and full(~110 secs)
PASS: - Swact Query request failure 401 - partial and full(~130 secs)
PASS: - Swact Action request enqueue fail - partial and full(~130 secs)
PASS: - Swact Action request failure - partial and full(~150 secs)
PASS: - Swact Action request dropped - partial and full(~300 secs)
PASS: - Swact Action request failure 500 - partial and full(~180 secs)
PASS: - Swact Action request failure 401 - partial and full(~110 secs)
PASS: Verify Swact timeout - mtcAgent service shutdown took too long
- all commands pass but swact does not occur
- timeout specified in /etc/mtc.conf:swact_timeout (def:120 secs)
PASS: Verify retry handling with min swact_timeout of 20 secs
- the 20 second swact timeout was honoured.
- was auto corrected to 20 secs from an invalid 10 second setting
PASS: Verify retry handling with larger swact_timeout of 300 secs
- the 300 second timeout is honoured. 500 secs was also verified
PASS: Verify Swact failure handling when SM is active but not scheduling
PASS: - FIT with kill -STOP <sm pid>.
PASS: - Swact is failed after the swact request max retries are reached.
PASS: Verify Swact handling logging ; all success and failure paths
PASS: Verify Horizon task updates ; all success and failure paths
Regression:
PASS: Verify Lock/Unlock of standby controller soak ; 20 iterations
PASS: Verify Automated Sanity
PASS: Verify memory leak 8 hr soak for hwmond and mtcAgent
PASS: Verify handling of spontaneous Active controller reboot
PASS: Verify handling of spontaneous Standby controller reboot
PASS: Verify password change handling - sudo passwd sysadmin
- change is auto updated on all hosts
Closes-Bug: 2135128
Closes-Bug: 2135130
Change-Id: I0189cd9e682dcf387d89c3dbf955d7c7459a0bd0
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Description
Languages
C++
83.2%
Shell
10.1%
Python
3.2%
C
2.5%
Makefile
1%