* StarlingX devstack has switched Ubuntu Bionic. Default compiler is
gcc 7.3.0. gcc 7.3.0 will report compiling error message
"error: format not a string literal and no format arguments
[-Werror=format-security]" for the calling of snprintf in
pingUtil_send of pingUtil.cpp
* gcc 4.8.5 doesn't report such warning. That's why current StarlingX
building doesn't have such issue.
Passed tests:
* Fresh building
* Deployment test
* Unit tests, verified the change doesn't impact the code behavior.
* System-level verification, mtcAgent and hwmond can start normally.
Story: 2003161
Task: 29793
Change-Id: I21e84ac4b2c9deb8926c752fe79ea284a0d92b30
Signed-off-by: Yi Wang <yi.c.wang@intel.com>
The preceeding 4 reviews all needed to be in place in order for
the devstack run to complete. Enable it now.
Change-Id: I139c862b8edbe7214ad11b9820e400b7e613bd61
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
libc6 renamed siginfo.h to siginfo-const.h sometime between
2.23 (in Xenial) and 2.27 (in bionic).
This builds on bionic and centos7 and in fact is required to
get DevStack to copmlete on bionic.
This is last in the stack since it has not been tested
beyond the compile/install that DevStack does. There
may be a better/alternate solution...but with this we should
get a passing DevStack job.
Change-Id: I5a2ed9455b05e604731c3775d0f402c6137da2ef
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
This allows DevStack plugins to add its configured STX_INST_DIR
to the linker search path.
Change-Id: I277204cd89767b93eec6c96969fc33d23e04516b
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
* Install build artifacts to a fixed dir rather than attempting
to infer a location based on the Python binary location. That
was intended to work seamlessly in venvs, we'll burn that bridge
when we come to it, for now just put it all in
$DEST/usr/{include|lib}. This also removed the need for
root access for these files to allow the build steps to be performed
on laptops that may not otherwise run DevStack.
* Install systemd unit files directly to /etc/systemd/system
and skip the requirement to copy them a second time
* Add the declarations to settings for the devstack playbook to
handle plugin precedence order properly.
Change-Id: I5d68465384e000c05eb650a8358b70f7a7a6c293
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
devstack default user stack may not have permission to modify system
file /etc/hosts. use sudo to make sure the modification is done.
Change-Id: Iabe47cae88da9d70a1f7788c1847d99856963713
Closes-Bug: 1816520
Signed-off-by: Yi Wang <yi.c.wang@intel.com>
Use Openstack Barbican API to retrieve BMC passwords stored by SysInv.
See SysInv commit for details on how to write password to Barbican.
MTCE is going to find corresponding secret by host uuid and retrieve
secret payload associated with it. mtcSecretApi_get is used to find
secret reference, based on a hostname. mtcSecretApi_read is used to
read a password using the reference found on a prevoius step.
Also, did a little cleanup and removed old unused token handling code.
Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100
Change-Id: I66011dc95bb69ff536bd5888c08e3987bd666082
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
Add the base DevStack job and make sure bashate runs on
the devstack plugin files.
Begin to re-structure the plugin to match the common structure.
Add devstack/build.sh and split out the build steps into
separate functions in devstack/lib/stx-metal
This is complete, further work to be done in follow-up changes.
Change-Id: I05f6df758e18f182fb0a05731eddc6cb7f599e51
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
Update pxeboot-update script to accept parameter for
installer base URL
Add a common function to parse the port number from
inst.repo
Update pxeboot and kickstart URLs to support a configurable
HTTP port
Story: 2004642
Task: 28593
Depends-On: https://review.openstack.org/#/c/634237/
Change-Id: Ibd66e89e49794ca57b938eb43d227860eda6674a
Signed-off-by: Tao Liu <tao.liu@windriver.com>
Replacing existing mechanism of storing BMC passwords in Inventory.
Porting all the changes made in SysInv to Inventory to make them on par.
Inventory is going to use Barbican API instead of keyring to store
BMC passwords for MTCE as well.
Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100
Change-Id: I74e971495fa7538d77cfebc28d76fd752af69f5e
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
This update refactors daemon_system_type function so that it
returns a SIMPLEX system type if it is unable to properly
find and parse the system_mode/system_type from platform.conf
This is needed for Ansible Bootstrap Deployment where mtcAgent
and mtcClient need to run and function like it would in a
simplex system prior to the system type being added to the
platform.conf file.
Change-Id: Ib0130f3559ee3aa8d8d8203ea59d4896a571944f
Story: 2004695
Task: 28714
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update introduces a new Link Monitor daemon to the Mtce
flock of daemons and disable rmon's interface monitoring.
This new daemon parses the platform.conf file and using the
interface names assigned to each monitored network (mgmt,
infra and oam) queries the kernel for their physical,
bonded and vlan interface names and then registers to listen
for netlink events.
All link/interface state change (netlink) events that correspond
to any of the interfaces or links assiciated with the monitored
networks are tracked by this new daemon.
This new daemon then also implements an http listener for
localhost initiated GET requests targeted to /mtce/lmond
on port 2122 and responds with a json link_info string that
contains a summary of monitored networks, links and their
current Up/Down status.
lmond behavioral summary:
1. learn interface/port model,
2. load initial link status for learned links,
3. listen for link status change events
4. provide link status info to http GET Query requests.
Another update to stx-integ implements the collectd interface
plugin that periodically issues the Link Status GET requests
for the purponse of alarming port and interface Down conditions,
clearing alarms on Up state changes, and storing sample data
that represents the percentage of active links for each monitored
network.
Test Plan:
PASS: Verify lmond process startup
PASS: Verify lmond logging and log rotation
PASS: Verify lmond process monitoring by pmon
PASS: Verify lmond interface learning on process startup
PASS: Verify lmond port learning on process startup
PASS: Verify lmond handling of vlan and bond interface types
PASS: Verify lmond http link info GET Query handling
PASS: Verify lmond has no memory leak during normal and eventfull operation
Change-Id: I58915644e60f31e3a12c3b451399c4f76ec2ea37
Story: 2002823
Task: 28635
Depends-On:
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The maintenance token request's response parser is looking
for nova compute endpoint as a day one implementation when
mtce actually managed nova. That is long since changed but
this endpoint lookup remained.
In the new containterized environment the nova compute
endpoint is not always present and when its not mtce
fails to get its token.
Since mtce needs the token for communication with sysinv
this update changes the endpoint lookup type to 'platform'
to match that of sysinv.
Change-Id: I389b64d345e47f7d7bc062671da7c7cc51ac398f
Story: 2004695
Task: 29213
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Remove the automated creation of storage host aggregates and host
population in inventory.
Story: 2004607
Task: 29068
Change-Id: I4a74a1ee1f8b3bc8dc6293a5c971d9c7ed1442b5
Signed-off-by: Jack Ding <jack.ding@windriver.com>
We want to default to running all tox environments under python 3, so
set the basepython value in each environment.
We do not want to specify a minor version number, because we do not
want to have to update the file every time we upgrade python.
We do not want to set the override once in testenv, because that
breaks the more specific versions used in default environments like
py35 and py36.
Change-Id: I1bd6a3aebbbe539d4f21ca71c76d92e3c325c1e8
Closes-Bug: #1802032
This update disables rmon NTP monitoring which is now done
as a collectd plugin with the following depends update.
Story: 2002823
Task: 22859
Depends-On: https://review.openstack.org/#/c/628685/
Change-Id: I736703542c8a6ba3dd9e9db2d6fb7ccbdc906643
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
doc index.rst:
1. Update intro sentence to read as a complete sentence
2. Remove unused toctree
3. Correct heading levels (impacting side nav and correct rendering
of content)
4. Remove "Indices and Tables" section: genindex page not used,
search searches only index (not useful here)
api-ref index.rst:
1. Update intro sentence to read as a complete sentence
2. Update text around search link for consistency (move to
follow intro)
3. Add heading before toctree for consistency with other pages
releasenotes index.rst:
1. Standardize page title reST markup
2. Remove search (make consistent with other openstack release
note pages)
Story: 2004737
Task: 28805
Change-Id: I388cc5d69db56e6e94bf034ece2478933c9d9c1e
Signed-off-by: Kristal Dale <kristal.dale@intel.com>
Add maintenance services as stx-metal plugin.
Enable services by both node type and metal components.
Target:
Mtce services are installed and active(running) in devstack.
Story: 2003161
Task: 23296
Change-Id: I2123c64fb1b70bd135e8945d7ff7f4f3691bdbcc
Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com>
This commit creates cgts-vg volume group automatically on worker
nodes by kickstart. This cgts-vg volume group reserves space for
log-lv, scratch-lv, docker-lv and ceph-mon-lv.
This commit reserves space in cgts-vg volume group for 30G
docker-lv and 20G ceph-mon-lv for AIO configuration.
Story: 2004520
Task: 28663
Change-Id: Ic77d00c354da1070e2c4c2da4545d70ab4a93d91
Signed-off-by: Wei Zhou <wei.zhou@windriver.com>
Starting collectd too early in the manifest apply is seen
to occasionally fail due to a dependency configuration on
hostname resolution in FQDNLookup not being complete.
Since influxdb is used by collectd and is a controller
only service this update moves it to the manifest apply
post stage as well and is filtered out from non
controller load types.
This issue is fixed by the following multi-git changes.
stx-metal: This update.
Filter influxdb out of storage and compute only loads.
No real inter git merge dependency
stx-integ:
Add startup Before=pmond dependency
stx-config:
Move collectd config and startup to manifest apply post stage
Move influxdb config and startup to manifest apply post stage
Test Plan:
PASS: Build iso
PASS: verify install storage system and collectd startup
PASS: Verify Storage system DOR
PASS: Verify influxdb and extensions excluded in non-controller loads
PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
PASS: Verify collectd pmond process monitoring and recovery
PASS: Verify influxdb pmond process monitoring and recovery
PEND: Verify collectd statistics storage and fetch to/from influxdb
PEND: Install AIO DX and verify collectd and influxdb startup
Change-Id: I8c71f36978620e0650062cc848bfb9d85f6810b2
Closes-Bug: 1797909
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
1. Build-iso - PASS
2. Install iso and unlock all hosts -PASS
3. Force reboot on unlocked host to verify heartbeat failure detection
and graceful recovery. PASS
4. Verify hbsAgent logs for unexpected logs. PASS
Change-Id: Ia4f52d3ffa52152914f3c221fa6eb860d127724b
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
Closes-Bug: 1806963
In the case where the active controller experiences a
spontaneous reboot failure there is the potential for
a race condition in the new Active-Active Heartbeat
model between the inactive hbsAgent and mtcAgent
starting up on the newly active controller.
The inactive hbsAgent can report a heartbeat Loss before
SM starts up the mtcAgent. This results in a no detect
of the of a heartbeat failed host.
This update modifies the hbsAgent to continue to report
heartbeat Loss at a throttled rate while the hbsAgent
continues to experience heartbeat loss of enabled monitored
hosts. This change is implemented in nodeClass.cpp.
Debug of this issue also revealed another undesirable race
condition and logging issue when a controller is locked. This
issue is remedied with the introduction of a control structure
'locked' state that is set on controller lock and looked at in
the hbs_cluster_update utility. hbsCluster.cpp
Two additional hbsAgent logging changes were implemented with
this update.
1. Only print "missing peer controller cluster view" on a
state change event. Otherwise, this becomes excessive
whenever the inactive controller fails.
hbsAgent.cpp
2. Don't print the full heartbeat inventory and state banner
with hbsInv.print_node_info on every heartbeat Loss event.
Otherwise, this becomes excessive in larget systems.
hbsCluster.cpp
Test Plan:
PASS: Verify hbsAgent log stream for implemented improvements.
PASS: Verify Lock inactive controller several times.
PASS: Fail inactive controller several times. verify detect.
PASS: Reboot active controller several times. verify detect.
PASS: DOR System several times. Verify proper recovery.
PASS: DOR system but prevent power-up of several hosts. Verify detect.
Change-Id: I36e6309e141e9c7844b736cce0cf0cddff3eb588
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This increases the default docker distribution partition size from
1G to 16G. This also increases the minimum disk requirements from
130G to 145G for small disk, 170G to 185G for large disk.
Story: 2004520
Task: 28526
Change-Id: I898cfac45757ff1f9e6ce7c4928bbd9a42dca77d
Signed-off-by: Angie Wang <angie.wang@windriver.com>
This update replaces compute references to worker in mtce,
kickstarts, installer and bsp files.
Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration
Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms
Story: 2004022
Task: 27013
Depends-On: https://review.openstack.org/#/c/624452/
Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f
Signed-off-by: Tao Liu <tao.liu@windriver.com>
The maintenance process monitor is failing the hbsClient
process over config or process reload operations.
The issue relates to the hbsClient's subfunction being
'last-config' without pmon properly gating the active
monitoring FSM from starting until the passive monitoring
phase is complete and in the MANAGE state.
Test Plan
PASS: Verify active monitoring failure detection and handling
PASS: Verify proper process monitoring over pmond config reload
PASS: Verify proper process monitoring over SIGHUP -> pmond
PASS: Verify proper process monitoring over SIGUSR2 -> pmond
PASS: Verify proper process monitoring over process failure recovery
PASS: Verify pmond regression test soak ; on active and inactive controllers
PASS: Verify pmond regression test soak ; on compute node
PASS: Verify pmond regression test soak ; kill/recovery function
PASS: Verify pmond regression test soak ; restart function
PASS: Verify pmond regression test soak ; alarming function
PASS: Verify pmond handles critical process failure with no restart config
PASS: Verify pmond handles ntpd process failure
PASS: Verify AIO DX Install
PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.
Change-Id: Ie2fe7b6ce479f660725e5600498cc98f36f78337
Closes-Bug: 1807724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.
A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.
The thresholded Enable failure causes are
Configuration Failure ....... threshold:2 retry interval:30 secs
In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
Start Host Services Failure . threshold:2 retry interval:30 sec
Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute
This update refactors the old auto recovery for AIO SX into this
more generic framework.
Story: 2003576
Task: 24905
Test Plan:
PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling
PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)
PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message
Regression:
PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT
Corner Cases:
PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.
Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>