The disable_worker_services file was originally created
to prevent the (bare metal) nova-compute services from
running on a newly upgraded controller in an AIO-DX
configuration. This situation no longer exists because
the bare metal nova-compute services do not exist after
transiting to containers. this flag is no longer needed.
Removing all references to the disable_worker_services file.
Change-Id: Ic9555a36890f613f440e97f9090b22ff5ec8fd82
Partial-Bug: #1838432
Signed-off-by: marvin <weifei.yu@intel.com>
devstack is failing, most likely because StarlingX
uses postgres, and postgres was dropped in devstack by:
cf1c847191
I am not removing the devstack job declaration, or the devstack files
because in the future StarlingX could convert from postgres to
another DB backend, at which point we might want to revisit
using devstack.
Change-Id: I3adec4669d9181d71421f43905f86bf2e7e211c2
Partial-Bug: 1848557
Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
In the case of a switch recycle, the connected nic will go down and up
but the communication will restore after the switch is up and running.
This could take a few seconds (much longer than anticipated).
This holds off the i/f state update to the peer.
Also remove the batching interface failover state change. This is already
handled in the failover fsm fail_pending state.
Change-Id: Ia810927dbbc4b3821f7915e6a42bceeac43d9e46
Closes-Bug: 1845393
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Not having this change causes a linter error in opensuse.
Change-Id: I52830fa64bdb5f1b5bb00c4052f3c047be728bb3
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
The sm component had the 1.0.0 version in the folder name, this
change removes that version and updates the centos_pkg_dirs.
Story: 2006623
Task: 36827
Depends-On: https://review.opendev.org/#/c/685128/
Change-Id: I6725d1f961c2a82275da5fabbff8e89a8dd6f245
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
The sm-db component had the 1.0.0 version in the folder name, this
change removes that version and updates the centos_pkg_dirs.
Story: 2006623
Task: 36829
Depends-On: https://review.opendev.org/#/c/685127
Change-Id: Ia6025337529f4f48a89c175bb524548d81bc993f
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
The sm-common component had the 1.0.0 version in the folder name, this
change removes that version and updates the centos_pkg_dirs.
Story: 2006623
Task: 36828
Change-Id: I0e998a3e2482bc06f3a91f9494a3e5d21faa28e7
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
The opensuse build system reported two linter issues regarding
the LSB scripts in sm. The issues are:
- For `sm`: Has `Should-Start` but no `Should-Stop`.
- For `sm-shutdown`: LSB header not found.
To fix this issues the `Should-Stop` line was added in `sm` and
the LSB header was added in `sm.shutdown` script.
In `sm.shutdown` the `Default-Start` and `Default-Stop` were set
as the same as `sm`. `sm.shutdown` does nothing on the start stage
so this change won't affect any functionality.
Story: 2006508
Task: 36648
Change-Id: I4fac67a0a1c1abd82e47a3293aeae3036ee9722b
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
AIO-SX by design does not have a peer, so it never needs to
communicate potential peer before determining its role. For
AIO-SX even all network interfaces are down, the node should
still go enabled based on the situation of the node.
Closes-Bug: 1844427
Change-Id: Iafe0a8209cdbd3f83514c07041856cf6b6824f9c
Signed-off-by: Bin Qian <bin.qian@windriver.com>
The linters in the Opensuse build service are failing because sm_client has
unneeded python shebangs in the code. This is because a python source code
file that is not intended to be executed shouldn't include this shebang.
Also, the linter fails as `/usr/bin/env python` is used causing that the
dependency discovery tool fails. It is safe to use `/usr/bin/python` as
currently we don't provide any other python version.
Story: 2006508
Task: 36647
Change-Id: If3f83b9562414c3392515828a3c716a5bc23015d
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
Building sm is not possible in opensuse as the code present
format-truncation warnings and the opensuse's build system
enforces the -Werror flag.
The solution is to define the proper string lengths.
- SM_INTERFACE_NAME_MAX_CHAR was set to IFNAMSIZ.
- SM_SERVICE_ACTION_PLUGIN_EXIT_CODE_MAX_CHAR increase to 32.
- SM_SERVICE_HEARTBEAT_ADDRESS_MAX_CHAR decrease to 108.
These changes were updated in the database schema as well.
Story: 2006523
Task: 36551
Change-Id: Icce1d912c147fc6caaf06cc93de3cddadbcb0720
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
IPv6 multicast should be sent to the interface that the socket
binds to.
Closes-Bug: 1842949
Change-Id: I14b6c5193c67a0ddd69e31d1044219c4e9fd6b94
Signed-off-by: Bin Qian <bin.qian@windriver.com>
In AIO-DX, during the swact, dbmon experiences kubectl commands
respond slower than expected. dbmon reports error while the kubectl
commands not responding within 5 seconds, the 5 seconds timeout is too
short.
Extend the timeout to 10 seconds, to avoid reporting unnecessary error.
Change-Id: Ie07c84e0a53c00ac78970bf6b06e6cf0b19479e1
Closes-Bug: 1837919
Signed-off-by: Bin Qian <bin.qian@windriver.com>
This commit updates the barbican OCF scripts to address
logging issues:
- barbican-api is updated to set permissions on the logfile
to restrict access
- barbican-keystone-listener and barbican-worker are updated
to log via syslog
Depends-On: I31b29bb8ffff28cd329b383704b88cf73199bcec
Change-Id: I814d35ca3e55fbfb9e0a462f3f05ff2db6a9cca5
Partial-Bug: 1836632
Signed-off-by: Don Penney <don.penney@windriver.com>
It turns out that when swacting we can end up with kubernetes going
down for a while, causing kubectl commands to hang.
Accordingly, let's add some timeouts to critical commands to limit
how long they can hang for.
Depends-On: I8d91dc13cb9a9adb7f7a7a95faadad4339ddb466
Change-Id: I777895497300cc605762db002958a778cd204e49
Story: 2004712
Task: 30410
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
On a two-node system the openstack-helm chart for mariadb has issues.
If you run with a single replica then your failover times are very
long due to internal timeouts in kubernetes which prevent accessing
the backing volume on the newly-active node.
At the same time, you can't run a "garbd" pod the way we do on a full
lab configuration because there is no third node to run it on.
The only viable option we've found is to trigger something to
explicitly tell the mariadb pod on the active node to bootstrap a new
primary cluster if it loses quorum due to the other mariadb pod going
away unexpectedly.
Accordingly, this commit creates a new "dbmon" OCF script which
behaves basically as follows:
start -- return $OCF_SUCCESS
stop -- return $OCF_NOT_RUNNING
standby -- return $OCF_SUCCESS or $OCF_NOT_RUNNING depending on whether
mariadb on this node is a member of the primary cluster
active -- if mariadb on this node is not a member of the primary cluster
then tell it to bootstrap a new primary cluster. Then check
again and return $OCF_SUCCESS or $OCF_NOT_RUNNING depending on
whether mariadb on this node is a member of the primary cluster
monitor -- If mariadb on this node is a member of the primary cluster
then return $OCF_RUNNING_MASTER on the active controller and
$OCF_SUCCESS on the standby controller. If mariadb is not a
member of the primary cluster return $OCF_NOT_RUNNING.
There are a few complicating factors.
If openstack application or mariadb chart not installed then treat it
like being a member of the primary cluster.
If the mariadb pod is still initializing treat it like not being a member
of the primary cluster.
If we're in a standard lab (with garbd running on a compute node) then
don't actually tell mariadb to bootstrap a new primary cluster but just
report whether it's a member of the primary cluster or not.
Story: 2004712
Task: 30410
Depends-On: I2667d56a71b7d3881c03b6a5c1e5ed61d4f0b902
Change-Id: I8d91dc13cb9a9adb7f7a7a95faadad4339ddb466
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Create OCF scripts for controlling Barbican processes lifecycle.
There are three Barican proceses that needs to be managed:
barbican-api, barbican-keystone-listener and barbican-worker.
Depends-On: I63a6fd3d112a98449ea22524bb2a83b5db8ce6d1
Change-Id: I2667d56a71b7d3881c03b6a5c1e5ed61d4f0b902
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
As part of switching to the upstream implementation of cinder B&R, we
need an OCF script to manage the cinder-backup process.
Depends-On: I6bec51c7401339f4c71f9558d73389d0c793093d
Change-Id: I63a6fd3d112a98449ea22524bb2a83b5db8ce6d1
Story: 2003715
Task: 26375
Signed-off-by: Scott Little <scott.little@windriver.com>
The following upstream projects did not have OCF scripts and these were
created for StarlingX:
aodh-api
aodh-evaluator
aodh-listener
aodh-notifier
ceilometer-agent-notification
heat-api
heat-api-cfn
heat-api-cloudwatch
ironic-api
ironic-conductor
magnum-api
magnum-conductor
murano-api
murano-engine
nova-conductor
nova-placement-api
nova-serialproxy
panko-api
Move these out of stx/git.openstack-ras and place them into a seperate
package within the openstack/stx-upstream repo.
Depends-On: I080b6e893d5f6ccff04951879eed71e8ccbe0b52
Change-Id: I6bec51c7401339f4c71f9558d73389d0c793093d
Story: 2003715
Task: 26375
Signed-off-by: Scott Little <scott.little@windriver.com>
Use templates instead of individual jobs so that these
can be changed in one place.
Depends-On: https://review.opendev.org/677606
Change-Id: Ic70832ed4e4fba3343381f7ead611085c0849994
The glance devstack plugin is not working for us,
and is not needed for our devstack to work, so updating
the zuul job to use the "min" devstack version that is used
by other repos such as 'fault' and avoid setting up the
glance devstack plugin altogether.
Change-Id: Id16671961e10962530d2eaff28387b4b206e0a3b
Partial-Bug: 1840292
Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
The bug reported was because the dbmon service audit timer was
overwritten accidentally, therefore no audit was performed so the
dbmon service was not actually being audit.
Major change is to enhance timer system to use global unique timer
id (not reused) to ensure timer is not double deregistered by 2
different mechanisms (disarm/deregister).
Change the timer id to 64 bit integer to ensure id never overflow.
Above change eliminates the double deregistering a timer issue which
could accidentally deregister a new timer that reuses the same id.
Also some cleaning to get rid of cases that could double deregister
timer (although it is no longer harmful as above mentioned change is
in place)
Change-Id: I2603870d2eb2749d78456e406095ae543353963f
Closes-Bug: 1837724
Signed-off-by: Bin Qian <bin.qian@windriver.com>
This update added the dcdbsync service for containerized openstack
services into SM. Note that this second dcdbsync instance is also
running on platform (not containerized)
Story: 2004766
Task: 36099
Change-Id: If406127d26d6230771c0d44105da3a08facf3277
Signed-off-by: Andy Ning <andy.ning@windriver.com>
The filesystem /opt/cgcs is removed and the “helm_charts” and “keystone”
folders now resides under /opt/platform.
ls /opt/platform/
armada config helm nfv puppet sysinv
ls /opt/cgcs/
helm_charts keystone
Resources related to cgcs-drbd and /opt/cgcs are removed from puppet.
SMS is no longer monitoring these resources.
Tested in AIO-SX, AIO-DX and Standard hardware labs.
Depends-On: https://review.opendev.org/674360
Partial-Bug: 1830142
Change-Id: I4be7a877efb89bb9e5c2b067bdc7e4259f2b0c0c
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
Note: this only affects AIO-DX setups as that is the only kind
of setup where ceph-mon is managed by SM
In some edge-cases, during a swact, ceph-mon may take too long
to be stopped on the active controller resulting in a failed
swact.
This change increases the timeout to account for those
edge cases.
Change-Id: I3ace73650e4fe9aafc84c82e2ffe048f2039305e
Partial-bug: 1836075
Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>
Add mtc-agent service dependency to fm-mgr to ensure mtc-agent shuts
down before fm-mgr does.
An issue was found that in rare cases a swact occurs when mtc-agent
try to clear an alarm, while fm-mgr has been disabled, clear alarm
message went lost. The alarm therefor remained not being able to
clear.
Closes-bug 1829289
Change-Id: I39196d5f3ce764a14b4d1e0fb1a4f3344ddd6a1a
Signed-off-by: Bin Qian <bin.qian@windriver.com>