Add NovaLibvirtMaxQueues role parameter to set [libvirt]/max_queues in
nova.conf of the compute. Default 0 corresponds to not set meaning the
legacy limits based on the reported kernel major version will be used.
Conflicts:
deployment/nova/nova-compute-container-puppet.yaml
Depends-On: https://review.opendev.org/c/openstack/puppet-nova/+/772805
Change-Id: I353e8ca2676bbdceb056f8b2b084bc5102f52c1f
(cherry picked from commit 67a5a78897)
nova-consoleauth was removed in Stein. We need to delete the compute
services during major upgrades.
Related: https://bugzilla.redhat.com/1825849
Change-Id: I74465f5ae77a0666540d3465e2ad29b03f9bd3c3
(cherry picked from commit 04405abdd4)
When a node has hugepages enabled, we can help with live migrations by
enabling NovaLiveMigrationPermitPostCopy and
NovaLiveMigrationPermitAutoConverge.
Related: https://bugzilla.redhat.com/1298201
Conflicts:
deployment/nova/nova-compute-container-puppet.yaml
Change-Id: I1133c210f35181d44f8ba56f09b52f00589e035c
(cherry picked from commit df207fd2e9)
CephFS gatewayed by NFS is more generally suitable for multi-tenant
OpenStack deployments than native CephFS since the latter requires
that VMs belonging to regular members of Keystone projects be exposed
to the Ceph infrastructure and run client software with capabilities
that are not appropriate for untrusted cloud tenants.
Change-Id: I269607d43f45f65efcbce33dd776e7eb4f475311
(cherry picked from commit 63c5a94f83)
libvirt-daemon is part of the default overcloud image but it's also
possible that it's not installed or simply removed by operators. In this
case, tripleo_nova_libvirt_guests will fail.
Related: https://bugzilla.redhat.com/1810319
Change-Id: I0814bd8794ab82792837b27d0128e15c34b90adc
(cherry picked from commit 93b5c3a20e)
We currently forcefully install pacemaker + pcs in a number of upgrade
tasks. This is suboptimal because on certain nodes pcs/pacemaker repos
are not enabled. Let's let puppet pull in those packages normally.
Tested this during a queen -> train FFU successfully on a composable
roles environment.
Closes-Bug: #1911684
Change-Id: I97b42e618bcd31e408374157eb10d315ac62f306
(cherry picked from commit 4a599e3721)
When we clear the cached facts with unreachable nodes, we attempt to
gather facts by default. This can cause the node to be skipped for every
future playbook. This ends up bypassing all our failure percentage
logic.
Change-Id: Ie240877496b73a37f553a84af47dfebdbaf899e5
Related-Bug: 1908573
(cherry picked from commit 969693e667)
When running minor update in a composable HA, different
roles could run ansible tasks concurrently. However,
there is currently a race when pacemaker nodes are
stopped in parallel [1,2], that could cause nodes to
incorrectly stop themselves once they reconnect to the
cluster.
To prevent concurrent shutdown, use a cluster-wide lock
to signals that one node is about to shutdown, and block
the others until the node disconnects from the cluster.
Tested the minor update in a composable HA environment:
. when run with "openstack update run", every role
is updated sequentially, and the shutdown lock
doesn't interfere.
. when running multiple ansible tasks in parallel
"openstack update run --limit role<X>", pacemaker
nodes are correctly stopped sequentially thanks
to the shutdown lock.
. when updating an existing overcloud, the new
locking script used in the review is correctly
injected on the overcloud, thanks to [3].
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1791841
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1872404
[3] I2ac6bb98e1d4183327e888240fc8d5a70e0d6fcb
Closes-Bug: #1904193
Change-Id: I0e041c6a95a7f53019967f9263df2326b1408c6f
(cherry picked from commit cb55cc8ce5)
Without this patch, files in /srv/node are relabeled on every start of
the account_auditor and/or account_reaper containers. If there are many
files (eg. when using Gnocchi) this will take a long time, sometimes
dozens of minutes per container start, and might result in breaking
upgrades/updates.
Relabeling already happens in step 3, this should be sufficient and
prevent additional delays when (re-) starting containers.
Closes-Bug: 1907070
Change-Id: I172ae8f35df34887aaf61b3e03d5aaab1d462a60
(cherry picked from commit 191d160903)
In spine-and-leaf TLS-e deployments as done in OSP13,
services are filter based on role networks when adding
metadata for nova-join. This filtering removes valid
services due to the fact that the roles network does'nt
match the global ServiceNetMap.
Add a role based parameter {{role.name}}ServiceNetMap
that can be used to override the ServiceNetMap per-role
when it's being passed to {{role.name}}ServiceChain and
the {{role.name}} resource group.
Related: RHBZ#1875508
Closes-Bug: #1904482
Change-Id: I56b6dfe8a0e95385e469d9eac97a0ec24e147450
(cherry picked from commit be6a844a79)
The octavia_rsyslog container should not use the octavia user and group
since the rsyslog image doesn't know about that user. It should use
root and have a dedicated log directory on the host to avoid mixing
ownerships with the octavia containers that are running as octavia user.
Change-Id: Ie7eb7905eb33235fc73f94b9e84f553394e951fd
Closes-Bug: #1907260
(cherry picked from commit ffd86b3f2c)
In order to configure OVN+SRIOV, the following THT files need to be
imported:
- neutron-ovn-ha.yaml
- neutron-ovn-sriov.yaml
Change-Id: Ia918d8b92d9bb1efac9bec2017b88593bf362834
Resolves: rhbz#1913700
(cherry picked from commit 831f5d65fa)
After change [1] nova-compute launch libguestfs using the default
``qemu:///system``, but when ``inject_password` is set to true and
user tries to create vm, the vm creation is successful and we could
see libguestfs error in nova-compute logs.
This change forces libvirt to use ``direct`` when launching instances
on host.
[1] Ib55936ea562dfa965be0764647e2b8e3fa309fd6
Change-Id: I195358742c19d6ea0a3d32979896c0268e3b55a6
Closes-bug: #1912141
(cherry picked from commit 67917bf650)
Setting nova::metadata::dhcp_domain will no longer work unless nova::metadata
is included.
Since I07caa3185427b48e6e7d60965fa3e6157457018c we no longer include
nova::metadata on computes.
So we must now set nova::dhcp_domain in nova-base instead of relying on the
deprecated nova::metadata::dhcp_domain param.
Closes-bug: #1905418
Depends-on: I98fe83e0c245388944529cd19b5e2bbed134e855
Change-Id: Iaf7823ea8d456008c1f4a3d7631657faa65eb6d3
(cherry picked from commit bf7ef6b4d7)
Added MemcachedMaxConnections to allow max connection override as
actually the limit is 8192 connections but in some cases the environment
will create more than 8192 connections to each memcached server.
Closes-Bug: #1911664
Change-Id: Iaef7c01127327f709577bef3d2e96db840ba2b80
(cherry picked from commit bbed1ef736)
The change https://review.opendev.org/#/c/616116 unwinded the swift
part of the https://review.opendev.org/#/c/590008/ changes. So the
contents of the /var/lib/config-data/swift_ringbuilder config volume
used to be managed by container-puppet tool. That made swift containers
always restarted because the puppet-generated rings are always changing
on each deployment/update execution.
Restore that unwinded change back and exclude swift rings from the
management of container-puppet tooling. Instead make init containers
swift_copy_rings and swift_setup_srv to be always executed (takes
the same approach as in https://review.opendev.org/#/c/564798/).
That also fixes the issue with swift_copy_rings seems never been
executed - at least there is no traces of it in CI jobs logs for swift
init containers.
Change-Id: I23b469057e4c47c42601beb166f815ee71147c14
Closes-Bug: #1867765
Related-Bug: #1802066
Related-Bug: #1786065
Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
(cherry picked from commit 9bc6640907)
Updating rings consists of multiple steps today:
1. Apply puppet in the swift_ringbuilder container
2. Copying files in step 3 using swift_copy_rings
3. Run kolla_set_config to copy files to /etc/swift before starting
Swift service processes
This needs a container restart today, because kolla_set_config will be
executed only on container (re-)starts.
This patch will execute kolla_set_configs at the step 5 of deployment,
and apply all ring changes, if any, for the Swift processes without
a container restart. Swift processes will notice the changed ring files
within 15 seconds and will use the updated rings.
Co-authored-by: Bogdan Dobrelya <bdobreli@redhat.com>
Change-Id: Ibdd783b484a84c0fdfaac84d892a8ea46be85fde
(cherry picked from commit cb982440d7)
Adding the ability to specifies the private key size
used when creating the certificate. We have defined the
default value the same as we have before 2048 bits.
Also, it'll be able to override the key_size value
per service.
Depends-on: I4da96f2164cf1d136f9471f1d6251bdd8cfd2d0b
Change-Id: Ic2edabb7f1bd0caf4a5550d03f60fab7c8354d65
(cherry picked from commit 9760977529)
Currently galera and ovn require a coordinated restart across
the controller node when certmonger determines the certificate
for a node has expired and it needs to regenerate it.
But right now, when the tripleo certmonger puppet module is
called to assert to state of the certificates, it ends up
regenerating new certificate unconditionally. So the galera and
ovn get restarted on stack update, even when there is no need to.
To mitigate these unecessary restarts, disable the post-action
for now until we fix the behaviour of tripleo's certmonger puppet
module. This has the side effect that services won't get restarted
automatically if no stack update takes place until the certificate
expiration date is reached.
Related-Bug: #1906505
Change-Id: I17f1364932e43b8487515084e41b525e186888db
(cherry picked from commit 8b16911cc2)
A resource lock is used as a synchronization point between
pacemaker cluster nodes. It is currently implemented
by adding an attribute in an offline copy of CIB, and merging
the update in the CIB only if no concurrent updates has
occurred in the mean time.
The problem with that approach is that - even if the concurrency
is enforced by pacemaker - the offline CIB contains a snapshot
of the cluster state; so pushing back the entire offline CIB
pushes old resources' state back into the cluster. This causes
additional burden on the cluster and sometimes caused unexpected
cluster state transition.
Reimplement the locking strategy with cibadmin; It's a much faster
approach, that provides the same concurrency guarantees, and only
changes one attribute rather than the entire CIB, so it doesn't
cause unexpected cluster state transition.
Closes-Bug: #1905585
Change-Id: Id10f026c8b31cad7b7313ac9427a99b3e6744788
(cherry picked from commit c8f5fdfc36)
Currently, multiple scripts are being stored in
/var/lib/container-config-scripts directory if any of theses scripts
are being used during the update_tasks the content won't be up to
date (or the script will be fully missing if added in a new release),
as the content of this folder is being updated during the deploy tasks
(step 1), which occurs after all the update_tasks.
This patch gathers the tasks responsible for the folder creation and
content update into a new playbook named common_container_config_scripts.yaml,
this way we can reference the tasks from the deploy-tasks step1 playbook
(as it was happening up to now) and invoke them before the update_tasks
playbook gets called.
Change-Id: I2ac6bb98e1d4183327e888240fc8d5a70e0d6fcb
Related-Bug: #1904193
(cherry picked from commit bb8cb15d20)
The tripleo_free startegy cuts the run time of composable upgrade.
But as each node gets to different point of code there is need to
remove clear_facts part from tripleo-packages. The clear_facts
applies globaly meaning if messaging node looks for distribution
fact while controller runs clear_facts we will fail. The
clear_facts was added to force ansible to reset the
ansible_python_interpreter. This should be avoided by simply
setting default python by alternatives command and it's irelevant
to any other version than stable/train.
Change-Id: I556327228f23eb5e744b580618fb581a9fd2ce41
(cherry picked from commit 735faf0478)