51 Commits

Author SHA1 Message Date
Takashi Kajinami
e99a251ad4 Use consistent indent in .sh files
This change fixes the inconsistent indents in the bash script files.
Details are described below.

* Use 4 spaces instead of tabs. Currently tabs are used partially.

* Ensures the corresponding items(like if/else/fi) are placed at
  the same indent level.

Change-Id: Iccf01cd325e171fba8e399d22ee9e0a00f3e781b
2022-03-09 08:38:02 +09:00
Bogdan Dobrelya
dbf5d36fdf Add timestamps to nova/placement wait for scripts
Related-bug: #1951577

Change-Id: I5ca99f53540d27b3e7824d22910ddc69cae3c9d0
Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
2021-11-22 11:02:03 +01:00
Zuul
c793e9174f Merge "Remove six library" 2021-10-12 00:55:27 +00:00
Brendan Shephard
b522254bc2 Remove six library
The six library was used to bridge the py2 > py3
gap. This library is longer required on branches
that do not support Python2.

Change-Id: I40cb90bc6bc058dcbf3659b97dbb489b53adb9d3
2021-10-06 07:01:42 +00:00
Zuul
1e1b6d125c Merge "CentOS 9: support restart of HA resources" 2021-09-08 12:13:18 +00:00
Damien Ciabrini
128c2bcc25 CentOS 9: support restart of HA resources
Pacemaker 2.1 changed naming convention around multi-state
resources and ocf resource name. Adapt our resource restart
scripts so that they parse the proper data from the CIB.

Change-Id: Ieade3444e44e305f507c057991e02048ab5f3b3a
Closes-Bug: #1942771
2021-09-06 14:26:55 +02:00
Damien Ciabrini
ad2a13ab47 Check whether an HA resource already exists explicitly
With ephemeral heat we lost the meaning of the 'stack_action' hiera key
which we previously used to distinguish between fresh deployment and
pre-existing deployment (aka redeploy).
Since this hiera key is not available anymore, in ansible we added a
TRIPLEO_HA_WRAPPER_RESOURCE_EXISTS env variable which will be true
when the resource existed even before calling puppet.

This way we can restore the previous behaviour (which was relying
on the stack_update hiera key) of restarting an HA
bundle on the bootstrap node in case of a configuration change.

While we're at it we make sure that the logging takes place via logger
so we're sure to capture these events in the journal.

Tested as follows:
1) Initial deploy:
[root@controller-0 ~]# journalctl |grep pcmkres
Sep 01 10:23:35 controller-0.alejandro.ftw pcmkrestart[47636]: Initial deployment, skipping the restart of haproxy-bundle
Sep 01 10:24:25 controller-0.alejandro.ftw pcmkrestart[49735]: Initial deployment, skipping the restart of galera-bundle
Sep 01 10:25:15 controller-0.alejandro.ftw pcmkrestart[53052]: Initial deployment, skipping the restart of rabbitmq-bundle
Sep 01 10:37:35 controller-0.alejandro.ftw pcmkrestart[148651]: Initial deployment, skipping the restart of openstack-cinder-volume

Redeploy changing only the haproxy config via a hiera key change:
Sep 01 11:12:29 controller-0.alejandro.ftw pcmkrestart[438507]: Wed Sep Restarting haproxy-bundle globally. Stopping:
Sep 01 11:12:37 controller-0.alejandro.ftw pcmkrestart[439271]: Wed Sep Restarting haproxy-bundle globally. Starting:

Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/806610/

Closes-Bug: #1942309

Change-Id: I90ea2287b5ab32c8dc6bbf5f91927e7488326dcd
2021-09-01 13:23:23 +02:00
Michele Baldessari
61f67eff10 nova_libvirt_init_secret Give a proper error if ceph is not configured properly
Let's make the error a little more clearer when ceph failed to be
configured properly.

Before:
2021-08-13T12: 42:07.472193117+00:00 stdout F ------------------------------------------------
2021-08-13T12: 42:07.472193117+00:00 stdout F Initializing virsh secrets for: ceph:openstack
2021-08-13T12: 42:07.481397478+00:00 stdout F --------
2021-08-13T12: 42:07.481397478+00:00 stdout F Initializing the virsh secret for 'ceph' cluster () 'openstack' client
2021-08-13T12: 42:07.484466828+00:00 stdout F Creating /etc/nova/ceph-secret.xml
2021-08-13T12: 42:07.493435343+00:00 stderr F Usage: grep [OPTION]... PATTERN [FILE]...
2021-08-13T12: 42:07.493435343+00:00 stderr F Try 'grep --help' for more information.
2021-08-13T12: 42:07.591038798+00:00 stdout F Secret 5e23cf03-81b0-4e02-b678-7c5363fbf0e2 created
2021-08-13T12: 42:07.591038798+00:00 stdout F
2021-08-13T12: 42:07.671036635+00:00 stderr F error: failed to get secret '--base64'
2021-08-13T12: 42:07.671036635+00:00 stderr F error: uuidstr in virSecretLookupByUUIDString must be a valid UUID
2021-08-13T12: 42:07.674021136+00:00 stdout F

After:
2021-08-14T13:10:20.866443451+00:00 stdout F Initializing virsh secrets for: ceph:openstack
2021-08-14T13:10:20.880988730+00:00 stdout F Error: /etc/ceph/ceph.conf contained an empty fsid definition
2021-08-14T13:10:20.880988730+00:00 stdout F Check your ceph configuration

Change-Id: I781db8142015d713d9e99114aed42667418bf23b
2021-08-14 15:17:47 +02:00
Damien Ciabrini
1662600e6e HA minor update: fix bad pcs invocation
When a HA resource is in failed stated, the minor update
should normally try to restart it but the associated
pcs invocation is currently invalid, so the resource never
gets a chance to be restarted.

Use the right pcs call to fix this minor update use case.

Change-Id: Iaf85807d067898bbab6d76ab40bc070e845a8b38
Closes-Bug: #1931500
2021-06-09 23:37:14 +02:00
Alan Bishop
e2936d7604 Add cinder RBD support for multiple ceph clusters
The CinderRbdMultiConfig parameter provides a mechanism for
configuring cinder RBD backends associated with external ceph
clusters defined by CephExternalMultiConfig.

A new nova_libvirt_init_secret.sh script handles the creation of
the libvirt secret that is required for nova to connect to volumes
on the cinder RBD backends.

Depends-On: I040e25341c9869ad289d7e7c98e831caef23fece
Change-Id: I73af5b868de629870a35d38f8436e7025aae791e
2021-04-14 12:44:45 -07:00
Damien Ciabrini
712cfcc71b Upgrade mariadb storage during upgrade tasks
When a tripleo major upgrade or FFU causes an update or mariadb
to a new major version (e.g. 10.1 -> 10.3), some internal DB
tables must be upgraded (myisam tables), and sometimes the
existing user tables may be migrated to new mariadb defaults.

Move the db-specific upgrade steps into a dedicated script and
make sure that it is called at the right time while upgrading
the undercloud and/or the overcloud.

Closes-Bug: #1913438

Change-Id: I92353622994b28c895d95bdcbe348a73b6c6bb99
2021-02-16 09:08:40 +01:00
Damien Ciabrini
cb55cc8ce5 Serialize shutdown of pacemaker nodes
When running minor update in a composable HA, different
roles could run ansible tasks concurrently. However,
there is currently a race when pacemaker nodes are
stopped in parallel [1,2], that could cause nodes to
incorrectly stop themselves once they reconnect to the
cluster.

To prevent concurrent shutdown, use a cluster-wide lock
to signals that one node is about to shutdown, and block
the others until the node disconnects from the cluster.

Tested the minor update in a composable HA environment:
  . when run with "openstack update run", every role
    is updated sequentially, and the shutdown lock
    doesn't interfere.
  . when running multiple ansible tasks in parallel
    "openstack update run --limit role<X>", pacemaker
    nodes are correctly stopped sequentially thanks
    to the shutdown lock.
  . when updating an existing overcloud, the new
    locking script used in the review is correctly
    injected on the overcloud, thanks to [3].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1791841
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1872404
[3] I2ac6bb98e1d4183327e888240fc8d5a70e0d6fcb

Closes-Bug: #1904193
Change-Id: I0e041c6a95a7f53019967f9263df2326b1408c6f
2020-12-24 14:06:32 +01:00
Damien Ciabrini
c8f5fdfc36 HA: reimplement resource locks with cibadmin
A resource lock is used as a synchronization point between
pacemaker cluster nodes. It is currently implemented
by adding an attribute in an offline copy of CIB, and merging
the update in the CIB only if no concurrent updates has
occurred in the mean time.

The problem with that approach is that - even if the concurrency
is enforced by pacemaker - the offline CIB contains a snapshot
of the cluster state; so pushing back the entire offline CIB
pushes old resources' state back into the cluster. This causes
additional burden on the cluster and sometimes caused unexpected
cluster state transition.

Reimplement the locking strategy with cibadmin; It's a much faster
approach, that provides the same concurrency guarantees, and only
changes one attribute rather than the entire CIB, so it doesn't
cause unexpected cluster state transition.

Closes-Bug: #1905585
Change-Id: Id10f026c8b31cad7b7313ac9427a99b3e6744788
2020-12-09 12:37:31 +00:00
Martin Schuppert
70818dc684 fix nova_statedir_ownership
with change in Ic6f053d56194613046ae0a4a908206ebb453fcf4 run() was
removed to be triggered, as a result the script actually don't run.

Change-Id: I5050f198f0109faa9299de85e01b0dbe4e5a30ab
Closes-Bug: #1903033
2020-11-20 16:05:21 +01:00
Oliver Walsh
c156534010 Skip Trilio dirs when setting ownership in /var/lib/nova
Trilio currently mounts an NFS export in /var/lib/nova to make it accessible
from within the nova_compute and nova_libvirt containers.
This can result in considerable delays when walking the directory tree to
ensure the ownership is correct.

This patch adds the ability to skip paths when recursively setting the
ownership and selinux context in /var/lib/nova. The list of paths to skip
can be set via te NovaStatedirOwnershipSkip heat parameter. This default to
the Trilio dir.

Change-Id: Ic6f053d56194613046ae0a4a908206ebb453fcf4
2020-10-23 16:55:13 +00:00
Martin Magr
f84655ed55 Return details in output of container health check
This patch reformats check-container-health script for sensubility to output
json formatted data instead of semi-colon separated data. Removes calculation
of duration for each container HC to keep the runtime shorter.

Change-Id: I18bcde4b6031c79deae3f6c9ee6f2c4bb754be88
2020-10-14 11:57:56 +02:00
Zuul
bc199530de Merge "Fix typos" 2020-09-23 19:37:27 +00:00
Zuul
be04d1536a Merge "Adapt container health check for built-in podman health checks" 2020-09-23 03:56:56 +00:00
Rajesh Tailor
a672bedfc2 Fix typos
Change-Id: Ia9b0410d1ade1abc2d29d3634379b9128016d0e9
2020-09-16 15:45:12 +05:30
Martin Magr
1952a9ce64 Adapt container health check for built-in podman health checks
This patch removes regression which was introduced by moving from systemd
health check framework to built-in podman health check support.

Change-Id: I1706e04b543e8c9ff3903a9575b7c2cd74b9a0b3
2020-09-15 16:52:56 +02:00
Michele Baldessari
87b365afd3 Fix Flakes and lower-constraints errors
With the switch to Ubuntu Focal for tox jobs via https://review.opendev.org/#/c/738322/
our 1.1.0 version of hacking pulls in old modules that are not compatible
with python3.8:
https://github.com/openstack/hacking/blob/1.1.0/requirements.txt#L6

Let's upgrade hacking to >= 3.0.1 and < 3.1.0 so that it supports python3.8
correctly. The newer hacking also triggered new errors which are
fixed in this review as well:
./tools/render-ansible-tasks.py:113:25: F841 local variable 'e' is assigned to but never used
./tools/yaml-validate.py:541:19: F999 '...'.format(...) has unused arguments at position(s): 2
./tools/render-ansible-tasks.py:126:1: E305 expected 2 blank lines after class or function definition, found 1
./tools/yaml-validate.py:33:1: E305 expected 2 blank lines after class or function definition, found 1
./container_config_scripts/tests/test_nova_statedir_ownership.py:35:1: E305 expected 2 blank lines after class or function definition, found 0

Also make sure we exclude .tox and __pycache__ from flake8 as well

We also need to change the lower-constraint requirements to make them
py3.8 compatible. See https://bugs.launchpad.net/nova/+bug/1886298
cffi==1.14.0
greenlet==0.4.15
MarkupSafe==1.1.0
paramiko==2.7.1

Suggested-By: Yatin Karel <ykarel@redhat.com>

Change-Id: Ic280ce9a51f26d165d4e93ba0dc0c47cdf8d7961
Closes-Bug: #1895093
2020-09-10 11:10:54 +02:00
Michele Baldessari
dcfc98d236 Fix pcs restart in composable HA
When a redeploy command is being run in a composable HA environment, if there
are any configuration changes, the <bundle>_restart containers will be kicked
off. These restart containers will then try and restart the bundles globally in
the cluster.

These restarts will be fired off in parallel from different nodes. So
haproxy-bundle will be restarted from controller-0, mysql-bundle from
database-0, rabbitmq-bundle from messaging-0.

This has proven to be problematic and very often (rhbz#1868113) it would fail
the redeploy with:
2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
2020-08-11T13:40:25.996896822+00:00 stderr F

or

2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
2020-08-11T13:39:49.197487180+00:00 stderr F

After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
"""
Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

- Get the current configuration and state of the cluster, including a list of active resources (list #1)
- Set resource target-role to Stopped
- Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
- Compare lists #1 and #2, and the difference is the resources that should stop
- Periodically refresh the configuration and state until the list of active resources matches list #2
- Delete the target-role
- Periodically refresh the configuration and state until the list of active resources matches list #1
"""

So the suggestion is to replace the restarts with an enable/disable cycle of the resource.

Tested this on a dozen runs on a composable HA environment and did not observe the error
any longer.

Closes-Bug: #1892206

Change-Id: I9cc27b1539a62a88fb0bccac64e6b1ae9295f22e
2020-08-19 16:21:15 +02:00
Zuul
d13d010693 Merge "Rolling certificate update for HA services" 2020-08-12 21:22:53 +00:00
Damien Ciabrini
ba471ee461 Fix HA resource restart when no replicas are running
When the helper script pacemaker_restart_bundle.sh is called
during a stack update, it restarts the pacemaker resource via
a "pcs resource restart <name>".

When all the replicas are stopped due to a previous error,
pcs won't restart them because there is nothing to stop. In
that case, one must use "pcs resource cleanup <name>".

Change-Id: I1790444d289d057e9a3f612c53efe485080978b5
Closes-Bug: #1889395
2020-08-03 21:00:44 +02:00
Damien Ciabrini
0f54889408 Rolling certificate update for HA services
There are certain HA clustered services (e.g. galera) that don't
have the ability natively to reload their TLS certificate without
being restarted. If too many replicas are restarted concurrently
this might result in full service disruption.

To ensure service availability, provide a means to ensure that
only one service replica is restarted at a time in the cluster.
This works by using pacemaker's CIB to implement a cluster-wide
restart lock for a service. The lock has a TTL so it's guaranteed
to be eventually released without requiring complex contingency
cleanup in case of failures.

Tested locally by running the following:
1. force recreate certificate on all nodes at once for galera
   (ipa-cert resubmit -i mysql), and verify that the resources
   restart one after the other

2. create a lock manually in pacemaker, recreate certificate for
   galera on all nodes, and verify that no resource is restarted
   before the manually created lock expires.

3. create a lock manually, let it expires, recreate a certificate,
   and verify that the resource is restarted appropriately and the
   lock gets cleaned up from pacemaker once the restart finished.

Closes-Bug: #1885113
Change-Id: Ib2b62e33b34cf72edfdae6299cf432259bf960a2
2020-07-30 16:51:48 +02:00
Zuul
e59009a7e1 Merge "Avoid failing on deleted file" 2020-07-24 20:01:24 +00:00
Zuul
45c959a5ea Merge "Ensure redis_tls_proxy starts after all redis instances" 2020-07-23 04:31:17 +00:00
David Hill
6c3c8b41de Avoid failing on deleted file
Avoid failing on deleted file as sometimes file might get
deleted while the script run.  Log the exception instead for
troubleshooting purposes.

Change-Id: I733cec2b34ef0bd0780ba5b0520127b911505e1b
2020-07-08 13:23:52 +01:00
Damien Ciabrini
b91a1a09cb Ensure redis_tls_proxy starts after all redis instances
When converting a HA control plane to TLS-e, 1) the bootstrap node
tells pacemaker to restart all redis instances to take into
account the new TLS-e config; 2) a new container redis_tls_proxy
is started on every controller to encapsulate redis traffic in TLS
tunnels. This happens during step 2.

Redis servers have to be restarted everywhere for redis_tls_proxy
to be able to start tunnels properly. Since we can't guarantee that
across several nodes during the same step, tweak the startup of
redis_tls_proxy instead; make sure to only create the tunnels once
the targeted host:port can be bound (i.e. redis was restarted).

Change-Id: I70560f80775dacddd82262e8079c13f86b0eb0e6
Closes-Bug: #1883096
2020-07-07 05:36:43 +00:00
Hervé Beraud
be280e39c2 Stop to use the __future__ module.
The __future__ module [1] was used in this context to ensure compatibility
between python 2 and python 3.

We previously dropped the support of python 2.7 [2] and now we only support
python 3 so we don't need to continue to use this module and the imports
listed below.

Imports commonly used and their related PEPs:
- `division` is related to PEP 238 [3]
- `print_function` is related to PEP 3105 [4]
- `unicode_literals` is related to PEP 3112 [5]
- `with_statement` is related to PEP 343 [6]
- `absolute_import` is related to PEP 328 [7]

[1] https://docs.python.org/3/library/__future__.html
[2] https://governance.openstack.org/tc/goals/selected/ussuri/drop-py27.html
[3] https://www.python.org/dev/peps/pep-0238
[4] https://www.python.org/dev/peps/pep-3105
[5] https://www.python.org/dev/peps/pep-3112
[6] https://www.python.org/dev/peps/pep-0343
[7] https://www.python.org/dev/peps/pep-0328

Change-Id: I2cf7495c5cb42c632993bb2372ffb626ab97bf0d
2020-07-02 15:27:27 +00:00
Hervé Beraud
11f84b6302 Use unittest.mock instead of mock
The mock third party library was needed for mock support in py2
runtimes. Since we now only support py36 and later, we can use the
standard lib unittest.mock module instead.

Change-Id: Iabd3e90a46fd087c8e780796e04fcc050c5277ab
2020-06-09 18:41:21 +02:00
Michele Baldessari
4d8eb35114 Drop bootstrap_host_exec from pacemaker_restart_bundle
bootstrap_host_exec does not exist on the host so we cannot assume its
existence. Since, barring argument checking, it is three lines of shell
script [1] we just use it directly. We also add an extra echo to make it
simpler to debug any bootstrap vs non-bootstrap issued.

[1] https://github.com/openstack/tripleo-common/blob/master/scripts/bootstrap_host_exec

Change-Id: Ia850286682f09cd75651591a1158c2e467343c1d
Related-Bug: #1863442
2020-04-20 17:28:06 +02:00
Oliver Walsh
45dd4e18a5 Tolerate NFS exports in /var/lib/nova when selinux relabelling
When the :z bind mount option is used, podman peforms a recursive relabel of
the mount point which is failing with "Operation not supported" if there are
any NFS exports mounted within. While it's possible for NFS to support true
selinux labelling, in practice is rarely does.

As we are already walking the tree to set ownership/permission, take ownership
of the relabelling logic too and skip relabelling on subtrees where we hit this
error.

Change-Id: Id5503ed274bd5dc0c5365cc994de7e5cdcbc2fb6
Closes-bug: #1869020
2020-03-26 11:22:38 +00:00
Damien Ciabrini
4d21bab8f2 HA: check before restarting resource on stack update
When container <service>_restart_bundle is run, it checks whether
it can call pcs to restart the associated pacemaker resource, when
applicable.
Make sure we enforce the checks in all cases (when we run during
a stack update / update converge, and during a minor update).

Change-Id: I0367a657ddf440f0b73c4de5346306f12439db15
Closes-Bug: #1868533
2020-03-23 10:48:08 +01:00
Damien Ciabrini
3230f005c1 HA: reorder init_bundle and restart_bundle for improved updates
A pacemaker bundle can be restarted either because:
  . a tripleo config has been updated (from /var/lib/config-data)
  . the bundle config has been updated (container image, bundle
    parameter,...)

In HA services, special container "*_restart_bundle" is in charge
of restarting the HA service on tripleo config change. Special
container "*_init_bundle" handles restart on bundle config change.

When both types of change occur at the same time, the bundle must
be restarted first, so that the container has a chance to be
recreated with all bind-mounts updated before it tries to reload
the updated config.

Implement the improvement with two changes:

1. Make the "*_restart_bundle" start after the "*_init_bundle", and
make sure "*_restart_bundle" is only enabled after the initial
deployment.

2. During minor update, make sure that the "*_restart_bundle" not
only restarts the container, but also waits until the service
is operational (e.g. galera fully promoted to Master). This forces
the rolling restart to happen sequentially, and avoid service
disruption in quorum-based clustered services like galera and
rabbitmq.

Tested the following update use cases:

* minor update: ensure that *_restart_bundle restarts all types of
  resources (OCF, bundles, A/P, A/P Master/Slave).

* minor update: ensure *_restart_bundle is not executed when no
  config or image update happened for a service.

* restart_bundle: when resource (OCF or container) fails to
  restart, bail out early instead of waiting for nothing until
  timeout is reached.

* restart_bundle: make sure a resource is restarted even when it
  is in failed stated when *_restart_bundle is called.

* restart_bundle: A/P can be restarted on any node, so watch
  restart globally. When the resource restarts as Slave, continue
  watching for a Master elsewhere in the cluster.

* restart_bundle: if an A/P is not running locally, make sure it
  doesn't get restarted anywhere else in the cluster.

* restart_bundle: do not try to restart stopped (disabled) or
  unmanaged resource. Bail out early instead, to not wait until
  timeout is reached.

* stack update: make sure that running a stack update with no
  change does not trigger any *_restart_bundle, and does not
  restart any HA container either.

* stack update: when bundle and config will change, ensure bundle
  is updated before HA containers are restarted (e.g. HAProxy
  migration to TLS everywhere)

Change-Id: Ic41d4597e9033f9d7847bb6c10c25f443fbd5b0e
Closes-Bug: #1839858
2020-01-23 16:09:36 +01:00
Sandeep Yadav
08ca0a97d4 Change optparse to argparse
The optparse module has been deprecated since python version2.7[1].
This change switches remaining modules that was using optparse to the
newer argparse usage.

[1] https://docs.python.org/2/library/optparse.html

Change-Id: Iea9ef9dd4ac224a1f9fa5eaca0aa0959c802bcdd
2020-01-21 04:17:09 +00:00
Zuul
469b977e23 Merge "HA: ensure TRIPLEO_MINOR_UPDATE is defined for <svc>_restart_bundle" 2019-10-25 04:22:55 +00:00
Damien Ciabrini
81610bdc36 HA: ensure TRIPLEO_MINOR_UPDATE is defined for <svc>_restart_bundle
Containers <svc>_restart_bundle use script pacemaker_restart_bundle.sh
which behaves according to the value of environment variable
TRIPLEO_MINOR_UPDATE. Set a default value in case this variable is
unset (i.e. during stack update).

Change-Id: I59da2d3c50fa30a8f3e557a16367f889b103a6f8
Closes-Bug: #1849503
2019-10-23 16:56:38 +02:00
Martin Schuppert
d80d948fe7 Fix placement_wait_for_service
This fix the indent and volumes of the placement_wait_for_service
and the corresponding placement_wait_for_service.py to use the
config of the extracted placement service.

It also
* changes to set placement::keystone::authtoken::auth_url
instead of placement::keystone::authtoken::auth_uri as auth_uri is
deprecated and not supported by placement::keystone::authtoken.
* sets placement::keystone::authtoken::region_name

Related-Bug: 1842948

Change-Id: Ic24cf646efdd70ba1dbca42d3408847fe09a6e49
2019-10-17 16:08:36 +02:00
Zuul
cb5a99b905 Merge "Ensure nova-api is running before starting nova-compute containers" 2019-10-11 18:43:10 +00:00
Zuul
291f6472c2 Merge "Add multi region support in nova_wait_for_compute_service.py" 2019-10-04 03:43:01 +00:00
Oliver Walsh
8a87cbcc34 Ensure nova-api is running before starting nova-compute containers
If nova-api is delayed starting then the nova_wait_for_compute_service
can timeout. A deployment using a slow/busy remote container repository is
particularly susceptible to this issue. To resolve this nova_compute and
nova_wait_for_compute_service have been postponed to step_5 and a task
has been added to step_4 to ensure nova_api is active before proceeding.

Change-Id: I6fcbc5cb5d4f3cbb618d9661d2a36c868e18b3d6
Closes-bug: #1842948
2019-10-01 11:11:44 +01:00
Takashi Kajinami
f47dfe1059 Enforce pep8/pyflakes rule on python codes
This change makes sure that we apply pyflake8 checks on all python
codes to improve its readability.

Note that there are some rules applied for other OpenStack projects,
but not yet turned on, which should be enabled in the future.

Change-Id: Iaf0299983d3a3fe48e3beb8f47bd33c21deb4972
2019-09-05 15:40:46 +09:00
Damien Ciabrini
7f785e8757 HA: fix <service>_restart_bundle with minor update workflow
For each HA service we have a paunch container <service>_restart_bundle
which is started by paunch whenever config files changes during stack
deploy/update. This container runs a pcs command on a single node to
restart all the service's containers (e.g. all galera on all controllers).
By design, when it is run, configs have already been regenerated by the
deploy tasks on all nodes.

For minor updates, the workflow runs differently: all the steps of the
deploy tasks are run one node after the other, so when
<service>_restart_bundle is called, there is no guarantee that the
service's configs have been regenerated on all the nodes yet.

To fix the wrong restart behaviour, only restart local containers when
running during a minor update. And run once per node. When the minor
update workflow calls <service>_restart_container, we still have the
guarantee that the config files are already regenerated locally.

Co-Authored-By: Michele Baldessari <michele@acksyn.org>
Co-Authored-By: Luca Miccini <lmiccini@redhat.com>

Change-Id: I92d4ddf2feeac06ce14468ae928c283f3fd04f45
Closes-Bug: #1841629
2019-08-30 18:46:31 +02:00
Gauvain Pocentek
2ed2b72021 Add multi region support in nova_wait_for_compute_service.py
The region_name parameter needs to be defined in a multi-region setup.

Change-Id: Ifa1b3ffe63a5390d6f53cba9ae2b73c43a105e83
2019-08-19 13:47:37 +02:00
Martin Schuppert
f8779e5023 Move nova cell v2 discovery to deploy_steps_tasks
Recent changes for e.g edge scenarios caused intended move of discovery
from controller to bootstrap compute node. The task is triggered by
deploy-identifier to make sure it gets run on any deploy,scale, ... run.
If deploy run is triggered with --skip-deploy-identifier flag, discovery
will not be triggered at and as result causing failures in previously
supported scenarios.
This change moves the host discovery task to be an ansible
deploy_steps_tasks that it gets triggered even if --skip-deploy-identifier
is used, or the compute bootstrap node is blacklisted.

Closes-Bug: #1831711

Change-Id: I4bd8489e4f79e3e1bfe9338ed3043241dd605dcb
2019-07-02 17:24:27 +02:00
Martin Schuppert
bbd2d94483 Allow multiple same options in nova.conf
In python3 SafeConfigParser was renamed to ConfigParser and the default
for duplicate options default to true. In case of nova it is valid to
have duplicate option lines, e.g. pci_alias can be specified more then
once in nova.conf and results in an error like seen in
https://bugs.launchpad.net/tripleo/+bug/1827775

https://docs.python.org/3/library/configparser.html#configparser.ConfigParser

Closes-Bug: #1827775

Change-Id: I410af66d8dceb6dde84828c9bd1969aa623bf34c
2019-05-09 09:22:22 +02:00
Martin Schuppert
4d4263f4f1 Set debug level of nova container_config_scripts only when enabled
Right now all scripts log in DEBUG level. This change enables only
DEBUG level if debug is also enabled for the nova service.

Change-Id: Ie58a6630877a58bec8ce763ede166997bd41f882
2019-04-30 14:40:33 +02:00
Oliver Walsh
908e6b9810 Avoid concurrent nova cell_v2 discovery instances
The nova_cell_v2_discover_hosts.py was moved to run on compute
nodes instead of controllers to allow adding computes without
touching controllers and in case multiple stacks are used to
manage compute nodes. In case the nova-manage command, run by
nova_cell_v2_discover_hosts.py, gets triggered at the same time
on compute nodes races.

With this change if this is _not_ an additional cell:
* in docker_config step4, on every compute, we start the nova-compute
  container and then start a (detach=false) container to wait for
  it's service to appear in the service list.
* in docker_config step5, on the bootstrap node only, we run the
  discovery.

Change-Id: I1a159a7c2ac286373df2b7c566426b37b7734961
Closes-bug: 1824445
Co-authored-by: mschuppert@redhat.com
2019-04-18 16:23:15 +02:00
Gauvain Pocentek
8948eced73 Test the correct placement endpoint with multiple regions
In a multi-region setup (not yet supported but there are plans to
support it) the nova_wait_for_placement_service.py script might check
the wrong placement endpoint.

This change makes the script explicitly look for the endpoint in the
correct region.

Change-Id: I83e44e0d0cb104dbb10b3699469e00e15b320409
Closes-Bug: #1819174
2019-03-08 15:45:10 +01:00