tripleo-heat-templates

Author	SHA1	Message	Date
Takashi Kajinami	e99a251ad4	Use consistent indent in .sh files This change fixes the inconsistent indents in the bash script files. Details are described below. * Use 4 spaces instead of tabs. Currently tabs are used partially. * Ensures the corresponding items(like if/else/fi) are placed at the same indent level. Change-Id: Iccf01cd325e171fba8e399d22ee9e0a00f3e781b	2022-03-09 08:38:02 +09:00
Bogdan Dobrelya	dbf5d36fdf	Add timestamps to nova/placement wait for scripts Related-bug: #1951577 Change-Id: I5ca99f53540d27b3e7824d22910ddc69cae3c9d0 Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>	2021-11-22 11:02:03 +01:00
Zuul	c793e9174f	Merge "Remove six library"	2021-10-12 00:55:27 +00:00
Brendan Shephard	b522254bc2	Remove six library The six library was used to bridge the py2 > py3 gap. This library is longer required on branches that do not support Python2. Change-Id: I40cb90bc6bc058dcbf3659b97dbb489b53adb9d3	2021-10-06 07:01:42 +00:00
Zuul	1e1b6d125c	Merge "CentOS 9: support restart of HA resources"	2021-09-08 12:13:18 +00:00
Damien Ciabrini	128c2bcc25	CentOS 9: support restart of HA resources Pacemaker 2.1 changed naming convention around multi-state resources and ocf resource name. Adapt our resource restart scripts so that they parse the proper data from the CIB. Change-Id: Ieade3444e44e305f507c057991e02048ab5f3b3a Closes-Bug: #1942771	2021-09-06 14:26:55 +02:00
Damien Ciabrini	ad2a13ab47	Check whether an HA resource already exists explicitly With ephemeral heat we lost the meaning of the 'stack_action' hiera key which we previously used to distinguish between fresh deployment and pre-existing deployment (aka redeploy). Since this hiera key is not available anymore, in ansible we added a TRIPLEO_HA_WRAPPER_RESOURCE_EXISTS env variable which will be true when the resource existed even before calling puppet. This way we can restore the previous behaviour (which was relying on the stack_update hiera key) of restarting an HA bundle on the bootstrap node in case of a configuration change. While we're at it we make sure that the logging takes place via logger so we're sure to capture these events in the journal. Tested as follows: 1) Initial deploy: [root@controller-0 ~]# journalctl \|grep pcmkres Sep 01 10:23:35 controller-0.alejandro.ftw pcmkrestart[47636]: Initial deployment, skipping the restart of haproxy-bundle Sep 01 10:24:25 controller-0.alejandro.ftw pcmkrestart[49735]: Initial deployment, skipping the restart of galera-bundle Sep 01 10:25:15 controller-0.alejandro.ftw pcmkrestart[53052]: Initial deployment, skipping the restart of rabbitmq-bundle Sep 01 10:37:35 controller-0.alejandro.ftw pcmkrestart[148651]: Initial deployment, skipping the restart of openstack-cinder-volume Redeploy changing only the haproxy config via a hiera key change: Sep 01 11:12:29 controller-0.alejandro.ftw pcmkrestart[438507]: Wed Sep Restarting haproxy-bundle globally. Stopping: Sep 01 11:12:37 controller-0.alejandro.ftw pcmkrestart[439271]: Wed Sep Restarting haproxy-bundle globally. Starting: Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/806610/ Closes-Bug: #1942309 Change-Id: I90ea2287b5ab32c8dc6bbf5f91927e7488326dcd	2021-09-01 13:23:23 +02:00
Michele Baldessari	61f67eff10	nova_libvirt_init_secret Give a proper error if ceph is not configured properly Let's make the error a little more clearer when ceph failed to be configured properly. Before: 2021-08-13T12: 42:07.472193117+00:00 stdout F ------------------------------------------------ 2021-08-13T12: 42:07.472193117+00:00 stdout F Initializing virsh secrets for: ceph:openstack 2021-08-13T12: 42:07.481397478+00:00 stdout F -------- 2021-08-13T12: 42:07.481397478+00:00 stdout F Initializing the virsh secret for 'ceph' cluster () 'openstack' client 2021-08-13T12: 42:07.484466828+00:00 stdout F Creating /etc/nova/ceph-secret.xml 2021-08-13T12: 42:07.493435343+00:00 stderr F Usage: grep [OPTION]... PATTERN [FILE]... 2021-08-13T12: 42:07.493435343+00:00 stderr F Try 'grep --help' for more information. 2021-08-13T12: 42:07.591038798+00:00 stdout F Secret 5e23cf03-81b0-4e02-b678-7c5363fbf0e2 created 2021-08-13T12: 42:07.591038798+00:00 stdout F 2021-08-13T12: 42:07.671036635+00:00 stderr F error: failed to get secret '--base64' 2021-08-13T12: 42:07.671036635+00:00 stderr F error: uuidstr in virSecretLookupByUUIDString must be a valid UUID 2021-08-13T12: 42:07.674021136+00:00 stdout F After: 2021-08-14T13:10:20.866443451+00:00 stdout F Initializing virsh secrets for: ceph:openstack 2021-08-14T13:10:20.880988730+00:00 stdout F Error: /etc/ceph/ceph.conf contained an empty fsid definition 2021-08-14T13:10:20.880988730+00:00 stdout F Check your ceph configuration Change-Id: I781db8142015d713d9e99114aed42667418bf23b	2021-08-14 15:17:47 +02:00
Damien Ciabrini	1662600e6e	HA minor update: fix bad pcs invocation When a HA resource is in failed stated, the minor update should normally try to restart it but the associated pcs invocation is currently invalid, so the resource never gets a chance to be restarted. Use the right pcs call to fix this minor update use case. Change-Id: Iaf85807d067898bbab6d76ab40bc070e845a8b38 Closes-Bug: #1931500	2021-06-09 23:37:14 +02:00
Alan Bishop	e2936d7604	Add cinder RBD support for multiple ceph clusters The CinderRbdMultiConfig parameter provides a mechanism for configuring cinder RBD backends associated with external ceph clusters defined by CephExternalMultiConfig. A new nova_libvirt_init_secret.sh script handles the creation of the libvirt secret that is required for nova to connect to volumes on the cinder RBD backends. Depends-On: I040e25341c9869ad289d7e7c98e831caef23fece Change-Id: I73af5b868de629870a35d38f8436e7025aae791e	2021-04-14 12:44:45 -07:00
Damien Ciabrini	712cfcc71b	Upgrade mariadb storage during upgrade tasks When a tripleo major upgrade or FFU causes an update or mariadb to a new major version (e.g. 10.1 -> 10.3), some internal DB tables must be upgraded (myisam tables), and sometimes the existing user tables may be migrated to new mariadb defaults. Move the db-specific upgrade steps into a dedicated script and make sure that it is called at the right time while upgrading the undercloud and/or the overcloud. Closes-Bug: #1913438 Change-Id: I92353622994b28c895d95bdcbe348a73b6c6bb99	2021-02-16 09:08:40 +01:00
Damien Ciabrini	cb55cc8ce5	Serialize shutdown of pacemaker nodes When running minor update in a composable HA, different roles could run ansible tasks concurrently. However, there is currently a race when pacemaker nodes are stopped in parallel [1,2], that could cause nodes to incorrectly stop themselves once they reconnect to the cluster. To prevent concurrent shutdown, use a cluster-wide lock to signals that one node is about to shutdown, and block the others until the node disconnects from the cluster. Tested the minor update in a composable HA environment: . when run with "openstack update run", every role is updated sequentially, and the shutdown lock doesn't interfere. . when running multiple ansible tasks in parallel "openstack update run --limit role<X>", pacemaker nodes are correctly stopped sequentially thanks to the shutdown lock. . when updating an existing overcloud, the new locking script used in the review is correctly injected on the overcloud, thanks to [3]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1791841 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1872404 [3] I2ac6bb98e1d4183327e888240fc8d5a70e0d6fcb Closes-Bug: #1904193 Change-Id: I0e041c6a95a7f53019967f9263df2326b1408c6f	2020-12-24 14:06:32 +01:00
Damien Ciabrini	c8f5fdfc36	HA: reimplement resource locks with cibadmin A resource lock is used as a synchronization point between pacemaker cluster nodes. It is currently implemented by adding an attribute in an offline copy of CIB, and merging the update in the CIB only if no concurrent updates has occurred in the mean time. The problem with that approach is that - even if the concurrency is enforced by pacemaker - the offline CIB contains a snapshot of the cluster state; so pushing back the entire offline CIB pushes old resources' state back into the cluster. This causes additional burden on the cluster and sometimes caused unexpected cluster state transition. Reimplement the locking strategy with cibadmin; It's a much faster approach, that provides the same concurrency guarantees, and only changes one attribute rather than the entire CIB, so it doesn't cause unexpected cluster state transition. Closes-Bug: #1905585 Change-Id: Id10f026c8b31cad7b7313ac9427a99b3e6744788	2020-12-09 12:37:31 +00:00
Martin Schuppert	70818dc684	fix nova_statedir_ownership with change in Ic6f053d56194613046ae0a4a908206ebb453fcf4 run() was removed to be triggered, as a result the script actually don't run. Change-Id: I5050f198f0109faa9299de85e01b0dbe4e5a30ab Closes-Bug: #1903033	2020-11-20 16:05:21 +01:00
Oliver Walsh	c156534010	Skip Trilio dirs when setting ownership in /var/lib/nova Trilio currently mounts an NFS export in /var/lib/nova to make it accessible from within the nova_compute and nova_libvirt containers. This can result in considerable delays when walking the directory tree to ensure the ownership is correct. This patch adds the ability to skip paths when recursively setting the ownership and selinux context in /var/lib/nova. The list of paths to skip can be set via te NovaStatedirOwnershipSkip heat parameter. This default to the Trilio dir. Change-Id: Ic6f053d56194613046ae0a4a908206ebb453fcf4	2020-10-23 16:55:13 +00:00
Martin Magr	f84655ed55	Return details in output of container health check This patch reformats check-container-health script for sensubility to output json formatted data instead of semi-colon separated data. Removes calculation of duration for each container HC to keep the runtime shorter. Change-Id: I18bcde4b6031c79deae3f6c9ee6f2c4bb754be88	2020-10-14 11:57:56 +02:00
Zuul	bc199530de	Merge "Fix typos"	2020-09-23 19:37:27 +00:00
Zuul	be04d1536a	Merge "Adapt container health check for built-in podman health checks"	2020-09-23 03:56:56 +00:00
Rajesh Tailor	a672bedfc2	Fix typos Change-Id: Ia9b0410d1ade1abc2d29d3634379b9128016d0e9	2020-09-16 15:45:12 +05:30
Martin Magr	1952a9ce64	Adapt container health check for built-in podman health checks This patch removes regression which was introduced by moving from systemd health check framework to built-in podman health check support. Change-Id: I1706e04b543e8c9ff3903a9575b7c2cd74b9a0b3	2020-09-15 16:52:56 +02:00
Michele Baldessari	87b365afd3	Fix Flakes and lower-constraints errors With the switch to Ubuntu Focal for tox jobs via https://review.opendev.org/#/c/738322/ our 1.1.0 version of hacking pulls in old modules that are not compatible with python3.8: https://github.com/openstack/hacking/blob/1.1.0/requirements.txt#L6 Let's upgrade hacking to >= 3.0.1 and < 3.1.0 so that it supports python3.8 correctly. The newer hacking also triggered new errors which are fixed in this review as well: ./tools/render-ansible-tasks.py:113:25: F841 local variable 'e' is assigned to but never used ./tools/yaml-validate.py:541:19: F999 '...'.format(...) has unused arguments at position(s): 2 ./tools/render-ansible-tasks.py:126:1: E305 expected 2 blank lines after class or function definition, found 1 ./tools/yaml-validate.py:33:1: E305 expected 2 blank lines after class or function definition, found 1 ./container_config_scripts/tests/test_nova_statedir_ownership.py:35:1: E305 expected 2 blank lines after class or function definition, found 0 Also make sure we exclude .tox and __pycache__ from flake8 as well We also need to change the lower-constraint requirements to make them py3.8 compatible. See https://bugs.launchpad.net/nova/+bug/1886298 cffi==1.14.0 greenlet==0.4.15 MarkupSafe==1.1.0 paramiko==2.7.1 Suggested-By: Yatin Karel <ykarel@redhat.com> Change-Id: Ic280ce9a51f26d165d4e93ba0dc0c47cdf8d7961 Closes-Bug: #1895093	2020-09-10 11:10:54 +02:00
Michele Baldessari	dcfc98d236	Fix pcs restart in composable HA When a redeploy command is being run in a composable HA environment, if there are any configuration changes, the <bundle>_restart containers will be kicked off. These restart containers will then try and restart the bundles globally in the cluster. These restarts will be fired off in parallel from different nodes. So haproxy-bundle will be restarted from controller-0, mysql-bundle from database-0, rabbitmq-bundle from messaging-0. This has proven to be problematic and very often (rhbz#1868113) it would fail the redeploy with: 2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining 2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired 2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped 2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop: 2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle 2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle 2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle 2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role 2020-08-11T13:40:25.996896822+00:00 stderr F or 2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again: 2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle 2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle 2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining 2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle 2020-08-11T13:39:49.197487180+00:00 stderr F After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle: """ Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this: - Get the current configuration and state of the cluster, including a list of active resources (list #1) - Set resource target-role to Stopped - Get the current configuration and state of the cluster, including a list of which resources should be active (list #2) - Compare lists #1 and #2, and the difference is the resources that should stop - Periodically refresh the configuration and state until the list of active resources matches list #2 - Delete the target-role - Periodically refresh the configuration and state until the list of active resources matches list #1 """ So the suggestion is to replace the restarts with an enable/disable cycle of the resource. Tested this on a dozen runs on a composable HA environment and did not observe the error any longer. Closes-Bug: #1892206 Change-Id: I9cc27b1539a62a88fb0bccac64e6b1ae9295f22e	2020-08-19 16:21:15 +02:00
Zuul	d13d010693	Merge "Rolling certificate update for HA services"	2020-08-12 21:22:53 +00:00
Damien Ciabrini	ba471ee461	Fix HA resource restart when no replicas are running When the helper script pacemaker_restart_bundle.sh is called during a stack update, it restarts the pacemaker resource via a "pcs resource restart <name>". When all the replicas are stopped due to a previous error, pcs won't restart them because there is nothing to stop. In that case, one must use "pcs resource cleanup <name>". Change-Id: I1790444d289d057e9a3f612c53efe485080978b5 Closes-Bug: #1889395	2020-08-03 21:00:44 +02:00
Damien Ciabrini	0f54889408	Rolling certificate update for HA services There are certain HA clustered services (e.g. galera) that don't have the ability natively to reload their TLS certificate without being restarted. If too many replicas are restarted concurrently this might result in full service disruption. To ensure service availability, provide a means to ensure that only one service replica is restarted at a time in the cluster. This works by using pacemaker's CIB to implement a cluster-wide restart lock for a service. The lock has a TTL so it's guaranteed to be eventually released without requiring complex contingency cleanup in case of failures. Tested locally by running the following: 1. force recreate certificate on all nodes at once for galera (ipa-cert resubmit -i mysql), and verify that the resources restart one after the other 2. create a lock manually in pacemaker, recreate certificate for galera on all nodes, and verify that no resource is restarted before the manually created lock expires. 3. create a lock manually, let it expires, recreate a certificate, and verify that the resource is restarted appropriately and the lock gets cleaned up from pacemaker once the restart finished. Closes-Bug: #1885113 Change-Id: Ib2b62e33b34cf72edfdae6299cf432259bf960a2	2020-07-30 16:51:48 +02:00
Zuul	e59009a7e1	Merge "Avoid failing on deleted file"	2020-07-24 20:01:24 +00:00
Zuul	45c959a5ea	Merge "Ensure redis_tls_proxy starts after all redis instances"	2020-07-23 04:31:17 +00:00
David Hill	6c3c8b41de	Avoid failing on deleted file Avoid failing on deleted file as sometimes file might get deleted while the script run. Log the exception instead for troubleshooting purposes. Change-Id: I733cec2b34ef0bd0780ba5b0520127b911505e1b	2020-07-08 13:23:52 +01:00
Damien Ciabrini	b91a1a09cb	Ensure redis_tls_proxy starts after all redis instances When converting a HA control plane to TLS-e, 1) the bootstrap node tells pacemaker to restart all redis instances to take into account the new TLS-e config; 2) a new container redis_tls_proxy is started on every controller to encapsulate redis traffic in TLS tunnels. This happens during step 2. Redis servers have to be restarted everywhere for redis_tls_proxy to be able to start tunnels properly. Since we can't guarantee that across several nodes during the same step, tweak the startup of redis_tls_proxy instead; make sure to only create the tunnels once the targeted host:port can be bound (i.e. redis was restarted). Change-Id: I70560f80775dacddd82262e8079c13f86b0eb0e6 Closes-Bug: #1883096	2020-07-07 05:36:43 +00:00
Hervé Beraud	be280e39c2	Stop to use the __future__ module. The __future__ module [1] was used in this context to ensure compatibility between python 2 and python 3. We previously dropped the support of python 2.7 [2] and now we only support python 3 so we don't need to continue to use this module and the imports listed below. Imports commonly used and their related PEPs: - `division` is related to PEP 238 [3] - `print_function` is related to PEP 3105 [4] - `unicode_literals` is related to PEP 3112 [5] - `with_statement` is related to PEP 343 [6] - `absolute_import` is related to PEP 328 [7] [1] https://docs.python.org/3/library/__future__.html [2] https://governance.openstack.org/tc/goals/selected/ussuri/drop-py27.html [3] https://www.python.org/dev/peps/pep-0238 [4] https://www.python.org/dev/peps/pep-3105 [5] https://www.python.org/dev/peps/pep-3112 [6] https://www.python.org/dev/peps/pep-0343 [7] https://www.python.org/dev/peps/pep-0328 Change-Id: I2cf7495c5cb42c632993bb2372ffb626ab97bf0d	2020-07-02 15:27:27 +00:00
Hervé Beraud	11f84b6302	Use unittest.mock instead of mock The mock third party library was needed for mock support in py2 runtimes. Since we now only support py36 and later, we can use the standard lib unittest.mock module instead. Change-Id: Iabd3e90a46fd087c8e780796e04fcc050c5277ab	2020-06-09 18:41:21 +02:00
Michele Baldessari	4d8eb35114	Drop bootstrap_host_exec from pacemaker_restart_bundle bootstrap_host_exec does not exist on the host so we cannot assume its existence. Since, barring argument checking, it is three lines of shell script [1] we just use it directly. We also add an extra echo to make it simpler to debug any bootstrap vs non-bootstrap issued. [1] https://github.com/openstack/tripleo-common/blob/master/scripts/bootstrap_host_exec Change-Id: Ia850286682f09cd75651591a1158c2e467343c1d Related-Bug: #1863442	2020-04-20 17:28:06 +02:00
Oliver Walsh	45dd4e18a5	Tolerate NFS exports in /var/lib/nova when selinux relabelling When the :z bind mount option is used, podman peforms a recursive relabel of the mount point which is failing with "Operation not supported" if there are any NFS exports mounted within. While it's possible for NFS to support true selinux labelling, in practice is rarely does. As we are already walking the tree to set ownership/permission, take ownership of the relabelling logic too and skip relabelling on subtrees where we hit this error. Change-Id: Id5503ed274bd5dc0c5365cc994de7e5cdcbc2fb6 Closes-bug: #1869020	2020-03-26 11:22:38 +00:00
Damien Ciabrini	4d21bab8f2	HA: check before restarting resource on stack update When container <service>_restart_bundle is run, it checks whether it can call pcs to restart the associated pacemaker resource, when applicable. Make sure we enforce the checks in all cases (when we run during a stack update / update converge, and during a minor update). Change-Id: I0367a657ddf440f0b73c4de5346306f12439db15 Closes-Bug: #1868533	2020-03-23 10:48:08 +01:00
Damien Ciabrini	3230f005c1	HA: reorder init_bundle and restart_bundle for improved updates A pacemaker bundle can be restarted either because: . a tripleo config has been updated (from /var/lib/config-data) . the bundle config has been updated (container image, bundle parameter,...) In HA services, special container "_restart_bundle" is in charge of restarting the HA service on tripleo config change. Special container "_init_bundle" handles restart on bundle config change. When both types of change occur at the same time, the bundle must be restarted first, so that the container has a chance to be recreated with all bind-mounts updated before it tries to reload the updated config. Implement the improvement with two changes: 1. Make the "_restart_bundle" start after the "_init_bundle", and make sure "_restart_bundle" is only enabled after the initial deployment. 2. During minor update, make sure that the "_restart_bundle" not only restarts the container, but also waits until the service is operational (e.g. galera fully promoted to Master). This forces the rolling restart to happen sequentially, and avoid service disruption in quorum-based clustered services like galera and rabbitmq. Tested the following update use cases: * minor update: ensure that _restart_bundle restarts all types of resources (OCF, bundles, A/P, A/P Master/Slave). minor update: ensure _restart_bundle is not executed when no config or image update happened for a service. restart_bundle: when resource (OCF or container) fails to restart, bail out early instead of waiting for nothing until timeout is reached. * restart_bundle: make sure a resource is restarted even when it is in failed stated when _restart_bundle is called. restart_bundle: A/P can be restarted on any node, so watch restart globally. When the resource restarts as Slave, continue watching for a Master elsewhere in the cluster. * restart_bundle: if an A/P is not running locally, make sure it doesn't get restarted anywhere else in the cluster. * restart_bundle: do not try to restart stopped (disabled) or unmanaged resource. Bail out early instead, to not wait until timeout is reached. * stack update: make sure that running a stack update with no change does not trigger any _restart_bundle, and does not restart any HA container either. stack update: when bundle and config will change, ensure bundle is updated before HA containers are restarted (e.g. HAProxy migration to TLS everywhere) Change-Id: Ic41d4597e9033f9d7847bb6c10c25f443fbd5b0e Closes-Bug: #1839858	2020-01-23 16:09:36 +01:00
Sandeep Yadav	08ca0a97d4	Change optparse to argparse The optparse module has been deprecated since python version2.7[1]. This change switches remaining modules that was using optparse to the newer argparse usage. [1] https://docs.python.org/2/library/optparse.html Change-Id: Iea9ef9dd4ac224a1f9fa5eaca0aa0959c802bcdd	2020-01-21 04:17:09 +00:00
Zuul	469b977e23	Merge "HA: ensure TRIPLEO_MINOR_UPDATE is defined for <svc>_restart_bundle"	2019-10-25 04:22:55 +00:00
Damien Ciabrini	81610bdc36	HA: ensure TRIPLEO_MINOR_UPDATE is defined for <svc>_restart_bundle Containers <svc>_restart_bundle use script pacemaker_restart_bundle.sh which behaves according to the value of environment variable TRIPLEO_MINOR_UPDATE. Set a default value in case this variable is unset (i.e. during stack update). Change-Id: I59da2d3c50fa30a8f3e557a16367f889b103a6f8 Closes-Bug: #1849503	2019-10-23 16:56:38 +02:00
Martin Schuppert	d80d948fe7	Fix placement_wait_for_service This fix the indent and volumes of the placement_wait_for_service and the corresponding placement_wait_for_service.py to use the config of the extracted placement service. It also * changes to set placement::keystone::authtoken::auth_url instead of placement::keystone::authtoken::auth_uri as auth_uri is deprecated and not supported by placement::keystone::authtoken. * sets placement::keystone::authtoken::region_name Related-Bug: 1842948 Change-Id: Ic24cf646efdd70ba1dbca42d3408847fe09a6e49	2019-10-17 16:08:36 +02:00
Zuul	cb5a99b905	Merge "Ensure nova-api is running before starting nova-compute containers"	2019-10-11 18:43:10 +00:00
Zuul	291f6472c2	Merge "Add multi region support in nova_wait_for_compute_service.py"	2019-10-04 03:43:01 +00:00
Oliver Walsh	8a87cbcc34	Ensure nova-api is running before starting nova-compute containers If nova-api is delayed starting then the nova_wait_for_compute_service can timeout. A deployment using a slow/busy remote container repository is particularly susceptible to this issue. To resolve this nova_compute and nova_wait_for_compute_service have been postponed to step_5 and a task has been added to step_4 to ensure nova_api is active before proceeding. Change-Id: I6fcbc5cb5d4f3cbb618d9661d2a36c868e18b3d6 Closes-bug: #1842948	2019-10-01 11:11:44 +01:00
Takashi Kajinami	f47dfe1059	Enforce pep8/pyflakes rule on python codes This change makes sure that we apply pyflake8 checks on all python codes to improve its readability. Note that there are some rules applied for other OpenStack projects, but not yet turned on, which should be enabled in the future. Change-Id: Iaf0299983d3a3fe48e3beb8f47bd33c21deb4972	2019-09-05 15:40:46 +09:00
Damien Ciabrini	7f785e8757	HA: fix <service>_restart_bundle with minor update workflow For each HA service we have a paunch container <service>_restart_bundle which is started by paunch whenever config files changes during stack deploy/update. This container runs a pcs command on a single node to restart all the service's containers (e.g. all galera on all controllers). By design, when it is run, configs have already been regenerated by the deploy tasks on all nodes. For minor updates, the workflow runs differently: all the steps of the deploy tasks are run one node after the other, so when <service>_restart_bundle is called, there is no guarantee that the service's configs have been regenerated on all the nodes yet. To fix the wrong restart behaviour, only restart local containers when running during a minor update. And run once per node. When the minor update workflow calls <service>_restart_container, we still have the guarantee that the config files are already regenerated locally. Co-Authored-By: Michele Baldessari <michele@acksyn.org> Co-Authored-By: Luca Miccini <lmiccini@redhat.com> Change-Id: I92d4ddf2feeac06ce14468ae928c283f3fd04f45 Closes-Bug: #1841629	2019-08-30 18:46:31 +02:00
Gauvain Pocentek	2ed2b72021	Add multi region support in nova_wait_for_compute_service.py The region_name parameter needs to be defined in a multi-region setup. Change-Id: Ifa1b3ffe63a5390d6f53cba9ae2b73c43a105e83	2019-08-19 13:47:37 +02:00
Martin Schuppert	f8779e5023	Move nova cell v2 discovery to deploy_steps_tasks Recent changes for e.g edge scenarios caused intended move of discovery from controller to bootstrap compute node. The task is triggered by deploy-identifier to make sure it gets run on any deploy,scale, ... run. If deploy run is triggered with --skip-deploy-identifier flag, discovery will not be triggered at and as result causing failures in previously supported scenarios. This change moves the host discovery task to be an ansible deploy_steps_tasks that it gets triggered even if --skip-deploy-identifier is used, or the compute bootstrap node is blacklisted. Closes-Bug: #1831711 Change-Id: I4bd8489e4f79e3e1bfe9338ed3043241dd605dcb	2019-07-02 17:24:27 +02:00
Martin Schuppert	bbd2d94483	Allow multiple same options in nova.conf In python3 SafeConfigParser was renamed to ConfigParser and the default for duplicate options default to true. In case of nova it is valid to have duplicate option lines, e.g. pci_alias can be specified more then once in nova.conf and results in an error like seen in https://bugs.launchpad.net/tripleo/+bug/1827775 https://docs.python.org/3/library/configparser.html#configparser.ConfigParser Closes-Bug: #1827775 Change-Id: I410af66d8dceb6dde84828c9bd1969aa623bf34c	2019-05-09 09:22:22 +02:00
Martin Schuppert	4d4263f4f1	Set debug level of nova container_config_scripts only when enabled Right now all scripts log in DEBUG level. This change enables only DEBUG level if debug is also enabled for the nova service. Change-Id: Ie58a6630877a58bec8ce763ede166997bd41f882	2019-04-30 14:40:33 +02:00
Oliver Walsh	908e6b9810	Avoid concurrent nova cell_v2 discovery instances The nova_cell_v2_discover_hosts.py was moved to run on compute nodes instead of controllers to allow adding computes without touching controllers and in case multiple stacks are used to manage compute nodes. In case the nova-manage command, run by nova_cell_v2_discover_hosts.py, gets triggered at the same time on compute nodes races. With this change if this is _not_ an additional cell: * in docker_config step4, on every compute, we start the nova-compute container and then start a (detach=false) container to wait for it's service to appear in the service list. * in docker_config step5, on the bootstrap node only, we run the discovery. Change-Id: I1a159a7c2ac286373df2b7c566426b37b7734961 Closes-bug: 1824445 Co-authored-by: mschuppert@redhat.com	2019-04-18 16:23:15 +02:00
Gauvain Pocentek	8948eced73	Test the correct placement endpoint with multiple regions In a multi-region setup (not yet supported but there are plans to support it) the nova_wait_for_placement_service.py script might check the wrong placement endpoint. This change makes the script explicitly look for the endpoint in the correct region. Change-Id: I83e44e0d0cb104dbb10b3699469e00e15b320409 Closes-Bug: #1819174	2019-03-08 15:45:10 +01:00

1 2

51 Commits