This simplifies the ServiceNetMap/VipSubnetMap interfaces
to use parameter merge strategy and removes the *Defaults
interfaces.
Change-Id: Ic73628a596e9051b5c02435b712643f9ef7425e3
With I57047682cfa82ba6ca4affff54fab5216e9ba51c Heat has added
a new template version for wallaby. This would allow us to use
2-argument variant of the ``if`` function that would allow for
e.g. conditional definition of resource properties and help
cleanup templates. If only two arguments are passed to ``if``
function, the entire enclosing item is removed when the condition
is false.
Change-Id: I25f981b60c6a66b39919adc38c02a051b6c51269
This is using linux-system-roles.certificate ansible role,
which replaces puppet-certmonger for submitting certificate
requests to certmonger. Each service is configured through
it's heat template.
Partial-Implements: blueprint ansible-certmonger
Depends-On: https://review.rdoproject.org/r/31713
Change-Id: Ib868465c20d97c62cbcb214bfc62d949bd6efc62
In order to ANSIBLE_INJECT_FACT_VARS=False we have to use ansible_facts
instead of ansible_* vars. This change switches our distribution and
hostname related items to use ansible_facts instead.
Change-Id: I49a2c42dcbb74671834f312798367f411c819813
Related-Bug: #1915761
When a tripleo major upgrade or FFU causes an update or mariadb
to a new major version (e.g. 10.1 -> 10.3), some internal DB
tables must be upgraded (myisam tables), and sometimes the
existing user tables may be migrated to new mariadb defaults.
Move the db-specific upgrade steps into a dedicated script and
make sure that it is called at the right time while upgrading
the undercloud and/or the overcloud.
Closes-Bug: #1913438
Change-Id: I92353622994b28c895d95bdcbe348a73b6c6bb99
This was mainly there as an legacy interface which was
for internal use. Now that we pull the passwords from
the existing environment and don't use it, we can drop
this.
Reduces a number of heat resources.
Change-Id: If83d0f3d72a229d737a45b2fd37507dc11a04649
Currently galera and ovn require a coordinated restart across
the controller node when certmonger determines the certificate
for a node has expired and it needs to regenerate it.
But right now, when the tripleo certmonger puppet module is
called to assert to state of the certificates, it ends up
regenerating new certificate unconditionally. So the galera and
ovn get restarted on stack update, even when there is no need to.
To mitigate these unecessary restarts, disable the post-action
for now until we fix the behaviour of tripleo's certmonger puppet
module. This has the side effect that services won't get restarted
automatically if no stack update takes place until the certificate
expiration date is reached.
Related-Bug: #1906505
Change-Id: I17f1364932e43b8487515084e41b525e186888db
Pulling images over internet is not considered as very stable operation
as there can be a lot of issues (DNS, HTTP rate limit, route change or
interface restart ...) This patch retries "pull" 3 times to mitigate
possible networking issues
Change-Id: I03643576c9f8444d6db36364a73bccce244c8446
Closes-Bug: 1899057
With [1] we added the ability to use a fixed, static prefix for
the container image names that pacemaker uses to set up the HA
resources. This name is just an alias to the real tripleo
image that is being used in the control plane, and it allows to
update the namespace part of the image during a minor update
without disrupting service.
Add a new Heat parameter to use a completely static container
image name, to allow arbitrary name change during minor update,
e.g. registryA/namespaceA/imgA:tagA to registryB/namespaceB/imgB:tagB
By default, this new paramter is disabled.
[1] Id369154d147cd5cf0a6f997bf806084fc7580e01
Change-Id: I124c1e4dbcc7a8ed38079411f41a8f31c8f62284
The mysql database is create by container mysql_bootstrap,
which let Kolla run mysqld_safe temporarily, and then
let TripleO run it for additional setup.
Before running the second temporary mysqld server, make
sure that the mysqld_safe script started by Kolla is
always stopped, to avoid any race condition that would
cause the second mysqld_safe server to be killed by the
Kolla one.
Change-Id: Id7cf45fb95d3c8a2c5519b1a13a5651cf414a115
Co-Authored-By: Michele Baldessari <michele@acksyn.org>
Closes-Bug: #1896009
This is needed because the Mysql_ providers will prefetch
the the mysql users if facter finds the '/bin/mysql' executable
on the system and we do not want to run any mysql task on the host
directly.
Co-Authored-By: Damien Ciabrini <dciabrin@redhat.com>
Change-Id: Ic6c65e6849368185177aeaa31d50f52761225f62
Related-Bug: #1863442
This implements the creation of the haproxy bundle on the host.
The testing protocol used is documented in the depends-on.
The reason for adding a post_update task is that during a minor update
the deployment tasks are not run during the node update procedure but
only during the final converge. So we ran the role again there to make
sure that any config change will trigger a restart during the minor
update, so the disruption is only local to the single node being
updated. If we did not do this a final converge could potentially
trigger a global restart of HA bundles which would be quite disruptive.
NB: in this case we keep the container init_bundle (renamed to
wait_bundle) around just use it to wait for galera to be up.
Depends-On: Iaa7e89f0d25221c2a6ef0b81eb88a6f496f01696
Change-Id: Ie14819b66cecdb5a9cc6299b68a0cc70a7aa3370
Related-Bug: #1863442
There are certain HA clustered services (e.g. galera) that don't
have the ability natively to reload their TLS certificate without
being restarted. If too many replicas are restarted concurrently
this might result in full service disruption.
To ensure service availability, provide a means to ensure that
only one service replica is restarted at a time in the cluster.
This works by using pacemaker's CIB to implement a cluster-wide
restart lock for a service. The lock has a TTL so it's guaranteed
to be eventually released without requiring complex contingency
cleanup in case of failures.
Tested locally by running the following:
1. force recreate certificate on all nodes at once for galera
(ipa-cert resubmit -i mysql), and verify that the resources
restart one after the other
2. create a lock manually in pacemaker, recreate certificate for
galera on all nodes, and verify that no resource is restarted
before the manually created lock expires.
3. create a lock manually, let it expires, recreate a certificate,
and verify that the resource is restarted appropriately and the
lock gets cleaned up from pacemaker once the restart finished.
Closes-Bug: #1885113
Change-Id: Ib2b62e33b34cf72edfdae6299cf432259bf960a2
Now that the FFU process relies on the upgrade_tasks and deployment
tasts there is no need to keep the old fast_forward_upgrade_tasks.
This patch removes all the fast_forward_upgrade_tasks section from
the services, as well as from the common structures.
Change-Id: I39b8a846145fdc2fb3d0f6853df541c773ee455e
Bind mounting the single files is never a good idea and
we're actually planning to potentially add one more script in there
to be used by the restart bundles, so best if we bind mount the whole
folder and use that directly.
Change-Id: I881f6fdb7f99575f017fb86d6ab2dc3d55348e46
When the operating system upgrade is performed a
transfer data step must be executed (to allow the
db import once the operating system upgrades). During
this step the pacemaker cluster is stopped, so we can
create a new cluster with the newly OS upgraded node.
Therefore, as the pacemaker cluster is down we need
to skip some of the tasks which would be executed
during a normal upgrade (check pcs status, stop pcs and
start pcs) from the upgrade_tasks.
This patch removes the use of UpgradeLeappEnabled heat
parameter to identify this and uses the existence of
the flag file created during the transfer_data to skip
the normal pacemaker upgrade tasks by setting a new
fact, cluster_recreate, if this file exists.
Change-Id: Iba85e99f59258ce6ef4e05ccae737b9eeb6cfc57
Previously the --limit from tripleoclient assumed an operator would only
use a comma ','. It now converts all limit formats into a standardized format
that will always be colon ':' separated instead. This patch corrects issues
with upgrade tasks in Train and newer.
Change-Id: Icbd6b8568a697cbb0cf5740fc1a6c17b2b001c0e
Related-Change-Id: I190f6efe8d728f124c18ce80be715ae7c5c0da01
Signed-off-by: Luke Short <ekultails@gmail.com>
The operating system from RHEL 7 to RHEL 8 was required during the upgrade
from Rocky to Stein, however it isn't anymore for the upgrade from Stein to
Train. But we can't get rid of these tasks as they will be required for the
three releases jump from Queens to Train.
The solution has been making use of an existing heat paramter
UpgradeLeappEnabled which will be set when an Operating System is required.
Before, this parameter defaulted to true, but from now on it
defaults to false and will be set to true during the prepare step.
Change-Id: I7ac0c74726f7bbeb773d54f6909c5f647717f79a
Almost every single tripleo service creates a persistent directory. To
simplify the creation, a with_items structure was being used. In which
many times, the mode option was being set. However, that mode option
was not taken into account at the time of creating the file. As a
consequence, the directory was being created with its father directory
rights, instead of the ones being passed in the template.
Change-Id: I215db2bb79029c19ab8c62a7ae8d93cec50fb8dc
Closes-Bug: #1871231
Current puppet modules uses only absolute name to include classes,
so replace relative name by absolute name in template files so that
template description can be consistent with puppet implementation.
Change-Id: I7a704d113289d61ed05f7a31d65caf2908a7994a
With the HA NG work having landed, the impact of pacemaker
is reduced and only very few core services are being managed
by pacemaker. Since the HA deployments work just fine
with a single node, it makes little sense to use the non-ha
deployment as default any longer (Also because downstream
we do the default to the HA deployment by default and this
keeps confusing users).
This patch does the following :
* Remove Keepalived services from all CI scenarios running it.
* Make sure all HA services deployed in CI run with Pacemaker.
* Remove non HA containers so Pacemaker can
bootstrap the new containers safely.
* Before removing mysql container, create the clustercheck user and
grant correct permissions to avoid bootstrap issues later when galera
is spawned.
* Disable HA on the minor update job, it seems to not working fine if
only one controller is deployed.
Depends-On: https://review.opendev.org/#/c/718759
Change-Id: I0f61016df6a9f07971c5eab51cc9674a1458c66f
These roles were not renamed when we removed all of the hyphens.
This change removes the remaining hyphenated roles.
Change-Id: I3b25bfdef91b0bfc8d624d71a884d57508eaf004
In HA deployments, puppet-mysql is not in charge of deleting
all default users in the DB, so we end up keeping an extra
root@<controller-x> user that is never used nor supported for
password update. Make sure we delete it at creation time.
Change-Id: I0dbe6bd43ad0e6bcb884798912d195e94738c344
Closes-Bug: #1867165
- deploy-steps-tasks-step-1.yaml: Do not ignore errors when dealing
with check-mode directories. The file module is resilient enough to
not fail if the path is already absent.
- deploy-steps-tasks.yaml: Replace ignore_errors by another condition,
"not ansible_check_mode"; this task is not needed in check mode.
- generate-config-tasks.yaml: Replace ignore_errors by another
condition, "not ansible_check_mode"; this task is not needed in check mode.
- Neutron wrappers: use fail_key: False instead of ignore_errors: True
if a key can't be found in /etc/passwd.
- All services with service checks: Replace "ignore_errors: true" by
"failed_when: false". Since we don't care about whether or not the
task returns 0, let's just make the task never fail. It will only
improve UX when scrawling logs; no more failure will be shown for
these tasks.
- Same as above for cibadmin commands, cluster resources show
commands and keepalived container restart command; and all other shell
or command or yum modules uses where we just don't care about their potential
failures.
- Aodh/Gnocchi: Add pipefail so the task isn't support to fail
- tripleo-packages-baremetal-puppet and undercloud-upgrade: check shell
rc instead of "succeeded", since the task will always succeed.
Change-Id: I0c44db40e1b9a935e7dde115bb0c9affa15c42bf
While they are, at SELinux level, exactly the same (one is an alias to
the other), the "container_file_t" name is easier to understand (and
shorter to write).
A second pass in a couple of days or weeks will be needed in order to
change files that were merged after this first pass.
Change-Id: Ib4b3e65dbaeb5894403301251866b9817240a9d5
The upgrade workflow to Stein has a guard task that
checks that the --limit option is being used when
running the overcloud upgrade run command, as the
upgrade needs to be performed node by node due to
the operating system upgrade. However, if the --limit
option is not passed, the upgrade tasks fails in the
task right before the guard, as that task already
references the undefined variable. So, it is needed
to invert the order so we fail at will in the guard
task.
Change-Id: I9ffddcaa52314c615362969757c94ebdf01a3b6d
Closes-Bug: #1861663
Back in the days we had added the rmi -f container calls in order
to try and clean up any old unused container images whenever we updated
any HA container. Nowadays this already happens via the
tripleo_ansible/tripleo_podman/purge role which prunes any unused
container image.
There is no point in keeping this code around since we already purge
images as a post upgrade/update task. We want to remove this code also
because it fails horribly when we update the HA containers with an image
that is based off the previously deployed image. In fact that fails
with:
TASK [Remove previous galera images] *******************************************
Friday 31 January 2020 10:34:40 +0000 (0:00:02.684) 0:02:56.021 ********
fatal: [database-0]: FAILED! => {"changed": true, "cmd": "podman rmi -f 209e952aa6cb3c212e57e5f81693eb4776c0c4b6cf96fb4faabdaa7403b2a94d", "delta": "0:00:00.110460", "end": "2020-01-31 10:34:40.772522", "msg": "non-zero return code", "rc": 2, "start": "2020-01-31 10:34:40.662062", "stderr": "Error: unable to delete \"209e952aa6cb3c212e57e5f81693eb4776c0c4b6cf96fb4faabdaa7403b2a94d\" (cannot be forced) - image has dependent child images", "stderr_lines": ["Error: unable to delete \"209e952aa6cb3c212e57e5f81693eb4776c0c4b6cf96fb4faabdaa7403b2a94d\" (cannot be forced) - image has dependent child images"], "stdout": "", "stdout_lines": []}
This is particularly important because any hotfix container
generated with tripleo-modify-image role will be affected by this issue.
We tested this by doing the following:
1) Deploying an overcloud
2) Patching all HA containers with tripleo-modify-image
3) Running an update
With this change the update did not fail any longer and the correct
images were being used by pacemaker after the update process.
Co-Authored-By: Sofer Athlan-Guyot <sathlang@redhat.com>
Change-Id: I5346b32962b8cee5c64e4f07c0b68e2512085e83
Closes-Bug: #1861498
The next iteration of fast-forward-upgrade will be
from queens through to train, so we update the names
accordingly.
Change-Id: Ia6d73c33774218b70c1ed7fa9eaad882fde2eefe
A pacemaker bundle can be restarted either because:
. a tripleo config has been updated (from /var/lib/config-data)
. the bundle config has been updated (container image, bundle
parameter,...)
In HA services, special container "*_restart_bundle" is in charge
of restarting the HA service on tripleo config change. Special
container "*_init_bundle" handles restart on bundle config change.
When both types of change occur at the same time, the bundle must
be restarted first, so that the container has a chance to be
recreated with all bind-mounts updated before it tries to reload
the updated config.
Implement the improvement with two changes:
1. Make the "*_restart_bundle" start after the "*_init_bundle", and
make sure "*_restart_bundle" is only enabled after the initial
deployment.
2. During minor update, make sure that the "*_restart_bundle" not
only restarts the container, but also waits until the service
is operational (e.g. galera fully promoted to Master). This forces
the rolling restart to happen sequentially, and avoid service
disruption in quorum-based clustered services like galera and
rabbitmq.
Tested the following update use cases:
* minor update: ensure that *_restart_bundle restarts all types of
resources (OCF, bundles, A/P, A/P Master/Slave).
* minor update: ensure *_restart_bundle is not executed when no
config or image update happened for a service.
* restart_bundle: when resource (OCF or container) fails to
restart, bail out early instead of waiting for nothing until
timeout is reached.
* restart_bundle: make sure a resource is restarted even when it
is in failed stated when *_restart_bundle is called.
* restart_bundle: A/P can be restarted on any node, so watch
restart globally. When the resource restarts as Slave, continue
watching for a Master elsewhere in the cluster.
* restart_bundle: if an A/P is not running locally, make sure it
doesn't get restarted anywhere else in the cluster.
* restart_bundle: do not try to restart stopped (disabled) or
unmanaged resource. Bail out early instead, to not wait until
timeout is reached.
* stack update: make sure that running a stack update with no
change does not trigger any *_restart_bundle, and does not
restart any HA container either.
* stack update: when bundle and config will change, ensure bundle
is updated before HA containers are restarted (e.g. HAProxy
migration to TLS everywhere)
Change-Id: Ic41d4597e9033f9d7847bb6c10c25f443fbd5b0e
Closes-Bug: #1839858
Ansible has decided that roles with hypens in them are no longer supported
by not including support for them in collections. This change renames all
the roles we use to the new role name.
Depends-On: Ie899714aca49781ccd240bb259901d76f177d2ae
Change-Id: I4d41b2678a0f340792dd5c601342541ade771c26
Signed-off-by: Kevin Carter <kecarter@redhat.com>
Rather than use the shell module directly, we switch to using the
role so that we do things consistently.
Change-Id: I0fc349d0697b2c34bb1daac4a6a961300cff49fa
Since [1,2], HA container image name used in pacemaker are
look like container-common-tag/<servicename>:pcmklatest.
Unlike docker, podman prepends a host information in
front of this tag, which confuses the podman resource
agent in pacemaker.
Make the first part of the tag look like a fqdn so that
podman doesn't change it.
Closes-Bug: #1858648
[1] Id369154d147cd5cf0a6f997bf806084fc7580e01
[2] I7a63e8e2d9457c5025f3d70aeed6922e24958049
Change-Id: I07d72b87225dbadbc4df46564452ccb232593219
HA services get their container image name from a pacemaker
resource configuration. This image name is shared between
all cluster nodes.
To achieve image update without service disruption, a pacemaker
resource is configured to use an intermediate image name
"<registry>/<namespace>/<servicename>:pcmklatest" pointing to
the real image name configured in Heat. This tag can then be
updated independently on every node during the minor update.
In order to support the same rolling update when the <namespace>
changes in the container image, we need a similar floating
approach for the prefix part of the container image.
Introduce a new Heat parameter ClusterCommonTag that, when enabled,
sets the intermediate image name to
"cluster-common-tag/<servicename>:pcmklatest". By default, this
parameter is disabled and the original naming scheme is conserved.
Note: by introducing this new naming scheme, we stop seeing a
meaningful image name prefix when doing a "pcs status", but since
we already can't tell what image ID the :pcmklatest tag points to,
we don't lose much information really.
Related-Bug: #1854730
Change-Id: Id369154d147cd5cf0a6f997bf806084fc7580e01
When podman parses such volume map it removes the slash
automatically and shows in inspection volumes w/o slash.
When comparing configurations it turns to be a difference and
it breaks idempotency of containers, causing them to be recreated.
Change-Id: Ifdebecc8c7975b6f5cfefb14b0133be247b7abf0
The upgrade tasks are failing because podman can't locate
the log output file during the MySQL upgrade step. The
parameter LOG_DIR was misswritten into LOGDIR and Heat
could not subsitute it.
Change-Id: I19a7b8495a4510bff59feff43a6042da0e74eb5a
Closes-Bug: #1853162
This change converts our filewall deployment practice to use
the tripleo-ansible firewall role. This change creates a new
"firewall_rules" object which is queried using YAQL from the
"FirewallRules" resource.
A new parameter has been added allowing users to input
additional firewall rules as needed. The new parameter is
`ExtraFirewallRules` and will be merged on top of the YAQL
interface.
Depends-On: Ie5d0f51d7efccd112847d3f1edf5fd9cdb1edeed
Change-Id: I1be209a04f599d1d018e730c92f1fc8dd9bf884b
Signed-off-by: Kevin Carter <kecarter@redhat.com>
https://review.opendev.org/#/c/692850/ cleaned up the
legacy directories, but since then rhel8 jobs fails while
starting galera containers with error of missing
directory /var/log/mariadb, this patch adds it again.
Closes-Bug: #1851847
Change-Id: Iea081ecb3fc021fc796c93631ed6f663fd9580db
There were two issues in the mysql-pacemaker upgrade tasks:
- SELinux: since we're using podman, we have proper selinux enforcing on
the system and proper selinux separation for the containers. Some
volumes were lacking the "z" flag, making them unaccessible
- Since we're on podman, we have to correct the "log-driver" in the
command. This allow to get a dedicated log for debug purpose.
Change-Id: Ia03e6e8e913198b315c47982c14ed52569ec702c
Closes-Bug: #1851617
Resolves: rhbz#1769291