In order to ANSIBLE_INJECT_FACT_VARS=False we have to use ansible_facts
instead of ansible_* vars. This change switches our distribution and
hostname related items to use ansible_facts instead.
Change-Id: Id01e754f0cf9f6e98c02f45a4011f3d6f59f80a1
Related-Bug: #1915761
This exposes new var for tripleo_container_manage. Also,
removes the check to force set clean_orphans=True in
needs_delete filter, when there is one or no item in
the startup config.
There is a possibility when disabling services, there would
be one or zero container_startup_configs for a step.
Partial-Bug: #1893335
Change-Id: I9d08168015487c48d8b380a9575ba236b7fb7d0d
The default PID limit in a container is set to 4096. This limit might be
reached in a nova_libvirt container, after launching about 150 VMs.
Change-Id: Ibfbe63cbd9e2a219f10ebc407596aeefe4a5b194
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1871885
Closes-Bug: #1892817
We don't have to use no_log if we use label to prevent all the info from
being outputed during execution.
Change-Id: I084c7e933a024b76f4d6fa6ac000c58b6b3c7acf
1) shutdown: include tasks without a block. this should remove the stat
and include task; to only have one task that include the shutdown
playbook if needed.
2) Remove containers_changed tasks. This is not useful to restart a
container just because its podman_container resource changed.
when podman_container module applies a change to the container, this
one is already restarted.
3) Remove all_containers_commands tasks, there aren't needed. Ansible
output already provides the commands that are run via the
container_status module.
Change-Id: Ic625bc5dd7bbd964d36eab0a3f81eca31c533716
Support the checks of "podman exec" commands, and add the commands to
the list of things podman ran.
It reduces the number of tasks as it removes get_commands_exec and
re-use the container_status action plugin.
Change-Id: I4c84b389b595a8fe18ef6d31e896d6b6608b9920
Instead of running a bunch of tasks to manage systemd resources, move
it into an action plugin which should make the execution faster and
easier to debug as well.
Example of task:
- name: Manage container systemd services
container_systemd:
container_config:
- keystone:
image: quay.io/tripleo/keystone
restart: always
- mysql:
image: quay.io/tripleo/mysql
stop_grace_period: 25
restart: always
The output is "restarted" for the list of services that were actually
restarted in systemd.
Note on testing: since that module is consummed by
tripleo_container_manage role, there is no need to create dedicated
molecule tests, we already cover containers with restart policy in that
role's molecule tests. So we'll re-use it.
Co-Authored-By: Alex Schultz <aschultz@redhat.com>
Co-Authored-By: Kevin Carter <kecarter@redhat.com>
Change-Id: I614766bd9b111bda9ddfea0a60b032e1dee09abc
Instead of running a bunch of tasks to figure out what container
commands have been run, which ones did not terminate after 5 minutes,
which ones failed or finished with a wrong exit code. We now have an
action plugin that will do it faster and with better logging.
Faster before it reduces the number of tasks.
Better logging is provided, now displaying all errors during a run and
fail at the end.
Supporting check-mode.
Re-using tripleo_container_manage role for molecule testing.
Co-Authored-By: Alex Schultz <aschultz@redhat.com>
Co-Authored-By: Kevin Carter <kecarter@redhat.com>
Change-Id: Ie7f8c9cceaf9540d7d33a9bb5f33258c46185e77
container_startup_config will replace a bunch of tasks that we did in
THT to generate the .json file for each container in
/var/lib/tripleo-config/container-startup-config.
It'll accelerate a bit the deployment by replacing tasks into a
single module, so we can generate the startup configs much faster.
Also tolerate empty configs in container_config_data, switching a
failure into a debug. If the config is empty, there is no need to run
the rest of the role tasks in tripleo_container_manage.
Note: we manage the SElinux context at the openstack-selinux:
https://github.com/redhat-openstack/openstack-selinux/commit/0b62
TODO (later): Manage idempotency and only create configs that need to be
updated.
Change-Id: I501f31b52db6e10826bd9a346d38e34f95ae8b75
Rather than using ansible task loops which increase the time to execute,
this change creates a filter that is used to assert that the containers
used in execs are running before we run the execs
Change-Id: I5e15cc71c45160109f5c303c13dd25a052ede3c3
Paunch was removed from TripleO during Victoria Cycle.
tripleo_container_manage was enabled by default in Ussuri.
Therefore there is no need to run these tasks anymore, since they were
suppose to run during a deployment or upgrade of Ussuri.
Change-Id: I4f61dc954695d6c235effd44f16215d5e5401088
This change will enable or disable no_log and debug options whenever the
verbosity is set to an integer greater than 2. This will ensure operators and
deployers are best equipped to troubleshoot issues by dynamically providing
additional data in an expected way. To ensure we're able to differentiate
between output masking and security masking, two options were used to enable or
disable no_log across our roles and playbooks.
> All debug options, without security implications, will now react to the
`ansible_verbosity` built in by default. Changes have been made to our
skeleton role to ensure this is enforced on all new roles created going
forward.
> An additional prefixed role option, `*_hide_sensitive_logs`, has been added to
allow operators to easily toggle sensitive output when required. The role
prefixed variables will respond to the global option `hide_sensitive_logs` as
defined in THT which will ensure a consistent user experience.
Depends-On: I84f3982811ade59bac5ebaf3a124f9bfa6fa22a4
Change-Id: Ia6658110326899107a0e277f0d2574c79a8a820b
Signed-off-by: Kevin Carter <kecarter@redhat.com>
These options can be used instead of the --privileged option with
some containerised services in TripleO.
Change-Id: If1d97e5f1697fdc1d6a7b845cf116d54b1897245
If debug is enabled we want the new log level option to be debug
otherwise fallback to error like it's the default in the module.
Change-Id: I7ceed8d9720c0df8a88c7e1de9d9afd05afde166
containers_to_check is a list of containers that run something and then
exit with a return code. Later we will check what their return code is
and make sure the containers didn't fail before we continue with the
deployment.
Also, if a container failed to be created or updated; we don't want to include
it in the list that contains the containers in which we want to check if
the returned RC is valid; since we already know the container didn't
terminate correctly.
Closes-Bug: #1878405
Change-Id: Ia3a985d923c5f44babaae601f3ba62b6d48659da
- helpers/haskey: add excluded_keys argument. It allows to return the
config that has an attribute but also where some attributed are
excluded. The use case here is that we have some container configs
which have both "command" and "action". We want to use that filter to
build a list of containers where the return code has to be checked;
which is the not the case for the containers with "action" in their
configs; since they are used for "podman exec" configs (and there is
nothing to check in return from podman inspect).
- check_exit_code: change the list of containers to check the exit code
to include all the containers with a "command" but not "action".
It should cover all the containers which are used to run some
non-services things like db_sync etc.
- molecule: change the fedora_bis and fedora_three containers to run
short sleep so we can actually test that change against these
containers and also on the first deployment of fedora_bis and
fedora_three, we'll check their return code.
Change-Id: I466a57bd788e02c32b1efb0ac0223684f0d39393
Closes-Bug: #1878074
On slow systems, it's possible that systemd takes more time than usual
to execute a task from Ansible (e.g. service restart); so Ansible
doesn't have yet the registered facts from systemd.
To make sure that Ansible doesn't fail with:
dict object' has no attribute 'status'
We first check if status is defined.
Change-Id: Ie73cecc115c87fe452a90892755a1df5b3d894a7
Closes-Bug: #1877449
tripleo_container_manage_create_retries and
tripleo_container_manage_exec_retries (default to 60) will allow a
timeout of 10 minutes for both podman exec and podman run commands.
Indeed, some containers (db-sync or when puppet runs) can take up to 10
minute to execute and finish.
Change-Id: Iff752cd124546bdd7cf857b0dacfc7d33b9a71a6
- In container_running, replace 2 tasks by one task
- In podman/create, move the check_exit_code tasks into its own playbook
- Rework podman/systemd to only be included if systemd services are
needed by the container configs and also reduce the tasks
Change-Id: Ief05797caf12084d7c1432bea037ccd56107dcde
Now that Podman natively supports healthchecks, let's use them; which
will reduce our footprint in how we consume Podman.
Using native healthcheck brings a few benefits:
- Less Ansible tasks to manage the systemd resources, so deployment
should be slightly faster.
- Leverage features into the container tooling directly; not in tripleo.
This patch does the following:
- Fix the podman arguments for healthcheck options in podman_container
module, transparent for the end-user. Indeed, the args are "health-*".
- Remove the management of timers and healthcheck services and their
requires.
- New playbook "healthcheck_cleanup" to cleanup previous systemd
healthchecks if they exist.
- Update molecule default testing to test if new healthchecks work fine.
- Update the role manual for healthchecks usage.
This patch should be transparent for the end-users except that the
systemd healthchecks won't exist anymore:
Instead of running: "systemctl status tripleo_keystone_healthcheck.timer
status", we would run "podman healthcheck run keystone" or check the
output of "podman inspect keystone".
The document has also been updated in the role manual.
It requires at least Podman 1.6 where this patch has been tested.
Depends-On: https://review.opendev.org/720089
Change-Id: I37508cd8243999389f9e17d5ea354529bb042279
If the tasks are skipped the variables are empty and should be default
to an empty list; which will return empty services when figuring out
what services need a restart.
Change-Id: I852066179c86b97a7f775a7babb4e44e89a0d9a3
Instead of managing systemd services per start_order per step, we could
manage them per step.
The start_order was created to be able to create containers which runs
some command or do exec; but not really for the ones who are services
and managed by systemd.
Doing it per step will reduce the number of tasks and therefore the
deployment time.
Note: it adds dict_to_list() filter; which converts a dict of dicts to a
list of dicts. Ansible allows to do it via dict2items | list but in this
particular case we don't want key/value when later treating the data to
figure out if systemd is needed.
Change-Id: Ia38f2ec753dc3c21bcf91f057fe7ff8020d214e6
It's possible that in CI environment or in slow servers the systemd
reload takes time to read new systemd services and reload systemd while
Ansible already moved on and tries to enable these services.
Adding a retry mechanism so we retry 5 times with a delay of 5 seconds
between, until systemd service returns success as a result.
Change-Id: I2bb39d2aae68a0eadaad92cf0fc4e3506fbfc9ca
Closes-Bug: #1873453
If a container config has by mistake a healthcheck but no systemd
restart policy, we don't want to manage the healthcheck because it
requires its service to be created.
To prevent that situation, we'll create the healthchecks only if they
are already part of the systemd services list that was created earlier.
For that, we're using the intersect() filter which allows to
get the intersection of 2 lists (systemd services and healthchecks to
create).
Adding molecule coverage to test this scenario.
Closes-Bug: #1873249
Change-Id: Id5cc784bae597def0648f07d28b6463b387d2212
It helps with debugging, when no action is run for containers; instead
of just showing an empty list.
Change-Id: I27c550e49492df9a80a3e9ed119962cc412c491a
Separate the creation of systemd files & service restarts so we don't
call systemd too many times and makes the deployment faster.
It also uses a new filter that will read register data to figure out
what systemd files changed so what containers need a restart.
Change-Id: I16596a5b262642a678a8b8b123384fc387f69c70
To avoid this error:
The error was: KeyError: 'container_data'
We need to fetch container_data from async_result_item in the async
results; that's where the key is. Updating unit tests as well, and the
task which creates the facts so there is no confusion in the logs.
Change-Id: I2a5533335151c4b292e85aea310adfdc44ab1e02
If a container fails to start after many retries, the default logging of
the async_status tasks isn't great and it's hard to figure out what
container failed to start.
In this patch, we introduce a new filter that will read the
async_results and build a list of containers which failed to start
(failed to True) or did not finish to start (finished to 0); the
async_status ignores errors, but we fail a bit later after building that
list.
Change-Id: I5a2270130bdf5b9d781f4d81ec25c6ccf12fdc07
- octavia_controller_post_config: remove "ignore_errors: true". It's not
supposed to be needed, since there is already a
"failed_when: config_contents.rc != 0" which knows when to fail.
- octavia_undercloud, tripleo_cellv2, tripleo_ceph_common,
tripleo_container_manage, tripleo_packages and tripleo_puppet_cache:
replace "ignore_errors: true" by "failed_when: false" for debugging
experience. We know the tasks can fail and we don't care, let's just
not show them as failures in that case and force the task to never
fail.
- tripleo_podman: instead of ignoring errors, check if the config file
actually exists before wipping it out.
Change-Id: Ib3716e4823735a9db9bd3cac33b8daf0e5f3d186
container_config_data is a new module that replaces some tasks that were
previously used to read and process JSON files on a host.
Doing it in a module will reduce the number of tasks, and make it a
little bit faster and easier to debug in big deployments.
Unit tests will come later.
Change-Id: I9844ed166ecf0718935ac1537b719000da816dbb
Seems like some db_sync tasks are taking more time and this
results in jobs failing. Though probably not a long
term solution, would be good to increase it to reduce the
large number of job failures.
Change-Id: Ifa494ffdd58772c39808bcaa3d5d37b3802af065
Ralated-Bug: #1865473
To match with how Paunch created the systemd services for containers, we
add the requires so the timers require their service to run for proper
healthchecks. We also need to run a systemd reload right after.
Change-Id: Icc14d4f3bf137a543d9ef4f6a2f6384d9df65a70
container_data contains the config data for each container, which can be
passwords or other secrets. Ansible would show it in the logs when
executing the resources, which leaks sensitive data. Only enable that
display if tripleo_container_manage_debug is set to True; like we
already do for some other tasks.
Change-Id: I2d58a3a234b94d5b5d2cc16b47c9751f78a416d9
The timer task needs to be skipped if there is no healthcheck in the
configuration, so move it to the block reserved for healhtcheck tasks so
it gets ignored when a container has no healthecheck.
Change-Id: Ie1a4c5af7c2e8ff3e6ff0b85f264239a8e30e913
Re-use tripleo_container_rm role to remove the containers.
This role was doing the same thing as tasks/podman/delete.yml:
- remove healthcheck
- remove systemd service
- remove container
It reduces the duplication and re-use what we can from other roles.
Change-Id: Ibea31bb738fac4608b01cfe035e6c248b488aa09
podman_container_info module generates a bunch of logging, which is all
infos reported by podman inspect. It's useful only when you do some
debug; so let's disable its output when debug isn't enabled.
Change-Id: Ib35d18e5bfa8bbfdff75fda6c52ea7c770f6f986
If a container has a new config (e.g. new image), it'll now be removed
right before being re-created.
Before, we were first removing that container among potential orphan
containers, and the container would be re-created later but the downtime
in the middle can be several minutes; since we batch the container
creation.
Now, we separate the cleanup of orphan containers and the ones with new
configs.
The workflow is the following:
1) Remove orphan containers (not in the config anymore, missing Labels,
etc).
Then for each batch of containers:
2) Remove the containers where the config_data has changed.
3) Create the containers that are in the config.
This patch should reduce the downtime of containers that are updated.
Change-Id: I821d674dead4a21b7ac30b47b31b8dd34e0ecc8b
Getting to know which containers changed is a challenge since we create
the containers in async with podman_container.
This patch introduces a new filter, called get_changed_containers() which
knows how to use returned data from async and figure out what containers
have podman actions (which mean podman rm and create); which have to
trigger the systemd playbook and make sure the services are triggering
changes in systemd.
Change-Id: I2ecf7d942dcca1e381d329e11939fd31299551f5
1) container_puppet_config: introduce update_config_hash_only
If update_config_hash_only is set to True, the module will do the
following:
Browse the container startup configs and for each config, check if the
config_hash has changed (e.g. Puppet has run before and generated a new
config, therefore the container will need a restart); then update the
startup configs with the new hash.
This extends container_puppet_config capabilities instead of writting a
new module for that.
2) tripleo_container_manage: add tripleo_container_manage_check_puppet_config
tripleo_container_manage_check_puppet_config is a new parameter, that is
set to False by default but if set to True, we will call the
container_puppet_config module with update_config_hash_only set to True
so we get the new config hashes in the container startup configs right
before we decide if a container needs to be restarted.
Change-Id: I16b2972cdf79cd6ac927925607197ec2a969a28b
If tripleo_container_manage_valid_exit_code is set to a list of codes,
for each container that will be created and expected to run something,
Ansible will check that the container exited with a valid code, given in
the parameter.
It can be useful for the Puppet containers where the entrypoint returns
the code from Puppet that has run in the container.
Change-Id: I268d7b257334d78da11c37cb0ba0783fbe2021c0
This module will do two things:
Summary:
(1) Generate container configs for each container that has a puppet
deployment step; and but the files in
/var/lib/tripleo-config/container-puppet-config
(2) Update the container-startup-config of the containers which have a new
hash; so they get restarted later.
Details:
(1) Here are the steps that happen to generate the puppet container
configs:
- Create /var/lib/tripleo-config/container-puppet-config/step_X
- Generate a JSON file, that is the same format as the
well-known container-startup-configs (which are understood by
Paunch and tripleo-container-manage Ansible role). It mimics
the logic from THT/common/container-puppet.py to add the
required configuration so the container can run.
(2) If a container has a new configuration, the TRIPLEO_CONFIG_HASH
will be updated in the startup config of the container, so later
Paunch or tripleo-container-manage Ansible role can restart the
container so the config is applied.
Note: it processing a bunch of files and data, so it's better for it to
be a module and not an action plugin so the file generation can be
delegated on to the remote nodes instead of the undercloud.
In the future, we'll try to generate container configuration directly in
config-download so the data will be ready to be copied.
Change-Id: I6b656df725803db0c1cdaac6f534766398a15810
When setting "int" after default(omit) it doesn't omit this arg,
but sets always 0 as a value. Let's remove this "int", anyway it
doesn't make much sense in cli command, everything is string there.
Change-Id: Ibbce7c75e9b221192d5cb27e93d66880464491ff
If the tripleo-container-shutdown.preset is already in place, we don't need to
execute the shutdown.yml playbook.
It'll avoid the repeat of these tasks at every step; and therefore save
time.
Change-Id: I0796695271ec7863bb23e5bc292876e1e845e0fe
Official documentation states:
If the value of async: is not high enough, this will cause the "check
on it later" task to fail because the temporary status file that the
async_status: is looking for will not have been written or no longer
exist.
And the listed there example has async=1000.
Rize the value of a 300 used in tripleo_container_manage for creating
new containers to appear to be "high enough"...
Change-Id: I598d230bfbb5696f1c0799295d9e456eade56676
Closes-bug: #1861093
Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
This will ensure all potential ansible accepted boolean operators are
trapped and handled correctly.
Change-Id: Ifbc293860cc81c0c689b127b2da8930f557a6ad5
Signed-off-by: Kevin Carter <kecarter@redhat.com>