Guarantee that ovn-dbs-pcmk update_tasks are run when the cluster is up

On particular role compositions, the code joining the update_tasks might order
things differently then on a typical 3ctrl control plane and the ovn-dbs tasks
at step1 (which require the cluster to be up) will happen after the pacemaker
task at step1 which stops the cluster.

So we can observe something like the following:
2021-09-10 10:05:13.370339 | 001c2891-506d-f833-ff5a-000000000954 | TASK | Change the bundle operation timeout
2021-09-10 10:05:14.136798 | 001c2891-506d-f833-ff5a-000000000954 | CHANGED | Change the bundle operation timeout | ovn-db-01
2021-09-10 10:05:14.137982 | 001c2891-506d-f833-ff5a-000000000954 | TIMING | Change the bundle operation timeout | ovn-db-01 | 0:00:54.808754 | 0.77s
2021-09-10 10:05:14.146853 | 001c2891-506d-f833-ff5a-000000000956 | TASK | Acquire the cluster shutdown lock to stop pacemaker cluster
2021-09-10 10:05:14.508085 | 001c2891-506d-f833-ff5a-000000000956 | CHANGED | Acquire the cluster shutdown lock to stop pacemaker cluster | ovn-db-01
2021-09-10 10:05:14.509257 | 001c2891-506d-f833-ff5a-000000000956 | TIMING | Acquire the cluster shutdown lock to stop pacemaker cluster | ovn-db-01 | 0:00:55.180032 | 0.36s
2021-09-10 10:05:14.518668 | 001c2891-506d-f833-ff5a-000000000957 | TASK | Stop pacemaker cluster
2021-09-10 10:05:18.559627 | 001c2891-506d-f833-ff5a-000000000957 | CHANGED | Stop pacemaker cluster | ovn-db-01

2021-09-10 10:05:18.560561 | 001c2891-506d-f833-ff5a-000000000957 | TIMING | Stop pacemaker cluster | ovn-db-01 | 0:00:59.231336 | 4.04s
2021-09-10 10:05:18.569161 | 001c2891-506d-f833-ff5a-000000000958 | TASK | Start pacemaker cluster
2021-09-10 10:05:18.627924 | 001c2891-506d-f833-ff5a-000000000958 | SKIPPED | Start pacemaker cluster | ovn-db-01
2021-09-10 10:05:18.628678 | 001c2891-506d-f833-ff5a-000000000958 | TIMING | Start pacemaker cluster | ovn-db-01 | 0:00:59.299453 | 0.06s
2021-09-10 10:05:18.637292 | 001c2891-506d-f833-ff5a-000000000959 | TASK | Release the cluster shutdown lock
2021-09-10 10:05:18.694945 | 001c2891-506d-f833-ff5a-000000000959 | SKIPPED | Release the cluster shutdown lock | ovn-db-01
2021-09-10 10:05:18.695717 | 001c2891-506d-f833-ff5a-000000000959 | TIMING | Release the cluster shutdown lock | ovn-db-01 | 0:00:59.366493 | 0.06s
2021-09-10 10:05:18.704368 | 001c2891-506d-f833-ff5a-00000000095a | TASK | Clear ovndb cluster pacemaker error
2021-09-10 10:05:19.368816 | 001c2891-506d-f833-ff5a-00000000095a | FATAL | Clear ovndb cluster pacemaker error | ovn-db-01 | error={"changed": true, "cmd": "pcs resource cleanup ovn-dbs-bundle", "delta": "0:00:00.399084", "end": "2021-09-10 10:05:20
.044985", "msg": "non-zero return code", "rc": 1, "start": "2021-09-10 10:05:19.645901", "stderr": "Error: Unable to forget failed operations of resource: ovn-dbs-bundle\nError connecting to the CIB manager: Transport endpoint is not connected\nError perf
orming operation: Transport endpoint is not connected", "stderr_lines": ["Error: Unable to forget failed operations of resource: ovn-dbs-bundle", "Error connecting to the CIB manager: Transport endpoint is not connected", "Error performing operation: Tran
sport endpoint is not connected"], "stdout": "", "stdout_lines": []}

We cannot call pcs resource cleanup at step1, we must call it at step0 so we're
guaranteed that the cluster is up, no matter how heat/ansible decide to order
the update_tasks.

Note: This is the short-term less-invasive fix. The mid-long term fix
should be around verifying that we can now remove those workarounds
that were implemented for OVN bugs.

Closes-Bug: #1943254
Change-Id: Idd827f72c0033978db7b9a8ea6acec2086cda961
This commit is contained in:
Michele Baldessari 2021-09-10 14:34:34 +02:00
parent 7679d5e0fe
commit ff3173bdcf
1 changed files with 17 additions and 17 deletions

View File

@ -269,6 +269,23 @@ outputs:
- {get_param: CertificateKeySize}
ca: ipa
update_tasks:
# When a schema change happens, the newer slaves don't connect
# back to the older master and end up timing out. So we clean
# up the error here until we get a fix for
# https://bugzilla.redhat.com/show_bug.cgi?id=1759974
- name: Clear ovndb cluster pacemaker error
shell: "pcs resource cleanup ovn-dbs-bundle"
when:
- step|int == 0
# Then we ban the resource for this node. It has no effect on
# the first two controllers, but when we reach the last one,
# it avoids a cut in the control plane as master get chosen in
# one of the updated Stopped ovn. They are in error, that why
# we need the cleanup just before.
- name: Ban ovndb resource on the current node.
shell: "pcs resource ban ovn-dbs-bundle $(hostname | cut -d. -f1)"
when:
- step|int == 0
- name: Tear-down non-HA ovn-dbs containers
when:
- step|int == 1
@ -282,23 +299,6 @@ outputs:
- ovn_north_db_server
- ovn_south_db_server
- ovn_northd
# When a schema change happens, the newer slaves don't connect
# back to the older master and end up timing out. So we clean
# up the error here until we get a fix for
# https://bugzilla.redhat.com/show_bug.cgi?id=1759974
- name: Clear ovndb cluster pacemaker error
shell: "pcs resource cleanup ovn-dbs-bundle"
when:
- step|int == 1
# Then we ban the resource for this node. It has no effect on
# the first two controllers, but when we reach the last one,
# it avoids a cut in the control plane as master get chosen in
# one of the updated Stopped ovn. They are in error, that why
# we need the cleanup just before.
- name: Ban ovndb resource on the current node.
shell: "pcs resource ban ovn-dbs-bundle $(hostname | cut -d. -f1)"
when:
- step|int == 1
- name: ovn-dbs fetch and retag container image for pacemaker
when:
- step|int == 3