tripleo-heat-templates/deployment/ovn
Sofer Athlan-Guyot d9c60ab05e Workaround ovn cluster failure during update when schema change.
During update the ovndb server can have a schema change. The problem
is that an updated slave ovndb wouldn't connect to a master which
still has the old db schema.  At some point (200000ms) pacemaker put
the resource in error Time Out.  Then it will wait for the operator to
cleanup the resource.  Meaning that the update can goes like this:

 - Original state: (Master, Slave, Failed): nothing updated
   - ctl0-M-old
   - ctl1-S-old
   - ctl2-S-old
 - First state: after update of ctl0
   - ctl0-F-new
   - ctl1-M-old
   - ctl2-S-old
 - Second state: after update of ctl1
   - ctl0-F-new
   - ctl1-F-new
   - ctl2-M-old
 - Third and final state: after update of ctl2
   - ctl0-F-new
   - ctl1-F-new
   - ctl2-M-new

During the third state we have a cut in the control plane as ctl2 is
the master and there is no slave to fall back to. Then we end up
loosing HA as only one node is active.  The error persists after
reboot.  Only a pcs resource cleanup will bring the cluster online.

The real solution will come from ovndb and the associated ocf agent,
but in the meantime, we workaround it by:
 - cleanup
 - ban the resource;
in step 1 and:
 - cleanup
 - unban the resource
in step 5.

This has the net effect of preventing the cut in the control plane for
the last node as we move master to the updated controller which will
form a cluster of one master and one slave (as two are updated).  The
last one will happily join then when it will be updated.

That means:
 - we always have either 1 or 2 nodes working;
 - we end the update with the cluster converged back to a stable
 state.

The problems are :
 - we could hide a real ovndb cluster issue;
- if the update break in-between we could have a leftover ban on one
 of the node;

But, all things considered, this looks like the best compromise for
the time being.

Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80
Closes-Bug: #1847780
(cherry picked from commit 751b3fc096)
2019-10-16 04:43:03 +00:00
..
ovn-controller-container-puppet.yaml Add the ability to configure ovn-remote-probe-interval 2019-07-29 17:15:59 +02:00
ovn-dbs-container-puppet.yaml Move containers-common.yaml into deployment 2019-04-14 18:15:12 -04:00
ovn-dbs-pacemaker-puppet.yaml Workaround ovn cluster failure during update when schema change. 2019-10-16 04:43:03 +00:00
ovn-metadata-container-puppet.yaml Merge "Move containers-common.yaml into deployment" 2019-04-15 17:57:15 +00:00