puppet-tripleo/manifests/profile/pacemaker
Michele Baldessari a0cfe0afdd Close OVN VIP race by adding an ordering constraint
Currently there is a race with the high-availability of ovn when resetting a
controller. Namely, the VIP that OVN uses (namely the internal_api VIP
by default) only has a colocation constraint with the master role of the
ovn-dbs resource. This leaves the following race open:
1) We reboot ctrl-0 hosting the master role of ovn-dbs
2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
   promotion operation running in the background is not completed)
3) OVN VIP moves to ctrl-1 even though it is still in slave mode
  (there is only a colocation constraint between vip and master role for
ovn)
4) OVN controllers on the overcloud connect to the VIP but it is in
  read-only mode because it was a slave
5) OVN controllers that connected at 4) stay in read-only forever
   until they get restarted manually.

With the addition of this constraint we force the VIP move only after
the master role has been promoted. This makes it much more unlikely
for a client to connect to the VIP and get a read-only db in the
background. With only this patch applied I did not manage to reproduce
the issue (even after 7 reboots of controllers).
Note that there is still a small race window possible because the
current OVN resource agent has a bug: it promotes a resource to master
after issuing the promotion command to the DB but without waiting for
this promotion to complete. A patch for OVN-ra will also be submitted
but from initial testing this change seems to be largely sufficient.

Also note that this change introduces a small less desirable
side-effect:
A failover of the internal VIP will now take a bit longer because it
will happen only after ovn-dbs gets promoted to master.
We plan to take care of this fully by decoupling the OVN VIP from the
internal_api one. This change addresses the immediate issue related
to ovn_controllers being stuck in read-only due to premature promotion.
(OVN upstream is discussing how to make connections to read-only VIP
trigger a reconnection eventually)

Closes-Bug: #1835830

Change-Id: I3fa07e28c4e37197890664d12a265f1673c780f2
(cherry picked from commit 5c10f33197)
(cherry picked from commit dc4bb7e7cb)
(cherry picked from commit bcea8ea1aa)
2019-08-12 10:06:20 +00:00
..
ceph Fix lint issues to upgrade to puppet-lint 2.3 2017-07-21 11:42:45 +02:00
cinder Force cinder properties to be set on ly on nodes with pcmk on it 2018-08-19 09:00:18 +02:00
database Allow mysql options to be set for the HA bundle 2019-06-08 15:18:34 +00:00
manila Allow external Ganesha for the cephfs manila backend 2019-06-04 11:21:03 +02:00
neutron Replace bootstrap_nodeid with SERVICE_short_bootstrap_node_name 2019-06-20 16:10:53 -04:00
ceph_nfs.pp Fix ceph-nfs duplicate property 2018-10-17 09:36:33 +00:00
clustercheck.pp Use clustercheck credentials to poll galera state in container 2017-07-31 12:39:51 -04:00
compute_instanceha.pp Instance HA support 2017-12-06 11:42:53 +01:00
haproxy.pp Configure VIPs for all networks including composable networks 2018-01-04 15:23:35 -05:00
haproxy_bundle.pp Fix up property names in case of mixed case hostnames 2018-05-30 04:21:40 +02:00
manila.pp Move manila backend configuration from pacemaker to base 2017-09-01 11:37:20 +02:00
ovn_dbs_bundle.pp Close OVN VIP race by adding an ordering constraint 2019-08-12 10:06:20 +00:00
ovn_northd.pp Fix generating connections to OVN db 2019-04-01 09:53:15 +02:00
rabbitmq.pp Make sure rabbitmq is fully up before creating any rabbitmq resources 2017-11-17 07:26:55 +01:00
rabbitmq_bundle.pp RabbitMQ: always allow promotion on HA queue during failover 2019-06-19 19:00:02 +02:00