Close OVN VIP race by adding an ordering constraint

Currently there is a race with the high-availability of ovn when resetting a
controller. Namely, the VIP that OVN uses (namely the internal_api VIP
by default) only has a colocation constraint with the master role of the
ovn-dbs resource. This leaves the following race open:
1) We reboot ctrl-0 hosting the master role of ovn-dbs
2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
   promotion operation running in the background is not completed)
3) OVN VIP moves to ctrl-1 even though it is still in slave mode
  (there is only a colocation constraint between vip and master role for
ovn)
4) OVN controllers on the overcloud connect to the VIP but it is in
  read-only mode because it was a slave
5) OVN controllers that connected at 4) stay in read-only forever
   until they get restarted manually.

With the addition of this constraint we force the VIP move only after
the master role has been promoted. This makes it much more unlikely
for a client to connect to the VIP and get a read-only db in the
background. With only this patch applied I did not manage to reproduce
the issue (even after 7 reboots of controllers).
Note that there is still a small race window possible because the
current OVN resource agent has a bug: it promotes a resource to master
after issuing the promotion command to the DB but without waiting for
this promotion to complete. A patch for OVN-ra will also be submitted
but from initial testing this change seems to be largely sufficient.

Also note that this change introduces a small less desirable
side-effect:
A failover of the internal VIP will now take a bit longer because it
will happen only after ovn-dbs gets promoted to master.
We plan to take care of this fully by decoupling the OVN VIP from the
internal_api one. This change addresses the immediate issue related
to ovn_controllers being stuck in read-only due to premature promotion.
(OVN upstream is discussing how to make connections to read-only VIP
trigger a reconnection eventually)

Closes-Bug: #1835830

Change-Id: I3fa07e28c4e37197890664d12a265f1673c780f2
(cherry picked from commit 5c10f33197)
This commit is contained in:
Michele Baldessari 2019-07-08 08:53:27 +02:00 committed by Alex Schultz
parent acbede7584
commit dc4bb7e7cb
1 changed files with 11 additions and 1 deletions

View File

@ -160,9 +160,19 @@ sb_master_port=${sb_db_port} manage_northd=yes inactive_probe_interval=180000",
tries => $pcs_tries,
}
pacemaker::constraint::order { "${ovndb_vip_resource_name}-with-${ovndb_servers_resource_name}":
first_resource => 'ovn-dbs-bundle',
second_resource => "${ovndb_vip_resource_name}",
first_action => 'promote',
second_action => 'start',
constraint_params => 'kind=Optional',
tries => $pcs_tries,
}
Pacemaker::Resource::Bundle['ovn-dbs-bundle']
-> Pacemaker::Resource::Ocf["${ovndb_servers_resource_name}"]
-> Pacemaker::Constraint::Colocation["${ovndb_vip_resource_name}-with-${ovndb_servers_resource_name}"]
-> Pacemaker::Constraint::Order["${ovndb_vip_resource_name}-with-${ovndb_servers_resource_name}"]
-> Pacemaker::Constraint::Colocation["${ovndb_vip_resource_name}-with-${ovndb_servers_resource_name}"]
}
}
}