Correct race in static node reregistration

The static node driver has two paths that reregister nodes. The first
runs to synchronize our internal node representation with the
configuration periodically via the cleanup process. The second is
triggered when a node is deleted to reregister it. Unfortunately,
this second path was not checking appropriately to see if something else
(the cleanup routine) had reregisterd a node before it could register
it. The end result was duplicate registrations for a single node.

This logging illustrates what is happening:

  17:22:03,387 [StaticNodeProvider] Static driver cleanup leaked resources beginning
  17:22:03,387 [StaticNodeProvider] Static driver cleanup leaked resources lock acquired
  17:22:03,389 [StaticNodeProvider] Static node deletion handler beginning
  17:22:03,390 [StaticNodeProvider] Provider static-provider pool main syncing nodes with config
  17:22:03,392 [StaticNodeProvider] Registered static node NodeTuple(hostname='fake-host-1', username='zuul', port=22022)
  17:22:03,392 [StaticNodeProvider] Provider static-provider pool main syncing nodes with config complete
  17:22:03,392 [StaticNodeProvider] Static driver cleanup leaked resources complete
  17:22:03,392 [StaticNodeProvider] Static node deletion handler lock acquired
  17:22:03,396 [StaticNodeProvider] Re-registering deleted node: NodeTuple(hostname='fake-host-1', username='zuul', port=22022)
  17:22:03,398 [StaticNodeProvider] Registered static node NodeTuple(hostname='fake-host-1', username='zuul', port=22022)
  17:22:03,398 [StaticNodeProvider] Static node deletion handler complete

With a simplified view:

    A cleanupLeakedResources begins
  L A cleanupLeakedResources lock acquired
    B nodeDeletedNotification begins
  L A syncNodeCount begins
  L A registerNodeFromConfig
  L A syncNodeCount ends
    A cleanupLeakedResources ends
  L B nodeDeletedNotification lock acquired
  L B nodeDeletedNotification re-registers node
  L B registerNodeFromConfig
    B nodeDeletedNotification ends

To address this in the deleted node reregistration method we check if
the node being reregistered is already updated in the slot info. If so
that implies the cleanup process has already reregistred us.

Change-Id: I08a77ec34f0a0080d7a6482e9ec9d0620929105e
This commit is contained in:
Clark Boylan 2022-12-08 10:55:34 -08:00
parent de7306b3e9
commit bcfc44b4a4
1 changed files with 13 additions and 1 deletions

View File

@ -484,8 +484,8 @@ class StaticNodeProvider(Provider, QuotaSupport):
"Cannot get registered nodes for re-registration:"
)
return
slot = node.slot
slot = node.slot
if slot is None:
return
# It's possible we were not able to de-register nodes due to a
@ -494,6 +494,18 @@ class StaticNodeProvider(Provider, QuotaSupport):
if slot >= static_node["max-parallel-jobs"]:
return
# The periodic cleanup process may have just reregistered
# this node. When that happens the node in the node slot is
# different than the one we are processing and we can short
# circuit.
try:
existing_node_slot = self._getSlot(node)
except Exception:
# We'll let config synchronization correct any slots changes
return
if node != self._node_slots[node_tuple][existing_node_slot]:
return
self.log.debug("Re-registering deleted node: %s", node_tuple)
try:
pool = self.provider.pools[node.pool]