Correct race in static node reregistration

The static node driver has two paths that reregister nodes. The first runs to synchronize our internal node representation with the configuration periodically via the cleanup process. The second is triggered when a node is deleted to reregister it. Unfortunately, this second path was not checking appropriately to see if something else (the cleanup routine) had reregisterd a node before it could register it. The end result was duplicate registrations for a single node. This logging illustrates what is happening: 17:22:03,387 [StaticNodeProvider] Static driver cleanup leaked resources beginning 17:22:03,387 [StaticNodeProvider] Static driver cleanup leaked resources lock acquired 17:22:03,389 [StaticNodeProvider] Static node deletion handler beginning 17:22:03,390 [StaticNodeProvider] Provider static-provider pool main syncing nodes with config 17:22:03,392 [StaticNodeProvider] Registered static node NodeTuple(hostname='fake-host-1', username='zuul', port=22022) 17:22:03,392 [StaticNodeProvider] Provider static-provider pool main syncing nodes with config complete 17:22:03,392 [StaticNodeProvider] Static driver cleanup leaked resources complete 17:22:03,392 [StaticNodeProvider] Static node deletion handler lock acquired 17:22:03,396 [StaticNodeProvider] Re-registering deleted node: NodeTuple(hostname='fake-host-1', username='zuul', port=22022) 17:22:03,398 [StaticNodeProvider] Registered static node NodeTuple(hostname='fake-host-1', username='zuul', port=22022) 17:22:03,398 [StaticNodeProvider] Static node deletion handler complete With a simplified view: A cleanupLeakedResources begins L A cleanupLeakedResources lock acquired B nodeDeletedNotification begins L A syncNodeCount begins L A registerNodeFromConfig L A syncNodeCount ends A cleanupLeakedResources ends L B nodeDeletedNotification lock acquired L B nodeDeletedNotification re-registers node L B registerNodeFromConfig B nodeDeletedNotification ends To address this in the deleted node reregistration method we check if the node being reregistered is already updated in the slot info. If so that implies the cleanup process has already reregistred us. Change-Id: I08a77ec34f0a0080d7a6482e9ec9d0620929105e
2022-12-08 10:55:34 -08:00 · 2022-12-08 10:55:34 -08:00 · bcfc44b4a4
parent de7306b3e9
commit bcfc44b4a4
1 changed files with 13 additions and 1 deletions
--- a/nodepool/driver/static/provider.py
+++ b/nodepool/driver/static/provider.py
@ -484,8 +484,8 @@ class StaticNodeProvider(Provider, QuotaSupport):
                    "Cannot get registered nodes for re-registration:"
                )
                return
-            slot = node.slot

+            slot = node.slot
            if slot is None:
                return
            # It's possible we were not able to de-register nodes due to a
@ -494,6 +494,18 @@ class StaticNodeProvider(Provider, QuotaSupport):
            if slot >= static_node["max-parallel-jobs"]:
                return

+            # The periodic cleanup process may have just reregistered
+            # this node. When that happens the node in the node slot is
+            # different than the one we are processing and we can short
+            # circuit.
+            try:
+                existing_node_slot = self._getSlot(node)
+            except Exception:
+                # We'll let config synchronization correct any slots changes
+                return
+            if node != self._node_slots[node_tuple][existing_node_slot]:
+                return
+
            self.log.debug("Re-registering deleted node: %s", node_tuple)
            try:
                pool = self.provider.pools[node.pool]