Fix race condition with enabling SG on many ports at once

When there are many calls to enable security groups on ports there can be sometimes race condition between refresh recource_cache with data get by "pull" call to neutron server and data received with "push" rpc message from neutron server. In such case when "push" message comes with information about updated port (with enabled port_security), in local cache this port is already updated so local AFTER_UPDATE call is not called for such port and its rules in firewall are not updated. It happend quite often in fullstack security groups test because there are 4 ports created in this test and all 4 are updated to apply SG to it one by one. And here's what happen then in details: 1. port 1 was updated in neutron-server so it sends push notification to L2 agent to update security groups, 2. port 1 info was saved in resource cache on L2 agent's side and agent started to configure security groups for this port, 3. as one of steps L2 agent called SecurityGroupServerAPIShim._select_ips_for_remote_group() method; In that method RemoteResourceCache.get_resources() is called and this method asks neutron-server for details about ports from given security_group, 4. in the meantime neutron-server got port update call for second port (with same security group) so it sends to L2 agent informations about 2 ports (as a reply to request sent from L2 agent in step 3), 5. resource cache updates informations about two ports in local cache, returns its data to SecurityGroupServerAPIShim._select_ips_for_remote_group() and all looks fine, 6. but now L2 agent receives push notification with info that port 2 is updated (changed security groups), so it checks info about this port in local cache, 7. in local cache info about port 2 is already WITH updated security group so RemoteResourceCache doesn't trigger local notification about port AFTER UPDATE and L2 agent doesn't know that security groups for this port should be changed This patch fixes it by changing way how items are updated in the resource_cache. For now it is done with record_resource_update() method instead of writing new values directly to resource_cache._type_cache dict. Due to that if resource will be updated during "pull" call to neutron server, local AFTER_UPDATE will still be triggered for such resource. Change-Id: I5a62cc5731c5ba571506a3aa26303a1b0290d37b Closes-Bug: #1742401
2018-01-22 14:01:30 +01:00 · 2018-01-22 14:01:30 +01:00 · 725df3e038
parent bab1ae8812
commit 725df3e038
2 changed files with 18 additions and 1 deletions
--- a/neutron/agent/resource_cache.py
+++ b/neutron/agent/resource_cache.py
@ -80,7 +80,7 @@ class RemoteResourceCache(object):
                # been updated already and pushed to us in another thread.
                LOG.debug("Ignoring stale update for %s: %s", rtype, resource)
                continue
-            self._type_cache(rtype)[resource.id] = resource
+            self.record_resource_update(context, rtype, resource)
        LOG.debug("%s resources returned for queries %s", len(resources),
                  query_ids)
        self._satisfied_server_queries.update(query_ids)
--- a/neutron/tests/unit/agent/test_resource_cache.py
+++ b/neutron/tests/unit/agent/test_resource_cache.py
@ -55,22 +55,39 @@ class RemoteResourceCacheTestCase(base.BaseTestCase):
        self.assertIsNone(self.rcache.get_resource_by_id('goose', 2))

    def test__flood_cache_for_query_pulls_once(self):
+        resources = [OVOLikeThing(66), OVOLikeThing(67)]
+        received_kw = []
+        receiver = lambda *a, **k: received_kw.append(k)
+        registry.subscribe(receiver, 'goose', events.AFTER_UPDATE)
+
+        self._pullmock.bulk_pull.side_effect = [
+            resources,
+            [resources[0]],
+            [resources[1]],
+            [resources[1]]
+        ]
+
        self.rcache._flood_cache_for_query('goose', id=(66, 67),
                                           name=('a', 'b'))
        self._pullmock.bulk_pull.assert_called_once_with(
            mock.ANY, 'goose',
            filter_kwargs={'id': (66, 67), 'name': ('a', 'b')})
+
        self._pullmock.bulk_pull.reset_mock()
        self.rcache._flood_cache_for_query('goose', id=(66, ), name=('a', ))
        self.assertFalse(self._pullmock.called)
        self.rcache._flood_cache_for_query('goose', id=(67, ), name=('b', ))
        self.assertFalse(self._pullmock.called)
+
        # querying by just ID should trigger a new call since ID+name is a more
        # specific query
        self.rcache._flood_cache_for_query('goose', id=(67, ))
        self._pullmock.bulk_pull.assert_called_once_with(
            mock.ANY, 'goose', filter_kwargs={'id': (67, )})

+        self.assertItemsEqual(
+            resources, [rec['updated'] for rec in received_kw])
+
    def test_bulk_pull_doesnt_wipe_out_newer_data(self):
        self.rcache.record_resource_update(
            self.ctx, 'goose', OVOLikeThing(1, revision_number=5))