[func test] Fix race between attachment delete and server delete

Recently openstacksdk functional test, test_volume_attachment, started failing frequently. It mostly failed during the tearDown step trying to delete the volume as the volume delete was already issued by server delete (which it shouldn't be). Looking into the issue, I found out the problem to be in a race between the BDM record of instance being deleted (during volume attachment delete) and trying to delete the server. The sequence of operations that trigger this issue are: 1. Delete volume attachment 2. Wait for volume to become available 3. Delete server In step (2), nova sends a request to Cinder to delete the volume attachment[1], making the volume in available state[2], BUT the operation is still ongoing on nova side to delete the BDM record[3]. Hence we end up in a race, where nova is trying to delete the BDM record and we issue a server delete (overlapping request), which in turn consumes that BDM record and sends request to (which it shouldn't): 1. delete attachment (which is already deleted, hence returns 404) 2. delete volume Later when the functional test issue another request to delete the volume, we fail since the volume is already in the process of being deleted (by the server delete operation -- delete_on_termination is set to true). This analysis can yield a number of fixes in nova and cinder, namely: 1. Nova to prevent the race of BDM being deleted and being used at the same time. 2. Cinder to detect the volume being deleted and return success for subsequent delete requests (and not fail with 400 BadRequest). This patch focuses on fixing this on the SDK side where the flow of operations happens too fast triggering this race condition. We introduce a wait mechanism to wait for the VolumeAttachment resource to be deleted and later verify that the number of attachments for the server to be 0 before moving to the tearDown that deletes the server and the volume. there is a 1 second gap race happening which can be seen here: 1. server delete starting at 17:13:49 2024-06-05 17:13:49,892 openstack.iterate_timeout ****Timeout is 300 --- wait is 2.0 --- start time is 1717607629.892198 ---- 2024-06-05 17:13:49,892 openstack.iterate_timeout $$$$ Count is 1 --- time difference is 299.99977254867554 2024-06-05 17:13:50,133 openstack.iterate_timeout Waiting 2.0 seconds 2. BDM being deleted at 17:13:50 (already used by server delete to do attachment and volume delete calls) *************************** 2. row *************************** created_at: 2024-06-05 17:13:11 ... deleted_at: 2024-06-05 17:13:50 ... device_name: /dev/vdb volume_id: c13a3070-c5ab-4c8a-bb7e-5c7527fdf0df attachment_id: a1280ca9-4f88-49f7-9ba2-1e796688ebcc instance_uuid: 98bc13b2-50fe-4681-b263-80abf08929ac ... [1] 7dc4b1ea62/nova/virt/block_device.py (L553) [2] 9f1292ad06/cinder/volume/api.py (L2685) [3] 7dc4b1ea62/nova/compute/manager.py (L7658-L7659) Closes-Bug: #2067869 Change-Id: Ia59df9640d778bec4b22e608d111f82b759ac610
2024-06-03 15:14:44 +05:30 · 2024-06-03 15:14:44 +05:30 · 850c99cf8b
commit 850c99cf8b
parent e3196baed0
2 changed files with 18 additions and 2 deletions
--- a/openstack/resource.py
+++ b/openstack/resource.py
@ -2503,8 +2503,10 @@ def wait_for_delete(session, resource, interval, wait, callback=None):
            resource = resource.fetch(session, skip_cache=True)
            if not resource:
                return orig_resource
-            if resource.status.lower() == 'deleted':
+            # Some resources like VolumeAttachment don't have status field.
-                return resource
+            if hasattr(resource, 'status'):
                if resource.status.lower() == 'deleted':
                    return resource
        except exceptions.NotFoundException:
            return orig_resource
--- a/openstack/tests/functional/compute/v2/test_volume_attachment.py
+++ b/openstack/tests/functional/compute/v2/test_volume_attachment.py
@ -137,3 +137,17 @@ class TestServerVolumeAttachment(ft_base.BaseComputeTest):
            status='available',
            wait=self._wait_for_timeout,
        )
        # Wait for the attachment to be deleted.
        # This is done to prevent a race between the BDM
        # record being deleted and we trying to delete the server.
        self.user_cloud.compute.wait_for_delete(
            volume_attachment,
            wait=self._wait_for_timeout,
        )
        # Verify the server doesn't have any volume attachment
        volume_attachments = list(
            self.user_cloud.compute.volume_attachments(self.server)
        )
        self.assertEqual(0, len(volume_attachments))