[func test] Fix race between attachment delete and server delete

Recently openstacksdk functional test, test_volume_attachment,
started failing frequently.
It mostly failed during the tearDown step trying to delete
the volume as the volume delete was already issued by
server delete (which it shouldn't be).

Looking into the issue, I found out the problem to be in
a race between the BDM record of instance being deleted (during
volume attachment delete) and trying to delete the server.
The sequence of operations that trigger this issue are:

1. Delete volume attachment
2. Wait for volume to become available
3. Delete server

In step (2), nova sends a request to Cinder to delete the
volume attachment[1], making the volume in available state[2], BUT
the operation is still ongoing on nova side to delete the BDM
record[3].
Hence we end up in a race, where nova is trying to delete the
BDM record and we issue a server delete (overlapping request), which
in turn consumes that BDM record and sends request to (which it shouldn't):

1. delete attachment (which is already deleted, hence returns 404)
2. delete volume

Later when the functional test issue another request to delete the volume,
we fail since the volume is already in the process of being deleted
(by the server delete operation -- delete_on_termination is set to true).

This analysis can yield a number of fixes in nova and cinder, namely:

1. Nova to prevent the race of BDM being deleted and being used at the same time.
2. Cinder to detect the volume being deleted and return success for
subsequent delete requests (and not fail with 400 BadRequest).

This patch focuses on fixing this on the SDK side where the flow
of operations happens too fast triggering this race condition.

We introduce a wait mechanism to wait for the VolumeAttachment resource
to be deleted and later verify that the number of attachments for the
server to be 0 before moving to the tearDown that deletes the server
and the volume.

there is a 1 second gap race happening which can be seen here:

1. server delete starting at 17:13:49

2024-06-05 17:13:49,892 openstack.iterate_timeout        ****Timeout is 300 --- wait is 2.0 --- start time is 1717607629.892198 ----
2024-06-05 17:13:49,892 openstack.iterate_timeout        $$$$ Count is 1 --- time difference is 299.99977254867554
2024-06-05 17:13:50,133 openstack.iterate_timeout        Waiting 2.0 seconds

2. BDM being deleted at 17:13:50
(already used by server delete to do attachment and volume delete calls)

*************************** 2. row ***************************
            created_at: 2024-06-05 17:13:11
            ...
            deleted_at: 2024-06-05 17:13:50
            ...
            device_name: /dev/vdb
            volume_id: c13a3070-c5ab-4c8a-bb7e-5c7527fdf0df
            attachment_id: a1280ca9-4f88-49f7-9ba2-1e796688ebcc
            instance_uuid: 98bc13b2-50fe-4681-b263-80abf08929ac
            ...

[1] 7dc4b1ea62/nova/virt/block_device.py (L553)
[2] 9f1292ad06/cinder/volume/api.py (L2685)
[3] 7dc4b1ea62/nova/compute/manager.py (L7658-L7659)

Closes-Bug: #2067869
Change-Id: Ia59df9640d778bec4b22e608d111f82b759ac610
This commit is contained in:
Rajat Dhasmana 2024-06-03 15:14:44 +05:30
parent e3196baed0
commit 850c99cf8b
2 changed files with 18 additions and 2 deletions

View File

@ -2503,8 +2503,10 @@ def wait_for_delete(session, resource, interval, wait, callback=None):
resource = resource.fetch(session, skip_cache=True) resource = resource.fetch(session, skip_cache=True)
if not resource: if not resource:
return orig_resource return orig_resource
if resource.status.lower() == 'deleted': # Some resources like VolumeAttachment don't have status field.
return resource if hasattr(resource, 'status'):
if resource.status.lower() == 'deleted':
return resource
except exceptions.NotFoundException: except exceptions.NotFoundException:
return orig_resource return orig_resource

View File

@ -137,3 +137,17 @@ class TestServerVolumeAttachment(ft_base.BaseComputeTest):
status='available', status='available',
wait=self._wait_for_timeout, wait=self._wait_for_timeout,
) )
# Wait for the attachment to be deleted.
# This is done to prevent a race between the BDM
# record being deleted and we trying to delete the server.
self.user_cloud.compute.wait_for_delete(
volume_attachment,
wait=self._wait_for_timeout,
)
# Verify the server doesn't have any volume attachment
volume_attachments = list(
self.user_cloud.compute.volume_attachments(self.server)
)
self.assertEqual(0, len(volume_attachments))