From 850c99cf8b7a7cb7ec5a88196b4170d1984e512d Mon Sep 17 00:00:00 2001
From: Rajat Dhasmana <rajatdhasmana@gmail.com>
Date: Mon, 3 Jun 2024 15:14:44 +0530
Subject: [PATCH] [func test] Fix race between attachment delete and server
 delete

Recently openstacksdk functional test, test_volume_attachment,
started failing frequently.
It mostly failed during the tearDown step trying to delete
the volume as the volume delete was already issued by
server delete (which it shouldn't be).

Looking into the issue, I found out the problem to be in
a race between the BDM record of instance being deleted (during
volume attachment delete) and trying to delete the server.
The sequence of operations that trigger this issue are:

1. Delete volume attachment
2. Wait for volume to become available
3. Delete server

In step (2), nova sends a request to Cinder to delete the
volume attachment[1], making the volume in available state[2], BUT
the operation is still ongoing on nova side to delete the BDM
record[3].
Hence we end up in a race, where nova is trying to delete the
BDM record and we issue a server delete (overlapping request), which
in turn consumes that BDM record and sends request to (which it shouldn't):

1. delete attachment (which is already deleted, hence returns 404)
2. delete volume

Later when the functional test issue another request to delete the volume,
we fail since the volume is already in the process of being deleted
(by the server delete operation -- delete_on_termination is set to true).

This analysis can yield a number of fixes in nova and cinder, namely:

1. Nova to prevent the race of BDM being deleted and being used at the same time.
2. Cinder to detect the volume being deleted and return success for
subsequent delete requests (and not fail with 400 BadRequest).

This patch focuses on fixing this on the SDK side where the flow
of operations happens too fast triggering this race condition.

We introduce a wait mechanism to wait for the VolumeAttachment resource
to be deleted and later verify that the number of attachments for the
server to be 0 before moving to the tearDown that deletes the server
and the volume.

there is a 1 second gap race happening which can be seen here:

1. server delete starting at 17:13:49

2024-06-05 17:13:49,892 openstack.iterate_timeout        ****Timeout is 300 --- wait is 2.0 --- start time is 1717607629.892198 ----
2024-06-05 17:13:49,892 openstack.iterate_timeout        $$$$ Count is 1 --- time difference is 299.99977254867554
2024-06-05 17:13:50,133 openstack.iterate_timeout        Waiting 2.0 seconds

2. BDM being deleted at 17:13:50
(already used by server delete to do attachment and volume delete calls)

*************************** 2. row ***************************
            created_at: 2024-06-05 17:13:11
            ...
            deleted_at: 2024-06-05 17:13:50
            ...
            device_name: /dev/vdb
            volume_id: c13a3070-c5ab-4c8a-bb7e-5c7527fdf0df
            attachment_id: a1280ca9-4f88-49f7-9ba2-1e796688ebcc
            instance_uuid: 98bc13b2-50fe-4681-b263-80abf08929ac
            ...

[1] https://opendev.org/openstack/nova/src/commit/7dc4b1ea627d864a0ee2745cc9de4336fc0ba7b5/nova/virt/block_device.py#L553
[2] https://opendev.org/openstack/cinder/src/commit/9f1292ad066c2c69066e791f9f5b914bcf1c4425/cinder/volume/api.py#L2685
[3] https://opendev.org/openstack/nova/src/commit/7dc4b1ea627d864a0ee2745cc9de4336fc0ba7b5/nova/compute/manager.py#L7658-L7659

Closes-Bug: #2067869
Change-Id: Ia59df9640d778bec4b22e608d111f82b759ac610
---
 openstack/resource.py                              |  6 ++++--
 .../compute/v2/test_volume_attachment.py           | 14 ++++++++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/openstack/resource.py b/openstack/resource.py
index 60f377ecf..27db32702 100644
--- a/openstack/resource.py
+++ b/openstack/resource.py
@@ -2503,8 +2503,10 @@ def wait_for_delete(session, resource, interval, wait, callback=None):
             resource = resource.fetch(session, skip_cache=True)
             if not resource:
                 return orig_resource
-            if resource.status.lower() == 'deleted':
-                return resource
+            # Some resources like VolumeAttachment don't have status field.
+            if hasattr(resource, 'status'):
+                if resource.status.lower() == 'deleted':
+                    return resource
         except exceptions.NotFoundException:
             return orig_resource
 
diff --git a/openstack/tests/functional/compute/v2/test_volume_attachment.py b/openstack/tests/functional/compute/v2/test_volume_attachment.py
index c65da7696..1e357c855 100644
--- a/openstack/tests/functional/compute/v2/test_volume_attachment.py
+++ b/openstack/tests/functional/compute/v2/test_volume_attachment.py
@@ -137,3 +137,17 @@ class TestServerVolumeAttachment(ft_base.BaseComputeTest):
             status='available',
             wait=self._wait_for_timeout,
         )
+
+        # Wait for the attachment to be deleted.
+        # This is done to prevent a race between the BDM
+        # record being deleted and we trying to delete the server.
+        self.user_cloud.compute.wait_for_delete(
+            volume_attachment,
+            wait=self._wait_for_timeout,
+        )
+
+        # Verify the server doesn't have any volume attachment
+        volume_attachments = list(
+            self.user_cloud.compute.volume_attachments(self.server)
+        )
+        self.assertEqual(0, len(volume_attachments))