cinder/cinder
Gorka Eguileor ed0be0c8fa Fix: Race between attachment and volume deletion
There are cases where requests to delete an attachment made by Nova can
race other third-party requests to delete the overall volume.

This has been observed when running cinder-csi, where it first requests
that Nova detaches a volume before itself requesting that the overall
volume is deleted once it becomes `available`.

This is a cinder race condition, and like most race conditions is not
simple to explain.

Some context on the issue:

- Cinder API uses the volume "status" field as a locking mechanism to
  prevent concurrent request processing on the same volume.

- Most cinder operations are asynchronous, so the API returns before the
  operation has been completed by the cinder-volume service, but the
  attachment operations such as creating/updating/deleting an attachment
  are synchronous, so the API only returns to the caller after the
  cinder-volume service has completed the operation.

- Our current code **incorrectly** modifies the status of the volume
  both on the cinder-volume and the cinder-api services on the
  attachment delete operation.

The actual set of events that leads to the issue reported in this bug
are:

[Cinder-CSI]
- Requests Nova to detach volume (Request R1)

[Nova]
- R1: Asks cinder-api to delete the attachment and **waits**

[Cinder-API]
- R1: Checks the status of the volume
- R1: Sends terminate connection request (R1) to cinder-volume and
  **waits**

[Cinder-Volume]
- R1: Ask the driver to terminate the connection
- R1: The driver asks the backend to unmap and unexport the volume
- R1: The last attachment is removed from the DB and the status of the
      volume is changed in the DB to "available"

[Cinder-CSI]
- Checks that there are no attachments in the volume and asks Cinder to
  delete it (Request R2)

[Cinder-API]

- R2: Check that the volume's status is valid. It doesn't have
  attachments and is available, so it can be deleted.
- R2: Tell cinder-volume to delete the volume and return immediately.

[Cinder-Volume]
- R2: Volume is deleted and DB entry is deleted
- R1: Finish the termination of the connection

[Cinder-API]
- R1: Now that cinder-volume has finished the termination the code
  continues
- R1: Try to modify the volume in the DB
- R1: DB layer raises VolumeNotFound since the volume has been deleted
  from the DB
- R1: VolumeNotFound is converted to HTTP 404 status code which is
  returned to Nova

[Nova]
- R1: Cinder responds with 404 on the attachment delete request
- R1: Nova leaves the volume as attached, since the attachment delete
  failed

At this point the Cinder and Nova DBs are out of sync, because Nova
thinks that the attachment is connected and Cinder has detached the
volume and even deleted it.

Hardening is also being done on the Nova side [2] to accept that the
volume attachment may be gone.

This patch fixes the issue mentioned above, but there is a request on
Cinder-CSI [1] to use Nova as the source of truth regarding its
attachments that, when implemented, would also fix the issue.

[1]: https://github.com/kubernetes/cloud-provider-openstack/issues/1645
[2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova

Closes-Bug: #1937084
Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614
(cherry picked from commit 2ec2222841)
2022-01-17 16:24:27 +01:00
..
api Reject bad img formats for uploaded encrypted vols 2022-01-11 18:09:57 +00:00
backup Merge "db: Remove 'db' argument from various managers" 2021-09-15 22:53:02 +00:00
brick LVM: Retry lvextend commands on code 139 2021-08-18 12:29:01 -04:00
cmd Merge "db: Vendor 'oslo_db.sqlalchemy.migration'" 2021-09-10 18:11:04 +00:00
common Merge "Change 'host' option from HostAddressOpt to StrOpt" 2021-09-08 18:27:48 +00:00
compute db: Remove 'db' argument from various managers 2021-08-27 15:13:21 +01:00
db Fix: Race between attachment and volume deletion 2022-01-17 16:24:27 +01:00
group db: Remove 'db_driver' option 2021-08-27 15:13:21 +01:00
image mypy: image cache 2021-08-10 10:26:39 -04:00
interface Replace getargspec with getfullargspec 2021-05-13 09:22:14 +08:00
keymgr Introduce flake8-import-order extension 2020-01-06 09:59:35 -06:00
locale Imported Translations from Zanata 2021-03-24 06:25:01 +00:00
message Add user messages for backup operations 2021-08-27 05:44:42 -04:00
objects Expose volume_attachments in Volume OVO 2022-01-17 16:24:20 +01:00
policies Fix typo in message policy deprecations 2021-10-14 17:19:38 +00:00
privsep Enable flake8-logging-format extension 2020-01-09 14:35:20 -06:00
scheduler Merge "Add user messages for backup operations" 2021-09-04 00:53:58 +00:00
tests Fix: Race between attachment and volume deletion 2022-01-17 16:24:27 +01:00
transfer db: Remove 'db_driver' option 2021-08-27 15:13:21 +01:00
volume Fix: Race between attachment and volume deletion 2022-01-17 16:24:27 +01:00
wsgi Introduce flake8-import-order extension 2020-01-06 09:59:35 -06:00
zonemanager Brocade: Fix lookup UnboundLocalError 2020-08-07 15:24:44 +02:00
__init__.py
context.py Merge "Add infrastructure for testing new RBAC policies" 2021-09-06 17:44:47 +00:00
coordination.py Remove file locks once we delete a resource 2021-08-04 10:41:33 -04:00
exception.py mypy: annotate volume_utils / utils / exc 2021-04-30 10:41:30 -04:00
flow_utils.py mypy: create_volume flows 2021-08-10 10:26:39 -04:00
i18n.py
manager.py mypy: Fix unused type: ignore in manager.py 2021-09-16 10:25:39 -04:00
opts.py Merge "db: Remove 'db_driver' option" 2021-09-15 21:32:03 +00:00
policy.py Add infrastructure for testing new RBAC policies 2021-08-31 15:41:17 -07:00
quota.py Merge "Remove six from quota.py" 2021-04-20 07:49:02 +00:00
quota_utils.py Modify/Move project validation methods to api_utils 2021-04-05 08:00:40 -04:00
rpc.py mypy: continued manager, scheduler, rpcapi 2021-08-11 08:36:09 -04:00
service.py Fix typo on service cluster change method 2020-05-06 19:36:07 -05:00
service_auth.py
ssh_utils.py Remove six in files under cinder/* 2020-10-08 14:00:14 +08:00
utils.py db: Vendor 'oslo_db.sqlalchemy.migration' 2021-08-27 15:13:21 +01:00
version.py