Merge "Add spec to use cinder's new attachment API"

2017-09-07 20:29:46 +00:00
parent 9c11c4acf8 e64a344edc
commit 019285bbd7
1 changed files with 462 additions and 0 deletions
--- a/specs/queens/approved/cinder-new-attach-apis.rst
+++ b/specs/queens/approved/cinder-new-attach-apis.rst
@@ -0,0 +1,462 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+===================================
+Use Cinder's new Attach/Detach APIs
+===================================
+
+https://blueprints.launchpad.net/nova/+spec/cinder-new-attach-apis
+
+Make Nova use Cinder's new attach/dettach APIs.
+
+Problem description
+===================
+
+In attempting to implement Cinder multi-attach and trying to get live
+migration working with all drivers, it has become clear Cinder and Nova
+interaction is not well understood, and that is leading to both bugs
+and issues when trying to evolve the interaction between the two projects.
+
+Lets create a new clean interface between Nova and Cinder.
+
+You can see details on the new Cinder API here:
+http://specs.openstack.org/openstack/cinder-specs/specs/ocata/add-new-attach-apis.html
+
+Use Cases
+---------
+
+The main API actions to consider are:
+
+* Attach a volume to an instance, including during spawning an instance,
+  and calling os-brick to (optionally) connect the volume backend to the
+  hypervisor.
+  The connect is optional because when there is a shared connection from the
+  host to the volume backend, the backend may already be attached.
+* Detach volume from an instance, including (optionally) calling os-brick to
+  disconnect the volume from the hypervisor host.
+* Live-migrate an instance, involves setting up the volume connection on the
+  destination host, before kicking off the live-migrate, then removing source
+  host connections once the live-migrate has completed. If there is a rollback
+  the destination host connection is removed.
+* Migrate and resize are very similar to live-migrate, from this new view of
+  the world.
+* Evacuate, we know the old host is no longer running, and we need to attach
+  the volume to a new host.
+* Shelve, we want the volume to stay logically attached to the instance, but
+  we also need to detach it from the host when the instance is offloaded.
+* For shelved-offloaded case the volume is in a 'reserved' state and not
+  physically attached
+* Attach/Detach a volume to/from a shelved instance
+* Use swap volume to migrate a volume between two different Cinder backends.
+
+In particular, please note:
+
+* Volume attachment is specific to a host uuid, instance uuid, and volume uuid
+* You can have multiple attachments to the same volume, to different instances
+  (on the same host or different hosts), when the volume is marked
+  multi_attach=True
+* For the same instance uuid and volume uuid, you can have connections on two
+  different hosts, even when multi_attach=False. This is generally used when
+  moving a VM.
+* Volume connections on a host can be shared with other volumes that are
+  connected to the same volume backend, depending on the chosen driver.
+  As such, need to take care when removing that connection, and not adding two
+  connections by mistake and not removing an in use connection too early.
+  Cinder needs to provide extra information to Nova, in particular, for each
+  attachment, if the connection is shared, and if so, who that connection is
+  currently shared with.
+
+Proposed change
+===============
+
+Cinder now has two different API flows for attach/detach. We need a way to
+switch from the old API to the new API without affecting any existing
+instances.
+
+Firstly, we need to decide when it is safe to use the new API. We need to have
+the Cinder v3 API configured, and that endpoint should have the micro-version
+v3.44 available. In addition we should only use the new API when all of the
+nova-compute nodes have been upgraded. We can detect that by looking up the
+minimum service version relating to when we add the support for the new
+Cinder API. Note, this means we will need to increment the service version so
+we can explicitly detect the support for the new Cinder API.
+
+If we allow the use of the new API, we can use that for all new attachments.
+When adding a new attachment we:
+
+* (api) call attachment_create, with no connector, before API call returns.
+  BDM record is updated with attachment_id.
+  Note, if the volume is not multi_attach=True, it will only allow one
+  instance_uuid to be associated with each volume. While the long term aim
+  is to enable multi-attach, this spec will not attach to any volume that has
+  multi-attach=True. While we could still make a single attachment to the
+  volume, as we rely on cinder to restrict the number of attachments to the
+  volume, for safety we shouldn't allow any attachments if multi_attach=True
+  until we have that support fully implemented in Nova.
+* (compute) get connector info and use that to call attachment_update.
+  The API now returns with all the information that needs to be given to
+  os-brick to attach the volume backend, and how to attach the VM to that
+  connection to the volume backend.
+* (compute) Before we can actually connect to the volume we need to wait for
+  the volume to be ready and fully provisioned. If we timeout waiting for the
+  volume to be ready, we fail here and delete the attachment. If this is the
+  first boot of the instance, that will put the instance into the ERROR state.
+  If the volume is ready, we can continue with the attach process.
+* (compute) use os-brick to connect to the volume backend.
+  If there are any errors, attempt to call os-brick disconnect
+  (to double check it is fully cleaned up) and then remove the attachment
+  in Cinder. If there are any issues in the rollback, put instance into the
+  ERROR state.
+* (compute) now the backend is connected, and the volume is ready, we can
+  attach the backend connection to the VM in the usual way.
+* (compute) we call attachment_complete to mark the attachment and volume
+  'attached' when all the above operations are successfully completed.
+
+For a detach:
+
+* (compute) if attachment_id is set in the BDM, we use the new detach flow,
+  otherwise we fall back to the old detach flow. The new flow is...
+* (api) usual checks to see if request is valid
+* (compute) detach volume from VM, if fails stop request here
+* (compute) call os-brick to disconnect from the volume backend
+* (compute) if success, attachment_remove is called.
+  If there was an error, we add an instance fault
+  and set the instance into the error state.
+
+As above, we can use the presence of the attachment_id in the BDM to decide
+if the attachment was made using the new or old flow. Long term we want to
+migrate all existing attachments to a new style attachment, but this is left
+for a later spec.
+
+Live-migrate
+------------
+
+During live-migration, we start the process by ensuring the volume is attached
+on both the source and destination. When a volume is multi_attach=False, and
+we are about to start live-migrating VM1, you get a situation like this ::
+
+    +-------------------+   +-------------------+
+    |                   |   |                   |
+    | +------------+    |   | +--------------+  |
+    | |VM1 (active)|    |   | |VM1 (inactive)|  |
+    | +---+--------+    |   | +--+-----------+  |
+    |     |             |   |    |              |
+    |     | Host 1      |   |    |  Host 2      |
+    +-------------------+   +-------------------+
+          |                      |
+          +-----------+----------+
+                      |
+                      |
+         +---------------------------+
+         |            |              |
+         |  +---------+---------+    |
+         |  | VolA              |    |
+         |  +-------------------+    |
+         |                           |
+         |    Cinder Backend 1       |
+         |                           |
+         +---------------------------+
+
+Note, in cinder we end up with two attachments for this multi_attach=False
+volume:
+
+* attachment 1: VolA, VM1, Host 1
+* attachment 2: VolA, VM1, Host 2
+
+Logically we have two attachments to the one non-multi-attach volume. Both
+attachments are related to vm1, but there is an attachment for both the
+source and destination host for the duration of the live-migration.
+Note both attachments are associated with the same instance uuid,
+which is why the two attachments are allowed even though multi_attach=False.
+
+Should the live-migration succeed, we will delete attachment 1 (i.e. source
+host attachment, host 1) and we are left with just attachment 2
+(i.e. destination host attachment, host 2). If there are any failures with
+os-brick disconnect on the source host, we put the instance into the ERROR
+state and don't delete the attachment in Cinder. We do this to signal to the
+operator that something needs manually fixing. We also put the migration into
+the error state, as we would even if a failure had a clean rollback.
+
+If we have any failures in the live-migration such that the instance is still
+running on host 1, we do the opposite of the above. We attempt os-brick
+disconnect on host 2. If success we delete attachment 2, otherwise put the
+instance into the ERROR state. If the rollback succeeds we are back to one
+attachment again, but in this case its attachment 1.
+
+So for volumes that have an attachment_id in their BDM, we follow this new
+flow of API calls Cinder:
+
+* (destination) get connector, and create new attachment
+* (destination) attach the volume backend
+* (source) kicks off live-migration
+
+If live-migration succeeds:
+
+* (source) call os-brick to disconnect
+* (source) if success, delete the attachment, otherwise put the
+  instance into an ERROR state
+
+If live-migration rolls back due to an abort or similar:
+
+* (destination) call os-brick to disconnect
+* (destination) if success, delete the attachment, otherwise put the
+  instance into an ERROR state
+
+Migrate
+-------
+
+Similar to live-migrate, at the start of the migration we have attachments
+for both the source and destination node. On calling confirm resize we do
+a detach on source, a call to revert resize and its detach on destination.
+
+Evacuate
+--------
+
+When you call evacuate, and there is a volume that has an attachment_id in its
+BDM, we follow this new flow:
+
+* (source) Nothing happens on the source, it is assumed the administrator
+  has already fenced the host, and confirmed that by calling force host down.
+* (destination) Create a second attachment for this instance_uuid for
+  any attached volumes
+* (destination) Follow the usual volume attach flow
+* (destination) Now delete the old attachment to ensure Cinder cleans up any
+  resources relating to that connection. It is similar to how we call
+  terminate_connection today, except we must call this after creating the
+  new attachment to ensure the volume is always reserved to this instance
+  during the whole of the evacuate process.
+* (operator) should the source host never be started, the instances that
+  have been evacuated are detected in the usual way (using the migration
+  record created when evacuate is called). This may leave some things not
+  cleaned up by os-brick, but that is fairly safe, and we are in a no worse
+  situation than we are today.
+
+Shelve and Unshelve
+-------------------
+
+When a volume attached to an instance has an attachment_id in the BDM, we
+follow this new flow of calls to the Cinder API.
+Note: it is possible to have both old flow and new flow volumes attached to
+the one instance that is getting shelved.
+
+When offloading from an old host, we first add a new attachment (with no
+connector set) then perform a disconnect of the old attachment in the
+usual way. This ensures the volume is still attached to the instance,
+but is safely detached from the host we are offloading from. Should that
+detach fail, the instance should be moved into an ERROR state.
+
+Similarly, when it comes to unshelve, we update the existing attachments
+with the connector, before continuing with the usual attach volume flow.
+
+Swap Volume
+-----------
+
+For swap volume, we have one host, one instance, one device path, but
+multiple volumes.
+
+In this section, we talk about what happens should the volume being swapped
+have the attachment_id present in the BDM, and as such we follow the new flow.
+
+Firstly, there is the flow when cinder calls our API, secondly when a
+user calls our API. Both flows are covered here:
+
+* The Nova swap volume API is called to swap uuid-old with uuid-new
+
+    * The new volume may have been created by the user in cinder, and the
+      user may have made the Nova API call.
+    * Alternatively, the user may have called Cinder's migrate volume API.
+      That means cinder has created the new volume, and calls the Nova API on
+      the user's behalf.
+
+* (api) create new attachment for the volume uuid-new, fail API call if we
+  can't create that attachment
+* (compute) update cinder attachment with connector for uuid-new
+* (compute) os-brick connect the new volume. If there is an error we
+  deal with this like a failure during attach, and delete the
+  attachment to the new volume
+* (compute) Nova copies content of volume uuid-old to volume uuid-new,
+  in libvirt this is via a rebase operation
+* (compute) once the copy is complete, we detach uuid-old from instance
+* (compute) update BDM so the attachment_id now points to the attachment
+  associated with uuid-new
+* (compute) once the old volume is detached, we do an os-brick disconnect
+* (compute) for a Nova initiated swap we don't call cinder's
+  migrate_volume_completion callback. We check the state of the volume in this
+  one case to ensure it's not 'retyping' or 'migrating'.
+* (compute) Update the BDM with a new volume-uuid, based on what
+  migrate_volume_completion has returned (when called). Note if cinder called
+  swap, it will have deleted the old volume, but renamed the new volume to have
+  the same uuid as the old volume had. If someone called Nova, we get back
+  uuid-new, and we update the BDM to reflect the change.
+* so on success we have created a new attachment to the new volume
+  and deleted the attachment to the old volume.
+
+Note: it is assumed if a volume is multi-attach, the swap operation will fail
+and not be allowed. That will be true in either the Cinder or Nova started
+case. In time we will likely move to Cinder's migrate_volume_completion API
+using attachment_ids instead of volume ids. This spec does not look at what is
+needed to support multi-attach, but this problem seemed worth noting here.
+
+Alternatives
+------------
+
+We could struggle on fixing bugs in a "whack a mole" way.
+
+There are several ways we should structure the API interactions. One of the
+key alternatives is to add lots of state machine complexity into the API so
+the shared connection related locking is handled by Cinder in the API layer.
+While it makes the clients more complex, it seemed simpler for Nova and other
+clients to do the locking discussed above.
+
+Nova could look up the attachment uuid rather than store it in the BDM, there
+is a period where the host uuid is not set, so it seems safer to store the
+attachment uuid to stop any possible confusion around which attachment is
+associated to each BDM.
+
+During live-migration we could store the additional attachment_ids in the
+migrate data, rather than as part of the BDM.
+
+We could continue to save the connection_info in the BDM to be used when we
+detach the volume. While seems like it might help avoid issues with changes
+in the connection info that Nova hasn't been notified of, this is really a
+premature optimization. We should instead work with Cinder and os-brick to
+properly fix any such interaction problems in a way that helps all systems
+that work with Cinder.
+
+Data model impact
+-----------------
+
+When using the new API flow, we no longer need to store the connection_info,
+as we don't need to pass that back to Cinder. Instead we just store the
+attachment_id for each host the volume is attached to, and any time we need
+the connection_info we fetch that from Cinder.
+
+When an attachment_id is populated, we use the new flow to do all attach or
+detach operations. When not present, we use the old flow.
+
+REST API impact
+---------------
+
+No changes to Nova's REST API.
+
+Security impact
+---------------
+
+Nova no longer needs to store the volume connection information, however it is
+now available at any time from the Cinder API.
+
+Notifications impact
+--------------------
+
+None.
+
+Other end user impact
+---------------------
+
+None.
+
+Performance Impact
+------------------
+
+There should be no impact to performance. The focus here is stability across
+all drivers. There may slightly more API calls between Nova and Cinder, but it
+is not expected to be significantly impact performance.
+
+Other deployer impact
+---------------------
+
+To use this more stable API interaction, and the new features that will depend
+on this effort, must upgrade Cinder to a version that supports the new API.
+
+It is expected we will drop support for older versions of Cinder within
+two release cycles of this work being completed.
+
+Developer impact
+----------------
+
+Nova and Cinder interactions should be better understood.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Ildiko Vancsa
+
+Other contributors:
+  Matt Riedemann
+  John Griffith
+  Steve Noyes
+
+Work Items
+----------
+
+To make progress in the previous and this cycle we needed to split this work
+into small patches. The overall strategy is that we implement new style attach
+last, and all the other operations depend on the attachment_id being in the
+BDM, that will not be true until the attach code is merged.
+
+* use Cinder v3 API
+* detect if the microversion that includes the new BDM support is present
+* detach a new style BDM/volume attach - Merged in Pike
+* reboot / rebuild (get connection info from cinder using attachment_id)
+* live-migration
+* migration
+* evacuate
+* shelve and unshelve
+* swap volume - Merged in Pike
+* attach (this means we now expose all the previous features)
+
+Note there are more steps before we can support multi-attach, but these are
+left for future specs:
+
+* migrate old BDMs to the new BDM flow
+* add explicit support for shared backend connections
+
+Dependencies
+============
+
+Depends on the Cinder work to add the new API.
+This was completed in Ocata.
+
+Testing
+=======
+
+We need to functionally test both old and new Cinder interactions. A new case
+was added to grenade that creates and attaches a volume to an instance before
+the upgrade, and detaches it after the upgrade. There is also an addition in
+Tempest to check the volume attachments after live migration. Beyond this unit
+and functional tests are added in Nova to reach proper test coverage for the
+new flow.
+
+Documentation Impact
+====================
+
+We need to add good developer documentation around the updated
+Nova and Cinder interactions.
+
+References
+==========
+
+* Cinder API spec:
+  http://specs.openstack.org/openstack/cinder-specs/specs/ocata/add-new-attach-apis.html
+* Merged and open reviews:
+  https://review.openstack.org/#/q/topic:bp/cinder-new-attach-apis
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Pike
+     - Introduced
+   * - Queens
+     - Re-proposed