diff --git a/specs/queens/approved/cinder-new-attach-apis.rst b/specs/queens/approved/cinder-new-attach-apis.rst new file mode 100644 index 000000000..b85675b60 --- /dev/null +++ b/specs/queens/approved/cinder-new-attach-apis.rst @@ -0,0 +1,462 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +=================================== +Use Cinder's new Attach/Detach APIs +=================================== + +https://blueprints.launchpad.net/nova/+spec/cinder-new-attach-apis + +Make Nova use Cinder's new attach/dettach APIs. + +Problem description +=================== + +In attempting to implement Cinder multi-attach and trying to get live +migration working with all drivers, it has become clear Cinder and Nova +interaction is not well understood, and that is leading to both bugs +and issues when trying to evolve the interaction between the two projects. + +Lets create a new clean interface between Nova and Cinder. + +You can see details on the new Cinder API here: +http://specs.openstack.org/openstack/cinder-specs/specs/ocata/add-new-attach-apis.html + +Use Cases +--------- + +The main API actions to consider are: + +* Attach a volume to an instance, including during spawning an instance, + and calling os-brick to (optionally) connect the volume backend to the + hypervisor. + The connect is optional because when there is a shared connection from the + host to the volume backend, the backend may already be attached. +* Detach volume from an instance, including (optionally) calling os-brick to + disconnect the volume from the hypervisor host. +* Live-migrate an instance, involves setting up the volume connection on the + destination host, before kicking off the live-migrate, then removing source + host connections once the live-migrate has completed. If there is a rollback + the destination host connection is removed. +* Migrate and resize are very similar to live-migrate, from this new view of + the world. +* Evacuate, we know the old host is no longer running, and we need to attach + the volume to a new host. +* Shelve, we want the volume to stay logically attached to the instance, but + we also need to detach it from the host when the instance is offloaded. +* For shelved-offloaded case the volume is in a 'reserved' state and not + physically attached +* Attach/Detach a volume to/from a shelved instance +* Use swap volume to migrate a volume between two different Cinder backends. + +In particular, please note: + +* Volume attachment is specific to a host uuid, instance uuid, and volume uuid +* You can have multiple attachments to the same volume, to different instances + (on the same host or different hosts), when the volume is marked + multi_attach=True +* For the same instance uuid and volume uuid, you can have connections on two + different hosts, even when multi_attach=False. This is generally used when + moving a VM. +* Volume connections on a host can be shared with other volumes that are + connected to the same volume backend, depending on the chosen driver. + As such, need to take care when removing that connection, and not adding two + connections by mistake and not removing an in use connection too early. + Cinder needs to provide extra information to Nova, in particular, for each + attachment, if the connection is shared, and if so, who that connection is + currently shared with. + +Proposed change +=============== + +Cinder now has two different API flows for attach/detach. We need a way to +switch from the old API to the new API without affecting any existing +instances. + +Firstly, we need to decide when it is safe to use the new API. We need to have +the Cinder v3 API configured, and that endpoint should have the micro-version +v3.44 available. In addition we should only use the new API when all of the +nova-compute nodes have been upgraded. We can detect that by looking up the +minimum service version relating to when we add the support for the new +Cinder API. Note, this means we will need to increment the service version so +we can explicitly detect the support for the new Cinder API. + +If we allow the use of the new API, we can use that for all new attachments. +When adding a new attachment we: + +* (api) call attachment_create, with no connector, before API call returns. + BDM record is updated with attachment_id. + Note, if the volume is not multi_attach=True, it will only allow one + instance_uuid to be associated with each volume. While the long term aim + is to enable multi-attach, this spec will not attach to any volume that has + multi-attach=True. While we could still make a single attachment to the + volume, as we rely on cinder to restrict the number of attachments to the + volume, for safety we shouldn't allow any attachments if multi_attach=True + until we have that support fully implemented in Nova. +* (compute) get connector info and use that to call attachment_update. + The API now returns with all the information that needs to be given to + os-brick to attach the volume backend, and how to attach the VM to that + connection to the volume backend. +* (compute) Before we can actually connect to the volume we need to wait for + the volume to be ready and fully provisioned. If we timeout waiting for the + volume to be ready, we fail here and delete the attachment. If this is the + first boot of the instance, that will put the instance into the ERROR state. + If the volume is ready, we can continue with the attach process. +* (compute) use os-brick to connect to the volume backend. + If there are any errors, attempt to call os-brick disconnect + (to double check it is fully cleaned up) and then remove the attachment + in Cinder. If there are any issues in the rollback, put instance into the + ERROR state. +* (compute) now the backend is connected, and the volume is ready, we can + attach the backend connection to the VM in the usual way. +* (compute) we call attachment_complete to mark the attachment and volume + 'attached' when all the above operations are successfully completed. + +For a detach: + +* (compute) if attachment_id is set in the BDM, we use the new detach flow, + otherwise we fall back to the old detach flow. The new flow is... +* (api) usual checks to see if request is valid +* (compute) detach volume from VM, if fails stop request here +* (compute) call os-brick to disconnect from the volume backend +* (compute) if success, attachment_remove is called. + If there was an error, we add an instance fault + and set the instance into the error state. + +As above, we can use the presence of the attachment_id in the BDM to decide +if the attachment was made using the new or old flow. Long term we want to +migrate all existing attachments to a new style attachment, but this is left +for a later spec. + +Live-migrate +------------ + +During live-migration, we start the process by ensuring the volume is attached +on both the source and destination. When a volume is multi_attach=False, and +we are about to start live-migrating VM1, you get a situation like this :: + + +-------------------+ +-------------------+ + | | | | + | +------------+ | | +--------------+ | + | |VM1 (active)| | | |VM1 (inactive)| | + | +---+--------+ | | +--+-----------+ | + | | | | | | + | | Host 1 | | | Host 2 | + +-------------------+ +-------------------+ + | | + +-----------+----------+ + | + | + +---------------------------+ + | | | + | +---------+---------+ | + | | VolA | | + | +-------------------+ | + | | + | Cinder Backend 1 | + | | + +---------------------------+ + +Note, in cinder we end up with two attachments for this multi_attach=False +volume: + +* attachment 1: VolA, VM1, Host 1 +* attachment 2: VolA, VM1, Host 2 + +Logically we have two attachments to the one non-multi-attach volume. Both +attachments are related to vm1, but there is an attachment for both the +source and destination host for the duration of the live-migration. +Note both attachments are associated with the same instance uuid, +which is why the two attachments are allowed even though multi_attach=False. + +Should the live-migration succeed, we will delete attachment 1 (i.e. source +host attachment, host 1) and we are left with just attachment 2 +(i.e. destination host attachment, host 2). If there are any failures with +os-brick disconnect on the source host, we put the instance into the ERROR +state and don't delete the attachment in Cinder. We do this to signal to the +operator that something needs manually fixing. We also put the migration into +the error state, as we would even if a failure had a clean rollback. + +If we have any failures in the live-migration such that the instance is still +running on host 1, we do the opposite of the above. We attempt os-brick +disconnect on host 2. If success we delete attachment 2, otherwise put the +instance into the ERROR state. If the rollback succeeds we are back to one +attachment again, but in this case its attachment 1. + +So for volumes that have an attachment_id in their BDM, we follow this new +flow of API calls Cinder: + +* (destination) get connector, and create new attachment +* (destination) attach the volume backend +* (source) kicks off live-migration + +If live-migration succeeds: + +* (source) call os-brick to disconnect +* (source) if success, delete the attachment, otherwise put the + instance into an ERROR state + +If live-migration rolls back due to an abort or similar: + +* (destination) call os-brick to disconnect +* (destination) if success, delete the attachment, otherwise put the + instance into an ERROR state + +Migrate +------- + +Similar to live-migrate, at the start of the migration we have attachments +for both the source and destination node. On calling confirm resize we do +a detach on source, a call to revert resize and its detach on destination. + +Evacuate +-------- + +When you call evacuate, and there is a volume that has an attachment_id in its +BDM, we follow this new flow: + +* (source) Nothing happens on the source, it is assumed the administrator + has already fenced the host, and confirmed that by calling force host down. +* (destination) Create a second attachment for this instance_uuid for + any attached volumes +* (destination) Follow the usual volume attach flow +* (destination) Now delete the old attachment to ensure Cinder cleans up any + resources relating to that connection. It is similar to how we call + terminate_connection today, except we must call this after creating the + new attachment to ensure the volume is always reserved to this instance + during the whole of the evacuate process. +* (operator) should the source host never be started, the instances that + have been evacuated are detected in the usual way (using the migration + record created when evacuate is called). This may leave some things not + cleaned up by os-brick, but that is fairly safe, and we are in a no worse + situation than we are today. + +Shelve and Unshelve +------------------- + +When a volume attached to an instance has an attachment_id in the BDM, we +follow this new flow of calls to the Cinder API. +Note: it is possible to have both old flow and new flow volumes attached to +the one instance that is getting shelved. + +When offloading from an old host, we first add a new attachment (with no +connector set) then perform a disconnect of the old attachment in the +usual way. This ensures the volume is still attached to the instance, +but is safely detached from the host we are offloading from. Should that +detach fail, the instance should be moved into an ERROR state. + +Similarly, when it comes to unshelve, we update the existing attachments +with the connector, before continuing with the usual attach volume flow. + +Swap Volume +----------- + +For swap volume, we have one host, one instance, one device path, but +multiple volumes. + +In this section, we talk about what happens should the volume being swapped +have the attachment_id present in the BDM, and as such we follow the new flow. + +Firstly, there is the flow when cinder calls our API, secondly when a +user calls our API. Both flows are covered here: + +* The Nova swap volume API is called to swap uuid-old with uuid-new + + * The new volume may have been created by the user in cinder, and the + user may have made the Nova API call. + * Alternatively, the user may have called Cinder's migrate volume API. + That means cinder has created the new volume, and calls the Nova API on + the user's behalf. + +* (api) create new attachment for the volume uuid-new, fail API call if we + can't create that attachment +* (compute) update cinder attachment with connector for uuid-new +* (compute) os-brick connect the new volume. If there is an error we + deal with this like a failure during attach, and delete the + attachment to the new volume +* (compute) Nova copies content of volume uuid-old to volume uuid-new, + in libvirt this is via a rebase operation +* (compute) once the copy is complete, we detach uuid-old from instance +* (compute) update BDM so the attachment_id now points to the attachment + associated with uuid-new +* (compute) once the old volume is detached, we do an os-brick disconnect +* (compute) for a Nova initiated swap we don't call cinder's + migrate_volume_completion callback. We check the state of the volume in this + one case to ensure it's not 'retyping' or 'migrating'. +* (compute) Update the BDM with a new volume-uuid, based on what + migrate_volume_completion has returned (when called). Note if cinder called + swap, it will have deleted the old volume, but renamed the new volume to have + the same uuid as the old volume had. If someone called Nova, we get back + uuid-new, and we update the BDM to reflect the change. +* so on success we have created a new attachment to the new volume + and deleted the attachment to the old volume. + +Note: it is assumed if a volume is multi-attach, the swap operation will fail +and not be allowed. That will be true in either the Cinder or Nova started +case. In time we will likely move to Cinder's migrate_volume_completion API +using attachment_ids instead of volume ids. This spec does not look at what is +needed to support multi-attach, but this problem seemed worth noting here. + +Alternatives +------------ + +We could struggle on fixing bugs in a "whack a mole" way. + +There are several ways we should structure the API interactions. One of the +key alternatives is to add lots of state machine complexity into the API so +the shared connection related locking is handled by Cinder in the API layer. +While it makes the clients more complex, it seemed simpler for Nova and other +clients to do the locking discussed above. + +Nova could look up the attachment uuid rather than store it in the BDM, there +is a period where the host uuid is not set, so it seems safer to store the +attachment uuid to stop any possible confusion around which attachment is +associated to each BDM. + +During live-migration we could store the additional attachment_ids in the +migrate data, rather than as part of the BDM. + +We could continue to save the connection_info in the BDM to be used when we +detach the volume. While seems like it might help avoid issues with changes +in the connection info that Nova hasn't been notified of, this is really a +premature optimization. We should instead work with Cinder and os-brick to +properly fix any such interaction problems in a way that helps all systems +that work with Cinder. + +Data model impact +----------------- + +When using the new API flow, we no longer need to store the connection_info, +as we don't need to pass that back to Cinder. Instead we just store the +attachment_id for each host the volume is attached to, and any time we need +the connection_info we fetch that from Cinder. + +When an attachment_id is populated, we use the new flow to do all attach or +detach operations. When not present, we use the old flow. + +REST API impact +--------------- + +No changes to Nova's REST API. + +Security impact +--------------- + +Nova no longer needs to store the volume connection information, however it is +now available at any time from the Cinder API. + +Notifications impact +-------------------- + +None. + +Other end user impact +--------------------- + +None. + +Performance Impact +------------------ + +There should be no impact to performance. The focus here is stability across +all drivers. There may slightly more API calls between Nova and Cinder, but it +is not expected to be significantly impact performance. + +Other deployer impact +--------------------- + +To use this more stable API interaction, and the new features that will depend +on this effort, must upgrade Cinder to a version that supports the new API. + +It is expected we will drop support for older versions of Cinder within +two release cycles of this work being completed. + +Developer impact +---------------- + +Nova and Cinder interactions should be better understood. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Ildiko Vancsa + +Other contributors: + Matt Riedemann + John Griffith + Steve Noyes + +Work Items +---------- + +To make progress in the previous and this cycle we needed to split this work +into small patches. The overall strategy is that we implement new style attach +last, and all the other operations depend on the attachment_id being in the +BDM, that will not be true until the attach code is merged. + +* use Cinder v3 API +* detect if the microversion that includes the new BDM support is present +* detach a new style BDM/volume attach - Merged in Pike +* reboot / rebuild (get connection info from cinder using attachment_id) +* live-migration +* migration +* evacuate +* shelve and unshelve +* swap volume - Merged in Pike +* attach (this means we now expose all the previous features) + +Note there are more steps before we can support multi-attach, but these are +left for future specs: + +* migrate old BDMs to the new BDM flow +* add explicit support for shared backend connections + +Dependencies +============ + +Depends on the Cinder work to add the new API. +This was completed in Ocata. + +Testing +======= + +We need to functionally test both old and new Cinder interactions. A new case +was added to grenade that creates and attaches a volume to an instance before +the upgrade, and detaches it after the upgrade. There is also an addition in +Tempest to check the volume attachments after live migration. Beyond this unit +and functional tests are added in Nova to reach proper test coverage for the +new flow. + +Documentation Impact +==================== + +We need to add good developer documentation around the updated +Nova and Cinder interactions. + +References +========== + +* Cinder API spec: + http://specs.openstack.org/openstack/cinder-specs/specs/ocata/add-new-attach-apis.html +* Merged and open reviews: + https://review.openstack.org/#/q/topic:bp/cinder-new-attach-apis + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Pike + - Introduced + * - Queens + - Re-proposed