neutron/doc/source/contributor/internals/live_migration.rst
Akihiro Motoki 2a47ffd96d Rearrange existing documentation to fit the new standard layout
For more detail, see the doc migration spec.
http://specs.openstack.org/openstack/docs-specs/specs/pike/os-manuals-migration.html

Change-Id: I142a686a3abbe65138a9f3296cd21fc21fbd763a
2017-07-08 05:49:56 +00:00

191 lines
7.9 KiB
ReStructuredText

..
Licensed under the Apache License, Version 2.0 (the "License"); you may
not use this file except in compliance with the License. You may obtain
a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License.
Convention for heading levels in Neutron devref:
======= Heading 0 (reserved for the title in a document)
------- Heading 1
~~~~~~~ Heading 2
+++++++ Heading 3
''''''' Heading 4
(Avoid deeper levels because they do not render well.)
Live-migration
==============
Let's consider a VM with one port migrating from host1 with nova-compute1,
neutron-l2-agent1 and neutron-l3-agent1 to host2 with nova-compute2 and
neutron-l2-agent2 and neutron-l3agent2.
Since the VM that is about to migrate is hosted by nova-compute1, nova sends
the live-migration order to nova-compute1 through RPC.
Nova Live Migration consists of the following stages:
* Pre-live-migration
* Live-migration-operation
* Post-live-migration
Pre-live-migration actions
--------------------------
Nova-compute1 will first ask nova-compute2 to perform pre-live-migration
actions with a synchronous RPC call.
Nova-compute2 will use neutron REST API to retrieve the list of VM's ports.
Then, it calls its vif driver to create the VM's port (VIF) using plug_vifs().
In the case Open vSwitch Hybrid plug is used, Neutron-l2-agent2 will detect
this new VIF, request the device details from the neutron server and configure
it accordingly. However, port's status won't change, since this port is
not bound to nova-compute2.
Nova-compute1 calls setup_networks_on_hosts. This updates the Neutron ports
binding:profile with the information of the target host. The port update RPC
message sent out by Neutron server will be received by neutron-l3-agent2,
which proactively sets up the DVR router.
If pre-live-migration fails, nova rollbacks and port is removed from host2.
If pre-live-migration succeeds, nova proceeds with live-migration-operation.
Potential error cases related to networking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Plugging the VIFs on host2 fails
As Live migration operation was not yet started, the instance resides
active on host1.
.. _live_mig_operation:
Live-migration-operation
------------------------
Once nova-compute2 has performed pre-live-migration actions, nova-compute1 can
start the live-migration. This results in the creation of the VM and its
corresponding tap interface on node 2.
In the case Open vSwitch normal plug, linux bridge or MacVTap is being used,
Neutron-l2-agent2 will detect this new tap device and configure it accordingly.
However, port's status won't change, since this port is not bound to
nova-compute2.
As soon as the instance is active on host2, the original instance on host1
gets removed and with it the corresponding tap device. Assuming OVS-hybrid plug
is NOT used, Neutron-l2-agent1 detects the removal and tells the neutron
server to set the port's status to DOWN state with RPC messages.
There is no rollback if failure happens in live-migration-operation stage.
TBD: Error are handled by the post-live-migration stage.
Potential error cases related to networking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Some host devices that are specified in the instance definition are not
present on the target host. Migration fails before it really started.
This can happen with MacVTap agent. See bug
https://bugs.launchpad.net/bugs/1550400
Post-live-migration actions
---------------------------
Once live-migration succeeded, both nova-compute1 and nova-compute2
perform post-live-migration actions. Nova-compute1 which is aware of
the success will send a RPC cast to nova-compute2 to tell it to perform
post-live-migration actions.
On host2, nova-compute2 sends a REST call
"update_port(binding=host2, profile={})" to the neutron server to tell it to
update the port's binding. This will clear the port binding information and
move the port's status to DOWN. The ML2 plugin will then try to rebind the
port according to its new host. This update_port REST call always triggers a
port-update RPC fanout message to every neutron-l2-agent. Since
neutron-l2-agent2 is now hosting the port, it will take this message into
account and re-synchronize the port by asking the neutron server details about
it through RPC messages. This will move the port from DOWN status to BUILD,
and then back to ACTIVE.
This update also removes the 'migrating_to' value from the portbindng
dictionary. It's not clearing it totally, like indicated by {}, but just
removing the 'migrating_to' key and value.
On host1, nova-compute1 calls its vif driver to unplug the VM's port.
Assuming, Open vSwitch Hybrid plug is used, Neutron-l2-agent1 detects the
removal and tells the neutron server to set the port's status to DOWN state
with RPC messages. For all other cases this happens as soon as the instance
and its tap device got destroyed on host1, like described in
:ref:`live_mig_operation`.
If neutron didn't already processed the REST call "update_port(binding=host2)",
the port status will effectively move to BUILD and then to DOWN. Otherwise,
the port is bound to host2, and neutron won't change the port status since
it's not bound the host that is sending RPC messages.
There is no rollback if failure happens in post-live-migration stage. In the
case of an error, the instance is set into "ERROR" state.
Potential error cases related to networking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Portbinding for host2 fails
If this happens, the vif_type of the port is set to 'binding_failed'.
When Nova tries to recreated the domain.xml on the migration target
it will stumble over this invalid vif_type and fail. The instance is put
into "ERROR" state.
Post-Copy Migration
-------------------
Usually, Live Migration is executed as pre-copy migration. The instance is
active on host1 until nearly all memory has been copied to host2. If a
certain threshold of copied memory is met, the instance on the source get's
paused, the rest of the memory copied over and the instance started on
the target. The challenge with this approach is, that migration might take
a infinite amount of time, when the instance is heavily writing to memory.
This issue gets solved with post-copy migration. At some point in time,
the instance on host2 will be set to active, although still
a huge amount of memory pages reside only on host1. The phase that starts
now is called the post_copy phase. If the instance tries to access a memory
page that has not yet been transferred, libvirt/qemu takes care of moving this
page to the target immediately. New pages will only be written to the source.
With this approach the migration operation takes a finite amount of time.
Today, the rebinding of the port from host1 to host2 happens
in the post_live_migration phase, after migration finished. This is fine for
the pre-copy case, as the time windows between the activation of the instance
on the target and the binding of the port to the target is pretty small. This
becomes more problematic for the post-copy migration case. The instance
becomes active on the target pretty early but the portbinding still happens
after migration finished. During this time window, the instance might not be
reachable via the network. This should be solved with bug
https://bugs.launchpad.net/nova/+bug/1605016
Flow Diagram
------------
OVS Normal plug, Linux bridge, MacVTap, SR-IOV
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: images/live-mig.png
OVS-Hybrid plug
~~~~~~~~~~~~~~~
The sequence with RPC messages from neutron-l2-agent processed first is
described in the following UML sequence diagram
.. image:: images/live-mig-ovs-hybrid.png