Deprecate CONF.workarounds.enable_numa_live_migration

Once a deployment has been fully upgraded to Train, the
CONF.workarounds.enable_numa_live_migration config option is no longer
necessary. This patch changes the conductor check to only apply if the
cell's (cross-cell live migration isn't supported) minimum service
version is old.

Implements blueprint numa-aware-live-migration
Change-Id: If649218db86a04db744990ec0139b4f0b1e79ad6
This commit is contained in:
Artom Lifshitz 2019-02-28 06:38:05 -05:00
parent b335d0c157
commit 083bafc353
5 changed files with 123 additions and 25 deletions

View File

@ -1,10 +1,12 @@
.. important::
Unless :oslo.config:option:`specifically enabled
<workarounds.enable_numa_live_migration>`, live migration is not currently
possible for instances with a NUMA topology when using the libvirt driver.
A NUMA topology may be specified explicitly or can be added implicitly due
to the use of CPU pinning or huge pages. Refer to `bug #1289064`__ for more
information.
In deployments older than Train, or in mixed Stein/Train deployments with a
rolling upgrade in progress, unless :oslo.config:option:`specifically
enabled <workarounds.enable_numa_live_migration>`, live migration is not
possible for instances with a NUMA topology when using the libvirt
driver. A NUMA topology may be specified explicitly or can be added
implicitly due to the use of CPU pinning or huge pages. Refer to `bug
#1289064`__ for more information. As of Train, live migration of instances
with a NUMA topology when using the libvirt driver is fully supported.
__ https://bugs.launchpad.net/nova/+bug/1289064

View File

@ -175,7 +175,9 @@ class LiveMigrationTask(base.TaskBase):
method='live migrate')
def _check_instance_has_no_numa(self):
"""Prevent live migrations of instances with NUMA topologies."""
"""Prevent live migrations of instances with NUMA topologies.
TODO(artom) Remove this check in compute RPC 6.0.
"""
if not self.instance.numa_topology:
return
@ -189,17 +191,32 @@ class LiveMigrationTask(base.TaskBase):
if hypervisor_type.lower() != obj_fields.HVType.QEMU:
return
msg = ('Instance has an associated NUMA topology. '
'Instance NUMA topologies, including related attributes '
'such as CPU pinning, huge page and emulator thread '
'pinning information, are not currently recalculated on '
'live migration. See bug #1289064 for more information.'
)
# We're fully upgraded to a version that supports NUMA live
# migration, carry on.
if objects.Service.get_minimum_version(
self.context, 'nova-compute') >= 40:
return
if CONF.workarounds.enable_numa_live_migration:
LOG.warning(msg, instance=self.instance)
LOG.warning(
'Instance has an associated NUMA topology, cell contains '
'compute nodes older than train, but the '
'enable_numa_live_migration workaround is enabled. Live '
'migration will not be NUMA-aware. The instance NUMA '
'topology, including related attributes such as CPU pinning, '
'huge page and emulator thread pinning information, will not '
'be recalculated. See bug #1289064 for more information.',
instance=self.instance)
else:
raise exception.MigrationPreCheckError(reason=msg)
raise exception.MigrationPreCheckError(
reason='Instance has an associated NUMA topology, cell '
'contains compute nodes older than train, and the '
'enable_numa_live_migration workaround is disabled. '
'Refusing to perform the live migration, as the '
'instance NUMA topology, including related attributes '
'such as CPU pinning, huge page and emulator thread '
'pinning information, cannot be recalculated. See '
'bug #1289064 for more information.')
def _check_can_migrate_pci(self, src_host, dest_host):
"""Checks that an instance can migrate with PCI requests.

View File

@ -157,14 +157,25 @@ Related options:
cfg.BoolOpt(
'enable_numa_live_migration',
default=False,
deprecated_for_removal=True,
deprecated_since='20.0.0',
deprecated_reason="""This option was added to mitigate known issues
when live migrating instances with a NUMA topology with the libvirt driver.
Those issues are resolved in Train. Clouds using the libvirt driver and fully
upgraded to Train support NUMA-aware live migration. This option will be
removed in a future release.
""",
help="""
Enable live migration of instances with NUMA topologies.
Live migration of instances with NUMA topologies is disabled by default
when using the libvirt driver. This includes live migration of instances with
CPU pinning or hugepages. CPU pinning and huge page information for such
instances is not currently re-calculated, as noted in `bug #1289064`_. This
means that if instances were already present on the destination host, the
Live migration of instances with NUMA topologies when using the libvirt driver
is only supported in deployments that have been fully upgraded to Train. In
previous versions, or in mixed Stein/Train deployments with a rolling upgrade
in progress, live migration of instances with NUMA topologies is disabled by
default when using the libvirt driver. This includes live migration of
instances with CPU pinning or hugepages. CPU pinning and huge page information
for such instances is not currently re-calculated, as noted in `bug #1289064`_.
This means that if instances were already present on the destination host, the
migrated instance could be placed on the same dedicated cores as these
instances or use hugepages allocated for another instance. Alternately, if the
host platforms were not homogeneous, the instance could be assigned to

View File

@ -187,7 +187,7 @@ class LiveMigrationTaskTestCase(test.NoDBTestCase):
self.flags(enable_numa_live_migration=False, group='workarounds')
self.task.instance.numa_topology = None
mock_get.return_value = objects.ComputeNode(
uuid=uuids.cn1, hypervisor_type='kvm')
uuid=uuids.cn1, hypervisor_type='qemu')
self.task._check_instance_has_no_numa()
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
@ -201,25 +201,47 @@ class LiveMigrationTaskTestCase(test.NoDBTestCase):
self.task._check_instance_has_no_numa()
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
def test_check_instance_has_no_numa_passes_workaround(self, mock_get):
@mock.patch.object(objects.Service, 'get_minimum_version',
return_value=39)
def test_check_instance_has_no_numa_passes_workaround(
self, mock_get_min_ver, mock_get):
self.flags(enable_numa_live_migration=True, group='workarounds')
self.task.instance.numa_topology = objects.InstanceNUMATopology(
cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
memory=1024)])
mock_get.return_value = objects.ComputeNode(
uuid=uuids.cn1, hypervisor_type='kvm')
uuid=uuids.cn1, hypervisor_type='qemu')
self.task._check_instance_has_no_numa()
mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
def test_check_instance_has_no_numa_fails(self, mock_get):
@mock.patch.object(objects.Service, 'get_minimum_version',
return_value=39)
def test_check_instance_has_no_numa_fails(self, mock_get_min_ver,
mock_get):
self.flags(enable_numa_live_migration=False, group='workarounds')
mock_get.return_value = objects.ComputeNode(
uuid=uuids.cn1, hypervisor_type='QEMU')
uuid=uuids.cn1, hypervisor_type='qemu')
self.task.instance.numa_topology = objects.InstanceNUMATopology(
cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
memory=1024)])
self.assertRaises(exception.MigrationPreCheckError,
self.task._check_instance_has_no_numa)
mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
@mock.patch.object(objects.Service, 'get_minimum_version',
return_value=40)
def test_check_instance_has_no_numa_new_svc_passes(self, mock_get_min_ver,
mock_get):
self.flags(enable_numa_live_migration=False, group='workarounds')
mock_get.return_value = objects.ComputeNode(
uuid=uuids.cn1, hypervisor_type='qemu')
self.task.instance.numa_topology = objects.InstanceNUMATopology(
cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
memory=1024)])
self.task._check_instance_has_no_numa()
mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
@mock.patch.object(objects.Service, 'get_by_compute_host')
@mock.patch.object(servicegroup.API, 'service_is_up')

View File

@ -0,0 +1,46 @@
---
features:
- |
With the libvirt driver, live migration now works correctly for instances
that have a NUMA topology. Previously, the instance was naively moved to
the destination host, without updating any of the underlying NUMA guest to
host mappings or the resource usage. With the new NUMA-aware live migration
feature, if the instance cannot fit on the destination the live migration
will be attempted on an alternate destination if the request is
setup to have alternates. If the instance can fit on the destination, the
NUMA guest to host mappings will be re-calculated to reflect its new
host, and its resource usage updated.
upgrade:
- |
For the libvirt driver, the NUMA-aware live migration feature requires the
conductor, source compute, and destination compute to be upgraded to Train.
It also requires the conductor and source compute to be able to send RPC
5.3 - that is, their ``[upgrade_levels]/compute`` configuration option must
not be set to less than 5.3 or a release older than "train".
In other words, NUMA-aware live migration with the libvirt driver is not
supported until:
* All compute and conductor services are upgraded to Train code.
* The ``[upgrade_levels]/compute`` RPC API pin is removed (or set to
"auto") and services are restarted.
If any of these requirements are not met, live migration of instances with
a NUMA topology with the libvirt driver will revert to the legacy naive
behavior, in which the instance was simply moved over without updating its
NUMA guest to host mappings or its resource usage.
.. note:: The legacy naive behavior is dependent on the value of the
``[workarounds]/enable_numa_live_migration`` option. Refer to the
Deprecations sections for more details.
deprecations:
- |
With the introduction of the NUMA-aware live migration feature for the
libvirt driver, ``[workarounds]/enable_numa_live_migration`` is
deprecated. Once a cell has been fully upgraded to Train, its value is
ignored.
.. note:: Even in a cell fully upgraded to Train, RPC pinning via
``[upgrade_levels]/compute`` can cause live migration of
instances with a NUMA topology to revert to the legacy naive
behavior. For more details refer to the Upgrade section.