Deprecate CONF.workarounds.enable_numa_live_migration
Once a deployment has been fully upgraded to Train, the CONF.workarounds.enable_numa_live_migration config option is no longer necessary. This patch changes the conductor check to only apply if the cell's (cross-cell live migration isn't supported) minimum service version is old. Implements blueprint numa-aware-live-migration Change-Id: If649218db86a04db744990ec0139b4f0b1e79ad6
This commit is contained in:
parent
b335d0c157
commit
083bafc353
@ -1,10 +1,12 @@
|
||||
.. important::
|
||||
|
||||
Unless :oslo.config:option:`specifically enabled
|
||||
<workarounds.enable_numa_live_migration>`, live migration is not currently
|
||||
possible for instances with a NUMA topology when using the libvirt driver.
|
||||
A NUMA topology may be specified explicitly or can be added implicitly due
|
||||
to the use of CPU pinning or huge pages. Refer to `bug #1289064`__ for more
|
||||
information.
|
||||
In deployments older than Train, or in mixed Stein/Train deployments with a
|
||||
rolling upgrade in progress, unless :oslo.config:option:`specifically
|
||||
enabled <workarounds.enable_numa_live_migration>`, live migration is not
|
||||
possible for instances with a NUMA topology when using the libvirt
|
||||
driver. A NUMA topology may be specified explicitly or can be added
|
||||
implicitly due to the use of CPU pinning or huge pages. Refer to `bug
|
||||
#1289064`__ for more information. As of Train, live migration of instances
|
||||
with a NUMA topology when using the libvirt driver is fully supported.
|
||||
|
||||
__ https://bugs.launchpad.net/nova/+bug/1289064
|
||||
|
@ -175,7 +175,9 @@ class LiveMigrationTask(base.TaskBase):
|
||||
method='live migrate')
|
||||
|
||||
def _check_instance_has_no_numa(self):
|
||||
"""Prevent live migrations of instances with NUMA topologies."""
|
||||
"""Prevent live migrations of instances with NUMA topologies.
|
||||
TODO(artom) Remove this check in compute RPC 6.0.
|
||||
"""
|
||||
if not self.instance.numa_topology:
|
||||
return
|
||||
|
||||
@ -189,17 +191,32 @@ class LiveMigrationTask(base.TaskBase):
|
||||
if hypervisor_type.lower() != obj_fields.HVType.QEMU:
|
||||
return
|
||||
|
||||
msg = ('Instance has an associated NUMA topology. '
|
||||
'Instance NUMA topologies, including related attributes '
|
||||
'such as CPU pinning, huge page and emulator thread '
|
||||
'pinning information, are not currently recalculated on '
|
||||
'live migration. See bug #1289064 for more information.'
|
||||
)
|
||||
# We're fully upgraded to a version that supports NUMA live
|
||||
# migration, carry on.
|
||||
if objects.Service.get_minimum_version(
|
||||
self.context, 'nova-compute') >= 40:
|
||||
return
|
||||
|
||||
if CONF.workarounds.enable_numa_live_migration:
|
||||
LOG.warning(msg, instance=self.instance)
|
||||
LOG.warning(
|
||||
'Instance has an associated NUMA topology, cell contains '
|
||||
'compute nodes older than train, but the '
|
||||
'enable_numa_live_migration workaround is enabled. Live '
|
||||
'migration will not be NUMA-aware. The instance NUMA '
|
||||
'topology, including related attributes such as CPU pinning, '
|
||||
'huge page and emulator thread pinning information, will not '
|
||||
'be recalculated. See bug #1289064 for more information.',
|
||||
instance=self.instance)
|
||||
else:
|
||||
raise exception.MigrationPreCheckError(reason=msg)
|
||||
raise exception.MigrationPreCheckError(
|
||||
reason='Instance has an associated NUMA topology, cell '
|
||||
'contains compute nodes older than train, and the '
|
||||
'enable_numa_live_migration workaround is disabled. '
|
||||
'Refusing to perform the live migration, as the '
|
||||
'instance NUMA topology, including related attributes '
|
||||
'such as CPU pinning, huge page and emulator thread '
|
||||
'pinning information, cannot be recalculated. See '
|
||||
'bug #1289064 for more information.')
|
||||
|
||||
def _check_can_migrate_pci(self, src_host, dest_host):
|
||||
"""Checks that an instance can migrate with PCI requests.
|
||||
|
@ -157,14 +157,25 @@ Related options:
|
||||
cfg.BoolOpt(
|
||||
'enable_numa_live_migration',
|
||||
default=False,
|
||||
deprecated_for_removal=True,
|
||||
deprecated_since='20.0.0',
|
||||
deprecated_reason="""This option was added to mitigate known issues
|
||||
when live migrating instances with a NUMA topology with the libvirt driver.
|
||||
Those issues are resolved in Train. Clouds using the libvirt driver and fully
|
||||
upgraded to Train support NUMA-aware live migration. This option will be
|
||||
removed in a future release.
|
||||
""",
|
||||
help="""
|
||||
Enable live migration of instances with NUMA topologies.
|
||||
|
||||
Live migration of instances with NUMA topologies is disabled by default
|
||||
when using the libvirt driver. This includes live migration of instances with
|
||||
CPU pinning or hugepages. CPU pinning and huge page information for such
|
||||
instances is not currently re-calculated, as noted in `bug #1289064`_. This
|
||||
means that if instances were already present on the destination host, the
|
||||
Live migration of instances with NUMA topologies when using the libvirt driver
|
||||
is only supported in deployments that have been fully upgraded to Train. In
|
||||
previous versions, or in mixed Stein/Train deployments with a rolling upgrade
|
||||
in progress, live migration of instances with NUMA topologies is disabled by
|
||||
default when using the libvirt driver. This includes live migration of
|
||||
instances with CPU pinning or hugepages. CPU pinning and huge page information
|
||||
for such instances is not currently re-calculated, as noted in `bug #1289064`_.
|
||||
This means that if instances were already present on the destination host, the
|
||||
migrated instance could be placed on the same dedicated cores as these
|
||||
instances or use hugepages allocated for another instance. Alternately, if the
|
||||
host platforms were not homogeneous, the instance could be assigned to
|
||||
|
@ -187,7 +187,7 @@ class LiveMigrationTaskTestCase(test.NoDBTestCase):
|
||||
self.flags(enable_numa_live_migration=False, group='workarounds')
|
||||
self.task.instance.numa_topology = None
|
||||
mock_get.return_value = objects.ComputeNode(
|
||||
uuid=uuids.cn1, hypervisor_type='kvm')
|
||||
uuid=uuids.cn1, hypervisor_type='qemu')
|
||||
self.task._check_instance_has_no_numa()
|
||||
|
||||
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
|
||||
@ -201,25 +201,47 @@ class LiveMigrationTaskTestCase(test.NoDBTestCase):
|
||||
self.task._check_instance_has_no_numa()
|
||||
|
||||
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
|
||||
def test_check_instance_has_no_numa_passes_workaround(self, mock_get):
|
||||
@mock.patch.object(objects.Service, 'get_minimum_version',
|
||||
return_value=39)
|
||||
def test_check_instance_has_no_numa_passes_workaround(
|
||||
self, mock_get_min_ver, mock_get):
|
||||
self.flags(enable_numa_live_migration=True, group='workarounds')
|
||||
self.task.instance.numa_topology = objects.InstanceNUMATopology(
|
||||
cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
|
||||
memory=1024)])
|
||||
mock_get.return_value = objects.ComputeNode(
|
||||
uuid=uuids.cn1, hypervisor_type='kvm')
|
||||
uuid=uuids.cn1, hypervisor_type='qemu')
|
||||
self.task._check_instance_has_no_numa()
|
||||
mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
|
||||
|
||||
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
|
||||
def test_check_instance_has_no_numa_fails(self, mock_get):
|
||||
@mock.patch.object(objects.Service, 'get_minimum_version',
|
||||
return_value=39)
|
||||
def test_check_instance_has_no_numa_fails(self, mock_get_min_ver,
|
||||
mock_get):
|
||||
self.flags(enable_numa_live_migration=False, group='workarounds')
|
||||
mock_get.return_value = objects.ComputeNode(
|
||||
uuid=uuids.cn1, hypervisor_type='QEMU')
|
||||
uuid=uuids.cn1, hypervisor_type='qemu')
|
||||
self.task.instance.numa_topology = objects.InstanceNUMATopology(
|
||||
cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
|
||||
memory=1024)])
|
||||
self.assertRaises(exception.MigrationPreCheckError,
|
||||
self.task._check_instance_has_no_numa)
|
||||
mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
|
||||
|
||||
@mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
|
||||
@mock.patch.object(objects.Service, 'get_minimum_version',
|
||||
return_value=40)
|
||||
def test_check_instance_has_no_numa_new_svc_passes(self, mock_get_min_ver,
|
||||
mock_get):
|
||||
self.flags(enable_numa_live_migration=False, group='workarounds')
|
||||
mock_get.return_value = objects.ComputeNode(
|
||||
uuid=uuids.cn1, hypervisor_type='qemu')
|
||||
self.task.instance.numa_topology = objects.InstanceNUMATopology(
|
||||
cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
|
||||
memory=1024)])
|
||||
self.task._check_instance_has_no_numa()
|
||||
mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
|
||||
|
||||
@mock.patch.object(objects.Service, 'get_by_compute_host')
|
||||
@mock.patch.object(servicegroup.API, 'service_is_up')
|
||||
|
@ -0,0 +1,46 @@
|
||||
---
|
||||
features:
|
||||
- |
|
||||
With the libvirt driver, live migration now works correctly for instances
|
||||
that have a NUMA topology. Previously, the instance was naively moved to
|
||||
the destination host, without updating any of the underlying NUMA guest to
|
||||
host mappings or the resource usage. With the new NUMA-aware live migration
|
||||
feature, if the instance cannot fit on the destination the live migration
|
||||
will be attempted on an alternate destination if the request is
|
||||
setup to have alternates. If the instance can fit on the destination, the
|
||||
NUMA guest to host mappings will be re-calculated to reflect its new
|
||||
host, and its resource usage updated.
|
||||
upgrade:
|
||||
- |
|
||||
For the libvirt driver, the NUMA-aware live migration feature requires the
|
||||
conductor, source compute, and destination compute to be upgraded to Train.
|
||||
It also requires the conductor and source compute to be able to send RPC
|
||||
5.3 - that is, their ``[upgrade_levels]/compute`` configuration option must
|
||||
not be set to less than 5.3 or a release older than "train".
|
||||
|
||||
In other words, NUMA-aware live migration with the libvirt driver is not
|
||||
supported until:
|
||||
|
||||
* All compute and conductor services are upgraded to Train code.
|
||||
* The ``[upgrade_levels]/compute`` RPC API pin is removed (or set to
|
||||
"auto") and services are restarted.
|
||||
|
||||
If any of these requirements are not met, live migration of instances with
|
||||
a NUMA topology with the libvirt driver will revert to the legacy naive
|
||||
behavior, in which the instance was simply moved over without updating its
|
||||
NUMA guest to host mappings or its resource usage.
|
||||
|
||||
.. note:: The legacy naive behavior is dependent on the value of the
|
||||
``[workarounds]/enable_numa_live_migration`` option. Refer to the
|
||||
Deprecations sections for more details.
|
||||
deprecations:
|
||||
- |
|
||||
With the introduction of the NUMA-aware live migration feature for the
|
||||
libvirt driver, ``[workarounds]/enable_numa_live_migration`` is
|
||||
deprecated. Once a cell has been fully upgraded to Train, its value is
|
||||
ignored.
|
||||
|
||||
.. note:: Even in a cell fully upgraded to Train, RPC pinning via
|
||||
``[upgrade_levels]/compute`` can cause live migration of
|
||||
instances with a NUMA topology to revert to the legacy naive
|
||||
behavior. For more details refer to the Upgrade section.
|
Loading…
x
Reference in New Issue
Block a user