As a follow up to the spec merging, fix some typos. Change-Id: I04d5d94456172ae89493f68a1f02b65a9f21f0c2
18 KiB
NUMA-aware live migration
https://blueprints.launchpad.net/nova/+spec/numa-aware-live-migration
When an instance with NUMA characteristics is live-migrated, those characteristics are not recalculated on the destination compute host. In the CPU pinning case, using the source host's pin mappings on the destination can lead to multiple instances being pinned to the same pCPUs. In the case of hugepage-backed instances, which are NUMA-localized, an instance needs to have its NUMA mapping recalculated on the destination compute host during a live migration.
Problem description
In the following paragraphs the term NUMA is incorrectly used to signify any guest characteristic that is expressed in the InstanceNUMATopology object, for example CPU pinning and hugepages. CPU pinning can be achieved without a guest NUMA topology, but because no better term than NUMA is available it will continue to be used.
The problem can best be described with three examples.
The first example is live migration with CPU pinning. An instance
with a dedicated
CPU policy and pinned CPUs is
live-migrated. Its pin mappings are naively copied over to the
destination host. This creates two problems. First, its pinned pCPUs
aren't properly claimed on the destination. This means that, should a
second instance with pinned CPUs land on the destination, both
instances' vCPUs could be pinned to the same pCPUs. Second, any existing
pin mappings on the destination are ignored. If another instance already
exists on the destination, both instances's vCPUs could be pinned to the
same pCPUs. In both cases, the dedicated
CPU policy is
violated, potentially leading to unpredictable performance
degradation.
The second example is instances with hugepages. There are two hosts, each with two NUMA nodes and 8 1GB hugepages per node. Two identical instances are booted on the two hosts. Their virtual NUMA topology is one virtual NUMA node and 8 1GB memory pages. They land on their respective host's NUMA node 0, consuming all 8 of its pages. One instance is live-migrated to the other host. The libvirt driver enforces strict NUMA affinity and does not regenerate the instance XML. Both instances end up on the hosts NUMA node 0, and the live-migrated instance fails to run.
The third example is an instance with a virtual NUMA topology (but without hugepages). If an instance affined to its host's NUMA node 2 is live migrated to a host with only two NUMA nodes, and thus without a NUMA node 2, it will fail to run.
The first two of these examples are known bugs [1] [2].
Use Cases
As a cloud administrator, I want to live migrate instances with CPU pinning without the pin mappings overlapping on the destination compute host.
As a cloud administrator, I want live migration of hugepage-backed instances to work and for the instances to successfully run on the destination compute host.
As a cloud administrator, I want live migration of instances with an explicit NUMA topology to work and for the instances to successfully run on the destination compute host.
Proposed change
Currently, the scheduler does not claim any NUMA resources. While work has started to model NUMA topologies as resources providers in placement [3], this spec intentionally ignores that work and does not depend on it. Instead, the current method of claiming NUMA resources will continue to be used. Specifically, NUMA resources will continue to be claimed by the compute host's resource tracker.
At the cell conductor (live migration isn't supported between cells, so the superconductor is not involved) and compute level, the relevant parts of the current live migration flow can be summarized by the following oversimplified pseudo sequence diagram.:
+-----------+ +---------+ +-------------+ +---------+
| Conductor | | Source | | Destination | | Driver |
+-----------+ +---------+ +-------------+ +---------+
| | | |
| check_can_live_migrate_destination| | |
|---------------------------------------------------------------------------->| |
| | | |
| | check_can_live_migrate_source | |
| |<----------------------------------------| |
| | | |
| | migrate_data | |
| |---------------------------------------->| |
| | | |
| | migrate_data | |
|<----------------------------------------------------------------------------| |
| | | |
| live_migration(migrate_data) | | |
|---------------------------------->| | |
| | | |
| | pre_live_migration(migrate_data) | |
| |---------------------------------------->| |
| | | |
| | migrate_data | |
| |<----------------------------------------| |
| | | |
| | live_migration(migrate_data) | |
| |------------------------------------------------------>|
| | | |
migrate_data is a LiveMigrateData object. This spec proposes to add an object field containing an InstanceNUMATopology object. The source will include the instance's existing NUMA topology in the migrate_data that its check_can_live_migrate_source returns to the destination. The destination's virt driver will fit this InstanceNUMATopology to the destination's NUMATopology and claim the resources using the resource tracker. It will then send the updated InstanceNUMATopology back to the conductor as part of the existing migrate_data that check_can_live_migrate_destination returns. The updated InstanceNUMATopology will continue to be propagated as part of migrate_data, eventually reaching the source. The source's libvirt driver will use this updated InstanceNUMATopology when generating the instance XML to be sent to the destination for the live migration. The proposed flow is summarised in the following diagram.:
+-----------+ +---------+ +-------------+ +---------+
| Conductor | | Source | | Destination | | Driver |
+-----------+ +---------+ +-------------+ +---------+
| | | |
| check_can_live_migrate_destination | | |
|---------------------------------------------------------------------------------------------------------->| |
| | | |
| | check_can_live_migrate_source | |
| |<-------------------------------------------| |
| | | |
| | migrate_data + InstanceNUMATopology | |
| |------------------------------------------->| |
| | | --------------------------------------------\ |
| | |-| Fit InstanceNUMATopology to NUMATopology, | |
| | | | fail live migration if unable | |
| | | |-------------------------------------------| |
| | migrate_data + new InstanceNUMATopology | |
|<----------------------------------------------------------------------------------------------------------| |
| | | |
| live_migration(migrate_data + new InstanceNUMATopology) | | |
|------------------------------------------------------------->| | |
| --------------------------\ | | |
| | pre_live_migration call |-| | |
| |-------------------------| | | |
| | | |
| | live_migration(migrate_data + new InstanceNUMATopology) |
| |---------------------------------------------------------------------------------------------->|
| | | ------------------------------------\ |
| | | | generate NUMA XML for destination |-|
| | | |-----------------------------------| |
| | | |
Exchanging instance NUMA topologies is done early (in check_can_live_migrate_source rather than pre_live_migration) in order to fail as fast as possible if the destination cannot fit the instance. What happens when the compute hosts are not both running the updated handshake code is discussed in ref:upgrade-impact.
Currently, only placement allocations are updated during a live migration. The proposed resource tracker claims mechanism will become obsolete once NUMA resource providers are implemented [3]. Therefore, as a stopgap error handling method, the live migration can be failed if the resource claim does not succeed on the destination compute host. Once NUMA is handled by placement, the compute host will not have to do any resource claims.
It would also be possible for another instance to steal NUMA resources from a live migrated instance before the latter's destination compute host has a chance to claim them. Until NUMA resource providers are implemented [3] and allow for an essentially atomic schedule+claim operation, scheduling and claiming will keep being done at different times on different nodes. Thus, the potential for races will continue to exist.
Alternatives
It would be possible to reuse the result of numa_fit_instance_to_host as called from the scheduler before the live migration reaches the conductor. select_destinations in the scheduler returns a list of Selection objects to the conductor's live migrate task. The Selection object could be modified to include InstanceNUMATopology. The NUMA topology filter could add an InstanceNUMATopology for every host that passes. That topology would eventually reach the conductor, which would put it in migrate_data. The destination compute host would then claim the resources as previously described.
Data model impact
InstanceNUMATopology is added to LiveMigrateData.
REST API impact
None.
Security impact
None.
Notifications impact
None.
Other end user impact
None.
Performance Impact
None.
Other deployer impact
None.
Developer impact
None.
Upgrade impact
None.
Hypothetically, how NUMA aware live migration could be supported between version-mismatched compute hosts would depend on which of the two compute hosts is older.
If the destination is older than the source, the source does not get an InstanceNUMATopology in migrate_data and can therefore choose to run an old-style live migration.
If the source is older than the destination, the new field in LiveMigrateData is ignored and the source's old live migration runs without issues. However, the destination has already claimed NUMA resources that the source does generate instance XML for. The destination could conceivably check the source's compute service version and fail the migration before claiming resources if the source doesn't support NUMA live migration.
However, given the current broken state of NUMA live migration, a simpler solution is to refuse to perform a NUMA live migration unless both source and destination compute hosts have been upgraded to a version that supports it. To achieve this, the conductor can check the source and destination compute's service version and fail the migration if either one is too old.
Implementation
Assignee(s)
- Primary assignee:
-
notartom
Work Items
- Add InstanceNUMATopology to LiveMigrateData.
- Modify the libvirt driver to generate live migration instance XML based on the InstanceNUMATopolgy in the migrate_data it receives from the destination.
Dependencies
None.
Testing
The libvirt/qemu driver used in the gate does not currently support NUMA features (though work is in progress [4]). Therefore, testing NUMA aware live migration in the upstream gate would require nested virt. In addition, the only assertable outcome of a NUMA live migration test (if it ever becomes possible) would be that the live migration succeeded. Examining the instance XML to assert things about its NUMA affinity or CPU pin mapping is explicitly out of tempest's scope. For these reasons, NUMA aware live migration is best tested in third party CI [5] or other downstream test scenarios [6].
Documentation Impact
Current live migration documentation does not mention the NUMA limitations anywhere. Therefore, a release note explaining the new NUMA capabilities of live migration should be enough.
References
History
Release Name | Description |
---|---|
Rocky | Introduced |