8.3 KiB
Expose auto converge and post copy
https://blueprints.launchpad.net/nova/+spec/expose-auto-converge-post-copy
Problem description
Currently auto converge and post copy can only be enabled/disabled via configuration, which is somewhat inflexible. If an application sensitive to reduced performance (some scientific computing applications may be more sensitive to memory access latency) is on a host with these options enabled, live migration may cause the application to raise an error. Therefore, the user wants to control whether to enable/disable auto converge or post copy during live migration.
Use Cases
- Some applications do not want increased risk of being rebooted due to a network failure or memory page access failure during post-copy live-migration.
- Some applications are performance sensitive (such as some scientific computing applications); such applications do not want performance throttled back by the auto-converge feature during live-migration.
- Some applications would like to avoid reboot risk and performance throttling. If the network between two compute nodes is interrupted during post-copy live-migration, the live-migration will fail and the user will need to reset the instance to make it available. Therefore such applications do not want use both features during live-migration.
- For the above problems, the operator wants to control whether a single instance enables auto converge or post copy during live migration. But currently the minimum unit that can be controlled is the compute node.
Proposed change
Support for auto converge and post copy requires QEMU version >=
2.5.0. Since the Rocky release, the minimum required version of QEMU is
2.5.01. Therefore, all compute nodes using
the libvirt driver should support these features. There are flags from
the libvirt virDomainMigrateFlags enum 2:
...
VIR_MIGRATE_AUTO_CONVERGE = 8192
VIR_MIGRATE_POSTCOPY = 32768
...
The configurations live_migration_permit_auto_converge
and live_migration_permit_post_copy can only affect the
hypervisor by modifying the configuration, but traits can affect a
single instance.
In order to request the feature (scheduling an instance to nodes that provide the feature) we propose defining two new traits. The traits are reported by the libvirt driver, regardless of the conf:
COMPUTE_MIGRATE_AUTO_CONVERGECOMPUTE_MIGRATE_POST_COPY
Introduce two new flavor extra specs:
compute:live_migration_auto_converge=true/falsecompute:live_migration_post_copy=true/false
And introduce two new image properties:
compute_live_migration_auto_converge=true/falsecompute_live_migration_post_copy=true/false
Use these properties, instead of asking the operator to set
required/forbidden on the traits. Before
calling placement, when
compute:live_migration_auto_converge=true or
compute:live_migration_post_copy=true, we add required
traits for the corresponding feature to the placement request. When
compute:live_migration_auto_converge=false and
compute:live_migration_post_copy=false, we just add nothing
to the placement request. Thus we still can schedule an instance on a
host with the features but we disable these two features for that
instance. We use these keys in the scheduler to optionally add required
traits to ensure that the instance can land on a host that is capable of
the requested behavior. The libvirt driver will then interpret the
values to decide whether to use the features during live migration. For
example, if the flavor says "false":
- We don't add the trait to the scheduling request, so the instance can land anywhere.
- The driver will not use the feature for live-migrate, regardless of what the compute's config says.
By default, when the operator creates an instance without any related
metadata, the scheduler will not care whether the host supports
auto-converge or post-copy. If the configurations
live_migration_permit_auto_converge or
live_migration_permit_post_copy are True, the libvirt
driver will prefer to use auto-converge or post-copy. These can be used
when the operator wants all instances on a given
compute node to use auto-converge/post-copy. For example:
- If an instance that has not requested related metadata is scheduled
to a host that enabled
live_migration_permit_auto_convergeorlive_migration_permit_post_copy, then libvirt will try to use auto-converge or post-copy during live migration.
If the operator creates instance with
compute:live_migration_auto_converge`=true/false or
compute:live_migration_post_copy=true/false, these metadata
will override the configurations:
live_migration_permit_auto_converge or
live_migration_permit_post_copy.
When compute:live_migration_auto_converge and
compute_live_migration_post_copy are both true or flavor
extra specs is in conflict with image properties, the 'create' API call
will raise an exception.
When using auto-converge during live migration, if the operator calls the force complete API, libvirt will not be converted to use post-copy because it's not required in flavor extra specs or image properties.
According to this spec3, if post-copy is enabled during live
migration, the abort API call will be rejected by libvirt driver. Now we
can reject the request in the API by checking
hw_live_migration_permit_reboot_risk properties.
Alternatives
Another method is to use traits in flavor extra_specs/image
properties. This could work well when the operators need
auto-converge/post-copy. But it can't be used to disable
auto-converge/post-copy. Since the Rocky release, all libvirt hypervisor
hosts support auto-converge/post-copy, which means every libvirt
hypervisor host would have traits
COMPUTE_MIGRATE_AUTO_CONVERGE and
COMPUTE_MIGRATE_POST_COPY. If operators want to not use
auto-converge or post-copy, they would use forbidden traits:
traits:COMPUTE_MIGRATE_AUTO_CONVERGE=forbidden or
traits:COMPUTE_MIGRATE_POST_COPY=forbidden. Which means
don't schedule my vm to the hosts who support
auto-converge/post-copy, as the above says, this means that all libvirt
compute nodes will be ignored. The result will be that the vm creation
failed because the compute node can't be scheduled.
Data model impact
Add the two image properties to the ImageMeta object:
- compute_live_migration_auto_converge
- compute_live_migration_post_copy
The ImageMeta is stored in table instance_system_metadata, no schema modification is needed.
REST API impact
None
Security impact
None
Notifications impact
None
Other end user impact
None
Performance Impact
None
Other deployer impact
None
Developer impact
None
Upgrade impact
None
Implementation
Assignee(s)
- Primary assignee:
-
Ya Wang
Work Items
- Support for new placement traits.
- Libvirt driver changes to report traits to placement, the traits
will be reported by the libvirt driver as part of
update_provider_tree. This will not be added to the generic compute capabilities dict inherited by all the virt drivers because these traits are libvirt-specific. - Scheduler changes to translate metadata to traits.
- Recalculate
_live_migration_flagsbefore live migration start in the libvirt driver. - Add functional tests and unit tests.
Dependencies
None
Testing
Unit tests and functional tests will be included to test the new functionality.
Documentation Impact
- The live migration document should be changed to introduce this new feature.
References
History
| Release Name | Description |
|---|---|
| Train | Introduced |