Libvirt driver emulator threads placement policy
The Nova schedular determines CPU resource utilization and instance CPU placement based on the number of vCPUs in the flavour. A number of hypervisors have work that is performed in the host OS on behalf of a guest instance, which does not take place in association with a vCPU. This is currently unaccounted for in Nova scheduling and cannot have any placement policy controls applied. Blueprint: libvirt-emulator-threads-policy Change-Id: I069cd14ea89045136ae8a29c2d2f6c8c17157533
This commit is contained in:
committed by
Sahid Orentino Ferdjaoui
parent
5e5b627aee
commit
ad39cf6b87
256
specs/ocata/approved/libvirt-emulator-threads-policy.rst
Normal file
256
specs/ocata/approved/libvirt-emulator-threads-policy.rst
Normal file
@@ -0,0 +1,256 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
================================================
|
||||||
|
Libvirt driver emulator threads placement policy
|
||||||
|
================================================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/nova/+spec/libvirt-emulator-threads-policy
|
||||||
|
|
||||||
|
The Nova scheduler determines CPU resource utilization and instance
|
||||||
|
CPU placement based on the number of vCPUs in the flavor. A number of
|
||||||
|
hypervisors have operations that are being performed on behalf of the
|
||||||
|
guest instance in the host OS. These operations should be accounted
|
||||||
|
and scheduled separately, as well as have their own placement policy
|
||||||
|
controls applied.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
The Nova scheduler determines CPU resource utilization by counting the
|
||||||
|
number of vCPUs allocated for each guest. When doing overcommit, as
|
||||||
|
opposed to dedicated resources, this vCPU count is multiplied by an
|
||||||
|
overcommit ratio. This utilization is then used to determine optimal
|
||||||
|
guest placement across compute nodes, or within NUMA nodes.
|
||||||
|
|
||||||
|
A number of hypervisors, however, perform work on behalf of a guest
|
||||||
|
instance in an execution context that is not associated with the
|
||||||
|
virtual instance vCPUs. With KVM / QEMU, there are one or more threads
|
||||||
|
associated with the QEMU process which are used for the QEMU main
|
||||||
|
event loop, asynchronous I/O operation completion, migration data
|
||||||
|
transfer, SPICE display I/O and more. With Xen, if the stub-domain
|
||||||
|
feature is in use, there is an entire domain used to provide I/O
|
||||||
|
backends for the main domain.
|
||||||
|
|
||||||
|
Nova does not have any current mechanism to either track this extra
|
||||||
|
guest instance compute requirement in order to measure utilization,
|
||||||
|
nor to place any control over its execution policy.
|
||||||
|
|
||||||
|
The libvirt driver has implemented a generic placement policy for KVM
|
||||||
|
whereby the QEMU emulator threads are allowed to float across the same
|
||||||
|
pCPUs that the instance vCPUs are running on. In other words, the
|
||||||
|
emulator threads will steal some time from the vCPUs whenever they
|
||||||
|
have work to do. This is just about acceptable in the case where CPU
|
||||||
|
overcommit is being used. However, when guests want dedicated vCPU
|
||||||
|
allocation though, there is a desire to be able to express other
|
||||||
|
placement policies, for example, to allocate one or more pCPUs to be
|
||||||
|
dedicated to a guest's emulator threads. This becomes critical as Nova
|
||||||
|
continues to implement support for real-time workloads, as it will not
|
||||||
|
be acceptable to allow emulator threads to steal time from real-time
|
||||||
|
vCPUs.
|
||||||
|
|
||||||
|
While it would be possible for the libvirt driver to add different
|
||||||
|
placement policies, unless the concept of emulator threads is exposed
|
||||||
|
to the scheduler in some manner, CPU usage cannot be expressed in a
|
||||||
|
satisfactory manner. Thus there needs to be a way to describe to the
|
||||||
|
scheduler what other CPU usage may be associated with a guest, and
|
||||||
|
account for that during placement.
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
With current Nova real time support in libvirt, there is a requirement
|
||||||
|
to reserve one vCPU for running non-realtime workloads. The QEMU
|
||||||
|
emulator threads are pinned to run on the same host pCPU as this
|
||||||
|
vCPU. While this requirement is just about acceptable for Linux
|
||||||
|
guests, it prevents use of Nova to run other real time operating
|
||||||
|
systems which require realtime response for all vCPUs. To broaden the
|
||||||
|
realtime support it is necessary to pin emulator threads separately
|
||||||
|
from vCPUs, which requires that the scheduler be able to account for
|
||||||
|
extra pCPU usage per guest.
|
||||||
|
|
||||||
|
Project Priority
|
||||||
|
----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
A pre-requisite for enabling the emulator threads placement policy
|
||||||
|
feature on a flavor is that it must also have ‘hw:cpu_policy’ set to
|
||||||
|
‘dedicated’.
|
||||||
|
|
||||||
|
Each hypervisor has a different architecture, for example QEMU has
|
||||||
|
emulator threads, while Xen has stub-domains. To avoid favoring any
|
||||||
|
specific implementation, the idea is to extend
|
||||||
|
`estimate_instance_overhead` to return 1 additional host CPU to take
|
||||||
|
into account during claim. A user which expresses the desire to
|
||||||
|
isolate emulator threads must use a flavor configured to accept that
|
||||||
|
specification as:
|
||||||
|
|
||||||
|
* hw:cpu_emulator_threads=isolate
|
||||||
|
|
||||||
|
Would say that this instance is to be considered to consume 1
|
||||||
|
additional host CPU. That pCPU used to make running emulator threads
|
||||||
|
is going to always be configured on the related guest NUMA node ID 0,
|
||||||
|
to make it predictable for users. Currently there is no desire to make
|
||||||
|
customizable the number of host CPUs running emulator threads since
|
||||||
|
only one should work for almost every use case. If in the future there
|
||||||
|
is a desire to isolate more than one host CPU to run emulator threads,
|
||||||
|
we would implement instead I/O threads to add granularity on
|
||||||
|
dedicating used resources to run guests on host CPUs.
|
||||||
|
|
||||||
|
As we said an additional pCPU is going to be consumed but this first
|
||||||
|
implementation is not going to update the user quotas, that in a
|
||||||
|
spirit of simplicity since quotas already leak on different scenarios.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
We could use a host level tunable to just reserve a set of host pCPUs
|
||||||
|
for running emulator threads globally, instead of trying to account
|
||||||
|
for it per instance. This would work in the simple case, but when NUMA
|
||||||
|
is used, it is highly desirable to have more fine grained config to
|
||||||
|
control emulator thread placement. When real-time or dedicated CPUs
|
||||||
|
are used, it will be critical to separate emulator threads for
|
||||||
|
different KVM instances.
|
||||||
|
|
||||||
|
Another option is to hardcode an assumption that the vCPUs number set
|
||||||
|
against the flavour implicitly includes 1 vCPUs for emulator. eg a
|
||||||
|
vCPU value of 5 would imply 4 actual vCPUs and 1 system pseudo-vCPU.
|
||||||
|
This would likely be extremely confusing to tenant users, and
|
||||||
|
developers alike.
|
||||||
|
|
||||||
|
Do nothing is always an option. If we did nothing, then it would limit
|
||||||
|
the types of workload that can be run on Nova. This would have a
|
||||||
|
negative impact inparticular on users making use of the dedicated vCPU
|
||||||
|
feature, as there would be no way to guarantee their vCPUs are not
|
||||||
|
pre-empted by emulator threads. It can be worked around to some degree
|
||||||
|
with realtime by setting a fixed policy that the emulator threads only
|
||||||
|
run on the vCPUs that have non-realtime policy. This requires that all
|
||||||
|
guest OS using realtime are SMP, but some guest OS want realtime, but
|
||||||
|
are only UP.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
The InstanceNUMATopology object will be extended to have a new field
|
||||||
|
|
||||||
|
* cpu_emulator_threads=CPUEmulatorThreadsField()
|
||||||
|
|
||||||
|
This field will be implemented as an enum with two options:
|
||||||
|
|
||||||
|
* shared - The emulator threads float across the pCPUs associated to
|
||||||
|
the guest.
|
||||||
|
* isolate - The emulator threads are isolated on a single pCPU.
|
||||||
|
|
||||||
|
By default 'shared' will be used. It's important to note that: Since
|
||||||
|
[1] on kernel the load-balancing on CPU isolated from the kernel
|
||||||
|
command line using 'isolcpus=' has been removed. It means that the
|
||||||
|
emulator threads are not going to float on the union of pCPUs
|
||||||
|
dedicated to the guest but instead be constrained to the pCPU running
|
||||||
|
vCPU 0.
|
||||||
|
|
||||||
|
[1] https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/47b8ea7186aae7f474ec4c98f43eaa8da719cd83%5E%21/#F0
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Notifications impact
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
For end users, using the option 'cpu_emulator_threads' is going to
|
||||||
|
consume an additional host CPU on the resources quota regarding the
|
||||||
|
guest vCPUs allocated.
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
The NUMA and compute scheduler filters will have some changes to them,
|
||||||
|
but it is not anticipated that they will become more computationally
|
||||||
|
expensive to any measurable degree.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Deployers who want to use that new feature will have to configure
|
||||||
|
their flavors to use a dedicated cpu policy (hw:cpu_policy=dedicated),
|
||||||
|
in a same time set 'hw:cpu_emulator_threads' to 'isolate'.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
* Developers of other virtualization drivers may wish to make use of
|
||||||
|
the new flavor extra spec property and scheduler accounting. This
|
||||||
|
will be of particular interest to the Xen hypervisor, if using the
|
||||||
|
stub domain feature.
|
||||||
|
|
||||||
|
* Developers of metric or GUI systems have to take into account that
|
||||||
|
host CPU overhead which are going to be consumed by instances with a
|
||||||
|
`cpu_emulator_threads` set as `isolate`.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
sahid-ferdjaoui
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
berrange
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Enhance flavor extra spec to take into account hw:cpu_emulator_threads
|
||||||
|
* Enhance InstanceNUMATopology to take into account cpu_emulator_threads
|
||||||
|
* Make resource tracker to handle 'estimate_instance_overhead' with vcpus
|
||||||
|
* Extend estimate_instance_overhead for libvirt
|
||||||
|
* Make libvirt to corretly pin emulator threads if requested.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
The realtime spec is not a pre-requisite, but is complementary to
|
||||||
|
this work
|
||||||
|
|
||||||
|
* https://blueprints.launchpad.net/nova/+spec/libvirt-real-time
|
||||||
|
* https://review.openstack.org/#/c/139688/
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
This can be tested in any CI system that is capable of testing the
|
||||||
|
current NUMA and dedicated CPUs policy. i.e. It requires ability to
|
||||||
|
use KVM and not merely QEMU. Functionnal tests for the scheduling and
|
||||||
|
driver bits (libvirt) are going to be added.
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
The documentation detailing NUMA and dedicated CPU policy usage will need
|
||||||
|
to be extended to also describe the new options this work introduces.
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
Reference in New Issue
Block a user