Libvirt driver emulator threads placement policy

The Nova schedular determines CPU resource utilization and instance
CPU placement based on the number of vCPUs in the flavour. A number
of hypervisors have work that is performed in the host OS on behalf
of a guest instance, which does not take place in association with
a vCPU. This is currently unaccounted for in Nova scheduling and
cannot have any placement policy controls applied.

Blueprint: libvirt-emulator-threads-policy
Change-Id: I069cd14ea89045136ae8a29c2d2f6c8c17157533
This commit is contained in:
Daniel P. Berrange
2015-09-21 16:20:59 +01:00
committed by Sahid Orentino Ferdjaoui
parent 5e5b627aee
commit ad39cf6b87

View File

@@ -0,0 +1,256 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================================
Libvirt driver emulator threads placement policy
================================================
https://blueprints.launchpad.net/nova/+spec/libvirt-emulator-threads-policy
The Nova scheduler determines CPU resource utilization and instance
CPU placement based on the number of vCPUs in the flavor. A number of
hypervisors have operations that are being performed on behalf of the
guest instance in the host OS. These operations should be accounted
and scheduled separately, as well as have their own placement policy
controls applied.
Problem description
===================
The Nova scheduler determines CPU resource utilization by counting the
number of vCPUs allocated for each guest. When doing overcommit, as
opposed to dedicated resources, this vCPU count is multiplied by an
overcommit ratio. This utilization is then used to determine optimal
guest placement across compute nodes, or within NUMA nodes.
A number of hypervisors, however, perform work on behalf of a guest
instance in an execution context that is not associated with the
virtual instance vCPUs. With KVM / QEMU, there are one or more threads
associated with the QEMU process which are used for the QEMU main
event loop, asynchronous I/O operation completion, migration data
transfer, SPICE display I/O and more. With Xen, if the stub-domain
feature is in use, there is an entire domain used to provide I/O
backends for the main domain.
Nova does not have any current mechanism to either track this extra
guest instance compute requirement in order to measure utilization,
nor to place any control over its execution policy.
The libvirt driver has implemented a generic placement policy for KVM
whereby the QEMU emulator threads are allowed to float across the same
pCPUs that the instance vCPUs are running on. In other words, the
emulator threads will steal some time from the vCPUs whenever they
have work to do. This is just about acceptable in the case where CPU
overcommit is being used. However, when guests want dedicated vCPU
allocation though, there is a desire to be able to express other
placement policies, for example, to allocate one or more pCPUs to be
dedicated to a guest's emulator threads. This becomes critical as Nova
continues to implement support for real-time workloads, as it will not
be acceptable to allow emulator threads to steal time from real-time
vCPUs.
While it would be possible for the libvirt driver to add different
placement policies, unless the concept of emulator threads is exposed
to the scheduler in some manner, CPU usage cannot be expressed in a
satisfactory manner. Thus there needs to be a way to describe to the
scheduler what other CPU usage may be associated with a guest, and
account for that during placement.
Use Cases
---------
With current Nova real time support in libvirt, there is a requirement
to reserve one vCPU for running non-realtime workloads. The QEMU
emulator threads are pinned to run on the same host pCPU as this
vCPU. While this requirement is just about acceptable for Linux
guests, it prevents use of Nova to run other real time operating
systems which require realtime response for all vCPUs. To broaden the
realtime support it is necessary to pin emulator threads separately
from vCPUs, which requires that the scheduler be able to account for
extra pCPU usage per guest.
Project Priority
----------------
None
Proposed change
===============
A pre-requisite for enabling the emulator threads placement policy
feature on a flavor is that it must also have hw:cpu_policy set to
dedicated.
Each hypervisor has a different architecture, for example QEMU has
emulator threads, while Xen has stub-domains. To avoid favoring any
specific implementation, the idea is to extend
`estimate_instance_overhead` to return 1 additional host CPU to take
into account during claim. A user which expresses the desire to
isolate emulator threads must use a flavor configured to accept that
specification as:
* hw:cpu_emulator_threads=isolate
Would say that this instance is to be considered to consume 1
additional host CPU. That pCPU used to make running emulator threads
is going to always be configured on the related guest NUMA node ID 0,
to make it predictable for users. Currently there is no desire to make
customizable the number of host CPUs running emulator threads since
only one should work for almost every use case. If in the future there
is a desire to isolate more than one host CPU to run emulator threads,
we would implement instead I/O threads to add granularity on
dedicating used resources to run guests on host CPUs.
As we said an additional pCPU is going to be consumed but this first
implementation is not going to update the user quotas, that in a
spirit of simplicity since quotas already leak on different scenarios.
Alternatives
------------
We could use a host level tunable to just reserve a set of host pCPUs
for running emulator threads globally, instead of trying to account
for it per instance. This would work in the simple case, but when NUMA
is used, it is highly desirable to have more fine grained config to
control emulator thread placement. When real-time or dedicated CPUs
are used, it will be critical to separate emulator threads for
different KVM instances.
Another option is to hardcode an assumption that the vCPUs number set
against the flavour implicitly includes 1 vCPUs for emulator. eg a
vCPU value of 5 would imply 4 actual vCPUs and 1 system pseudo-vCPU.
This would likely be extremely confusing to tenant users, and
developers alike.
Do nothing is always an option. If we did nothing, then it would limit
the types of workload that can be run on Nova. This would have a
negative impact inparticular on users making use of the dedicated vCPU
feature, as there would be no way to guarantee their vCPUs are not
pre-empted by emulator threads. It can be worked around to some degree
with realtime by setting a fixed policy that the emulator threads only
run on the vCPUs that have non-realtime policy. This requires that all
guest OS using realtime are SMP, but some guest OS want realtime, but
are only UP.
Data model impact
-----------------
The InstanceNUMATopology object will be extended to have a new field
* cpu_emulator_threads=CPUEmulatorThreadsField()
This field will be implemented as an enum with two options:
* shared - The emulator threads float across the pCPUs associated to
the guest.
* isolate - The emulator threads are isolated on a single pCPU.
By default 'shared' will be used. It's important to note that: Since
[1] on kernel the load-balancing on CPU isolated from the kernel
command line using 'isolcpus=' has been removed. It means that the
emulator threads are not going to float on the union of pCPUs
dedicated to the guest but instead be constrained to the pCPU running
vCPU 0.
[1] https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/47b8ea7186aae7f474ec4c98f43eaa8da719cd83%5E%21/#F0
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
For end users, using the option 'cpu_emulator_threads' is going to
consume an additional host CPU on the resources quota regarding the
guest vCPUs allocated.
Performance Impact
------------------
The NUMA and compute scheduler filters will have some changes to them,
but it is not anticipated that they will become more computationally
expensive to any measurable degree.
Other deployer impact
---------------------
Deployers who want to use that new feature will have to configure
their flavors to use a dedicated cpu policy (hw:cpu_policy=dedicated),
in a same time set 'hw:cpu_emulator_threads' to 'isolate'.
Developer impact
----------------
* Developers of other virtualization drivers may wish to make use of
the new flavor extra spec property and scheduler accounting. This
will be of particular interest to the Xen hypervisor, if using the
stub domain feature.
* Developers of metric or GUI systems have to take into account that
host CPU overhead which are going to be consumed by instances with a
`cpu_emulator_threads` set as `isolate`.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
sahid-ferdjaoui
Other contributors:
berrange
Work Items
----------
* Enhance flavor extra spec to take into account hw:cpu_emulator_threads
* Enhance InstanceNUMATopology to take into account cpu_emulator_threads
* Make resource tracker to handle 'estimate_instance_overhead' with vcpus
* Extend estimate_instance_overhead for libvirt
* Make libvirt to corretly pin emulator threads if requested.
Dependencies
============
The realtime spec is not a pre-requisite, but is complementary to
this work
* https://blueprints.launchpad.net/nova/+spec/libvirt-real-time
* https://review.openstack.org/#/c/139688/
Testing
=======
This can be tested in any CI system that is capable of testing the
current NUMA and dedicated CPUs policy. i.e. It requires ability to
use KVM and not merely QEMU. Functionnal tests for the scheduling and
driver bits (libvirt) are going to be added.
Documentation Impact
====================
The documentation detailing NUMA and dedicated CPU policy usage will need
to be extended to also describe the new options this work introduces.
References
==========
History
=======