Libvirt driver emulator threads placement policy

The Nova schedular determines CPU resource utilization and instance CPU placement based on the number of vCPUs in the flavour. A number of hypervisors have work that is performed in the host OS on behalf of a guest instance, which does not take place in association with a vCPU. This is currently unaccounted for in Nova scheduling and cannot have any placement policy controls applied. Blueprint: libvirt-emulator-threads-policy Change-Id: I069cd14ea89045136ae8a29c2d2f6c8c17157533
2015-09-21 16:20:59 +01:00
parent 5e5b627aee
commit ad39cf6b87
1 changed files with 256 additions and 0 deletions
--- a/specs/ocata/approved/libvirt-emulator-threads-policy.rst
+++ b/specs/ocata/approved/libvirt-emulator-threads-policy.rst
@@ -0,0 +1,256 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 ================================================
 Libvirt driver emulator threads placement policy
 ================================================
 https://blueprints.launchpad.net/nova/+spec/libvirt-emulator-threads-policy
 The Nova scheduler determines CPU resource utilization and instance
 CPU placement based on the number of vCPUs in the flavor. A number of
 hypervisors have operations that are being performed on behalf of the
 guest instance in the host OS. These operations should be accounted
 and scheduled separately, as well as have their own placement policy
 controls applied.
 Problem description
 ===================
 The Nova scheduler determines CPU resource utilization by counting the
 number of vCPUs allocated for each guest. When doing overcommit, as
 opposed to dedicated resources, this vCPU count is multiplied by an
 overcommit ratio. This utilization is then used to determine optimal
 guest placement across compute nodes, or within NUMA nodes.
 A number of hypervisors, however, perform work on behalf of a guest
 instance in an execution context that is not associated with the
 virtual instance vCPUs. With KVM / QEMU, there are one or more threads
 associated with the QEMU process which are used for the QEMU main
 event loop, asynchronous I/O operation completion, migration data
 transfer, SPICE display I/O and more. With Xen, if the stub-domain
 feature is in use, there is an entire domain used to provide I/O
 backends for the main domain.
 Nova does not have any current mechanism to either track this extra
 guest instance compute requirement in order to measure utilization,
 nor to place any control over its execution policy.
 The libvirt driver has implemented a generic placement policy for KVM
 whereby the QEMU emulator threads are allowed to float across the same
 pCPUs that the instance vCPUs are running on. In other words, the
 emulator threads will steal some time from the vCPUs whenever they
 have work to do. This is just about acceptable in the case where CPU
 overcommit is being used. However, when guests want dedicated vCPU
 allocation though, there is a desire to be able to express other
 placement policies, for example, to allocate one or more pCPUs to be
 dedicated to a guest's emulator threads. This becomes critical as Nova
 continues to implement support for real-time workloads, as it will not
 be acceptable to allow emulator threads to steal time from real-time
 vCPUs.
 While it would be possible for the libvirt driver to add different
 placement policies, unless the concept of emulator threads is exposed
 to the scheduler in some manner, CPU usage cannot be expressed in a
 satisfactory manner. Thus there needs to be a way to describe to the
 scheduler what other CPU usage may be associated with a guest, and
 account for that during placement.
 Use Cases
 ---------
 With current Nova real time support in libvirt, there is a requirement
 to reserve one vCPU for running non-realtime workloads. The QEMU
 emulator threads are pinned to run on the same host pCPU as this
 vCPU. While this requirement is just about acceptable for Linux
 guests, it prevents use of Nova to run other real time operating
 systems which require realtime response for all vCPUs. To broaden the
 realtime support it is necessary to pin emulator threads separately
 from vCPUs, which requires that the scheduler be able to account for
 extra pCPU usage per guest.
 Project Priority
 ----------------
 None
 Proposed change
 ===============
 A pre-requisite for enabling the emulator threads placement policy
 feature on a flavor is that it must also have ‘hw:cpu_policy’ set to
 ‘dedicated’.
 Each hypervisor has a different architecture, for example QEMU has
 emulator threads, while Xen has stub-domains. To avoid favoring any
 specific implementation, the idea is to extend
 `estimate_instance_overhead` to return 1 additional host CPU to take
 into account during claim. A user which expresses the desire to
 isolate emulator threads must use a flavor configured to accept that
 specification as:
 * hw:cpu_emulator_threads=isolate
 Would say that this instance is to be considered to consume 1
 additional host CPU. That pCPU used to make running emulator threads
 is going to always be configured on the related guest NUMA node ID 0,
 to make it predictable for users. Currently there is no desire to make
 customizable the number of host CPUs running emulator threads since
 only one should work for almost every use case. If in the future there
 is a desire to isolate more than one host CPU to run emulator threads,
 we would implement instead I/O threads to add granularity on
 dedicating used resources to run guests on host CPUs.
 As we said an additional pCPU is going to be consumed but this first
 implementation is not going to update the user quotas, that in a
 spirit of simplicity since quotas already leak on different scenarios.
 Alternatives
 ------------
 We could use a host level tunable to just reserve a set of host pCPUs
 for running emulator threads globally, instead of trying to account
 for it per instance. This would work in the simple case, but when NUMA
 is used, it is highly desirable to have more fine grained config to
 control emulator thread placement. When real-time or dedicated CPUs
 are used, it will be critical to separate emulator threads for
 different KVM instances.
 Another option is to hardcode an assumption that the vCPUs number set
 against the flavour implicitly includes 1 vCPUs for emulator. eg a
 vCPU value of 5 would imply 4 actual vCPUs and 1 system pseudo-vCPU.
 This would likely be extremely confusing to tenant users, and
 developers alike.
 Do nothing is always an option. If we did nothing, then it would limit
 the types of workload that can be run on Nova. This would have a
 negative impact inparticular on users making use of the dedicated vCPU
 feature, as there would be no way to guarantee their vCPUs are not
 pre-empted by emulator threads. It can be worked around to some degree
 with realtime by setting a fixed policy that the emulator threads only
 run on the vCPUs that have non-realtime policy. This requires that all
 guest OS using realtime are SMP, but some guest OS want realtime, but
 are only UP.
 Data model impact
 -----------------
 The InstanceNUMATopology object will be extended to have a new field
 * cpu_emulator_threads=CPUEmulatorThreadsField()
 This field will be implemented as an enum with two options:
 * shared - The emulator threads float across the pCPUs associated to
  the guest.
 * isolate - The emulator threads are isolated on a single pCPU.
 By default 'shared' will be used. It's important to note that: Since
 [1] on kernel the load-balancing on CPU isolated from the kernel
 command line using 'isolcpus=' has been removed. It means that the
 emulator threads are not going to float on the union of pCPUs
 dedicated to the guest but instead be constrained to the pCPU running
 vCPU 0.
 [1] https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/47b8ea7186aae7f474ec4c98f43eaa8da719cd83%5E%21/#F0
 REST API impact
 ---------------
 None
 Security impact
 ---------------
 None
 Notifications impact
 --------------------
 None
 Other end user impact
 ---------------------
 For end users, using the option 'cpu_emulator_threads' is going to
 consume an additional host CPU on the resources quota regarding the
 guest vCPUs allocated.
 Performance Impact
 ------------------
 The NUMA and compute scheduler filters will have some changes to them,
 but it is not anticipated that they will become more computationally
 expensive to any measurable degree.
 Other deployer impact
 ---------------------
 Deployers who want to use that new feature will have to configure
 their flavors to use a dedicated cpu policy (hw:cpu_policy=dedicated),
 in a same time set 'hw:cpu_emulator_threads' to 'isolate'.
 Developer impact
 ----------------
 * Developers of other virtualization drivers may wish to make use of
  the new flavor extra spec property and scheduler accounting. This
  will be of particular interest to the Xen hypervisor, if using the
  stub domain feature.
 * Developers of metric or GUI systems have to take into account that
  host CPU overhead which are going to be consumed by instances with a
  `cpu_emulator_threads` set as `isolate`.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  sahid-ferdjaoui
 Other contributors:
  berrange
 Work Items
 ----------
 * Enhance flavor extra spec to take into account hw:cpu_emulator_threads
 * Enhance InstanceNUMATopology to take into account cpu_emulator_threads
 * Make resource tracker to handle 'estimate_instance_overhead' with vcpus
 * Extend estimate_instance_overhead for libvirt
 * Make libvirt to corretly pin emulator threads if requested.
 Dependencies
 ============
 The realtime spec is not a pre-requisite, but is complementary to
 this work
 * https://blueprints.launchpad.net/nova/+spec/libvirt-real-time
 * https://review.openstack.org/#/c/139688/
 Testing
 =======
 This can be tested in any CI system that is capable of testing the
 current NUMA and dedicated CPUs policy. i.e. It requires ability to
 use KVM and not merely QEMU. Functionnal tests for the scheduling and
 driver bits (libvirt) are going to be added.
 Documentation Impact
 ====================
 The documentation detailing NUMA and dedicated CPU policy usage will need
 to be extended to also describe the new options this work introduces.
 References
 ==========
 History
 =======