nova-specs/specs/juno/approved/virt-driver-cpu-pinning.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

=============================================
Virt driver pinning guest vCPUs to host pCPUs
=============================================

https://blueprints.launchpad.net/nova/+spec/virt-driver-cpu-pinning

This feature aims to improve the libvirt driver so that it is able to strictly
pin guest vCPUS to host pCPUs. This provides the concept of "dedicated CPU"
guest instances.

Problem description
===================

If a host is permitting overcommit of CPUs, there can be prolonged time
periods where a guest vCPU is not scheduled by the host, if another guest is
competing for the CPU time. This means that workloads executing in a guest can
have unpredictable latency, which may be unacceptable for the type of
application being run.

Depending on the workload being executed the end user or admin may wish to
have control over how the guest used hyperthreads. To maximise cache
efficiency, the guest may wish to be pinned to thread siblings. Conversely
the guest may wish to avoid thread siblings (ie only pin to 1 sibling)
or even avoid hosts with threads entirely.

Proposed change
===============

The flavor extra specs will be enhanced to support two new parameters

* hw:cpu_policy=shared|dedicated
* hw:cpu_threads_policy=avoid|separate|isolate|prefer

If the policy is set to 'shared' no change will be made compared to the current
default guest CPU placement policy. The guest vCPUs will be allowed to freely
float across host pCPUs, albeit potentially constrained by NUMA policy. If the
policy is set to 'dedicated' then the guest vCPUs will be strictly pinned to a
set of host pCPUs. In the absence of an explicit vCPU topology request, the
virt drivers typically expose all vCPUs as sockets with 1 core and 1 thread.
When strict CPU pinning is in effect the guest CPU topology will be setup to
match the topology of the CPUs to which it is pinned. ie if a 2 vCPU guest is
pinned to a single host core with 2 threads, then the guest will get a topology
of 1 socket, 1 core, 2 threads.

The threads policy will control how the scheduler / virt driver place guests
wrt CPU threads. It will only apply if the sheduler policy is 'dedicated'

 - avoid: the scheduler will not place the guest on a host which has
   hyperthreads.
 - separate: if the host has threads, each vCPU will be placed on a
   different core. ie no two vCPUs will be placed on thread siblings
 - isolate: if the host has threads, each vCPU will be placed on a
   different core and no vCPUs from other guests will be able to be
   placed on the same core. ie one thread sibling is guaranteed to
   always be unused,
 - prefer: if the host has threads, vCPU will be placed on the same
   core, so they are thread siblings.

The image metadata properties will also allow specification of the
threads policy

* hw_cpu_threads_policy=avoid|separate|isolate|prefer

This will only be honoured if the flavor does not already have a threads
policy set. This ensures the cloud administrator can have absolute control
over threads policy if desired.

The scheduler will have to be enhanced so that it considers the usage of CPUs
by existing guests. Use of a dedicated CPU policy will have to be accompanied
by the setup of aggregates to split the hosts into two groups, one allowing
overcommit of shared pCPUs and the other only allowing dedicated CPU guests.
ie we do not want a situation with dedicated CPU and shared CPU guests on the
same host. It is likely that the administrator will already need to setup host
aggregates for the purpose of using huge pages for guest RAM. The same grouping
will be usable for both dedicated RAM (via huge pages) and dedicated CPUs (via
pinning).

The compute host already has a notion of CPU sockets which are reserved for
execution of base operating system services. This facility will be preserved
unchanged. ie dedicated CPU guests will only be placed on CPUs which are not
marked as reserved for the base OS.

Alternatives
------------

There is no alternative way to ensure that a guest has predictable execution
latency free of cache effects from other guests working on the host, that does
not involve CPU pinning.

The proposed solution is to use host aggregates for grouping compute hosts into
those for dedicated vs overcommit CPU policy. An alternative would be to allow
compute hosts to have both dedicated and overcommit guests, splitting them onto
separate sockets. ie if there were for sockets, two sockets could be used for
dedicated CPU guests while two sockets could be used for overcommit guests,
with usage determined on a first-come, first-served basis. A problem with this
approach is that there is not strict workload isolation even if separate
sockets are used. Cached effects can be observed, and they will also contend
for memory access, so the overcommit guests can negatively impact performance
of the dedicated CPU guests even if on separate sockets. So while this would
be simpler from an administrative POV, it would not give the same performance
guarantees that are important for NFV use cases. It would none the less be
possible to enhance the design in the future, so that overcommit & dedicated
CPU guests could co-exist on the same host for those use cases where admin
simplicity is more important than perfect performance isolation. It is believed
that it is better to start off with the simpler to implement design based on
host aggregates for the first iteration of this feature.

Data model impact
-----------------

No impact.

The new data items are stored in the existing flavor extra specs data model
and in the host state metadata model.

REST API impact
---------------

No impact.

The existing APIs already support arbitrary data in the flavor extra specs.

Security impact
---------------

No impact.

Notifications impact
--------------------

No impact.

The notifications system is not used by this change.

Other end user impact
---------------------

There are no changes that directly impact the end user, other than the fact
that their guest should have more predictable CPU execution latency.

Performance Impact
------------------

The scheduler will incur small further overhead if a threads policy is set
on the image or flavor. This overhead will be negligible compared to that
implied by the enhancements to support NUMA policy and huge pages. It is
anticipated that dedicated CPU guests will typically be used in conjunction
with huge pages.

Other deployer impact
---------------------

The cloud administrator will gain the ability to define flavors which offer
dedicated CPU resources. The administrator will have to place hosts into groups
using aggregates such that the scheduler can separate placement of guests with
dedicated vs shared CPUs. Although not required by this design, it is expected
that the administrator will commonly use the same host aggregates to group
hosts for both CPU pinning and large page usage, since these concepts are
complementary and expected to be used together. This will minimise the
administrative burden of configuring host aggregates.

Developer impact
----------------

It is expected that most hypervisors will have the ability to setup dedicated
pCPUs for guests vs shared pCPUs. The flavor parameter is simple enough that
any Nova driver would be able to support it.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  berrange

Other contributors:
  ndipanov

Work Items
----------

* Enhance libvirt to support setup of strict CPU pinning for guests when the
  appropriate policy is set in the flavor

* Enhance the scheduler to take account of threads policy when choosing
  which host to place the guest on.

Dependencies
============

* Virt driver guest NUMA node placement & topology

   https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement

Testing
=======

It is unknown at this time if the gate hosts have sufficient pCPUs available
to allow this feature to be effectively tested by tempest.

Documentation Impact
====================

The new flavor parameter available to the cloud administrator needs to be
documented along with recommendations about effective usage. The docs will
also need to mention the compute host deployment pre-requisites such as the
need to setup aggregates.

References
==========

Current "big picture" research and design for the topic of CPU and memory
resource utilization and placement. vCPU topology is a subset of this
work

* https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement