nova-specs/specs/kilo/implemented/virt-driver-cpu-pinning.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

=============================================
Virt driver pinning guest vCPUs to host pCPUs
=============================================

https://blueprints.launchpad.net/nova/+spec/virt-driver-cpu-pinning

This feature aims to improve the libvirt driver so that it is able to strictly
pin guest vCPUS to host pCPUs. This provides the concept of "dedicated CPU"
guest instances.

Problem description
===================

If a host is permitting overcommit of CPUs, there can be prolonged time
periods where a guest vCPU is not scheduled by the host, if another guest is
competing for the CPU time. This means that workloads executing in a guest can
have unpredictable latency, which may be unacceptable for the type of
application being run.

Use Cases
---------

Depending on the workload being executed, the end user or cloud admin may
wish to have control over which physical CPUs (pCPUs) are utilized by the
virtual CPUs (vCPUs) of any given instance.

Project Priority
----------------

None

Proposed change
===============

The flavor extra specs will be enhanced to support one new parameter:

* hw:cpu_policy=shared|dedicated

If the policy is set to 'shared' no change will be made compared to the current
default guest CPU placement policy. The guest vCPUs will be allowed to freely
float across host pCPUs, albeit potentially constrained by NUMA policy. If the
policy is set to 'dedicated' then the guest vCPUs will be strictly pinned to a
set of host pCPUs. In the absence of an explicit vCPU topology request, the
virt drivers typically expose all vCPUs as sockets with 1 core and 1 thread.
When strict CPU pinning is in effect the guest CPU topology will be setup to
match the topology of the CPUs to which it is pinned, i.e. if a 2 vCPU guest is
pinned to a single host core with 2 threads, then the guest will get a topology
of 1 socket, 1 core, 2 threads.

The image metadata properties will also allow specification of the pinning
policy:

* hw_cpu_policy=shared|dedicated

.. NOTE::
   The original definition of this specification included support for
   configurable CPU thread policies. However, this part of the spec was not
   implemented in OpenStack "Kilo" and has since been extracted into a
   separate proposal attached to
   https://blueprints.launchpad.net/nova/+spec/virt-driver-cpu-thread-pinning.

The scheduler will have to be enhanced so that it considers the usage of CPUs
by existing guests. Use of a dedicated CPU policy will have to be accompanied
by the setup of aggregates to split the hosts into two groups, one allowing
overcommit of shared pCPUs and the other only allowing dedicated CPU guests,
i.e. we do not want a situation with dedicated CPU and shared CPU guests on the
same host. It is likely that the administrator will already need to setup host
aggregates for the purpose of using huge pages for guest RAM. The same grouping
will be usable for both dedicated RAM (via huge pages) and dedicated CPUs (via
pinning).

The compute host already has a notion of CPU sockets which are reserved for
execution of base operating system services. This facility will be preserved
unchanged, i.e. dedicated CPU guests will only be placed on CPUs which are not
marked as reserved for the base OS.

Alternatives
------------

There is no alternative way to ensure that a guest has predictable execution
latency free of cache effects from other guests working on the host, that does
not involve CPU pinning.

The proposed solution is to use host aggregates for grouping compute hosts into
those for dedicated vs. overcommit CPU policy. An alternative would be to allow
compute hosts to have both dedicated and overcommit guests, splitting them onto
separate sockets, i.e. if there were four sockets, two sockets could be used
for dedicated CPU guests while two sockets could be used for overcommit guests,
with usage determined on a first-come, first-served basis. A problem with this
approach is that there is not strict workload isolation even if separate
sockets are used. Cached effects can be observed, and they will also contend
for memory access, so the overcommit guests can negatively impact performance
of the dedicated CPU guests even if on separate sockets. So while this would
be simpler from an administrative POV, it would not give the same performance
guarantees that are important for NFV use cases. It would none the less be
possible to enhance the design in the future, so that overcommit & dedicated
CPU guests could co-exist on the same host for those use cases where admin
simplicity is more important than perfect performance isolation. It is believed
that it is better to start off with the simpler to implement design based on
host aggregates for the first iteration of this feature.

Data model impact
-----------------

The 'compute_node' table will gain a new field to record information about
what host CPUs are available and what are in use by guest instances with
dedicated CPU resource assigned. Similar to the 'numa_topology' field this
will be a structured data field containing something like

::

  {'cells': [
            {
                'cpuset': '0,1,2,3',
                'sib': ['0,1', '2,3'],
                'pin': '0,2',
                'id': 0
            },
            {
                'cpuset': '4,5,6,7',
                'sib': ['4,5', '6,7'],
                'pin': '4',
                'id': 1
            }
  ]}

The 'instance_extra' table will gain a new field to record information
about what host CPUs each guest CPU is being pinned to, which will also
contain structured data similar to that used in the 'numa_topology' field
of the same table.

::

 {'cells': [
            {
                'id': 0,
                'pin': {0: 0, 1: 3},
                'topo': {'sock': 1, 'core': 1, 'th': 2}
            },
            {
                'id': 1,
                'pin': {2: 1, 3: 2},
                'topo': {'sock': 1, 'core': 1, 'th': 2}
            }
 ]}


REST API impact
---------------

No impact.

The existing APIs already support arbitrary data in the flavor extra specs.

Security impact
---------------

No impact.

Notifications impact
--------------------

No impact.

The notifications system is not used by this change.

Other end user impact
---------------------

There are no changes that directly impact the end user, other than the fact
that their guest should have more predictable CPU execution latency.

Performance Impact
------------------

No impact.

Other deployer impact
---------------------

The cloud administrator will gain the ability to define flavors which offer
dedicated CPU resources. The administrator will have to place hosts into groups
using aggregates such that the scheduler can separate placement of guests with
dedicated vs shared CPUs. Although not required by this design, it is expected
that the administrator will commonly use the same host aggregates to group
hosts for both CPU pinning and large page usage, since these concepts are
complementary and expected to be used together. This will minimise the
administrative burden of configuring host aggregates.

Developer impact
----------------

It is expected that most hypervisors will have the ability to setup dedicated
pCPUs for guests vs shared pCPUs. The flavor parameter is simple enough that
any Nova driver would be able to support it.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  ndipanov

Other contributors:
  berrange
  vladik

Work Items
----------

* Enhance libvirt to support setup of strict CPU pinning for guests when the
  appropriate policy is set in the flavor

Dependencies
============

* Virt driver guest NUMA node placement & topology

   https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement

Testing
=======

It is not practical to test this feature using the gate and tempest at this
time, since effective testing will require that the guests running the test
be provided with multiple NUMA nodes, each in turn with multiple CPUs.

The Nova docs/source/devref documentation will be updated to include a
detailed set of instructions for manually testing the feature. This will
include testing of the previously developed NUMA and huge pages features
too. This doc will serve as the basis for later writing further automated
tests, as well as a useful basis for writing end user documentation on
the feature.

Documentation Impact
====================

The new flavor parameter available to the cloud administrator needs to be
documented along with recommendations about effective usage. The docs will
also need to mention the compute host deployment pre-requisites such as the
need to setup aggregates. The testing guide mentioned in the previous
section will provide useful material for updating the docs with.

References
==========

Current "big picture" research and design for the topic of CPU and memory
resource utilization and placement. vCPU topology is a subset of this
work

* https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement

Previously approved for Juno but implementation not completed

* https://review.openstack.org/93652

Virt driver pinning guest vCPUs threads to host pCPUs threads blueprint

* https://blueprints.launchpad.net/nova/+spec/virt-driver-cpu-thread-pinning