339 lines
13 KiB
ReStructuredText
339 lines
13 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
================================================
|
|
Virt driver guest NUMA node placement & topology
|
|
================================================
|
|
|
|
https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement
|
|
|
|
This feature aims to enhance the libvirt driver to be able to do intelligent
|
|
NUMA node placement for guests. This will increase the effective utilization
|
|
of compute resources and decrease latency by avoiding cross-node memory
|
|
accesses by guests.
|
|
|
|
Problem description
|
|
===================
|
|
|
|
The vast majority of hardware used for virtualization compute nodes will
|
|
exhibit NUMA characteristics. When running workloads on NUMA hosts it is
|
|
important that the CPUs executing the processes are on the same node as the
|
|
memory used. This ensures that all memory accesses are local to the NUMA node
|
|
and thus not consumed the very limited cross-node memory bandwidth, which adds
|
|
latency to memory accesses. PCI devices are directly associated with specific
|
|
NUMA nodes for the purposes of DMA, so when using PCI device assignment it is
|
|
also desirable that the guest be placed on the same NUMA node as any PCI device
|
|
that is assigned to it.
|
|
|
|
The libvirt driver does not currently attempt any NUMA placement, the guests
|
|
are free to float across any host pCPUs and their RAM is allocated from any
|
|
NUMA node. This is very wasteful of compute resources and increases memory
|
|
access latency which is harmful for NFV use cases.
|
|
|
|
If the RAM/vCPUs associated with a flavor are larger than any single NUMA
|
|
node, it is important to expose NUMA topology to the guest so that the OS in
|
|
the guest can intelligently schedule workloads it runs. For this to work the
|
|
guest NUMA nodes must be directly associated with host NUMA nodes.
|
|
|
|
Some guest workloads have very demanding requirements for memory access
|
|
latency and/or bandwidth, which exceed that which is available from a
|
|
single NUMA node. For such workloads, it will be beneficial to spread
|
|
the guest across multiple host NUMA nodes, even if the guest RAM/vCPUs
|
|
could theoretically fit in a single NUMA node.
|
|
|
|
Forward planning to maximise the choice of target hosts for use with live
|
|
migration may also cause an administrator to prefer splitting a guest
|
|
across multiple nodes, even if it could potentially fit in a single node
|
|
on some hosts.
|
|
|
|
For these two reasons it is desirable to be able to explicitly indicate
|
|
how many NUMA nodes to setup in a guest, and to specify how much RAM or
|
|
how many vCPUs to place in each node.
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
The libvirt driver will be enhanced so that it looks at the resources available
|
|
in each NUMA node and decides which is best able to run the guest. When
|
|
launching the guest, it will tell libvirt to confine the guest to the chosen
|
|
NUMA node.
|
|
|
|
The compute driver host stats data will be extended to include information
|
|
about the NUMA topology of the host and the availability of resources in the
|
|
nodes.
|
|
|
|
The scheduler will be enhanced such that it can consider the availability of
|
|
NUMA resources when choosing the host to schedule on. The algorithm that the
|
|
scheduler uses to decide if the host can run will need to be closely matched,
|
|
if not identical to, the algorithm used by the libvirt driver itself. This
|
|
will involve the creation of a new scheduler filter to match the flavor/image
|
|
config specification against the NUMA resource availability reported by the
|
|
compute hosts.
|
|
|
|
The flavor extra specs will support the specification of guest NUMA topology.
|
|
This is important when the RAM / vCPU count associated with a flavor is larger
|
|
than any single NUMA node in compute hosts, by making it possible to have guest
|
|
instances that span NUMA nodes. The compute driver will ensure that guest NUMA
|
|
nodes are directly mapped to host NUMA nodes. It is expected that the default
|
|
setup would be to not list any NUMA properties and just let the compute host
|
|
and scheduler apply a sensible default placement logic. These properties would
|
|
only need to be set in the sub-set of scenarios which require more precise
|
|
control over the NUMA topology / fit characteristics.
|
|
|
|
* hw:numa_nodes=NN - numa of NUMA nodes to expose to the guest.
|
|
* hw:numa_mempolicy=preferred|strict - memory allocation policy
|
|
* hw:numa_cpus.0=<cpu-list> - mapping of vCPUS N-M to NUMA node 0
|
|
* hw:numa_cpus.1=<cpu-list> - mapping of vCPUS N-M to NUMA node 1
|
|
* hw:numa_mem.0=<ram-size> - mapping N MB of RAM to NUMA node 0
|
|
* hw:numa_mem.1=<ram-size> - mapping N MB of RAM to NUMA node 1
|
|
|
|
The most common case will be that the admin only sets 'hw:numa_nodes' and then
|
|
the flavor vCPUs and RAM will be divided equally across the NUMA nodes.
|
|
|
|
The 'hw:numa_mempolicy' option allows specification of whether it is mandatory
|
|
for the instance's RAM allocations to come from the NUMA nodes to which it is
|
|
bound, or whether the kernel is free to fallback to using an alternative node.
|
|
If 'hw:numa_nodes' is specified, then 'hw:numa_mempolicy' is assumed to default
|
|
to 'strict'. It is useful to change it to 'preferred' when the 'hw:numa_nodes'
|
|
parameter is being set to '1' to force disable use of NUMA by image property
|
|
overrides.
|
|
|
|
It should only be required to use the 'hw:numa_cpu.N' and 'hw:numa_mem.N'
|
|
settings if the guest NUMA nodes should have asymetrical allocation of CPUs
|
|
and RAM. This is important for some NFV workloads, but in general these will
|
|
be rarely used tunables. If the 'hw:numa_cpu' or 'hw:numa_mem' settings are
|
|
provided and their values do not sum to the total vcpu count / memory size,
|
|
this is considered to be a configuration error. An exception will be raised
|
|
by the compute driver when attempting to boot the instance. As an enhancement
|
|
it might be possible to validate some of the data at the API level to allow
|
|
for earlier error reporting to the user. Such checking is not a functional
|
|
prerequisite for this work though so such work can be done out-of-band to
|
|
the main development effort.
|
|
|
|
When scheduling, if only the hw:numa_nodes=NNN property is set the scheduler
|
|
will synthesize hw:numa_cpus.NN and hw:numa_mem.NN properties such that the
|
|
flavor allocation is equally spread across the desired number of NUMA nodes.
|
|
It will then look consider the available NUMA resources on hosts to find one
|
|
that exactly matches the requirements of the guest. So, given an example
|
|
config:
|
|
|
|
* vcpus=8
|
|
* mem=4
|
|
* hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
|
|
* hw:numa_cpus.0=0,1,2,3,4,5
|
|
* hw:numa_cpus.1=6,7
|
|
* hw:numa_mem.0=3072
|
|
* hw:numa_mem.1=1024
|
|
|
|
The scheduler will look for a host with 2 NUMA nodes with the ability to run
|
|
6 CPUs + 3 GB of RAM on one node, and 2 CPUS + 1 GB of RAM on another node.
|
|
If a host has a single NUMA node with capability to run 8 CPUs and 4 GB of
|
|
RAM it will not be considered a valid match. The same logic will be applied
|
|
in the scheduler regardless of the hw:numa_mempolicy option setting.
|
|
|
|
All of the properties described against the flavor could also be set against
|
|
the image, with the leading ':' replaced by '_', as is normal for image
|
|
property naming conventions:
|
|
|
|
* hw_numa_nodes=NN - numa of NUMA nodes to expose to the guest.
|
|
* hw_numa_mempolicy=strict|preferred - memory allocation policy
|
|
* hw_numa_cpus.0=<cpu-list> - mapping of vCPUS N-M to NUMA node 0
|
|
* hw_numa_cpus.1=<cpu-list> - mapping of vCPUS N-M to NUMA node 1
|
|
* hw_numa_mem.0=<ram-size> - mapping N MB of RAM to NUMA node 0
|
|
* hw_numa_mem.1=<ram-size> - mapping N MB of RAM to NUMA node 1
|
|
|
|
This is useful if the application in the image requires very specific NUMA
|
|
topology characteristics, which is expected to be used frequently with NFV
|
|
images. The properties can only be set against the image, however, if they
|
|
are not already set against the flavor. So for example, if the flavor sets
|
|
'hw:numa_nodes=2' but does not set any 'hw:numa_cpus' / 'hw:numa_mem' values
|
|
then the image can optionally set those. If the flavor has, however, set a
|
|
specific property the image cannot override that. This allows the flavor
|
|
admin to strictly lock down what is permitted if desired. They can force a
|
|
non-NUMA topology by setting hw:numa_nodes=1 against the flavor.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
Libvirt supports integration with a daemon called numad. This daemon can be
|
|
given a RAM size + vCPU count and tells libvirt what NUMA node to place a
|
|
guest on. It is also capable of shifting running guests between NUMA nodes to
|
|
rebalance utilization. This is insufficient for Nova since it needs to have
|
|
intelligence in the scheduler to pick hosts. The compute drivers then needs to
|
|
be able to use the same logic when actually launching the guests. The numad
|
|
system is not portable to other compute hypervisors. It does not deal with the
|
|
problem of placing guests which span across NUMA nodes. Finally, it does not
|
|
address the needs for NFV workloads which require guaranteed NUMA topology
|
|
and placement policies, not merely dynamic best effort.
|
|
|
|
Another alternative is to just do nothing, as we do today, and rely on the
|
|
Linux kernel scheduler being enhanced to automatically place guests on
|
|
appropriate NUMA nodes and rebalance them on demand. This shares most of the
|
|
problems seen with using numad.
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
No impact.
|
|
|
|
The reporting of NUMA topology will be integrated in the existing data
|
|
structure used for host state reporting. This already supports arbitrary
|
|
fields so no data model changes are anticipated for this part. This would
|
|
appear as structured data
|
|
|
|
::
|
|
|
|
hw_numa = {
|
|
nodes = [
|
|
{
|
|
id = 0
|
|
cpus = 0, 2, 4, 6
|
|
mem = {
|
|
total = 10737418240
|
|
free = 3221225472
|
|
},
|
|
distances = [ 10, 20],
|
|
},
|
|
{
|
|
id = 1
|
|
cpus = 1, 3, 5, 7
|
|
mem = {
|
|
total = 10737418240
|
|
free = 5368709120
|
|
},
|
|
distances = [ 20, 10],
|
|
}
|
|
],
|
|
}
|
|
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
No impact.
|
|
|
|
The API for host state reporting already supports arbitrary data fields, so
|
|
no change is anticipated from that POV. No new API calls will be required.
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
No impact.
|
|
|
|
There are no new APIs involved which would imply a new security risk.
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
No impact.
|
|
|
|
There is no need for any use fo the notification system.
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
Depending on the flavor chosen, the guest OS may see NUMA nodes backing its
|
|
RAM allocation.
|
|
|
|
There is no end user interaction in setting up NUMA policies of usage.
|
|
|
|
The cloud administrator will gain the ability to set policies on flavors.
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
The new scheduler features will imply increased performance overhead when
|
|
determining whether a host is able to fit the memory and vCPU needs of the
|
|
flavor. ie the current logic which just checks the vCPU count and RAM
|
|
requirement against the host free memory will need to take account of the
|
|
availability of resources in specific NUMA nodes.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
If the deployment has flavors whose RAM + vCPU allocations are larger than
|
|
the size of the NUMA nodes in the compute hosts, the cloud administrator
|
|
should strongly consider defining guest NUMA nodes in the flavor. This will
|
|
enable the compute hosts to have better NUMA utilization and improve perf of
|
|
the guest OS.
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
The new flavor attributes could be used by any full machine virtualization
|
|
hypervisor, however, it is not mandatory that they do so.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
berrange
|
|
|
|
Other contributors:
|
|
ndipanov
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Enhance libvirt driver to report NUMA node resources & availability
|
|
* Enhance libvirt driver to support setup of guest NUMA nodes.
|
|
* Enhance libvirt driver to look at NUMA node availability when launching
|
|
guest instances and pin all guests to best NUMA node
|
|
* Add support to scheduler for picking hosts based on the NUMA availability
|
|
instead of simply considering the total RAM/vCPU availability.
|
|
|
|
Dependencies
|
|
============
|
|
|
|
* The driver vCPU topology feature is a pre-requisite
|
|
|
|
https://blueprints.launchpad.net/nova/+spec/virt-driver-vcpu-topology
|
|
|
|
* Supporting guest NUMA nodes will require completion of work in QEMU and
|
|
libvirt, to enable guest NUMA nodes to be pinned to specific host NUMA
|
|
nodes. In absence of libvirt/QEMU support, guest NUMA nodes can still be
|
|
used but it would not have any performance benefit, and may even hurt
|
|
performance.
|
|
|
|
https://www.redhat.com/archives/libvir-list/2014-June/msg00201.html
|
|
|
|
Testing
|
|
=======
|
|
|
|
There are various discrete parts of the work that can be tested in isolation
|
|
of each other, fairly effectively using unit tests.
|
|
|
|
The main area where unit tests might not be sufficient is the scheduler
|
|
integration, where performance/scalability would be a concern. Testing the
|
|
scalability of the scheduler in tempest though is not practical, since the
|
|
issues would only become apparent with many compute hosts and many guests.
|
|
ie a scale beyond that which tempest sets up.
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
The cloud administrator docs need to describe the new flavor parameters
|
|
and make recommendations on how to effectively use them.
|
|
|
|
The end user needs to be made aware of the fact that some flavors will cause
|
|
the guest OS to see NUMA topology.
|
|
|
|
References
|
|
==========
|
|
|
|
Current "big picture" research and design for the topic of CPU and memory
|
|
resource utilization and placement. vCPU topology is a subset of this
|
|
work
|
|
|
|
* https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement
|
|
|
|
OpenStack NFV team:
|
|
|
|
* https://wiki.openstack.org/wiki/Teams/NFV
|