Fix additional mistakes in the NUMA spec document
The 'hw:numa_mempolicy' parameter was never implemented and will likely never be, however, it's located in the 'implemented' folder suggesting otherwise. Seeing as specs are as much a reference as anything, this invalid information should not be retained. Correct this, taking the opportunity to fix some other typo, line wrapping and general formatting. Change-Id: I4aca0073f1fa26ff0c3a34407370f6ba6d916879
This commit is contained in:
parent
d79a01ff87
commit
45252df4c5
|
@ -11,9 +11,9 @@ Virt driver guest NUMA node placement & topology
|
||||||
https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement
|
https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement
|
||||||
|
|
||||||
This feature aims to enhance the libvirt driver to be able to do intelligent
|
This feature aims to enhance the libvirt driver to be able to do intelligent
|
||||||
NUMA node placement for guests. This will increase the effective utilization
|
NUMA node placement for guests. This will increase the effective utilization of
|
||||||
of compute resources and decrease latency by avoiding cross-node memory
|
compute resources and decrease latency by avoiding cross-node memory accesses
|
||||||
accesses by guests.
|
by guests.
|
||||||
|
|
||||||
Problem description
|
Problem description
|
||||||
===================
|
===================
|
||||||
|
@ -28,30 +28,30 @@ NUMA nodes for the purposes of DMA, so when using PCI device assignment it is
|
||||||
also desirable that the guest be placed on the same NUMA node as any PCI device
|
also desirable that the guest be placed on the same NUMA node as any PCI device
|
||||||
that is assigned to it.
|
that is assigned to it.
|
||||||
|
|
||||||
The libvirt driver does not currently attempt any NUMA placement, the guests
|
The libvirt driver does not currently attempt any NUMA placement; the guests
|
||||||
are free to float across any host pCPUs and their RAM is allocated from any
|
are free to float across any host pCPUs and their memory is allocated from any
|
||||||
NUMA node. This is very wasteful of compute resources and increases memory
|
NUMA node. This is very wasteful of compute resources and increases memory
|
||||||
access latency which is harmful for NFV use cases.
|
access latency which is harmful for NFV use cases.
|
||||||
|
|
||||||
If the RAM/vCPUs associated with a flavor are larger than any single NUMA
|
If the memory/vCPUs associated with a flavor are larger than any single NUMA
|
||||||
node, it is important to expose NUMA topology to the guest so that the OS in
|
node, it is important to expose NUMA topology to the guest so that the OS in
|
||||||
the guest can intelligently schedule workloads it runs. For this to work the
|
the guest can intelligently schedule workloads it runs. For this to work the
|
||||||
guest NUMA nodes must be directly associated with host NUMA nodes.
|
guest NUMA nodes must be directly associated with host NUMA nodes.
|
||||||
|
|
||||||
Some guest workloads have very demanding requirements for memory access
|
Some guest workloads have very demanding requirements for memory access latency
|
||||||
latency and/or bandwidth, which exceed that which is available from a
|
and/or bandwidth, which exceed that which is available from a single NUMA node.
|
||||||
single NUMA node. For such workloads, it will be beneficial to spread
|
For such workloads, it will be beneficial to spread the guest across multiple
|
||||||
the guest across multiple host NUMA nodes, even if the guest RAM/vCPUs
|
host NUMA nodes, even if the guest memory/vCPUs could theoretically fit in a
|
||||||
could theoretically fit in a single NUMA node.
|
single NUMA node.
|
||||||
|
|
||||||
Forward planning to maximise the choice of target hosts for use with live
|
Forward planning to maximize the choice of target hosts for use with live
|
||||||
migration may also cause an administrator to prefer splitting a guest
|
migration may also cause an administrator to prefer splitting a guest across
|
||||||
across multiple nodes, even if it could potentially fit in a single node
|
multiple nodes, even if it could potentially fit in a single node on some
|
||||||
on some hosts.
|
hosts.
|
||||||
|
|
||||||
For these two reasons it is desirable to be able to explicitly indicate
|
For these two reasons it is desirable to be able to explicitly indicate how
|
||||||
how many NUMA nodes to setup in a guest, and to specify how much RAM or
|
many NUMA nodes to setup in a guest, and to specify how much memory or how many
|
||||||
how many vCPUs to place in each node.
|
vCPUs to place in each node.
|
||||||
|
|
||||||
Proposed change
|
Proposed change
|
||||||
===============
|
===============
|
||||||
|
@ -68,111 +68,112 @@ nodes.
|
||||||
The scheduler will be enhanced such that it can consider the availability of
|
The scheduler will be enhanced such that it can consider the availability of
|
||||||
NUMA resources when choosing the host to schedule on. The algorithm that the
|
NUMA resources when choosing the host to schedule on. The algorithm that the
|
||||||
scheduler uses to decide if the host can run will need to be closely matched,
|
scheduler uses to decide if the host can run will need to be closely matched,
|
||||||
if not identical to, the algorithm used by the libvirt driver itself. This
|
if not identical to, the algorithm used by the libvirt driver itself. This will
|
||||||
will involve the creation of a new scheduler filter to match the flavor/image
|
involve the creation of a new scheduler filter to match the flavor/image config
|
||||||
config specification against the NUMA resource availability reported by the
|
specification against the NUMA resource availability reported by the compute
|
||||||
compute hosts.
|
hosts.
|
||||||
|
|
||||||
The flavor extra specs will support the specification of guest NUMA topology.
|
The flavor extra specs will support the specification of guest NUMA topology.
|
||||||
This is important when the RAM / vCPU count associated with a flavor is larger
|
This is important when the memory / vCPU count associated with a flavor is
|
||||||
than any single NUMA node in compute hosts, by making it possible to have guest
|
larger than any single NUMA node in compute hosts, by making it possible to
|
||||||
instances that span NUMA nodes. The compute driver will ensure that guest NUMA
|
have guest instances that span NUMA nodes. The compute driver will ensure that
|
||||||
nodes are directly mapped to host NUMA nodes. It is expected that the default
|
guest NUMA nodes are directly mapped to host NUMA nodes. It is expected that
|
||||||
setup would be to not list any NUMA properties and just let the compute host
|
the default setup would be to not list any NUMA properties and just let the
|
||||||
and scheduler apply a sensible default placement logic. These properties would
|
compute host and scheduler apply a sensible default placement logic. These
|
||||||
only need to be set in the sub-set of scenarios which require more precise
|
properties would only need to be set in the sub-set of scenarios which require
|
||||||
control over the NUMA topology / fit characteristics.
|
more precise control over the NUMA topology / fit characteristics.
|
||||||
|
|
||||||
* hw:numa_nodes=NN - numa of NUMA nodes to expose to the guest.
|
* ``hw:numa_nodes=NN`` - number of NUMA nodes to expose to the guest.
|
||||||
* hw:numa_mempolicy=preferred|strict - memory allocation policy
|
|
||||||
* hw:numa_cpus.0=<cpu-list> - mapping of vCPUS N-M to NUMA node 0
|
|
||||||
* hw:numa_cpus.1=<cpu-list> - mapping of vCPUS N-M to NUMA node 1
|
|
||||||
* hw:numa_mem.0=<ram-size> - mapping N MB of RAM to NUMA node 0
|
|
||||||
* hw:numa_mem.1=<ram-size> - mapping N MB of RAM to NUMA node 1
|
|
||||||
|
|
||||||
The most common case will be that the admin only sets 'hw:numa_nodes' and then
|
* ``hw:numa_cpus.NN=<cpu-list>`` - mapping of guest vCPUS to a given guest NUMA
|
||||||
the flavor vCPUs and RAM will be divided equally across the NUMA nodes.
|
node.
|
||||||
|
|
||||||
The 'hw:numa_mempolicy' option allows specification of whether it is mandatory
|
* ``hw:numa_mem.NN=<ram-size>`` - mapping of guest MB of memory to a given
|
||||||
for the instance's RAM allocations to come from the NUMA nodes to which it is
|
guest NUMA node.
|
||||||
bound, or whether the kernel is free to fallback to using an alternative node.
|
|
||||||
If 'hw:numa_nodes' is specified, then 'hw:numa_mempolicy' is assumed to default
|
|
||||||
to 'strict'. It is useful to change it to 'preferred' when the 'hw:numa_nodes'
|
|
||||||
parameter is being set to '1' to force disable use of NUMA by image property
|
|
||||||
overrides.
|
|
||||||
|
|
||||||
It should only be required to use the 'hw:numa_cpu.N' and 'hw:numa_mem.N'
|
.. important ::
|
||||||
settings if the guest NUMA nodes should have asymetrical allocation of CPUs
|
The NUMA nodes, CPUs and memory referred to above are guest NUMA nodes,
|
||||||
and RAM. This is important for some NFV workloads, but in general these will
|
guest CPUs, and guest memory. It is not possible to define specific host
|
||||||
be rarely used tunables. If the 'hw:numa_cpu' or 'hw:numa_mem' settings are
|
nodes, CPUs or memory that should be assigned to a guest.
|
||||||
provided and their values do not sum to the total vcpu count / memory size,
|
|
||||||
this is considered to be a configuration error. An exception will be raised
|
|
||||||
by the compute driver when attempting to boot the instance. As an enhancement
|
|
||||||
it might be possible to validate some of the data at the API level to allow
|
|
||||||
for earlier error reporting to the user. Such checking is not a functional
|
|
||||||
prerequisite for this work though so such work can be done out-of-band to
|
|
||||||
the main development effort.
|
|
||||||
|
|
||||||
When scheduling, if only the hw:numa_nodes=NNN property is set the scheduler
|
The most common case will be that the admin only sets ``hw:numa_nodes`` and
|
||||||
will synthesize hw:numa_cpus.NN and hw:numa_mem.NN properties such that the
|
then the flavor vCPUs and memory will be divided equally across the NUMA nodes.
|
||||||
flavor allocation is equally spread across the desired number of NUMA nodes.
|
When a NUMA policy is in effect, it is mandatory for the instance's memory
|
||||||
It will then look consider the available NUMA resources on hosts to find one
|
allocations to come from the NUMA nodes to which it is bound except where
|
||||||
that exactly matches the requirements of the guest. So, given an example
|
overriden by ``hw:numa_mem.NN``.
|
||||||
|
|
||||||
|
It should only be required to use the ``hw:numa_cpus.N`` and ``hw:numa_mem.N``
|
||||||
|
settings if the guest NUMA nodes should have asymmetrical allocation of CPUs
|
||||||
|
and memory. This is important for some NFV workloads, but in general these will
|
||||||
|
be rarely used tunables. If the ``hw:numa_cpus`` or ``hw:numa_mem`` settings
|
||||||
|
are provided and their values do not sum to the total vcpu count / memory size,
|
||||||
|
this is considered to be a configuration error. An exception will be raised by
|
||||||
|
the compute driver when attempting to boot the instance. As an enhancement it
|
||||||
|
might be possible to validate some of the data at the API level to allow for
|
||||||
|
earlier error reporting to the user. Such checking is not a functional
|
||||||
|
prerequisite for this work though so such work can be done out-of-band to the
|
||||||
|
main development effort.
|
||||||
|
|
||||||
|
If only the ``hw:numa_nodes=NNN`` property is set the ``hw:numa_cpus.NN`` and
|
||||||
|
``hw:numa_mem.NN`` properties will be synthesized such that the flavor
|
||||||
|
allocation is equally spread across the desired number of NUMA nodes. This will
|
||||||
|
happen twice: once when scheduling, to ensure the guest will fit on the host,
|
||||||
|
and once during claiming, when the resources are actually allocated. Both
|
||||||
|
processes will consider the available NUMA resources on hosts to find one that
|
||||||
|
exactly matches the requirements of the guest. For example, given the following
|
||||||
config:
|
config:
|
||||||
|
|
||||||
* vcpus=8
|
* ``vcpus=8``
|
||||||
* mem=4
|
* ``mem=4``
|
||||||
* hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
|
* ``hw:numa_nodes=2``
|
||||||
* hw:numa_cpus.0=0,1,2,3,4,5
|
* ``hw:numa_cpus.0=0,1,2,3,4,5``
|
||||||
* hw:numa_cpus.1=6,7
|
* ``hw:numa_cpus.1=6,7``
|
||||||
* hw:numa_mem.0=3072
|
* ``hw:numa_mem.0=3072``
|
||||||
* hw:numa_mem.1=1024
|
* ``hw:numa_mem.1=1024``
|
||||||
|
|
||||||
The scheduler will look for a host with 2 NUMA nodes with the ability to run
|
The scheduler will look for a host with 2 NUMA nodes with the ability to run 6
|
||||||
6 CPUs + 3 GB of RAM on one node, and 2 CPUS + 1 GB of RAM on another node.
|
CPUs + 3 GB of memory on one node, and 2 CPUS + 1 GB of RAM on another node.
|
||||||
If a host has a single NUMA node with capability to run 8 CPUs and 4 GB of
|
If a host has a single NUMA node with capability to run 8 CPUs and 4 GB of
|
||||||
RAM it will not be considered a valid match. The same logic will be applied
|
memory it will not be considered a valid match.
|
||||||
in the scheduler regardless of the hw:numa_mempolicy option setting.
|
|
||||||
|
|
||||||
All of the properties described against the flavor could also be set against
|
All of the properties described against the flavor could also be set against
|
||||||
the image, with the leading ':' replaced by '_', as is normal for image
|
the image, with the leading ':' replaced by '_', as is normal for image
|
||||||
property naming conventions:
|
property naming conventions:
|
||||||
|
|
||||||
* hw_numa_nodes=NN - numa of NUMA nodes to expose to the guest.
|
* ``hw_numa_nodes=NN`` - numa of NUMA nodes to expose to the guest.
|
||||||
* hw_numa_mempolicy=strict|preferred - memory allocation policy
|
* ``hw_numa_cpus.NN=<cpu-list>`` - mapping of guest vCPUS to a given guest NUMA
|
||||||
* hw_numa_cpus.0=<cpu-list> - mapping of vCPUS N-M to NUMA node 0
|
node.
|
||||||
* hw_numa_cpus.1=<cpu-list> - mapping of vCPUS N-M to NUMA node 1
|
* ``hw_numa_mem.NN=<ram-size>`` - mapping of guest MB of memory to a given
|
||||||
* hw_numa_mem.0=<ram-size> - mapping N MB of RAM to NUMA node 0
|
guest NUMA node.
|
||||||
* hw_numa_mem.1=<ram-size> - mapping N MB of RAM to NUMA node 1
|
|
||||||
|
|
||||||
This is useful if the application in the image requires very specific NUMA
|
This is useful if the application in the image requires very specific NUMA
|
||||||
topology characteristics, which is expected to be used frequently with NFV
|
topology characteristics, which is expected to be used frequently with NFV
|
||||||
images. The properties can only be set against the image, however, if they
|
images. The properties can only be set against the image, however, if they are
|
||||||
are not already set against the flavor. So for example, if the flavor sets
|
not already set against the flavor. So for example, if the flavor sets
|
||||||
'hw:numa_nodes=2' but does not set any 'hw:numa_cpus' / 'hw:numa_mem' values
|
``hw:numa_nodes=2`` but does not set any ``hw:numa_cpus`` or ``hw:numa_mem``
|
||||||
then the image can optionally set those. If the flavor has, however, set a
|
values then the image can optionally set those. If the flavor has, however, set
|
||||||
specific property the image cannot override that. This allows the flavor
|
a specific property the image cannot override that. This allows the flavor
|
||||||
admin to strictly lock down what is permitted if desired. They can force a
|
admin to strictly lock down what is permitted if desired. They can force a
|
||||||
non-NUMA topology by setting hw:numa_nodes=1 against the flavor.
|
non-NUMA topology by setting ``hw:numa_nodes=1`` against the flavor.
|
||||||
|
|
||||||
Alternatives
|
Alternatives
|
||||||
------------
|
------------
|
||||||
|
|
||||||
Libvirt supports integration with a daemon called numad. This daemon can be
|
Libvirt supports integration with a daemon called numad. This daemon can be
|
||||||
given a RAM size + vCPU count and tells libvirt what NUMA node to place a
|
given a memory size + vCPU count and tells libvirt what NUMA node to place a
|
||||||
guest on. It is also capable of shifting running guests between NUMA nodes to
|
guest on. It is also capable of shifting running guests between NUMA nodes to
|
||||||
rebalance utilization. This is insufficient for Nova since it needs to have
|
rebalance utilization. This is insufficient for Nova since it needs to have
|
||||||
intelligence in the scheduler to pick hosts. The compute drivers then needs to
|
intelligence in the scheduler to pick hosts. The compute drivers then needs to
|
||||||
be able to use the same logic when actually launching the guests. The numad
|
be able to use the same logic when actually launching the guests. The numad
|
||||||
system is not portable to other compute hypervisors. It does not deal with the
|
system is not portable to other compute hypervisors. It does not deal with the
|
||||||
problem of placing guests which span across NUMA nodes. Finally, it does not
|
problem of placing guests which span across NUMA nodes. Finally, it does not
|
||||||
address the needs for NFV workloads which require guaranteed NUMA topology
|
address the needs for NFV workloads which require guaranteed NUMA topology and
|
||||||
and placement policies, not merely dynamic best effort.
|
placement policies, not merely dynamic best effort.
|
||||||
|
|
||||||
Another alternative is to just do nothing, as we do today, and rely on the
|
Another alternative is to just do nothing, as we do today, and rely on the
|
||||||
Linux kernel scheduler being enhanced to automatically place guests on
|
Linux kernel scheduler being enhanced to automatically place guests on
|
||||||
appropriate NUMA nodes and rebalance them on demand. This shares most of the
|
appropriate NUMA nodes and rebalance them on demand. This shares most of the
|
||||||
problems seen with using numad.
|
problems seen with using NUMA.
|
||||||
|
|
||||||
Data model impact
|
Data model impact
|
||||||
-----------------
|
-----------------
|
||||||
|
@ -180,9 +181,9 @@ Data model impact
|
||||||
No impact.
|
No impact.
|
||||||
|
|
||||||
The reporting of NUMA topology will be integrated in the existing data
|
The reporting of NUMA topology will be integrated in the existing data
|
||||||
structure used for host state reporting. This already supports arbitrary
|
structure used for host state reporting. This already supports arbitrary fields
|
||||||
fields so no data model changes are anticipated for this part. This would
|
so no data model changes are anticipated for this part. This would appear as
|
||||||
appear as structured data
|
structured data
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
|
@ -215,8 +216,8 @@ REST API impact
|
||||||
|
|
||||||
No impact.
|
No impact.
|
||||||
|
|
||||||
The API for host state reporting already supports arbitrary data fields, so
|
The API for host state reporting already supports arbitrary data fields, so no
|
||||||
no change is anticipated from that POV. No new API calls will be required.
|
change is anticipated from that POV. No new API calls will be required.
|
||||||
|
|
||||||
Security impact
|
Security impact
|
||||||
---------------
|
---------------
|
||||||
|
@ -236,7 +237,7 @@ Other end user impact
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
Depending on the flavor chosen, the guest OS may see NUMA nodes backing its
|
Depending on the flavor chosen, the guest OS may see NUMA nodes backing its
|
||||||
RAM allocation.
|
memory allocation.
|
||||||
|
|
||||||
There is no end user interaction in setting up NUMA policies of usage.
|
There is no end user interaction in setting up NUMA policies of usage.
|
||||||
|
|
||||||
|
@ -247,18 +248,17 @@ Performance Impact
|
||||||
|
|
||||||
The new scheduler features will imply increased performance overhead when
|
The new scheduler features will imply increased performance overhead when
|
||||||
determining whether a host is able to fit the memory and vCPU needs of the
|
determining whether a host is able to fit the memory and vCPU needs of the
|
||||||
flavor. ie the current logic which just checks the vCPU count and RAM
|
flavor. ie the current logic which just checks the vCPU count and memory
|
||||||
requirement against the host free memory will need to take account of the
|
requirement against the host free memory will need to take account of the
|
||||||
availability of resources in specific NUMA nodes.
|
availability of resources in specific NUMA nodes.
|
||||||
|
|
||||||
Other deployer impact
|
Other deployer impact
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
If the deployment has flavors whose RAM + vCPU allocations are larger than
|
If the deployment has flavors whose memory + vCPU allocations are larger than
|
||||||
the size of the NUMA nodes in the compute hosts, the cloud administrator
|
the size of the NUMA nodes in the compute hosts, the cloud administrator should
|
||||||
should strongly consider defining guest NUMA nodes in the flavor. This will
|
strongly consider defining guest NUMA nodes in the flavor. This will enable the
|
||||||
enable the compute hosts to have better NUMA utilization and improve perf of
|
compute hosts to have better NUMA utilization and improve perf of the guest OS.
|
||||||
the guest OS.
|
|
||||||
|
|
||||||
Developer impact
|
Developer impact
|
||||||
----------------
|
----------------
|
||||||
|
@ -286,7 +286,7 @@ Work Items
|
||||||
* Enhance libvirt driver to look at NUMA node availability when launching
|
* Enhance libvirt driver to look at NUMA node availability when launching
|
||||||
guest instances and pin all guests to best NUMA node
|
guest instances and pin all guests to best NUMA node
|
||||||
* Add support to scheduler for picking hosts based on the NUMA availability
|
* Add support to scheduler for picking hosts based on the NUMA availability
|
||||||
instead of simply considering the total RAM/vCPU availability.
|
instead of simply considering the total memory/vCPU availability.
|
||||||
|
|
||||||
Dependencies
|
Dependencies
|
||||||
============
|
============
|
||||||
|
@ -306,20 +306,20 @@ Dependencies
|
||||||
Testing
|
Testing
|
||||||
=======
|
=======
|
||||||
|
|
||||||
There are various discrete parts of the work that can be tested in isolation
|
There are various discrete parts of the work that can be tested in isolation of
|
||||||
of each other, fairly effectively using unit tests.
|
each other, fairly effectively using unit tests.
|
||||||
|
|
||||||
The main area where unit tests might not be sufficient is the scheduler
|
The main area where unit tests might not be sufficient is the scheduler
|
||||||
integration, where performance/scalability would be a concern. Testing the
|
integration, where performance/scalability would be a concern. Testing the
|
||||||
scalability of the scheduler in tempest though is not practical, since the
|
scalability of the scheduler in tempest though is not practical, since the
|
||||||
issues would only become apparent with many compute hosts and many guests.
|
issues would only become apparent with many compute hosts and many guests, i.e.
|
||||||
ie a scale beyond that which tempest sets up.
|
a scale beyond that which tempest sets up.
|
||||||
|
|
||||||
Documentation Impact
|
Documentation Impact
|
||||||
====================
|
====================
|
||||||
|
|
||||||
The cloud administrator docs need to describe the new flavor parameters
|
The cloud administrator docs need to describe the new flavor parameters and
|
||||||
and make recommendations on how to effectively use them.
|
make recommendations on how to effectively use them.
|
||||||
|
|
||||||
The end user needs to be made aware of the fact that some flavors will cause
|
The end user needs to be made aware of the fact that some flavors will cause
|
||||||
the guest OS to see NUMA topology.
|
the guest OS to see NUMA topology.
|
||||||
|
@ -328,8 +328,7 @@ References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
Current "big picture" research and design for the topic of CPU and memory
|
Current "big picture" research and design for the topic of CPU and memory
|
||||||
resource utilization and placement. vCPU topology is a subset of this
|
resource utilization and placement. vCPU topology is a subset of this work:
|
||||||
work
|
|
||||||
|
|
||||||
* https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement
|
* https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue