From 45252df4c54674d2ac71cd88154af476c4d510e1 Mon Sep 17 00:00:00 2001 From: Stephen Finucane Date: Tue, 31 May 2016 10:31:07 +0100 Subject: [PATCH] Fix additional mistakes in the NUMA spec document The 'hw:numa_mempolicy' parameter was never implemented and will likely never be, however, it's located in the 'implemented' folder suggesting otherwise. Seeing as specs are as much a reference as anything, this invalid information should not be retained. Correct this, taking the opportunity to fix some other typo, line wrapping and general formatting. Change-Id: I4aca0073f1fa26ff0c3a34407370f6ba6d916879 --- .../virt-driver-numa-placement.rst | 217 +++++++++--------- 1 file changed, 108 insertions(+), 109 deletions(-) diff --git a/specs/juno/implemented/virt-driver-numa-placement.rst b/specs/juno/implemented/virt-driver-numa-placement.rst index 0f6fa3a8d..7460a18af 100644 --- a/specs/juno/implemented/virt-driver-numa-placement.rst +++ b/specs/juno/implemented/virt-driver-numa-placement.rst @@ -11,9 +11,9 @@ Virt driver guest NUMA node placement & topology https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement This feature aims to enhance the libvirt driver to be able to do intelligent -NUMA node placement for guests. This will increase the effective utilization -of compute resources and decrease latency by avoiding cross-node memory -accesses by guests. +NUMA node placement for guests. This will increase the effective utilization of +compute resources and decrease latency by avoiding cross-node memory accesses +by guests. Problem description =================== @@ -28,30 +28,30 @@ NUMA nodes for the purposes of DMA, so when using PCI device assignment it is also desirable that the guest be placed on the same NUMA node as any PCI device that is assigned to it. -The libvirt driver does not currently attempt any NUMA placement, the guests -are free to float across any host pCPUs and their RAM is allocated from any +The libvirt driver does not currently attempt any NUMA placement; the guests +are free to float across any host pCPUs and their memory is allocated from any NUMA node. This is very wasteful of compute resources and increases memory access latency which is harmful for NFV use cases. -If the RAM/vCPUs associated with a flavor are larger than any single NUMA +If the memory/vCPUs associated with a flavor are larger than any single NUMA node, it is important to expose NUMA topology to the guest so that the OS in the guest can intelligently schedule workloads it runs. For this to work the guest NUMA nodes must be directly associated with host NUMA nodes. -Some guest workloads have very demanding requirements for memory access -latency and/or bandwidth, which exceed that which is available from a -single NUMA node. For such workloads, it will be beneficial to spread -the guest across multiple host NUMA nodes, even if the guest RAM/vCPUs -could theoretically fit in a single NUMA node. +Some guest workloads have very demanding requirements for memory access latency +and/or bandwidth, which exceed that which is available from a single NUMA node. +For such workloads, it will be beneficial to spread the guest across multiple +host NUMA nodes, even if the guest memory/vCPUs could theoretically fit in a +single NUMA node. -Forward planning to maximise the choice of target hosts for use with live -migration may also cause an administrator to prefer splitting a guest -across multiple nodes, even if it could potentially fit in a single node -on some hosts. +Forward planning to maximize the choice of target hosts for use with live +migration may also cause an administrator to prefer splitting a guest across +multiple nodes, even if it could potentially fit in a single node on some +hosts. -For these two reasons it is desirable to be able to explicitly indicate -how many NUMA nodes to setup in a guest, and to specify how much RAM or -how many vCPUs to place in each node. +For these two reasons it is desirable to be able to explicitly indicate how +many NUMA nodes to setup in a guest, and to specify how much memory or how many +vCPUs to place in each node. Proposed change =============== @@ -68,111 +68,112 @@ nodes. The scheduler will be enhanced such that it can consider the availability of NUMA resources when choosing the host to schedule on. The algorithm that the scheduler uses to decide if the host can run will need to be closely matched, -if not identical to, the algorithm used by the libvirt driver itself. This -will involve the creation of a new scheduler filter to match the flavor/image -config specification against the NUMA resource availability reported by the -compute hosts. +if not identical to, the algorithm used by the libvirt driver itself. This will +involve the creation of a new scheduler filter to match the flavor/image config +specification against the NUMA resource availability reported by the compute +hosts. The flavor extra specs will support the specification of guest NUMA topology. -This is important when the RAM / vCPU count associated with a flavor is larger -than any single NUMA node in compute hosts, by making it possible to have guest -instances that span NUMA nodes. The compute driver will ensure that guest NUMA -nodes are directly mapped to host NUMA nodes. It is expected that the default -setup would be to not list any NUMA properties and just let the compute host -and scheduler apply a sensible default placement logic. These properties would -only need to be set in the sub-set of scenarios which require more precise -control over the NUMA topology / fit characteristics. +This is important when the memory / vCPU count associated with a flavor is +larger than any single NUMA node in compute hosts, by making it possible to +have guest instances that span NUMA nodes. The compute driver will ensure that +guest NUMA nodes are directly mapped to host NUMA nodes. It is expected that +the default setup would be to not list any NUMA properties and just let the +compute host and scheduler apply a sensible default placement logic. These +properties would only need to be set in the sub-set of scenarios which require +more precise control over the NUMA topology / fit characteristics. -* hw:numa_nodes=NN - numa of NUMA nodes to expose to the guest. -* hw:numa_mempolicy=preferred|strict - memory allocation policy -* hw:numa_cpus.0= - mapping of vCPUS N-M to NUMA node 0 -* hw:numa_cpus.1= - mapping of vCPUS N-M to NUMA node 1 -* hw:numa_mem.0= - mapping N MB of RAM to NUMA node 0 -* hw:numa_mem.1= - mapping N MB of RAM to NUMA node 1 +* ``hw:numa_nodes=NN`` - number of NUMA nodes to expose to the guest. -The most common case will be that the admin only sets 'hw:numa_nodes' and then -the flavor vCPUs and RAM will be divided equally across the NUMA nodes. +* ``hw:numa_cpus.NN=`` - mapping of guest vCPUS to a given guest NUMA + node. -The 'hw:numa_mempolicy' option allows specification of whether it is mandatory -for the instance's RAM allocations to come from the NUMA nodes to which it is -bound, or whether the kernel is free to fallback to using an alternative node. -If 'hw:numa_nodes' is specified, then 'hw:numa_mempolicy' is assumed to default -to 'strict'. It is useful to change it to 'preferred' when the 'hw:numa_nodes' -parameter is being set to '1' to force disable use of NUMA by image property -overrides. +* ``hw:numa_mem.NN=`` - mapping of guest MB of memory to a given + guest NUMA node. -It should only be required to use the 'hw:numa_cpu.N' and 'hw:numa_mem.N' -settings if the guest NUMA nodes should have asymetrical allocation of CPUs -and RAM. This is important for some NFV workloads, but in general these will -be rarely used tunables. If the 'hw:numa_cpu' or 'hw:numa_mem' settings are -provided and their values do not sum to the total vcpu count / memory size, -this is considered to be a configuration error. An exception will be raised -by the compute driver when attempting to boot the instance. As an enhancement -it might be possible to validate some of the data at the API level to allow -for earlier error reporting to the user. Such checking is not a functional -prerequisite for this work though so such work can be done out-of-band to -the main development effort. +.. important :: + The NUMA nodes, CPUs and memory referred to above are guest NUMA nodes, + guest CPUs, and guest memory. It is not possible to define specific host + nodes, CPUs or memory that should be assigned to a guest. -When scheduling, if only the hw:numa_nodes=NNN property is set the scheduler -will synthesize hw:numa_cpus.NN and hw:numa_mem.NN properties such that the -flavor allocation is equally spread across the desired number of NUMA nodes. -It will then look consider the available NUMA resources on hosts to find one -that exactly matches the requirements of the guest. So, given an example +The most common case will be that the admin only sets ``hw:numa_nodes`` and +then the flavor vCPUs and memory will be divided equally across the NUMA nodes. +When a NUMA policy is in effect, it is mandatory for the instance's memory +allocations to come from the NUMA nodes to which it is bound except where +overriden by ``hw:numa_mem.NN``. + +It should only be required to use the ``hw:numa_cpus.N`` and ``hw:numa_mem.N`` +settings if the guest NUMA nodes should have asymmetrical allocation of CPUs +and memory. This is important for some NFV workloads, but in general these will +be rarely used tunables. If the ``hw:numa_cpus`` or ``hw:numa_mem`` settings +are provided and their values do not sum to the total vcpu count / memory size, +this is considered to be a configuration error. An exception will be raised by +the compute driver when attempting to boot the instance. As an enhancement it +might be possible to validate some of the data at the API level to allow for +earlier error reporting to the user. Such checking is not a functional +prerequisite for this work though so such work can be done out-of-band to the +main development effort. + +If only the ``hw:numa_nodes=NNN`` property is set the ``hw:numa_cpus.NN`` and +``hw:numa_mem.NN`` properties will be synthesized such that the flavor +allocation is equally spread across the desired number of NUMA nodes. This will +happen twice: once when scheduling, to ensure the guest will fit on the host, +and once during claiming, when the resources are actually allocated. Both +processes will consider the available NUMA resources on hosts to find one that +exactly matches the requirements of the guest. For example, given the following config: -* vcpus=8 -* mem=4 -* hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest. -* hw:numa_cpus.0=0,1,2,3,4,5 -* hw:numa_cpus.1=6,7 -* hw:numa_mem.0=3072 -* hw:numa_mem.1=1024 +* ``vcpus=8`` +* ``mem=4`` +* ``hw:numa_nodes=2`` +* ``hw:numa_cpus.0=0,1,2,3,4,5`` +* ``hw:numa_cpus.1=6,7`` +* ``hw:numa_mem.0=3072`` +* ``hw:numa_mem.1=1024`` -The scheduler will look for a host with 2 NUMA nodes with the ability to run -6 CPUs + 3 GB of RAM on one node, and 2 CPUS + 1 GB of RAM on another node. +The scheduler will look for a host with 2 NUMA nodes with the ability to run 6 +CPUs + 3 GB of memory on one node, and 2 CPUS + 1 GB of RAM on another node. If a host has a single NUMA node with capability to run 8 CPUs and 4 GB of -RAM it will not be considered a valid match. The same logic will be applied -in the scheduler regardless of the hw:numa_mempolicy option setting. +memory it will not be considered a valid match. All of the properties described against the flavor could also be set against the image, with the leading ':' replaced by '_', as is normal for image property naming conventions: -* hw_numa_nodes=NN - numa of NUMA nodes to expose to the guest. -* hw_numa_mempolicy=strict|preferred - memory allocation policy -* hw_numa_cpus.0= - mapping of vCPUS N-M to NUMA node 0 -* hw_numa_cpus.1= - mapping of vCPUS N-M to NUMA node 1 -* hw_numa_mem.0= - mapping N MB of RAM to NUMA node 0 -* hw_numa_mem.1= - mapping N MB of RAM to NUMA node 1 +* ``hw_numa_nodes=NN`` - numa of NUMA nodes to expose to the guest. +* ``hw_numa_cpus.NN=`` - mapping of guest vCPUS to a given guest NUMA + node. +* ``hw_numa_mem.NN=`` - mapping of guest MB of memory to a given + guest NUMA node. This is useful if the application in the image requires very specific NUMA topology characteristics, which is expected to be used frequently with NFV -images. The properties can only be set against the image, however, if they -are not already set against the flavor. So for example, if the flavor sets -'hw:numa_nodes=2' but does not set any 'hw:numa_cpus' / 'hw:numa_mem' values -then the image can optionally set those. If the flavor has, however, set a -specific property the image cannot override that. This allows the flavor +images. The properties can only be set against the image, however, if they are +not already set against the flavor. So for example, if the flavor sets +``hw:numa_nodes=2`` but does not set any ``hw:numa_cpus`` or ``hw:numa_mem`` +values then the image can optionally set those. If the flavor has, however, set +a specific property the image cannot override that. This allows the flavor admin to strictly lock down what is permitted if desired. They can force a -non-NUMA topology by setting hw:numa_nodes=1 against the flavor. +non-NUMA topology by setting ``hw:numa_nodes=1`` against the flavor. Alternatives ------------ Libvirt supports integration with a daemon called numad. This daemon can be -given a RAM size + vCPU count and tells libvirt what NUMA node to place a +given a memory size + vCPU count and tells libvirt what NUMA node to place a guest on. It is also capable of shifting running guests between NUMA nodes to rebalance utilization. This is insufficient for Nova since it needs to have intelligence in the scheduler to pick hosts. The compute drivers then needs to be able to use the same logic when actually launching the guests. The numad system is not portable to other compute hypervisors. It does not deal with the problem of placing guests which span across NUMA nodes. Finally, it does not -address the needs for NFV workloads which require guaranteed NUMA topology -and placement policies, not merely dynamic best effort. +address the needs for NFV workloads which require guaranteed NUMA topology and +placement policies, not merely dynamic best effort. Another alternative is to just do nothing, as we do today, and rely on the Linux kernel scheduler being enhanced to automatically place guests on appropriate NUMA nodes and rebalance them on demand. This shares most of the -problems seen with using numad. +problems seen with using NUMA. Data model impact ----------------- @@ -180,9 +181,9 @@ Data model impact No impact. The reporting of NUMA topology will be integrated in the existing data -structure used for host state reporting. This already supports arbitrary -fields so no data model changes are anticipated for this part. This would -appear as structured data +structure used for host state reporting. This already supports arbitrary fields +so no data model changes are anticipated for this part. This would appear as +structured data :: @@ -215,8 +216,8 @@ REST API impact No impact. -The API for host state reporting already supports arbitrary data fields, so -no change is anticipated from that POV. No new API calls will be required. +The API for host state reporting already supports arbitrary data fields, so no +change is anticipated from that POV. No new API calls will be required. Security impact --------------- @@ -236,7 +237,7 @@ Other end user impact --------------------- Depending on the flavor chosen, the guest OS may see NUMA nodes backing its -RAM allocation. +memory allocation. There is no end user interaction in setting up NUMA policies of usage. @@ -247,18 +248,17 @@ Performance Impact The new scheduler features will imply increased performance overhead when determining whether a host is able to fit the memory and vCPU needs of the -flavor. ie the current logic which just checks the vCPU count and RAM +flavor. ie the current logic which just checks the vCPU count and memory requirement against the host free memory will need to take account of the availability of resources in specific NUMA nodes. Other deployer impact --------------------- -If the deployment has flavors whose RAM + vCPU allocations are larger than -the size of the NUMA nodes in the compute hosts, the cloud administrator -should strongly consider defining guest NUMA nodes in the flavor. This will -enable the compute hosts to have better NUMA utilization and improve perf of -the guest OS. +If the deployment has flavors whose memory + vCPU allocations are larger than +the size of the NUMA nodes in the compute hosts, the cloud administrator should +strongly consider defining guest NUMA nodes in the flavor. This will enable the +compute hosts to have better NUMA utilization and improve perf of the guest OS. Developer impact ---------------- @@ -286,7 +286,7 @@ Work Items * Enhance libvirt driver to look at NUMA node availability when launching guest instances and pin all guests to best NUMA node * Add support to scheduler for picking hosts based on the NUMA availability - instead of simply considering the total RAM/vCPU availability. + instead of simply considering the total memory/vCPU availability. Dependencies ============ @@ -306,20 +306,20 @@ Dependencies Testing ======= -There are various discrete parts of the work that can be tested in isolation -of each other, fairly effectively using unit tests. +There are various discrete parts of the work that can be tested in isolation of +each other, fairly effectively using unit tests. The main area where unit tests might not be sufficient is the scheduler integration, where performance/scalability would be a concern. Testing the scalability of the scheduler in tempest though is not practical, since the -issues would only become apparent with many compute hosts and many guests. -ie a scale beyond that which tempest sets up. +issues would only become apparent with many compute hosts and many guests, i.e. +a scale beyond that which tempest sets up. Documentation Impact ==================== -The cloud administrator docs need to describe the new flavor parameters -and make recommendations on how to effectively use them. +The cloud administrator docs need to describe the new flavor parameters and +make recommendations on how to effectively use them. The end user needs to be made aware of the fact that some flavors will cause the guest OS to see NUMA topology. @@ -328,8 +328,7 @@ References ========== Current "big picture" research and design for the topic of CPU and memory -resource utilization and placement. vCPU topology is a subset of this -work +resource utilization and placement. vCPU topology is a subset of this work: * https://wiki.openstack.org/wiki/VirtDriverGuestCPUMemoryPlacement