Merge "PCI NUMA Policies"

2017-10-18 16:01:22 +00:00
parent d410c128f0 91fb90589d
commit ef669cbce3
1 changed files with 265 additions and 0 deletions
--- a/specs/queens/approved/share-pci-between-numa-nodes.rst
+++ b/specs/queens/approved/share-pci-between-numa-nodes.rst
@@ -0,0 +1,265 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=================
+PCI NUMA Policies
+=================
+
+https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes
+
+In the Juno release the "I/O based NUMA scheduling" spec was implemented [1]_.
+This modified the scheduling algorithm such that users were only allowed to
+boot instances with PCI devices if the instance was being scheduled on at least
+one of the NUMA nodes associated with the PCI devices or if the PCI devices
+had no information about NUMA nodes and PCI devices affinity. Before this,
+nova booted instances with PCI devices without checking NUMA affinity. However,
+such hard-coded behaviour causes problems if not every NUMA node has its own
+PCI device. In this case nova wouldn't allow booting an instance on NUMA nodes
+without PCI devices.
+
+Problem description
+===================
+
+In its current iteration, nova boots instances with PCI devices on the same
+NUMA nodes that these PCI devices are associated with. This is good for
+performances, as it ensures there is limited cross-NUMA node memory traffic.
+However, if a user has an environment with two NUMA nodes and only one PCI
+device (for example SR-IOV card associated with first NUMA node) they would be
+able to boot instance with *one* NUMA node and SR-IOV ports only on the first
+NUMA node. In this case, the user cannot use half of the CPUs and RAM because
+these resources are placed on second NUMA node. The user should be able to boot
+instances on different NUMA nodes, even if it makes performance worse.
+
+In addition, the current behavior doesn't always provide the best performance
+solution because an instance can use a PCI device if there is no information
+about affinity of NUMA nodes with this PCI device. This can lead to a situation
+whereby PCI device is not on the NUMA node which the CPU and RAM is on. The
+scheduling mechanism should be more flexible. The user should be able to choose
+between maximum performance behavior and maximum chance of successfully
+launching the instance.
+
+Of course this ability should be configurable and the current scheduling
+behaviour must remain as the default.
+
+Use Cases
+---------
+
+- As an operator who cares about obtaining maximum performance from my PCI
+  devices, I want to ensure my PCI devices are always NUMA affinitized, even
+  if this results in lower resource usage.
+
+- As an operator who cares about maximum usage of resources, I want to ensure
+  that an instance has the best chance of being scheduled successfully, even if
+  this results in slightly lower performance for some instances.
+
+- As an operator of a deployment with a mix of NUMA-aware and non-NUMA-aware
+  hosts, I want to ensure my PCI devices are always NUMA affinitized *if NUMA
+  information is available*. However, I still want to be able to schedule
+  instances of the non-NUMA-aware hosts.
+
+  Alternatively, as an operator with an existing deployment using PCI devices,
+  I don't want nova to pull the rug from under my feet and suddenly refuse to
+  schedule to hosts with no NUMA information when it used to.
+
+Proposed change
+===============
+
+This spec is needed to decide which PCI device will be used by a new instance.
+To this end, we will add a new flavor extra spec
+``hw:pci_numa_affinity_policy`` and image metadata
+``hw_pci_numa_affinity_policy``. They will have one of three values.
+
+**required**
+
+  This value will mean that nova will boot instances with PCI devices *only* if
+  at least one of the NUMA nodes is associated with these PCI devices. It means
+  that if NUMA node info for some PCI devices could not be determined, those
+  PCI devices wouldn't be consumable by the instance. This provides maximum
+  performance.
+
+**preferred**
+
+  This value will mean that `nova-scheduler` will choose a compute host with
+  minimal consideration for the NUMA affinity of PCI devices. `nova-compute`
+  will attempt a best effort selection of PCI devices based on NUMA affinity,
+  however, if this is not possible then `nova-compute` will fall back to
+  scheduling on a NUMA node that is not associated with the PCI device.
+
+  Note that even though the ``NUMATopologyFilter`` will not consider NUMA
+  affinity, the weigher proposed in the *Reserve NUMA Nodes with PCI Devices
+  Attached* spec [2]_ can be used to maximize the chance that a chosen host
+  will have NUMA-affinitized PCI devices.
+
+**legacy**
+
+  This is the default value and it describes the current nova behavior. Usually
+  we have information about association of PCI devices with NUMA nodes.
+  However, some PCI devices do not provide such information. The ``legacy``
+  value will mean that nova will boot instances with PCI device if either:
+
+  * The PCI device is associated with at least one NUMA nodes on which the
+    instance will be booted
+
+  * There is no information about PCI-NUMA affinity available
+
+  This is required because the configuration option will apply globally to an
+  instance which may have multiple devices attached, and not all of these
+  devices may have NUMA affinity. An example of such a device is the FPGAs
+  integrated on to the dies of recent Intel Xeon chips, which hook into the QPI
+  bus and therefore have no NUMA affinity [3]_.
+
+If both image and flavor properties are not set (equals ``None``) the
+``legacy`` policy will be used. If one of image *or* flavor property is not set
+(equals ``None``) but the other property is set then the value of the set
+property will be used. In a case of conflicts between flavor and image
+properties (both properties are set and they are not equal) an exception will
+be raised.
+
+Alternatives
+------------
+
+- Change placement behavior to *not* boot instances which do not need PCI
+  devices on NUMA nodes with PCI devices. This would maximize the possibility
+  that an instance that requires PCI devices could find a suitable host to boot
+  on. However, it would severely limit our flexibility as attempting to boot
+  many instances without PCI devices would result in a large number of unused,
+  PCI device-having hosts. Furthermore, once all non-PCI-having NUMA nodes are
+  saturated, deploys of non-PCI-needing instances would fail.
+
+- Change placement behavior to *avoid* booting instances without PCI devices on
+  NUMA nodes with PCI devices *if possible*. This is a softer version of the
+  first alternative and has actually been addressed by the
+  'reserve-numa-with-pci' spec [4]_.
+
+- Make the PCI NUMA strictness part of the individual PCI device request. This
+  would allow us to represent requests like "I need to be strictly affined to
+  this NIC, but I don't need to be strictly affined to this FPGA". It is very
+  unlikely that this level of granularity of request would be required. In
+  addition, it's difficult to see how this would fit into the resource provider
+  world in the future as the problem is transformed from a scheduling one (at
+  the host level) to a placement one.
+
+Data model impact
+-----------------
+
+A new field, ``pci_numa_affinity_policy``, will be added to the
+``InstanceNUMACell`` object. As this object is stored as a JSON blob in the
+database, no DB migrations are necessary to add the new field to this object.
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+If the ``required`` policy is selected, the performance of instances with PCI
+devices will be more consistent in deployments with non-NUMA aware compute
+hosts present. This is because nova would no longer use these hosts. However,
+this will also result in a smaller number of hosts available on which to
+schedule instances. If all hosts correctly provide NUMA information,
+performance will be unchanged.
+
+If the ``preferred`` policy is selected, the performance of instances with PCI
+devices may be worse for some instances. This would be because nova can now
+schedule an instance on a host with non-NUMA-affinitized PCI devices. However,
+this will also result in a larger number of hosts available on which to
+schedule instances, maximizing flexibility for operators who don't require
+maximum performance. The PCI weigher proposed in the *Reserve NUMA Nodes with
+PCI Devices Attached* [2]_ can be used to minimize the risk of performance
+impacts.
+
+If the ``legacy`` policy is selected, the existing nova behaviour will be
+retained and performance will remain unchanged.
+
+From a scheduling perspective, this may introduce a delay if the ``required``
+policy is selected and there are a large number of hosts with PCI devices that
+do not report NUMA affinity. On the other hand, using the ``preferred`` policy
+will result in improved performance as the ability to schedule is no longer
+tied to the availability of a free CPUs on a NUMA node associated with the PCI
+device.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+    Stephen Finucane (stephenfinucane)
+
+Other contributors:
+    Sergey Nikitin (snikitin)
+
+Work Items
+----------
+
+* Add new spec to the flavor
+* Add new field to the InstanceNUMACell object
+* Change the process of NUMA node choosing, considering new policy
+* Update user docs
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+Scenario tests will be added to validate these modifications.
+
+Documentation Impact
+====================
+
+This feature will not add a new scheduling filter, but it will change the
+behaviour of NUMATopologyFilter. We should add documentation to describe new
+flavor extra spec and image metadata.
+
+References
+==========
+
+.. [1] https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/input-output-based-numa-scheduling.html
+.. [2] https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/reserve-numa-with-pci.html
+.. [3] https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf
+.. [4] https://blueprints.launchpad.net/nova/+spec/reserve-numa-with-pci
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced