Merge "Standardize CPU resource tracking"

2019-05-15 15:24:51 +00:00 · 2019-05-15 15:24:51 +00:00 · 35b7b021cc
parent 4c0d084993 0fcfb76256
commit 35b7b021cc
1 changed files with 574 additions and 0 deletions
--- a/specs/train/approved/cpu-resources.rst
+++ b/specs/train/approved/cpu-resources.rst
@ -0,0 +1,574 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=====================
+CPU resource tracking
+=====================
+
+https://blueprints.launchpad.net/nova/+spec/cpu-resources
+
+We would like to both simplify the configuration of a compute node with regards
+to CPU resource inventory as well as make the quantitative tracking of
+dedicated CPU resources consistent with the tracking of shared CPU resources
+via the placement API.
+
+Problem description
+===================
+
+The ways that CPU resources are currently tracked in Nova is overly complex
+and, due to the coupling of CPU pinning with NUMA-related concepts inside the
+``InstanceNUMATopology`` and ``NUMATopology`` (host) objects, difficult to
+reason about in terms that are consistent with other classes of resource in
+nova.
+
+Tracking of dedicated CPU resources is not done using the placement API,
+therefore there is no way to view the physical processor usage in the system.
+The CONF options and extra specs / image properties surrounding host CPU
+inventory and guest CPU pinning are difficult to understand, and despite
+efforts to document them, there are only a few individuals who even know how to
+"properly" configure a compute node for hosting certain workloads.
+
+We would like to both simplify the configuration of a compute node with regards
+to CPU resource inventory as well as make the quantitative tracking of
+dedicated CPU resources consistent with the tracking of shared CPU resources
+via the placement API.
+
+Definitions
+-----------
+
+**physical processor**
+    A single logical processor on the host machine that is associated with a
+    physical CPU core or hyperthread
+
+**dedicated CPU**
+    A physical processor that has been marked to be used for a single guest
+    only
+
+**shared CPU**
+    A physical processor that has been marked to be used for multiple guests
+
+**guest CPU**
+    A logical processor configured in a guest
+
+**VCPU**
+    Resource class representing a unit of CPU resources for a single guest
+    approximating the processing power of a single physical processor
+
+**PCPU**
+    Resource class representing an amount of dedicated CPUs for a single guest
+
+**CPU pinning**
+    The process of deciding which guest CPU should be assigned to which
+    dedicated CPU
+
+**pinset**
+    A set of physical processors
+
+**pinset string**
+    A specially-encoded string that indicates a set of specific physical
+    processors
+
+**NUMA-configured host system**
+    A host computer that has multiple physical processors arranged in a
+    non-uniform memory access architecture.
+
+**guest virtual NUMA topology**
+    When a guest wants its CPU resources arranged in a specific non-uniform
+    memory architecture layout. A guest's virtual NUMA topology may or may not
+    match an underlying host system's physical NUMA topology.
+
+**emulator thread**
+    An operating system thread created by QEMU to perform certain maintenance
+    activities on a guest VM
+
+**I/O thread**
+    An operating system thread created by QEMU to perform disk or network I/O
+    on behalf of a guest VM
+
+**vCPU thread**
+    An operating system thread created by QEMU to execute CPU instructions on
+    behalf of a guest VM
+
+Use Cases
+---------
+
+As an NFV orchestration system, I want to be able to differentiate between CPU
+resources that require stable performance and CPU resources that can tolerate
+inconsistent performance
+
+As an edge cloud deployer, I want to specify which physical processors should
+be used for dedicated CPU and which should be used for shared CPU
+
+As a VNF vendor, I wish to specify to the infrastructure whether my VNF can use
+hyperthread siblings as dedicated CPUs
+
+Proposed change
+===============
+
+Add ``PCPU`` resource class
+---------------------------
+
+In order to track dedicated CPU resources in the placement service, we need a
+new resource class to differentiate guest CPU resources that are provided by a
+host CPU that is shared among many guests (or many guest vCPU threads) from
+guest CPU resources that are provided by a single host CPU.
+
+A new ``PCPU`` resource class will be created for this purpose. It will
+represent a unit of guest CPU resources that is provided by a dedicated host
+CPU. In addition, a new config option, ``[compute] cpu_dedicated_set`` will be
+added to track the host CPUs that will be allocated to the ``PCPU`` inventory.
+This will complement the existing ``[compute] cpu_shared_set`` config option,
+which will now be used to track the host CPUs that will be allocated to the
+``VCPU`` inventory. These sets must be disjoint sets. If the two values are no
+disjoint, we will fail to start with an error. If they are, any host CPUs not
+included in the combined set will be considered reserved for the host.
+
+The ``Flavor.vcpus`` field will continue to represent the combined number of
+CPUs used by the instance, be they dedicated (``PCPU``) or shared (``VCPU``).
+In addition, the ``cpu_allocation_ratio`` will apply only to ``VCPU`` resources
+since overcommit for dedicated resources does not make sense.
+
+.. note::
+
+    This has significant implications for existing config options like
+    ``vcpu_pin_set`` and ``[compute] cpu_shared_set``. These are discussed
+    :ref:`below <cpu-resources_upgrade>`.
+
+Add ``HW_CPU_HYPERTHREADING`` trait
+-----------------------------------
+
+Nova exposes hardware threads as individual "cores", meaning a host with, for
+example, two Intel Xeon E5-2620 v3 CPUs will report 24 cores - 2 sockets * 6
+cores * 2 threads. However, hardware threads aren't real CPUs as they share
+share many components with each other. As a result, processes running on these
+cores can suffer from contention. This can be problematic for workloads that
+require no contention (think: real-time workloads).
+
+We support a feature called "CPU thread policies", first added in `Mitaka`__,
+which provides a way for users to control how these threads are used by
+instances. One of the policies supported by this feature, ``isolate``, allows
+users to mark thread sibling(s) for a given CPU as reserved, avoiding resource
+contention at the expense of not being able to use these cores for any other
+workload. However, on a typical x86-based platform with hyperthreading enabled,
+this can result in an instance consuming 2x more cores than expected, based on
+the value of ``Flavor.vcpus``. These untracked allocations cannot be supported
+in a placement world as we need to know how many ``PCPU`` resources to request
+at scheduling time, and we can't inflate this number (to account for the
+hyperthread sibling) without being absolutely sure that *every single host* has
+hyperthreading enabled. As a result, we need to provide another way to track
+whether hosts have hyperthreading or not. To this end, we will add the new
+``HW_CPU_HYPERTHREADING`` trait, which will be reported for hosts where
+hyperthreading is detected.
+
+.. note::
+
+    The ``HW_CPU_HYPERTHREADING`` trait will need to be among the traits that
+    the virt driver cannot always override, since the operator may want to
+    indicate that a single NUMA node on a multi-NUMA-node host is meant for
+    guests that tolerate hyperthread siblings as dedicated CPUs.
+
+.. note::
+
+    This has significant implications for the existing CPU thread policies
+    feature. These are discussed :ref:`below <cpu-resources_upgrade>`.
+
+__ https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/virt-driver-cpu-thread-pinning.html
+
+Example host configuration
+--------------------------
+
+Consider a compute node with a total of 24 host physical CPU cores with
+hyperthreading enabled. The operator wishes to reserve 1 physical CPU core and
+its thread sibling for host processing (not for guest instance use).
+Furthermore, the operator wishes to use 8 host physical CPU cores and their
+thread siblings for dedicated guest CPU resources. The remaining 15 host
+physical CPU cores and their thread siblings will be used for shared guest vCPU
+usage, with an 8:1 allocation ratio for those physical processors used for
+shared guest CPU resources.
+
+The operator could configure ``nova.conf`` like so::
+
+    [DEFAULT]
+    cpu_allocation_ratio=8.0
+
+    [compute]
+    cpu_dedicated_set=2-17
+    cpu_shared_set=18-47
+
+The virt driver will construct a provider tree containing a single resource
+provider representing the compute node and report inventory of ``PCPU`` and
+``VCPU`` for this single provider accordingly::
+
+    COMPUTE NODE provider
+        PCPU:
+            total: 18
+            reserved: 2
+            min_unit: 1
+            max_unit: 16
+            step_size: 1
+            allocation_ratio: 1.0
+        VCPU:
+            total: 30
+            reserved: 0
+            min_unit: 1
+            max_unit: 30
+            step_size: 1
+            allocation_ratio: 8.0
+
+Example flavor configurations
+-----------------------------
+
+Consider the following example flavor/image configurations, in increasing order
+of complexity.
+
+1) A simple web application server workload requires a couple CPU resources.
+   The workload does not require any dedicated CPU resources::
+
+       resources:VCPU=2
+
+   For example::
+
+       $ openstack flavor create --vcpus 2 ... example-1
+       $ openstack flavor set --property resources:VCPU=2 example-1
+
+   Alternatively, you can skip the explicit resource request and this will be
+   provided by default. This is the current behavior::
+
+       $ openstack flavor create --vcpus 2 ... example-1
+
+2) A database server requires 8 CPU resources, and the workload needs dedicated
+   CPU resources to minimize effects of other workloads hosted on the same
+   hardware. The deployer wishes to ensure that those dedicated CPU resources
+   are all served by the same resource provider::
+
+       resources:PCPU=8
+
+   For example::
+
+       $ openstack flavor create --vcpus 8 ... example-2
+       $ openstack flavor set --property resources:PCPU=8 example-2
+
+   Alternatively, you can skip the explicit resource request and use the legacy
+   ``hw:cpu_policy`` flavor extra spec instead, ::
+
+       $ openstack flavor create --vcpus 8 ... example-2
+       $ openstack flavor set --property hw:cpu_policy=dedicated example-2
+
+   In this legacy case, ``hw:cpu_policy`` acts as an alias for
+   ``resources=PCPU:${flavor.vcpus}`` as discussed :ref:`later
+   <cpu-resources_upgrade>`.
+
+3) A virtual network function running a packet-core processing application
+   requires 8 CPU resources. The VNF specifies that the dedicated CPUs it
+   receives should **not** be hyperthread siblings (in other words, it wants
+   full cores for its dedicated CPU resources)::
+
+       resources:PCPU=8
+       trait:HW_CPU_HYPERTHREADING=forbidden
+
+   For example::
+
+       $ openstack flavor create --vcpus 8 ... example-3
+       $ openstack flavor set --property resources:VCPU=8 \
+           --property trait:HW_CPU_HYPERTHREADING=forbidden example-3
+
+   Alternatively, you can skip the explicit resource request and trait request
+   and use the legacy ``hw:cpu_policy`` and ``hw:cpu_thread_policy`` flavor
+   extra specs instead::
+
+       $ openstack flavor create --vcpus 8 ... example-3
+       $ openstack flavor set --property hw:cpu_policy=dedicated \
+           --property hw:cpu_thread_policy=isolate example-3
+
+   In this legacy case, ``hw:cpu_policy`` acts as an alias for
+   ``resources=PCPU:${flavor.vcpus}`` and ``hw:cpu_thread_policy`` acts as an
+   alias for ``required=!HW_CPU_HYPERTHREADING``, as discussed :ref:`later
+   <cpu-resources_upgrade>`.
+
+   .. note::
+
+       The use of the legacy extra specs won't give the exact same behavior as
+       previously as hosts that have hyperthreads will be excluded, rather than
+       used but have their thread siblings isolated. This is unavoidable, as
+       discussed :ref:`below <cpu-resources_upgrade>`.
+
+.. note::
+
+    It will not initially be possible to request both ``PCPU`` and ``VCPU`` in
+    the same request. This functionality may be added later but such requests
+    will be rejected until that happens.
+
+.. note::
+
+    You will note that the resource requests only include the total amount of
+    ``PCPU`` and ``VCPU`` resources needed by an instance. It is entirely up to
+    the ``nova.virt.hardware`` module to **pin** the guest CPUs to the host
+    CPUs appropriately, doing things like taking NUMA affinity into account.
+    The placement service will return those provider trees that match the
+    required amount of requested PCPU resources. But placement does not do
+    assignment of specific CPUs, only allocation of CPU resource amounts to
+    particular providers of those resources.
+
+Alternatives
+------------
+
+There's definitely going to be some confusion around ``Flavour.vcpus``
+referring to both ``VCPU`` and ``PCPU`` resource classes. To avoid this, we
+could call the ``PCPU`` resource class ``CPU_DEDICATED`` to more explicitly
+indicate its purpose. However, we will continue to use the ``VCPU`` resource
+class to represent shared CPU resources and ``PCPU`` seemed a better logical
+counterpart to the existing ``VCPU`` resource class.
+
+Another option is to call the ``PCPU`` resource class ``VCPU_DEDICATED``. This
+doubles down on the idea that the term *vCPU* refers to an instance's CPUs (as
+opposed to the host CPUs) but the name is clunky and it's still somewhat
+confusing.
+
+Data model impact
+-----------------
+
+None.
+
+REST API impact
+---------------
+
+None.
+
+Security impact
+---------------
+
+None.
+
+Notifications impact
+--------------------
+
+None.
+
+Other end user impact
+---------------------
+
+This proposal should actually make the CPU resource tracking easier to reason
+about and understand for end users by making the inventory of both shared and
+dedicated CPU resources consistent.
+
+Performance Impact
+------------------
+
+There should be a positive impact on performance due to the placement service
+being able to perform a good portion of the work that the
+``NUMATopologyFilter`` currently does. The ``NUMATopologyFilter`` would be
+trimmed down to only handling questions about whether a particular thread
+allocation policy (tolerance of hyperthreads) could be met by a compute node.
+The number of ``HostInfo`` objects passed to the ``NUMATopologyFilter`` will
+have already been reduced to only those hosts which have the required number of
+dedicated and shared CPU resources.
+
+Note that the ``NUMATopologyFilter`` will still need to contain the more
+esoteric and complex logic surrounding CPU pinning and understanding NUMA node
+CPU amounts before compute nodes are given the ability to represent NUMA nodes
+as child resource providers in provider tree.
+
+Other deployer impact
+---------------------
+
+Primarily, the impact on deployers will be documentation-related. Good
+documentation needs to be provided that, like the above example flavor
+configurations, shows operators what resources and traits extra specs to
+configure in order to get a particular behavior and which configuration options
+have changed.
+
+Developer impact
+----------------
+
+None.
+
+.. _cpu-resources_upgrade:
+
+Upgrade impact
+--------------
+
+The upgrade impact of this feature is large and while we will endeavour to
+minimize impacts to the end user, there will be some disruption. The various
+impacts are described below. Before reading these, it may be worth reading the
+following articles which describe the current behavior of nova in various
+situations:
+
+* `NUMA, CPU Pinning and 'vcpu_pin_set'
+  <https://that.guru/blog/cpu-resources/>`__
+
+Configuration options
+~~~~~~~~~~~~~~~~~~~~~
+
+We will deprecate the ``vcpu_pin_set`` config option in Train. If both the
+``[compute] cpu_dedicated_set`` and ``[compute] cpu_shared_set`` config options
+are set in Train, this option will be ignored entirely and ``[compute]
+cpu_shared_set`` will be used in place of ``vcpu_pin_set`` to calculate the
+amount of ``VCPU`` resources to report for each compute node. If the
+``[compute] cpu_dedicated_set`` option is not set in Train, we will issue a
+warning and fall back to using ``vcpu_pin_set`` as the set of host logical
+processors to allocate for ``PCPU`` resources. These CPUs **will not** be
+excluded from the list of host logical processors used to generate the
+inventory of ``VCPU`` resources since ``vcpu_pin_set`` is useful for all
+NUMA-based instances, not just those with pinned CPUs, and we therefore cannot
+assume that these will be used exclusively by pinned instances. However, this
+double reporting of inventory is not considered an issue as our long-standing
+advice has been to use host aggregates to group pinned and unpinned instances.
+As a result, we should not encounter the two types of instance on the same host
+and either the ``VCPU`` or ``PCPU`` inventory will be unused. If host
+aggregates are not used and both pinned and unpinned instances exist in the
+cloud, the user will already be seeing overallocation issues: namely, unpinned
+instances do not respect the pinning constraints of pinned instances and may
+float across the cores that are supposed to be "dedicated" to the pinned
+instances.
+
+We will also deprecate the ``reserved_host_cpus`` config option in Train. If
+both the ``[compute] cpu_dedicated_set`` and ``[compute] cpu_shared_set``
+config options are set in Train, the value of the ``reserved_host_cpus`` config
+option will be ignored and the virt driver will calculate the ``PCPU``
+inventory reserved amount using the following formula::
+
+    (set(all_cpus) - (set(dedicated) | set(shared)))
+
+If the ``[compute] cpu_dedicated_set`` config option is not set, a warning will
+be logged stating that ``reserved_host_cpus`` is deprecated and that the
+operator should set both ``[compute] cpu_shared_set`` and ``[compute]
+cpu_dedicated_set``.
+
+The meaning of ``[compute] cpu_shared_set`` will change with this feature, from
+being a list of host CPUs used for emulator threads to a list of host CPUs used
+for both emulator threads and ``VCPU`` resources. Note that because this option
+already exists, we can't rely on its presence to do things like ignore
+``vcpu_pin_set``, as outlined previously, and must rely on ``[compute]
+cpu_dedicated_set`` instead.
+
+Finally, we will change documentation for the ``cpu_allocation_ratio`` config
+option to make it abundantly clear that this option ONLY applies to ``VCPU``
+and not ``PCPU`` resources
+
+Flavor extra specs and image metadata properties
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We will alias the ``hw:cpu_policy`` flavor extra spec and ``hw_cpu_policy``
+image metadata option to ``resources=(V|P)CPU:${flavor.vcpus}`` using a
+scheduler prefilter. For flavors/images using the ``shared`` policy, we will
+replace this with the ``resources=VCPU:${flavor.vcpus}`` extra spec, and for
+flavors/images using the ``dedicated`` policy, we will replace this with the
+``resources=PCPU:${flavor.vcpus}`` extra spec. Note that this is similar,
+though not identical, to how we currently translate ``Flavour.vcpus`` into a
+placement request for ``VCPU`` resources during scheduling.
+
+In addition, we will alias the ``hw:cpu_thread_policy`` flavor extra spec and
+``hw_cpu_thread_policy`` image metadata option to
+``trait:HW_CPU_HYPERTHREADING`` using a scheduler prefilter. For flavors/images
+using the ``isolate`` policy, we will replace this with
+``trait:HW_CPU_HYPERTHREADING=forbidden``, and for flavors/images using the
+``require`` policy, we will replace this with the
+``trait:HW_CPU_HYPERTHREADING=required`` extra spec.
+
+Placement inventory
+~~~~~~~~~~~~~~~~~~~
+
+For existing compute nodes that have guests which use dedicated CPUs, the virt
+driver will need to move inventory of existing ``VCPU`` resources (which are
+actually using dedicated host CPUs) to the new ``PCPU`` resource class.
+Furthermore, existing allocations for guests on those compute nodes will need
+to have their allocation records updated from the ``VCPU`` to ``PCPU`` resource
+class.
+
+In addition, for existing compute nodes that have guests which use dedicated
+CPUs **and** the ``isolate`` CPU thread policy, the number of allocated
+``PCPU`` resources may need to be increased to account for the additional CPUs
+"reserved" by the host. On an x86 host with hyperthreading enabled, this will
+result in a 2x the number of ``PCPU``\ s being reserved (N ``PCPU`` resources
+for the instance itself and N ``PCPU`` allocated to avoid another instance
+using them). This will be considered legacy behavior and won't be supported for
+new instances.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignees:
+
+* stephenfin
+* tetsuro nakamura
+* jaypipes
+* cfriesen
+* bauzas
+
+Work Items
+----------
+
+* Create ``PCPU`` resource class
+
+* Create ``[compute] cpu_dedicated_set`` and ``[compute] cpu_shared_set``
+  options
+
+* Modify virt code to calculate the set of host CPUs that will be used for
+  dedicated and shared CPUs by using the above new config options
+
+* Modify the code that creates the request group from the flavor's extra specs
+  and image properties to construct a request for ``PCPU`` resources when the
+  ``hw:cpu_policy=dedicated`` spec is found (smooth transition from legacy)
+
+* Modify the code that currently looks at the
+  ``hw:cpu_thread_policy=isolate|share`` extra spec / image property to add a
+  ``required=HW_CPU_HYPERTHREADING`` or ``required=!HW_CPU_HYPERTHREADING`` to
+  the request to placement
+
+* Modify virt code to reshape resource allocations for instances with dedicated
+  CPUs to consume ``PCPU`` resources instead of ``VCPU`` resources
+
+Dependencies
+============
+
+None.
+
+Testing
+=======
+
+Lots of functional testing for the various scenarios listed in the use cases
+above will be required.
+
+Documentation Impact
+====================
+
+* Docs for admin guide about configuring flavors for dedicated and shared CPU
+  resources
+
+* Docs for user guide explaining difference between shared and dedicated CPU
+  resources
+
+* Docs for how the operator can configure a single host to support guests that
+  tolerate thread siblings as dedicated CPUs along with guests that cannot
+
+References
+==========
+
+* `Support shared and dedicated VMs on same host`_
+* `Support shared/dedicated vCPU in one instance`_
+* `Emulator threads policy`_
+
+.. _Support shared and dedicated VMS on same host: https://review.openstack.org/#/c/543805/
+.. _Support shared/dedicated vCPU in one instance: https://review.openstack.org/#/c/545734/
+.. _Emulator threads policy: https://review.openstack.org/#/c/511188/
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Rocky
+     - Originally proposed, not accepted
+   * - Stein
+     - Proposed again, not accepted
+   * - Train
+     - Proposed again