Merge "re-propose numa with placement"

This commit is contained in:
Zuul 2020-05-22 09:24:44 +00:00 committed by Gerrit Code Review
commit ad3192d320
1 changed files with 652 additions and 0 deletions

View File

@ -0,0 +1,652 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================================
NUMA Topology with Resource Providers
=====================================
https://blueprints.launchpad.net/nova/+spec/numa-topology-with-rps
Now that `Nested Resource Providers`_ is a thing in both Placement API and
Nova compute nodes, we could use the Resource Providers tree for explaining
the relationship between a root Resource Provider (root RP) ie. a compute node,
and one or more Non-Uniform Memory Access (NUMA) nodes (aka. cells), each of
them having separate resources, like memory or PCI devices.
.. note::
This spec only targets to model resource capabilities for NUMA nodes in some
general and quite abstract manner. We won't address in this spec how we
should model NUMA-affinized hardware like PCI devices or GPUs and will
discuss these relationships in a later spec.
Problem description
===================
The NUMATopologyFilter checks a number of resources, including emulator threads
policies, CPU pinned instances and memory page sizes. Additionally, it does two
different verifications :
- *whether* some host can fit the query because it has enough capacity
- *which* resource(s) should be used for this query (eg. which pCPUs or NUMA
node)
With NUMA topologies modeled as Placement resources, those two questions could
be answered by the Placement service as potential allocation candidates that
the filter would *only* be responsible for choosing between them in some
very specific cases (eg. PCI device NUMA affinity, CPU pinning and NUMA
anti-affinity).
Accordingly, we could model the host memory and the CPU topologies as a set of
resource providers arranged in a tree, and just directly allocate resources for
a specific instance from a resource provider subtree representing a NUMA node
and its resources.
That said, non resource-related features (like `choosing a specific CPU pin
within a NUMA node for a vCPU`_) would still be only done by the virt driver,
and are not covered by this spec.
Use Cases
---------
Consider the following NUMA topology for a "2-NUMA nodes, 4 cores" host with no
Hyper-Threading:
.. code::
+--------------------------------------+
| CN1 |
+-+---------------+--+---------------+-+
| NUMA1 | | NUMA2 |
+-+----+-+----+-+ +-+----+-+----+-+
|CPU1| |CPU2| |CPU3| |CPU4|
+----+ +----+ +----+ +----+
Here, CPU1 and CPU2 would share the same memory through a common memory
controller, while CPU3 and CPU4 would share their own memory.
Ideally, applications that require low-latency memory access from multiple
vCPUs on the same instance (for parallel computing reasons) would like to
ensure that those CPU resources are provided by the same NUMA node, or some
performance penalties would occur (if your application is CPU-bound or
I/O-bound of course). For the moment, if you're an operator, you can use flavor
extra specs to indicate a desired guest NUMA topology for your instance like:
.. code::
$ openstack flavor set FLAVOR-NAME \
--property hw:numa_nodes=FLAVOR-NODES \
--property hw:numa_cpus.N=FLAVOR-CORES \
--property hw:numa_mem.N=FLAVOR-MEMORY
See all the `NUMA possible extra specs`_ for a flavor.
.. note ::
The example above is only needed when you want to not evenly divide your
virtual CPUs and memory between NUMA nodes, of course.
Proposed change
===============
Given there are a lot of NUMA concerns, let's do an iterative approach about
the model we agree.
NUMA nodes being nested Resource Providers
------------------------------------------
Given virt drivers can amend a provider tree given by the compute node
ResourceTracker, then the libvirt driver could create child providers for each
of the 2 sockets representing separate NUMA node.
Since CPU resources are tied to a specific NUMA node, it makes sense to model
the corresponding resource classes as part of the child NUMA Resource
Providers. In order to facilitate querying NUMA resources, we propose to
decorate the NUMA child resource providers with a specific trait named
``HW_NUMA_ROOT`` that would be on each NUMA *node*. That would help to know
which hosts would be *NUMA-aware* and which others are not.
Memory is a bit tougher to represent. The granularity of a NUMA node having
an amount of attached memory is somehow a first approach but we're missing the
point that the smallest allocatable unit you can assign with Nova is
really a page size. Accordingly, we should rather model our NUMA subtree
with children Resource Providers that represent the smallest unit of memory
you can allocate, ie. a page size. Since a pagesize is not a *consumable*
amount but rather a *qualitative* information that helps us to allocate
``MEMORY_MB`` resources, we propose three traits :
- ``MEMORY_PAGE_SIZE_SMALL`` and ``MEMORY_PAGE_SIZE_LARGE`` would allow us to
know whether the memory page size is default or optionally configured.
- ``CUSTOM_MEMORY_PAGE_SIZE_<X>`` where <X> is an integer would allow us to
know the size of the page in KB. To make it clear, even if the trait is a
custom one, it's important to have a naming convention for it so the
scheduler could ask about page sizes without knowing all the traits.
.. code::
+-------------------------------+
| <CN_NAME> |
| DISK_GB: 5 |
+-------------------------------+
| (no specific traits) |
+--+---------------------------++
| |
| |
+-------------------------+ +--------------------------+
| <NUMA_NODE_O> | | <NUMA_NODE_1> |
| VCPU: 8 | | VCPU: 8 |
| PCPU: 16 | | PCPU: 8 |
+-------------------------+ +--------------------------+
| HW_NUMA_ROOT | | HW_NUMA_ROOT |
+-------------------+-----+ +--------------------------+
/ | \ /+\
+ | \_____________________________ .......
| | \
+-------------+-----------+ +-+--------------------------+ +-------------------------------+
| <RP_UUID> | | <RP_UUID> | | <RP_UUID> |
| MEMORY_MB: 1024 | | MEMORY_MB: 1024 | |MEMORY_MB: 10240 |
| step_size=1 | | step_size=2 | |step_size=1024 |
+-------------------------+ +----------------------------+ +-------------------------------+
|MEMORY_PAGE_SIZE_SMALL | |MEMORY_PAGE_SIZE_LARGE | |MEMORY_PAGE_SIZE_LARGE |
|CUSTOM_MEMORY_PAGE_SIZE_4| |CUSTOM_MEMORY_PAGE_SIZE_2048| |CUSTOM_MEMORY_PAGE_SIZE_1048576|
+-------------------------+ +----------------------------+ +-------------------------------+
.. note ::
As we said above, we don't want to support children PCI devices for Ussuri
at the moment. Other current children RPs for a root compute node, like
ones for VGPU resources or bandwidth resources would still have their
parent be the compute node.
NUMA RP
-------
Resource Provider names for NUMA nodes shall follow a convention of
``nodename_NUMA#`` where nodename would be the hypervisor hostname (given by
the virt driver) and where NUMA# would literally be a string made of 'NUMA'
postfixed by the NUMA cell ID which is provided by the virt driver.
Each NUMA node would be then a child Resource Provider, having two resource
classes :
* ``VCPU``: for telling how many virtual cores (not able to be pinned) the NUMA
node has.
* ``PCPU``: for telling how many possible pinned cores the NUMA node has.
A specific trait should be decorating it as we explained : ``HW_NUMA_ROOT``.
Memory pagesize RP
------------------
Each `NUMA RP`_ should have child RPs for each possible memory page
size per host, and having a single resource class :
* ``MEMORY_MB``: for telling how much memory the NUMA node has in that specific
page size.
This RP would be decorated by two traits :
- either ``MEMORY_PAGE_SIZE_SMALL`` (default if not configured) or
``MEMORY_PAGE_SIZE_LARGE`` (if large pages are configured)
- the size of the page size : CUSTOM_MEMORY_PAGE_SIZE_# (where # is the size
in KB - default to 4 as the kernel defaults to 4KB page sizes)
Compute node RP
---------------
The root Resource Provider (ie. the compute node) would only provide resources
for classes that are not NUMA-related. Existing children RPs for vGPUs or
bandwidth-aware resources should still have this parent (until we discuss
about NUMA affinity for PCI devices).
Optionally configured NUMA resources
------------------------------------
Given there are NUMA workloads but also non-NUMA workloads, it's also important
for operators to just have compute nodes accepting the latter.
That said, having the compute node resources to be split between multiple
NUMA nodes could be a problem for those non-NUMA workloads if they want to keep
the existing behaviour.
For example, say an instance with 2 vCPUs and one host having 2 NUMA nodes but
each one only accepting one VCPU, then the Placement API wouldn't accept that
host (given each nested RP only accepts one VCPU). For that reason, we need to
have a configuration for saying which resources should be nested.
To reinforce the above, that means a host would be either NUMA or non-NUMA,
hence non-NUMA workloads being set on a specific NUMA node if host is set so.
The proposal we make here will be :
.. code::
[compute]
enable_numa_reporting_to_placement = <bool> (default None for Ussuri)
For below, we will tell hosts as "NUMA-aware" ones that have this option be
``True``. For hosts that have this option to ``False`` they are explicitely
asked to have a legacy behaviour and will be called "non-NUMA-aware".
Depending on the value of the option, Placement would accept or not a host
for the according request. The resulting matrix can be::
+----------------------------------------+----------+-----------+----------+
| ``enable_numa_reporting_to_placement`` | ``None`` | ``False`` | ``True`` |
+========================================+==========+===========+==========+
| NUMA-aware flavors | Yes | No | Yes |
+----------------------------------------+----------+-----------+----------+
| NUMA-agnostic flavors | Yes | Yes | No |
+----------------------------------------+----------+-----------+----------+
where ``Yes`` means that there could be allocation candidates from this host,
while ``No`` means that no allocation candidates will be returned.
In order to distinghish compute nodes that have the ``False`` value instead of
``None``, we will decorate the former with a specific trait name
``HW_NON_NUMA``. Accordingly, we will query Placement by adding this forbidden
trait for *not* getting nodes that operators explicitly don't want them to
support NUMA-aware flavors.
.. note::
By default, the value for that configuration option will be ``None`` for
upgrade reasons. By the Ussuri timeframe, operators will have to decide
which hosts they want to support NUMA-aware instances and which should be
dedicated for 'non-NUMA-aware' instances. A `nova-status pre-upgrade check`
command will be provided that will warn them to decide before upgrading to
Victoria, if the default value is about to change as we could decide later
in this cycle. Once we stop supporting ``None`` (in Victoria or later), the
``HW_NON_NUMA`` trait would no longer be needed so we could stop querying
it.
.. note::
Since we allow a transition period for helping the operators to decide, we
will also make clear that this is a one-way change and that we won't
provide a backwards support for turning a NUMA-aware host into a
non-NUMA-aware host.
See the `Upgrade impact`_ section for further details.
.. note:: Since the discovery of a NUMA topology is made by virt drivers, it
makes the population of those nested Resource Providers to necessarly
be done by each virt driver. Consequently, while the above
configuration option is said to be generic, the use of this option
for populating the Resource Providers tree will only be done by
the virt drivers. Of course, a shared module could be imagined for
the sake of consistency between drivers, but this is an
implementation detail.
The very simple case: I don't care about a NUMA-aware instance
--------------------------------------------------------------
For flavors just asking for, say, vCPUs and memory without asking them to be
NUMA-aware, then we will make a single Placement call asking to *not* land
them on a NUMA-aware host::
resources=VCPU:<X>,MEMORY_MB=<Y>
&required=!HW_NUMA_ROOT
In this case, even if NUMA-aware hosts have enough resources for this query,
the Placement API won't provide them but only non-NUMA-aware ones (given the
forbidden ``HW_NUMA_ROOT`` trait).
We're giving the possibility to the operator to shard their clouds between
NUMA-aware hosts and non-NUMA-aware hosts but that's not really changing the
current behaviour as of now where operators create aggregates to make sure
non-NUMA-aware instances can't land on NUMA-aware hosts.
See the `Upgrade impact` session for rolling upgrade situations where clouds
are partially upgraded to Ussuri and where only a very few nodes are reshaped.
Asking for NUMA-aware vCPUs
---------------------------
As NUMA-aware hosts have a specific topology with memory being in a grand-child
RP, we basically need to ensure we can translate the existing expressiveness in
the flavor extra specs into a Placement allocation candidates query that asks
for parenting between the NUMA RP containing the ``VCPU`` resources and the
memory pagesize RP containing the ``MEMORY_MB`` resources.
Accordingly, here are some examples:
* for a flavor of 8 VCPUs, 8GB of RAM and ``hw:numa_nodes=2``::
resources_MEM1=MEMORY_MB:4096
&required_MEM1=MEMORY_PAGE_SIZE_SMALL
&resources_PROC1=VCPU:4
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_MEM1,_PROC1,_NUMA1
&resources_MEM2=MEMORY_MB:4096
&required_MEM2=MEMORY_PAGE_SIZE_SMALL
&resources_PROC2=VCPU:4
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_MEM2,_PROC2,_NUMA2
&group_policy=none
.. note::
We use ``none`` as a value for ``group_policy`` which means that in this
example, allocation candidates can all be from ``PROC1`` group meaning
that we defeat the purpose of having the resources separated into different
NUMA nodes (which is the purpose of ``hw:numa_nodes=2``). This is OK
as we will also modify the ``NUMATopologyFilter`` to only accept
allocation candidates for a host that are in different NUMA nodes.
It will probably be implemented in the ``nova.virt.hardware`` module but
that's an implementation detail.
* for a flavor of 8 VCPUs, 8GB of RAM and ``hw:numa_nodes=1``::
resources_MEM1=MEMORY_MB:8192
&required_MEM1=MEMORY_PAGE_SIZE_SMALL
&resources_PROC1=VCPU:8
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_MEM1,_PROC1,_NUMA1
* for a flavor of 8 VCPUs, 8GB of RAM and
``hw:numa_nodes=2&hw:numa_cpus.0=0,1&hw:numa_cpus.1=2,3,4,5,6,7``::
resources_MEM1=MEMORY_MB:4096
&required_MEM1=MEMORY_PAGE_SIZE_SMALL
&resources_PROC1=VCPU:2
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_MEM1,_PROC1,_NUMA1
&resources_MEM2=MEMORY_MB:4096
&required_MEM2=MEMORY_PAGE_SIZE_SMALL
&resources_PROC2=VCPU:6
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_MEM2,_PROC2,_NUMA2
&group_policy=none
* for a flavor of 8 VCPUs, 8GB of RAM and
``hw:numa_nodes=2&hw:numa_cpus.0=0,1&hw:numa_mem.0=1024
&hw:numa_cpus.1=2,3,4,5,6,7&hw:numa_mem.1=7168``::
resources_MEM1=MEMORY_MB:1024
&required_MEM1=MEMORY_PAGE_SIZE_SMALL
&resources_PROC1=VCPU:2
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_MEM1,_PROC1,_NUMA1
&resources_MEM2=MEMORY_MB:7168
&required_MEM2=MEMORY_PAGE_SIZE_SMALL
&resources_PROC2=VCPU:6
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_MEM2,_PROC2,_NUMA2
&group_policy=none
As you can understand, the ``VCPU`` and ``MEMORY_MB`` values will be a result
of the division of respectively the flavored vCPUs and the flavored memory by
the value of ``hw:numa_nodes`` (which is actually already calculated and
provided as NUMATopology object information in the RequestSpec object).
.. note::
The translation mechanism from a flavor-based request into Placement query
will be handled by the scheduler service.
.. note::
Since memory is provided as grand-child, we need to always ask for a
``MEMORY_PAGE_SIZE_SMALL`` which is the default.
Asking for specific memory page sizes
-------------------------------------
Operators defining a flavor of 2 vCPUs, 4GB of RAM and
``hw:mem_page_size=2MB,hw:numa_nodes=2`` will see that the Placement query will
become::
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=CUSTOM_MEMORY_PAGE_SIZE_2048
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=CUSTOM_MEMORY_PAGE_SIZE_2048
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
If you only want large page size support without really specifying which size
(eg. by specifying ``hw:mem_page_size=large`` instead of, say, ``2MB``), then
the above same request for large pages would translate into::
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=MEMORY_PAGE_SIZE_LARGE
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=MEMORY_PAGE_SIZE_LARGE
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
Asking the same with ``hw:mem_page_size=small`` would translate into::
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=MEMORY_PAGE_SIZE_SMALL
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=MEMORY_PAGE_SIZE_SMALL
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
And eventually, asking with ``hw:mem_page_size=any`` would mean::
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
.. note:: As we said for vCPUs, given we query with ``group_policy=none``,
allocation candidates would be within the same NUMA node but that's fine
since we also said that the scheduler filter would then no agree with
them if there is a ``hw:numa_nodes=X`` there.
The fallback case for NUMA-aware flavors
----------------------------------------
In the `Optionally configured NUMA resources`_ section, we said that we would
want to accept NUMA-aware flavors to land on hosts that have the
``enable_numa_reporting_to_placement`` option set to ``None``. Since we can't
yet build a ``OR`` query for allocation candidates, we propose to make another
call to Placement.
In this specific call (we name it a fallback call), we want to get all
non-reshaped nodes that are *not* explicitly said to not support NUMA.
In this case, the request is fairly trivial since we decorated them with the
``HW_NON_NUMA`` trait::
resources=VCPU:<X>,MEMORY_MB=<Y>
&required=!HW_NON_NUMA,!HW_NUMA_ROOT
Then we would get all compute nodes that have the ``None`` value (
including nodes that are still running the Train release in a rolling upgrade
fashion).
Of course, we would get nodes that could potentially *not* accept the
NUMA-aware flavor but we rely on the ``NUMATopologyFilter`` for not selecting
them, exactly like what we do in Train.
There is some open question about whether we should do the fallback call only
if the NUMA-specific call is not getting candidates or if we should generate
the two calls either way and merge the results.
The former is better for performance reasons since we avoid a potentially
unnecessary call but would generate some potential spread/pack affinity issues.
Here we all agree on the fact we can leave the question unresolved for now and
defer the resolution to the implementation phase.
Alternatives
------------
Modeling of NUMA resources could be done by using specific NUMA resource
classes, like ``NUMA_VCPU`` or ``NUMA_MEMORY_MB`` that would only be set for
children NUMA resource providers, and where ``VCPU`` and ``MEMORY_MB`` resource
classes would only be set on the root Resource Provider (here the compute
node).
If the Placement allocations candidates API was also able to provide a way to
say 'you can split the resources between resource providers', we wouldn't need
to carry a specific configuration option for a long time. All hosts would then
be reshaped to be NUMA-aware but then non-NUMA-aware instances could
potentially land on those hosts. That wouldn't change the fact that for
optimal capacity, operators need to shard their clouds between NUMA workloads
and non-NUMA ones, but from a Placement perspective, all hosts would be equal.
This alternative proposal has largely already been discussed in a
spec but the outcome consensus was that it was very
difficult to implement and potentially not worth the difficulty.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None, flavors won't need to be modified since we will provide a translation
mechanism. That said, we will explicitly explain in the documentation that
we won't support any placement-like extra specs in flavors.
Performance Impact
------------------
Only when changing the configuration option to ``True``, a reshape is done.
Other deployer impact
---------------------
Operators would want to migrate some instances from hosts to anothers before
explicitely enabling or disabling NUMA awareness on their nodes since they will
have to consider the capacity usage accordingly as they will have to shard
their cloud. This being said, this would only be necessary for clouds that
weren't yet already dividing NUMA-aware and non-NUMA-aware workloads between
hosts thru aggregates.
Developer impact
----------------
None, except virt driver maintainers.
Upgrade impact
--------------
As described above, in order to prevent a flavor update during upgrade, we will
provide a translation mechanism that will take the existing
flavor extra spec properties and transform them into Placement numbered groups
query.
Since there will be a configuration option for telling that a host would become
NUMA-aware, the corresponding allocations accordingly have to change hence the
virt drivers be responsible for providing a reshape mechanism that will
eventually call the `Placement API /reshaper endpoint`_ when starting the
compute service. This reshape implementation will absolutely need to consider
the Fast Forward Upgrade (FFU) strategy where all controlplane is down and
should possibly document any extra step required for FFU with an eventual
removal in a couple of releases once all deployers no longer need this support.
Last but not the least, we will provide a transition period (at least during
the Ussuri timeframe) where operators can decide which hosts to dedicate to
NUMA-aware workloads. A specific ``nova-status pre-upgrade check`` command
will warn them to do so before upgrading to Victoria.
Implementation
==============
Assignee(s)
-----------
* sean-k-mooney
* bauzas
Feature Liaison
---------------
bauzas
Work Items
----------
* libvirt driver passing NUMA topology through ``update_provider_tree()`` API
* Hyper-V driver passing NUMA topology through ``update_provider_tree()`` API
* Possible work on the NUMATopologyFilter to look at the candidates
* Scheduler translating flavor extra specs for NUMA properties into Placement
queries
* ``nova-status pre-upgrade check`` command
Dependencies
============
None.
Testing
=======
Functional tests and unittests.
Documentation Impact
====================
None.
References
==========
* _`Nested Resource Providers`: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/nested-resource-providers.html
* _`choosing a specific CPU pin within a NUMA node for a vCPU`: https://docs.openstack.org/nova/latest/admin/cpu-topologies.html#customizing-instance-cpu-pinning-policies
* _`NUMA possible extra specs`: https://docs.openstack.org/nova/latest/admin/flavors.html#extra-specs-numa-topology
* _`Huge pages`: https://docs.openstack.org/nova/latest/admin/huge-pages.html
* _`Placement API /reshaper endpoint`: https://developer.openstack.org/api-ref/placement/?expanded=id84-detail#reshaper
* _`Placement can_split`: https://review.opendev.org/#/c/658510/
* _`physical CPU resources`: https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Ussuri
- Introduced
* - Victoria
- Re-proposed