Re-propose nested resource providers spec
The only change from the Pike version is that I've added wording that the call to `GET /allocation_candidates` will also need to be modified to include the root filter. I've also updated the History section. Previously-approved: Ocata Previously-approved: Pike Blueprint: nested-resource-providers Change-Id: I6b5531a0fe8c1aa5056d25ae47901b05e241eb9e
This commit is contained in:
281
specs/queens/approved/nested-resource-providers.rst
Normal file
281
specs/queens/approved/nested-resource-providers.rst
Normal file
@@ -0,0 +1,281 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=========================
|
||||
Nested Resource Providers
|
||||
=========================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/nested-resource-providers
|
||||
|
||||
We propose changing the database schema, object model and REST API of resource
|
||||
providers to allow a hierarchical relationship among different resource
|
||||
providers to be represented.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
With the addition of the new placement API, we now have a new way to account
|
||||
for quantitative resources in the system. Resource providers contain
|
||||
inventories of various resource classes. These inventories are simple integer
|
||||
amounts and, along with the concept of allocation records, are designed to
|
||||
answer the questions:
|
||||
|
||||
* "how many of a type of resource does this provider have available?"
|
||||
* "how much of a type of resource is being consumed in the system?"
|
||||
* "what level of over-commit does each provider expose for each type of
|
||||
resource?"
|
||||
|
||||
In the initial version of the resource provider schema in the placement API, we
|
||||
stuck with a simple world-view that resource providers could be related to each
|
||||
other only via an aggregate relationship. In other words, a resource provider
|
||||
"X" may provide shared resources to a set of other resource providers "S" if
|
||||
and only if "X" was associated with an aggregate "A" that all members of "S"
|
||||
were also associated with.
|
||||
|
||||
This relationship works perfectly fine for things like shared storage or IP
|
||||
pools. However, certain classes of resource require a more parent->child
|
||||
relationship than a many-to-many relationship that the aggregate association
|
||||
offers. Two examples of where a parent->child relationship is more appropriate
|
||||
are when handling VCPU/MEMORY_MB resources on NUMA nodes on a compute host and
|
||||
when handling SRIOV_NET_VF resources for NICs on a compute host.
|
||||
|
||||
In the case of NUMA nodes, the system must be able to track how many VCPU and
|
||||
MEMORY_MB have been allocated from each individual NUMA node on the host.
|
||||
Allocating memory to a guest and having that memory span address space across
|
||||
two banks of DIMMs attached to different NUMA nodes results in sub-optimal
|
||||
performance, and for certain high-performance guest workloads this penalty is
|
||||
not acceptable.
|
||||
|
||||
Another example is the SRIOV_NET_VF resource class, which is provided by
|
||||
SRIOV-enabled network interface cards. In the case of multiple SRIOV-enabled
|
||||
NICs on a compute host, different qualitative traits may be tagged to each NIC.
|
||||
For example, the NIC called enp2s0 might have a trait "CUSTOM_PHYSNET_PUBLIC"
|
||||
indicating that the NIC is attached to a physical network called "public". The
|
||||
NIC enp2s1 might have a trait "CUSTOM_PHYSNET_PRIVATE" that indicates the NIC
|
||||
is attached to the physical network called "Intranet". We need a way of
|
||||
representing that these NICs each provide SRIOV_NET_VF resources but those
|
||||
virtual functions are associated with different physical networks. In the
|
||||
resource providers data modeling, the entity which is associated with
|
||||
qualitative traits is the **resource provider** object. Therefore, we require a
|
||||
way of representing that the SRIOV-enabled NICs are themselves resource
|
||||
providers with inventories of SRIOV_NET_VF resources. Those resource providers
|
||||
are contained on a compute host which is a resource provider that has inventory
|
||||
records for *other* types of resources such as VCPU, MEMORY_MB or DISK_GB.
|
||||
|
||||
This spec proposes that nested resource providers be created to allow for
|
||||
distinguishing details of complex components of some resource providers. During
|
||||
review the question came up about "rolling up" amounts of these nested
|
||||
providers to the root level. Imagine this scenario: I have a NIC with two PFs,
|
||||
each of which has only 1 VF available, and I get a request for 2 VFs without
|
||||
any traits to distinguish them. Since there is no single resource provider that
|
||||
can satisfy this request, it will not select this root provider, even though
|
||||
the root provider "owns" 2 VFs. This spec does not propose any sort of "rolling
|
||||
up" of inventory, but this may be something to consider in the future. If it is
|
||||
an idea that has support, another BP/spec can be created then to add this
|
||||
behavior.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As an NFV cloud operator, I wish to request that my VNF workload needs an SRIOV
|
||||
virtual function on a NIC that is tagged to the physical network "public" and I
|
||||
want to be able to view the resource consumption of SRIOV virtual functions on
|
||||
a per-physical-network basis.
|
||||
|
||||
As an NFV cloud operator, I wish to ensure that the memory and vCPU assigned to
|
||||
my workload is local to a particular NUMA topology and that those resources are
|
||||
represented in unique inventories per NUMA node and reported as separate
|
||||
allocations.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We will add two new attributes to the resource provider data model:
|
||||
|
||||
* `parent_provider_uuid`: Indicates the UUID of the immediate parent provider.
|
||||
This will be None for the vast majority of providers, and for nested resource
|
||||
providers, this will most likely be the compute host's UUID. To be clear,
|
||||
a resource provider can have 0 or 1 parents. We will not support multiple
|
||||
parents for a resource provider.
|
||||
* `root_provider_uuid`: Indicates the UUID of the resource provider that is at
|
||||
the "root" of the tree of providers. This field allows us to implement
|
||||
efficient tree-access queries and avoid use of recursive queries to follow
|
||||
child->parent relations.
|
||||
|
||||
A new microversion will be added to the placement REST API that adds the above
|
||||
attributes to the appropriate request and response payloads.
|
||||
|
||||
The scheduler reporting client shall be modified to track NUMA nodes and
|
||||
SRIOV-enabled NICs as child resource providers to a parent compute host
|
||||
resource provider.
|
||||
|
||||
The `VCPU` and `MEMORY_MB` resource classes will continue to be inventoried on
|
||||
the parent resource provider (i.e the compute node resource provider) and not
|
||||
the NUMA node child providers. The NUMA node child providers will have
|
||||
inventory records populated for the `NUMA_CORE`, `NUMA_THREAD` and
|
||||
`NUMA_MEMORY_MB` resource classes. When a boot request is received, the Nova
|
||||
API service will need to determine whether the request (flavor and image)
|
||||
specifies a particular NUMA topology and, if so, construct the request to the
|
||||
placement service for the appropriate `NUMA_XXX` resources. This is currently
|
||||
out of scope for this spec. This spec is only about the inventorying of the
|
||||
various child providers with appropriate resource classes.
|
||||
|
||||
On the CPU-pinning side of the equation, we do not plan to allow a compute node
|
||||
to serve as *either* a general-purpose compute node *or* as a target for
|
||||
NUMA-specific (pinned) workloads. A compute node will be either a target for
|
||||
pinned workloads or it will be a target for generic (floating CPU) workloads.
|
||||
It is not yet clear what we will use to indicate that a compute node targets
|
||||
floating workloads or not. Initial thoughts were to use the
|
||||
pci_passthrough_whitelist CONF option to determine this however this still
|
||||
needs to be debated.
|
||||
|
||||
This spec will simply ensure that if a virt driver returns a NUMATopology
|
||||
object in the result of its get_available_resource() call, then we will create
|
||||
child resource providers representing those NUMA nodes. Similarly, if the PCI
|
||||
device manager returns a set of SR-IOV physical functions on the compute host,
|
||||
we will create child resource provider records for those SR-IOV PFs.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could try hiding the `root_provider_uuid` attribute from the GET
|
||||
/resource-provider[s] REST API response payload to reduce complexity of the
|
||||
API. We will still, however, need a REST API call that "gets all resource
|
||||
providers in a tree" where the user would pass a UUID and we'd look up all
|
||||
resource providers having that UUID as their root provider UUID.
|
||||
|
||||
Instead of having a concept of nested resource providers, we could force
|
||||
deployers to create custom resource classes for every permutation of physical
|
||||
network trait. For instance, assuming the example above, the operator would
|
||||
need to create an SRIOV_NET_VF_PUBLIC_NET and a SRIOV_NET_VF_INTRANET_NET
|
||||
custom resource class and then manually set the inventory of the compute node
|
||||
resource provider to an amount of VFs each PF exposed. The problem with this
|
||||
approach is two-fold. First, we no longer have any standardization on the
|
||||
SRIOV_NET_VF resource class. Secondly, we are coupling the qualitative and
|
||||
quantitative aspects of a provider together again, which is part of the problem
|
||||
with the existing Nova codebase and why it has been hard to standardize the
|
||||
tracking and scheduling of resources in the first place.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Two new fields will be added to the `resource_providers` DB table:
|
||||
|
||||
* `root_provider_uuid`: This will be populated using an online data migration
|
||||
that sets `root_provider_uuid` to the value of the `resource_providers.uuid`
|
||||
field for all existing resource providers.
|
||||
* `parent_provider_uuid`: This will be a NULLable field and default to NULL
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
`root_provider_uuid` and `parent_provider_uuid` fields will be added to the
|
||||
corresponding request and response payloads of appropriate placement REST APIs.
|
||||
|
||||
The `GET /resource_providers` call will get a new filter on `root={uuid}` that,
|
||||
when present, will return all resource provider records, inclusive of the root,
|
||||
having a `root_provider_uuid` equal to `{uuid}`.
|
||||
|
||||
The filter parameter `root={uuid}` will *not* be added to
|
||||
`GET /allocation_candidates`, as this call is for a specific use case for the
|
||||
Nova scheduler, and there is no use case for it.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None. The setting and getting of provider tree information will be entirely
|
||||
handled in the `nova-compute` worker with no changes needed by the deployer.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jaypipes
|
||||
|
||||
Other contributors:
|
||||
cdent
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add DB schema and object model changes
|
||||
* Add REST API microversion adding new attributes for resource providers and
|
||||
allocation candidates
|
||||
* Add REST API microversion adding new `root={uuid}` filter on `GET
|
||||
/resource_providers`
|
||||
* Add code in scheduler reporting client to track NUMA nodes as child resource
|
||||
providers on the parent compute host resource provider
|
||||
* Add code in scheduler reporting client to track SRIOV PFs as child resource
|
||||
providers on the parent compute host resource provider
|
||||
|
||||
Please note that not all of this spec is expected to be implemented in a single
|
||||
release cycle. At the Queens PTG we agreed that fully suppporting NUMA will
|
||||
probably have to be deferred to the next release.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Most of the focus will be on functional tests for the DB/server and the REST
|
||||
API with new functional tests added for the specific NUMA and SRIOV PF child
|
||||
provider scenarios described in this spec.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Some devref content should be written.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
http://etherpad.openstack.org/p/nested-resource-providers
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Ocata
|
||||
- Introduced
|
||||
* - Pike
|
||||
- Re-proposed
|
||||
* - Queens
|
||||
- Re-proposed
|
||||
Reference in New Issue
Block a user