Re-propose nested resource providers spec

The only change from the Pike version is that I've added wording that the call to `GET /allocation_candidates` will also need to be modified to include the root filter. I've also updated the History section. Previously-approved: Ocata Previously-approved: Pike Blueprint: nested-resource-providers Change-Id: I6b5531a0fe8c1aa5056d25ae47901b05e241eb9e
2017-09-19 12:44:56 +00:00
parent f49397be06
commit 5648432cda
1 changed files with 281 additions and 0 deletions
--- a/specs/queens/approved/nested-resource-providers.rst
+++ b/specs/queens/approved/nested-resource-providers.rst
@@ -0,0 +1,281 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=========================
+Nested Resource Providers
+=========================
+
+https://blueprints.launchpad.net/nova/+spec/nested-resource-providers
+
+We propose changing the database schema, object model and REST API of resource
+providers to allow a hierarchical relationship among different resource
+providers to be represented.
+
+Problem description
+===================
+
+With the addition of the new placement API, we now have a new way to account
+for quantitative resources in the system. Resource providers contain
+inventories of various resource classes. These inventories are simple integer
+amounts and, along with the concept of allocation records, are designed to
+answer the questions:
+
+* "how many of a type of resource does this provider have available?"
+* "how much of a type of resource is being consumed in the system?"
+* "what level of over-commit does each provider expose for each type of
+  resource?"
+
+In the initial version of the resource provider schema in the placement API, we
+stuck with a simple world-view that resource providers could be related to each
+other only via an aggregate relationship. In other words, a resource provider
+"X" may provide shared resources to a set of other resource providers "S" if
+and only if "X" was associated with an aggregate "A" that all members of "S"
+were also associated with.
+
+This relationship works perfectly fine for things like shared storage or IP
+pools. However, certain classes of resource require a more parent->child
+relationship than a many-to-many relationship that the aggregate association
+offers. Two examples of where a parent->child relationship is more appropriate
+are when handling VCPU/MEMORY_MB resources on NUMA nodes on a compute host and
+when handling SRIOV_NET_VF resources for NICs on a compute host.
+
+In the case of NUMA nodes, the system must be able to track how many VCPU and
+MEMORY_MB have been allocated from each individual NUMA node on the host.
+Allocating memory to a guest and having that memory span address space across
+two banks of DIMMs attached to different NUMA nodes results in sub-optimal
+performance, and for certain high-performance guest workloads this penalty is
+not acceptable.
+
+Another example is the SRIOV_NET_VF resource class, which is provided by
+SRIOV-enabled network interface cards. In the case of multiple SRIOV-enabled
+NICs on a compute host, different qualitative traits may be tagged to each NIC.
+For example, the NIC called enp2s0 might have a trait "CUSTOM_PHYSNET_PUBLIC"
+indicating that the NIC is attached to a physical network called "public". The
+NIC enp2s1 might have a trait "CUSTOM_PHYSNET_PRIVATE" that indicates the NIC
+is attached to the physical network called "Intranet". We need a way of
+representing that these NICs each provide SRIOV_NET_VF resources but those
+virtual functions are associated with different physical networks. In the
+resource providers data modeling, the entity which is associated with
+qualitative traits is the **resource provider** object. Therefore, we require a
+way of representing that the SRIOV-enabled NICs are themselves resource
+providers with inventories of SRIOV_NET_VF resources. Those resource providers
+are contained on a compute host which is a resource provider that has inventory
+records for *other* types of resources such as VCPU, MEMORY_MB or DISK_GB.
+
+This spec proposes that nested resource providers be created to allow for
+distinguishing details of complex components of some resource providers. During
+review the question came up about "rolling up" amounts of these nested
+providers to the root level. Imagine this scenario: I have a NIC with two PFs,
+each of which has only 1 VF available, and I get a request for 2 VFs without
+any traits to distinguish them. Since there is no single resource provider that
+can satisfy this request, it will not select this root provider, even though
+the root provider "owns" 2 VFs. This spec does not propose any sort of "rolling
+up" of inventory, but this may be something to consider in the future. If it is
+an idea that has support, another BP/spec can be created then to add this
+behavior.
+
+Use Cases
+---------
+
+As an NFV cloud operator, I wish to request that my VNF workload needs an SRIOV
+virtual function on a NIC that is tagged to the physical network "public" and I
+want to be able to view the resource consumption of SRIOV virtual functions on
+a per-physical-network basis.
+
+As an NFV cloud operator, I wish to ensure that the memory and vCPU assigned to
+my workload is local to a particular NUMA topology and that those resources are
+represented in unique inventories per NUMA node and reported as separate
+allocations.
+
+Proposed change
+===============
+
+We will add two new attributes to the resource provider data model:
+
+* `parent_provider_uuid`: Indicates the UUID of the immediate parent provider.
+  This will be None for the vast majority of providers, and for nested resource
+  providers, this will most likely be the compute host's UUID. To be clear,
+  a resource provider can have 0 or 1 parents. We will not support multiple
+  parents for a resource provider.
+* `root_provider_uuid`: Indicates the UUID of the resource provider that is at
+  the "root" of the tree of providers. This field allows us to implement
+  efficient tree-access queries and avoid use of recursive queries to follow
+  child->parent relations.
+
+A new microversion will be added to the placement REST API that adds the above
+attributes to the appropriate request and response payloads.
+
+The scheduler reporting client shall be modified to track NUMA nodes and
+SRIOV-enabled NICs as child resource providers to a parent compute host
+resource provider.
+
+The `VCPU` and `MEMORY_MB` resource classes will continue to be inventoried on
+the parent resource provider (i.e the compute node resource provider) and not
+the NUMA node child providers. The NUMA node child providers will have
+inventory records populated for the `NUMA_CORE`, `NUMA_THREAD` and
+`NUMA_MEMORY_MB` resource classes. When a boot request is received, the Nova
+API service will need to determine whether the request (flavor and image)
+specifies a particular NUMA topology and, if so, construct the request to the
+placement service for the appropriate `NUMA_XXX` resources. This is currently
+out of scope for this spec. This spec is only about the inventorying of the
+various child providers with appropriate resource classes.
+
+On the CPU-pinning side of the equation, we do not plan to allow a compute node
+to serve as *either* a general-purpose compute node *or* as a target for
+NUMA-specific (pinned) workloads. A compute node will be either a target for
+pinned workloads or it will be a target for generic (floating CPU) workloads.
+It is not yet clear what we will use to indicate that a compute node targets
+floating workloads or not. Initial thoughts were to use the
+pci_passthrough_whitelist CONF option to determine this however this still
+needs to be debated.
+
+This spec will simply ensure that if a virt driver returns a NUMATopology
+object in the result of its get_available_resource() call, then we will create
+child resource providers representing those NUMA nodes. Similarly, if the PCI
+device manager returns a set of SR-IOV physical functions on the compute host,
+we will create child resource provider records for those SR-IOV PFs.
+
+Alternatives
+------------
+
+We could try hiding the `root_provider_uuid` attribute from the GET
+/resource-provider[s] REST API response payload to reduce complexity of the
+API. We will still, however, need a REST API call that "gets all resource
+providers in a tree" where the user would pass a UUID and we'd look up all
+resource providers having that UUID as their root provider UUID.
+
+Instead of having a concept of nested resource providers, we could force
+deployers to create custom resource classes for every permutation of physical
+network trait. For instance, assuming the example above, the operator would
+need to create an SRIOV_NET_VF_PUBLIC_NET and a SRIOV_NET_VF_INTRANET_NET
+custom resource class and then manually set the inventory of the compute node
+resource provider to an amount of VFs each PF exposed. The problem with this
+approach is two-fold. First, we no longer have any standardization on the
+SRIOV_NET_VF resource class. Secondly, we are coupling the qualitative and
+quantitative aspects of a provider together again, which is part of the problem
+with the existing Nova codebase and why it has been hard to standardize the
+tracking and scheduling of resources in the first place.
+
+Data model impact
+-----------------
+
+Two new fields will be added to the `resource_providers` DB table:
+
+* `root_provider_uuid`: This will be populated using an online data migration
+  that sets `root_provider_uuid` to the value of the `resource_providers.uuid`
+  field for all existing resource providers.
+* `parent_provider_uuid`: This will be a NULLable field and default to NULL
+
+REST API impact
+---------------
+
+`root_provider_uuid` and `parent_provider_uuid` fields will be added to the
+corresponding request and response payloads of appropriate placement REST APIs.
+
+The `GET /resource_providers` call will get a new filter on `root={uuid}` that,
+when present, will return all resource provider records, inclusive of the root,
+having a `root_provider_uuid` equal to `{uuid}`.
+
+The filter parameter `root={uuid}` will *not* be added to
+`GET /allocation_candidates`, as this call is for a specific use case for the
+Nova scheduler, and there is no use case for it.
+
+Security impact
+---------------
+
+None.
+
+Notifications impact
+--------------------
+
+None.
+
+Other end user impact
+---------------------
+
+None.
+
+Performance Impact
+------------------
+
+None.
+
+Other deployer impact
+---------------------
+
+None. The setting and getting of provider tree information will be entirely
+handled in the `nova-compute` worker with no changes needed by the deployer.
+
+Developer impact
+----------------
+
+None.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  jaypipes
+
+Other contributors:
+  cdent
+
+Work Items
+----------
+
+* Add DB schema and object model changes
+* Add REST API microversion adding new attributes for resource providers and
+  allocation candidates
+* Add REST API microversion adding new `root={uuid}` filter on `GET
+  /resource_providers`
+* Add code in scheduler reporting client to track NUMA nodes as child resource
+  providers on the parent compute host resource provider
+* Add code in scheduler reporting client to track SRIOV PFs as child resource
+  providers on the parent compute host resource provider
+
+Please note that not all of this spec is expected to be implemented in a single
+release cycle. At the Queens PTG we agreed that fully suppporting NUMA will
+probably have to be deferred to the next release.
+
+Dependencies
+============
+
+None.
+
+Testing
+=======
+
+Most of the focus will be on functional tests for the DB/server and the REST
+API with new functional tests added for the specific NUMA and SRIOV PF child
+provider scenarios described in this spec.
+
+Documentation Impact
+====================
+
+Some devref content should be written.
+
+References
+==========
+
+http://etherpad.openstack.org/p/nested-resource-providers
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Ocata
+     - Introduced
+   * - Pike
+     - Re-proposed
+   * - Queens
+     - Re-proposed