This patch moves all specs that were implmented in the train release and set up the redirects accordingly. Change-Id: Id4b17e07f2d37cdcd789f4407b7bddd1008a23c1
25 KiB
CPU resource tracking
https://blueprints.launchpad.net/nova/+spec/cpu-resources
We would like to both simplify the configuration of a compute node with regards to CPU resource inventory as well as make the quantitative tracking of dedicated CPU resources consistent with the tracking of shared CPU resources via the placement API.
Problem description
The ways that CPU resources are currently tracked in Nova is overly
complex and, due to the coupling of CPU pinning with NUMA-related
concepts inside the InstanceNUMATopology and
NUMATopology (host) objects, difficult to reason about in
terms that are consistent with other classes of resource in nova.
Tracking of dedicated CPU resources is not done using the placement API, therefore there is no way to view the physical processor usage in the system. The CONF options and extra specs / image properties surrounding host CPU inventory and guest CPU pinning are difficult to understand, and despite efforts to document them, there are only a few individuals who even know how to "properly" configure a compute node for hosting certain workloads.
We would like to both simplify the configuration of a compute node with regards to CPU resource inventory as well as make the quantitative tracking of dedicated CPU resources consistent with the tracking of shared CPU resources via the placement API.
Definitions
- physical processor
-
A single logical processor on the host machine that is associated with a physical CPU core or hyperthread
- dedicated CPU
-
A physical processor that has been marked to be used for a single guest only
- shared CPU
-
A physical processor that has been marked to be used for multiple guests
- guest CPU
-
A logical processor configured in a guest
- VCPU
-
Resource class representing a unit of CPU resources for a single guest approximating the processing power of a single physical processor
- PCPU
-
Resource class representing an amount of dedicated CPUs for a single guest
- CPU pinning
-
The process of deciding which guest CPU should be assigned to which dedicated CPU
- pinset
-
A set of physical processors
- pinset string
-
A specially-encoded string that indicates a set of specific physical processors
- NUMA-configured host system
-
A host computer that has multiple physical processors arranged in a non-uniform memory access architecture.
- guest virtual NUMA topology
-
When a guest wants its CPU resources arranged in a specific non-uniform memory architecture layout. A guest's virtual NUMA topology may or may not match an underlying host system's physical NUMA topology.
- emulator thread
-
An operating system thread created by QEMU to perform certain maintenance activities on a guest VM
- I/O thread
-
An operating system thread created by QEMU to perform disk or network I/O on behalf of a guest VM
- vCPU thread
-
An operating system thread created by QEMU to execute CPU instructions on behalf of a guest VM
Use Cases
As an NFV orchestration system, I want to be able to differentiate between CPU resources that require stable performance and CPU resources that can tolerate inconsistent performance
As an edge cloud deployer, I want to specify which physical processors should be used for dedicated CPU and which should be used for shared CPU
As a VNF vendor, I wish to specify to the infrastructure whether my VNF can use hyperthread siblings as dedicated CPUs
Proposed change
Add PCPU resource
class
In order to track dedicated CPU resources in the placement service, we need a new resource class to differentiate guest CPU resources that are provided by a host CPU that is shared among many guests (or many guest vCPU threads) from guest CPU resources that are provided by a single host CPU.
A new PCPU resource class will be created for this
purpose. It will represent a unit of guest CPU resources that is
provided by a dedicated host CPU. In addition, a new config option,
[compute] cpu_dedicated_set will be added to track the host
CPUs that will be allocated to the PCPU inventory. This
will complement the existing [compute] cpu_shared_set
config option, which will now be used to track the host CPUs that will
be allocated to the VCPU inventory. These sets must be
disjoint sets. If the two values are no disjoint, we will fail to start
with an error. If they are, any host CPUs not included in the combined
set will be considered reserved for the host.
The Flavor.vcpus field will continue to represent the
combined number of CPUs used by the instance, be they dedicated
(PCPU) or shared (VCPU). In addition, the
cpu_allocation_ratio will apply only to VCPU
resources since overcommit for dedicated resources does not make
sense.
Note
This has significant implications for existing config options like
vcpu_pin_set and [compute] cpu_shared_set.
These are discussed below <cpu-resources_upgrade>.
Add
HW_CPU_HYPERTHREADING trait
Nova exposes hardware threads as individual "cores", meaning a host with, for example, two Intel Xeon E5-2620 v3 CPUs will report 24 cores - 2 sockets * 6 cores * 2 threads. However, hardware threads aren't real CPUs as they share share many components with each other. As a result, processes running on these cores can suffer from contention. This can be problematic for workloads that require no contention (think: real-time workloads).
We support a feature called "CPU thread policies", first added in Mitaka,
which provides a way for users to control how these threads are used by
instances. One of the policies supported by this feature,
isolate, allows users to mark thread sibling(s) for a given
CPU as reserved, avoiding resource contention at the expense of not
being able to use these cores for any other workload. However, on a
typical x86-based platform with hyperthreading enabled, this can result
in an instance consuming 2x more cores than expected, based on the value
of Flavor.vcpus. These untracked allocations cannot be
supported in a placement world as we need to know how many
PCPU resources to request at scheduling time, and we can't
inflate this number (to account for the hyperthread sibling) without
being absolutely sure that every single host has hyperthreading
enabled. As a result, we need to provide another way to track whether
hosts have hyperthreading or not. To this end, we will add the new
HW_CPU_HYPERTHREADING trait, which will be reported for
hosts where hyperthreading is detected.
Note
This has significant implications for the existing CPU thread
policies feature. These are discussed below <cpu-resources_upgrade>.
Example host configuration
Consider a compute node with a total of 24 host physical CPU cores with hyperthreading enabled. The operator wishes to reserve 1 physical CPU core and its thread sibling for host processing (not for guest instance use). Furthermore, the operator wishes to use 8 host physical CPU cores and their thread siblings for dedicated guest CPU resources. The remaining 15 host physical CPU cores and their thread siblings will be used for shared guest vCPU usage, with an 8:1 allocation ratio for those physical processors used for shared guest CPU resources.
The operator could configure nova.conf like so:
[DEFAULT]
cpu_allocation_ratio=8.0
[compute]
cpu_dedicated_set=2-17
cpu_shared_set=18-47
The virt driver will construct a provider tree containing a single
resource provider representing the compute node and report inventory of
PCPU and VCPU for this single provider
accordingly:
COMPUTE NODE provider
PCPU:
total: 16
reserved: 0
min_unit: 1
max_unit: 16
step_size: 1
allocation_ratio: 1.0
VCPU:
total: 30
reserved: 0
min_unit: 1
max_unit: 30
step_size: 1
allocation_ratio: 8.0
Example flavor configurations
Consider the following example flavor/image configurations, in increasing order of complexity.
A simple web application server workload requires a couple of CPU resources. The workload does not require any dedicated CPU resources:
resources:VCPU=2For example:
$ openstack flavor create --vcpus 2 ... example-1 $ openstack flavor set --property resources:VCPU=2 example-1Alternatively, you can skip the explicit resource request and this will be provided by default. This is the current behavior:
$ openstack flavor create --vcpus 2 ... example-1A database server requires 8 CPU resources, and the workload needs dedicated CPU resources to minimize effects of other workloads hosted on the same hardware:
resources:PCPU=8For example:
$ openstack flavor create --vcpus 8 ... example-2 $ openstack flavor set --property resources:PCPU=8 example-2Alternatively, you can skip the explicit resource request and use the legacy
hw:cpu_policyflavor extra spec instead:$ openstack flavor create --vcpus 8 ... example-2 $ openstack flavor set --property hw:cpu_policy=dedicated example-2In this legacy case,
hw:cpu_policyacts as an alias forresources=PCPU:${flavor.vcpus}as discussedlater <cpu-resources_upgrade>.A virtual network function running a packet-core processing application requires 8 CPU resources. The VNF specifies that the dedicated CPUs it receives should not be hyperthread siblings (in other words, it wants full cores for its dedicated CPU resources):
resources:PCPU=8 trait:HW_CPU_HYPERTHREADING=forbiddenFor example:
$ openstack flavor create --vcpus 8 ... example-3 $ openstack flavor set --property resources:PCPU=8 \ --property trait:HW_CPU_HYPERTHREADING=forbidden example-3Alternatively, you can skip the explicit resource request and trait request and use the legacy
hw:cpu_policyandhw:cpu_thread_policyflavor extra specs instead:$ openstack flavor create --vcpus 8 ... example-3 $ openstack flavor set --property hw:cpu_policy=dedicated \ --property hw:cpu_thread_policy=isolate example-3In this legacy case,
hw:cpu_policyacts as an alias forresources=PCPU:${flavor.vcpus}andhw:cpu_thread_policyacts as an alias forrequired=!HW_CPU_HYPERTHREADING, as discussedlater <cpu-resources_upgrade>.Note
The use of the legacy extra specs won't give the exact same behavior as previously as hosts that have hyperthreads will be excluded, rather than used but have their thread siblings isolated. This is unavoidable, as discussed
below <cpu-resources_upgrade>.
Note
It will not initially be possible to request both PCPU
and VCPU in the same request. This functionality may be
added later but such requests will be rejected until that happens.
Note
You will note that the resource requests only include the total
amount of PCPU and VCPU resources needed by an
instance. It is entirely up to the nova.virt.hardware
module to pin the guest CPUs to the host CPUs
appropriately, doing things like taking NUMA affinity into account. The
placement service will return those provider trees that match the
required amount of requested PCPU resources. But placement does not do
assignment of specific CPUs, only allocation of CPU resource amounts to
particular providers of those resources.
Alternatives
There's definitely going to be some confusion around
Flavor.vcpus referring to both VCPU and
PCPU resource classes. To avoid this, we could call the
PCPU resource class CPU_DEDICATED to more
explicitly indicate its purpose. However, we will continue to use the
VCPU resource class to represent shared CPU resources and
PCPU seemed a better logical counterpart to the existing
VCPU resource class.
Another option is to call the PCPU resource class
VCPU_DEDICATED. This doubles down on the idea that the term
vCPU refers to an instance's CPUs (as opposed to the host CPUs)
but the name is clunky and it's still somewhat confusing.
Data model impact
The NUMATopology object will need to be updated to
include a new pcpuset field, which complements the existing
cpuset field. In the future, we may wish to rename these to
e.g. cpu_shared_set and cpu_dedicated_set.
REST API impact
None.
Security impact
None.
Notifications impact
None.
Other end user impact
This proposal should actually make the CPU resource tracking easier to reason about and understand for end users by making the inventory of both shared and dedicated CPU resources consistent.
Performance Impact
There should be a positive impact on performance due to the placement
service being able to perform a good portion of the work that the
NUMATopologyFilter currently does. The
NUMATopologyFilter would be trimmed down to only handling
questions about whether a particular thread allocation policy (tolerance
of hyperthreads) could be met by a compute node. The number of
HostInfo objects passed to the
NUMATopologyFilter will have already been reduced to only
those hosts which have the required number of dedicated and shared CPU
resources.
Note that the NUMATopologyFilter will still need to
contain the more esoteric and complex logic surrounding CPU pinning and
understanding NUMA node CPU amounts before compute nodes are given the
ability to represent NUMA nodes as child resource providers in provider
tree.
Other deployer impact
Primarily, the impact on deployers will be documentation-related. Good documentation needs to be provided that, like the above example flavor configurations, shows operators what resources and traits extra specs to configure in order to get a particular behavior and which configuration options have changed.
Developer impact
None.
Upgrade impact
The upgrade impact of this feature is large and while we will endeavour to minimize impacts to the end user, there will be some disruption. The various impacts are described below. Before reading these, it may be worth reading the following articles which describe the current behavior of nova in various situations:
A key point here is that the new behavior must be opt-in during
Train. We recognize that operators may need time to upgrade a critical
number of compute nodes so that they are reporting PCPU
classes. This is reflected at numerous points below.
Configuration options
- Summary
-
A user must unset the
vcpu_pin_setandreserved_host_cpusconfig options and set one or both of the existing[compute] cpu_shared_setand new[compute] cpu_dedicated_setoptions.
We will deprecate the vcpu_pin_set config option in
Train. If both the [compute] cpu_dedicated_set and
[compute] cpu_shared_set config options are set in Train,
the vcpu_pin_set option will be ignored entirely and
[compute] cpu_shared_set will be used instead to calculate
the amount of VCPU resources to report for each compute
node. If the [compute] cpu_dedicated_set option is not set
in Train, we will issue a warning and fall back to using
vcpu_pin_set as the set of host logical processors to
allocate for PCPU resources. These CPUs will
not be excluded from the list of host logical processors used
to generate the inventory of VCPU resources since
vcpu_pin_set is useful for all NUMA-based instances, not
just those with pinned CPUs, and we therefore cannot assume that these
will be used exclusively by pinned instances. However, this double
reporting of inventory is not considered an issue as our long-standing
advice has been to use host aggregates to group pinned and unpinned
instances. As a result, we should not encounter the two types of
instance on the same host and either the VCPU or
PCPU inventory will be unused. If host aggregates are not
used and both pinned and unpinned instances exist in the cloud, the user
will already be seeing overallocation issues: namely, unpinned instances
do not respect the pinning constraints of pinned instances and may float
across the cores that are supposed to be "dedicated" to the pinned
instances.
We will also deprecate the reserved_host_cpus config
option in Train. If either the [compute] cpu_dedicated_set
or [compute] cpu_shared_set config options are set in
Train, the value of the reserved_host_cpus config option
will be ignored and neither the VCPU nor PCPU
inventories will have a reserved value unless explicitly set via the
placement API.
If neither the [compute] cpu_dedicated_set or
[compute] cpu_shared_set config options are set, a warning
will be logged stating that reserved_host_cpus is
deprecated and that the operator should set either
[compute] cpu_shared_set and
[compute] cpu_dedicated_set.
The meaning of [compute] cpu_shared_set will change with
this feature, from being a list of host CPUs used for emulator threads
to a list of host CPUs used for both emulator threads and
VCPU resources. Note that because this option already
exists, we can't rely on its presence to do things like ignore
vcpu_pin_set, as outlined previously, and must rely on
[compute] cpu_dedicated_set instead. For this same reason,
we will only use [compute] cpu_shared_set to determine the
number of VCPU resources if vcpu_pin_set is
unset. If vcpu_pin_set is set, a warning will be logged and
vcpu_pin_set will continue to be used to calculate the
number of VCPU resource available while
[compute] cpu_shared_set will continue to be used only for
emulator threads.
Note
It is possible that there are already hosts in the wild that have
[compute] cpu_shared_set set but do not have
vcpu_pin_set set. We consider this is to be exceptionally
unlikely and purposefully ignore this combination. The only reason to
define [compute] cpu_shared_set in Stein or before is to
use emulator thread offloading, which is used to isolate the additional
work the emulator needs to do from the work the guest OS is doing. It is
mainly required for real-time use cases. The use of
[compute] cpu_shared_set without vcpu_pin_set
could result in instance vCPUs being pinned to any host core including
those listed in cpu_shared_set. This would defeat the whole
purpose of the feature and is very unlikely to be configured by the
performance conscious users of this feature, hence the reason for the
scenario being ignored.
Finally, we will change documentation for the
cpu_allocation_ratio config option to make it abundantly
clear that this option ONLY applies to VCPU and not
PCPU resources
Flavor extra specs and image metadata properties
- Summary
-
We will attempt to rewrite legacy flavor extra specs and image metadata properties to the new resource types and traits, falling back if no matches are found.
We will alias the legacy hw:cpu_policy and
hw:cpu_thread_policy flavor extra specs and their
hw_cpu_policy and hw_cpu_thread_policy image
metadata counterparts to placement requests.
The hw:cpu_policy flavor extra spec and
hw_cpu_policy image metadata option will be aliased to
resources=(V|P)CPU:${flavor.vcpus}. For flavors/images
using the shared policy, the scheduler will replace this
with the resources=VCPU:${flavor.vcpus} extra spec, and for
flavors/images using the dedicated policy, we will replace
this with the resources=PCPU:${flavor.vcpus} extra spec.
Note that this is similar, though not identical, to how we currently
translate Flavour.vcpus into a placement request for
VCPU resources during scheduling.
The hw:cpu_thread_policy flavor extra spec and
hw_cpu_thread_policy image metadata option will be aliased
to trait:HW_CPU_HYPERTHREADING. For flavors/images using
the isolate policy, we will replace this with
trait:HW_CPU_HYPERTHREADING=forbidden, and for
flavors/images using the require policy, we will replace
this with the trait:HW_CPU_HYPERTHREADING=required extra
spec.
If the requests for placement inventory matching these requests fails, we will revert to the legacy behavior and query placement once more. This second request may return hosts that have been upgraded but these requests will fail once the instance reaches the compute node as the libvirt driver will reject it.
Placement inventory
- Summary
-
We will automatically reshape inventory of existing instances using pinned CPUs to use inventory of the
PCPUresource class instead ofVCPU. This will happen once the[compute] cpu_dedicated_setconfig option is set.
For existing compute nodes that have guests which use dedicated CPUs,
the virt driver will need to move inventory of existing
VCPU resources (which are actually using dedicated host
CPUs) to the new PCPU resource class. Furthermore, existing
allocations for guests on those compute nodes will need to have their
allocation records updated from the VCPU to
PCPU resource class.
In addition, for existing compute nodes that have guests which use
dedicated CPUs and the isolate CPU thread
policy, the number of allocated PCPU resources may need to
be increased to account for the additional CPUs "reserved" by the host.
On an x86 host with hyperthreading enabled, this will result in a 2x the
number of PCPUs being reserved (N PCPU
resources for the instance itself and N PCPU allocated to
avoid another instance using them). This will be considered legacy
behavior and won't be supported for new instances.
Summary
The final upgrade process will look like similar to standard upgrades, though there are some slight changes necessary:
Upgrade controllers
Update compute nodes in batches
For compute nodes hosting pinned instances:
- If set, unset
vcpu_pin_setand set[compute] cpu_dedicated_set. If unset, set[compute] cpu_dedicated_setto the entire range of host CPUs.
For compute nodes hosting unpinned instances:
- If set, unset
vcpu_pin_setand set[compute] cpu_shared_set. If unset, no action is necessary unless: - If set, unset
reserved_host_cpusand set[compute] cpu_shared_setto the entire range of host cores minus a number of host cores you wish to reserve.
- If set, unset
Implementation
Assignee(s)
Primary assignees:
- stephenfin
- tetsuro nakamura
- jaypipes
- cfriesen
- bauzas
Work Items
- Create
PCPUresource class - Create
[compute] cpu_dedicated_setand[compute] cpu_shared_setoptions - Modify virt code to calculate the set of host CPUs that will be used for dedicated and shared CPUs by using the above new config options
- Modify the code that creates the request group from the flavor's
extra specs and image properties to construct a request for
PCPUresources when thehw:cpu_policy=dedicatedspec is found (smooth transition from legacy) - Modify the code that currently looks at the
hw:cpu_thread_policy=isolate|shareextra spec / image property to add arequired=HW_CPU_HYPERTHREADINGorrequired=!HW_CPU_HYPERTHREADINGto the request to placement - Modify virt code to reshape resource allocations for instances with
dedicated CPUs to consume
PCPUresources instead ofVCPUresources
Dependencies
None.
Testing
Lots of functional testing for the various scenarios listed in the use cases above will be required.
Documentation Impact
- Docs for admin guide about configuring flavors for dedicated and shared CPU resources
- Docs for user guide explaining difference between shared and dedicated CPU resources
- Docs for how the operator can configure a single host to support guests that tolerate thread siblings as dedicated CPUs along with guests that cannot
References
- Support shared and dedicated VMs on same host
- Support shared/dedicated vCPU in one instance
- Emulator threads policy
History
| Release Name | Description |
|---|---|
| Rocky | Originally proposed, not accepted |
| Stein | Proposed again, not accepted |
| Train | Proposed again |
| Ussuri | Updated, based on final implementation |