enable nova to dynamically partion cpus
This canage introduces a backlog spec to explore that extention of nova to supprot dynimcaly partioning host cpus based on worklaod needs without updating exisitng instances. Change-Id: I95cd0ed5f10f1ea84cb8edb09015be07f95e836c
This commit is contained in:
parent
f0ffcb6ddf
commit
8487cb0bd5
|
@ -0,0 +1,918 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=========================================
|
||||
Dynamic Cpu Pinning for libvirt instances
|
||||
=========================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/example
|
||||
|
||||
Nova currently supports static partioning of CPUs for instances.
|
||||
This blueprint proposes to add support for dynamic cpu pinning for
|
||||
instances with the shared and dedicated CPU polices.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently its possible to device the cpu resocues on a host into shared and
|
||||
dedicated cpus. The shared cpus are available to all instances on the host
|
||||
(either for vcpus or emulator threads) and the dedicated cpus are available only to
|
||||
instances with the dedicated cpu policy. shared cpus are declared via the cpu_shared_set
|
||||
config option and dedicated cpus are declared via the cpu_dedicated_set config option.
|
||||
while this work well for static partitioning of cpus, it does not work well for
|
||||
dynamic partitioning of cpus. For example, if a host has 8 cpus and 4 of them are
|
||||
dedicated to instances with the dedicated cpu policy, and the remaining 4 cpus
|
||||
are available to all instances with the shared cpu policy we can have underutilisation
|
||||
of the platform. For example, if there are 4 instances with the dedicated cpu policy
|
||||
and each instance has 1 vcpu, then the remaining 4 cpus are not available to instances
|
||||
with the dedicated cpu policy. If we have 4 instances with the shared cpu policy and each
|
||||
instance has 1 vcpu, the the dedicated cpus are idle and not available to the instances
|
||||
with the shared cpu policy. This can lead to a bin packing problem where we have
|
||||
underutilisation of the platform.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As an operator, I want to be able to dynamically partition the cpus on a host
|
||||
into shared and dedicated cpus so that I can maximise the utilisation of the
|
||||
platform based on the workload requirements.
|
||||
|
||||
As an operator i want to be able to use dynamic cpu partioning without having to
|
||||
modify my existing flavor definitions. I want to be able to use the existing
|
||||
flavor definitions and have the system partition the cpus based on the workload
|
||||
so that existing workloads can benefit from dynamic cpu partioning if moved to
|
||||
a host with dynamic cpu partioning enabled.
|
||||
|
||||
As an operator, when using dynamic cpu partioning, i want unused cpus to be able
|
||||
to use the recently added cpu_state power management feature so that idle cpus
|
||||
will be put into a low power state.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
High level design constraints
|
||||
* No new Resouces classes will be added for dynamic cpu partioning. The existing
|
||||
VCPU and PCPU resources classes will be used for dynamic cpu partioning.
|
||||
* The existing cpu_shared_set and cpu_dedicated_set config options will not be used
|
||||
for dynamic cpu partioning. Instead, a new config option called cpu_dynamic_set
|
||||
will be used for dynamic cpu partioning. This will enable coeistence of static
|
||||
and dynamic cpu partioning on the same or diffent host in the future. coeistence
|
||||
on the same host will be out of scope of this blueprint.
|
||||
* The current roles of placement, the resouce tracker and the scheduler will not
|
||||
change. The resource trakcer will continue to be the source of truth for the
|
||||
cpu resources on a host. The placement service will continue to track the cpu
|
||||
capsity on a host using the VCPU and PCPU resouce classes.
|
||||
The scheduler will continue to select a host for an instance based on the
|
||||
cpu resources on the host but the assignment of cpus to an instance will be
|
||||
done by the resource tracker.
|
||||
|
||||
The role of placement in selecting a host.
|
||||
------------------------------------------
|
||||
While it may seem from an out side perspective that the placement service is the
|
||||
single source of truth for the cpu resources on a host, this is not, and has never
|
||||
been the case. As with all other resources, the placement service acts a consistent,
|
||||
atomic, distributed cache, of a summary view of the resources on a host.
|
||||
|
||||
Placement is not the source of truth for the cpu resources on a host. The resource
|
||||
tracker is the source of truth for the cpu resources on a host and is responsible
|
||||
for updating the placement service with the capasity and capabilities the host.
|
||||
|
||||
As placement is not aware of the toplogy or asignment of cpus to instances on a host,
|
||||
it is not possible for placement to select a host for an instance based on the cpu
|
||||
resources on the host, with respect to numa affintiy, cpu pinning or other affintiy
|
||||
requirements. Placement role is to fined host that have enough capsity to host a VM
|
||||
and the toplogy consideration are enforced by the scheduler and the resource tracker.
|
||||
|
||||
Put concretely, the placement service today can only say, "this host has 8 vcpus and
|
||||
the instance requires 4 vcpus, therefor it has capasity to host the instance."
|
||||
It cannot say "this host has 8 vcpus and the instance requires 4 vcpus, therfor it can
|
||||
use cpus 0,1,2,3 on the host to host the instance."
|
||||
|
||||
This spec does not change the role of placement in selecting a host for an instance.
|
||||
Placement will only be aware of the total number of vCPUs and pCPUs on a host and
|
||||
the assignment of cpus to instances will be done by the resource tracker.
|
||||
The implications of this are that, as it is today, it will be possible for 2 VM
|
||||
to atomically create an allocation in placement, and then both be scheduled to the
|
||||
same host where the assignment of cpus may only be valid for one of the vms.
|
||||
|
||||
This is not a new problem and is not introduced by this spec. It is a problem that
|
||||
already exists today and can be mitigated by introducing a new rpc to the compute node
|
||||
and a new placement rest api to allow updating an allocation in placement and a
|
||||
resource provider in a single atomic call. doing that is out of scope of this spec.
|
||||
|
||||
The meaning of cpu_policy=shared and cpu_policy=dedicated
|
||||
---------------------------------------------------------
|
||||
Contrary to how we commonly think of the meaning of cpu_policy=shared and
|
||||
cpu_policy=dedicated, the meaning of these cpu policies is not that the cpus
|
||||
are floating or pinned. The cpu dedicated policy is often referred to as cpu pinning.
|
||||
While it is true that the cpu dedicated policy does pin cpus to instances, that is
|
||||
a side effect of the cpu dedicated policy and not the meaning of the cpu dedicated.
|
||||
Similarly the cpu shared policy does not mean that the cpus are floating. The cpu
|
||||
shared policy means that the cpus are shared with other instances and not reserved
|
||||
for this instance.
|
||||
|
||||
The cpu_policy=shared and cpu_policy=dedicated are not the only cpu policies that
|
||||
nova also supports cpu_policy=mixed. The cpu_policy=mixed policy is a combination
|
||||
of the cpu_policy=shared and cpu_policy=dedicated policies where some cpus are mapped
|
||||
to the cpu_shared_set and some cpus are mapped to the cpu_dedicated_set.
|
||||
|
||||
Since the originally introduction of the vcpu_pin_set and the cpu_shared_set config
|
||||
if either are defined all instance with the cpu_policy=shared (or unset) will be
|
||||
pinned to the cpus in the vcpu_pin_set/cpu_shared_set.
|
||||
The difference between cpu_policy=shared and cpu_policy=dedicated is that for each vcpu
|
||||
in the instance, the cpu_policy=shared will ping that cpu to a range of cpus defined by the
|
||||
cpu_shared_set or the vcpu_pin_set, while the cpu_policy=dedicated will pin each vcpu
|
||||
to a single cpu defined by the cpu_dedicated_set/vcpu_pin_set.
|
||||
|
||||
This is important to understand as it means that the cpu_policy=shared are not unpinned
|
||||
they are just not pinned 1:1 and other instance can share the same cores up to the
|
||||
cpu_allocation_ratio.
|
||||
|
||||
Mechanically we cannot change the meaning of cpu_policy=shared and cpu_policy=dedicated
|
||||
(shared with other instance, and dedicated to this instance) as that would break
|
||||
backwards compatibility. Nor can we change the fact that cpu_policy=dedicated pins
|
||||
cpus to instance to specific cpus and does not change that without user intervention.
|
||||
What we can change is the relationship between the cpu_policy=shared, cpu_allocation_ratio
|
||||
and how the cpus are pinned to host cpus.
|
||||
|
||||
cpu_allocation_ratio and cpu_policy=shared
|
||||
------------------------------------------
|
||||
|
||||
The cpu_allocation_ratio config option is used to define the over subscription
|
||||
ratio for the cpus on a host. the cpu_allocation_ratio only applies to vcpu inventories.
|
||||
Regardless of the cpu_policy of an instance, a given nova instance will always be mapped
|
||||
to at least 1 host cpu per flavor.vcpu. That means that if you have a host with 2 cpus
|
||||
and an allocation ration of 4.0 then you can have 8 instances with 1 vcpu.
|
||||
you could also have 4 instances with 2 vcpus but not 2 instances with 4 vcpus.
|
||||
in other words, an instance can never over subscribe against itself.
|
||||
put a different way, if a host has 2 cpus it can never boot an instance with more then
|
||||
2 vcpus regardless of the cpu_allocation_ratio.
|
||||
|
||||
|
||||
cpu_shared_set and emulator threads
|
||||
-----------------------------------
|
||||
The cpu_shared_set config option is used to define the cpus that are available to
|
||||
instances with the cpu_policy=shared. The cpu_shared_set config option is also used
|
||||
to define the cpus that are available to emulator threads. This is a annoying so
|
||||
while we are here we can fix that.
|
||||
|
||||
While this is not strictly required it makes the code simpler and easier to understand
|
||||
this spec will also add a new config option called cpu_overhead_set. If this config
|
||||
option is defined then nova will be modifed to use the cpu_overhead_set when pinning
|
||||
the qemu emulator thread and/io thread if we add support for iothread in the future.
|
||||
when cpu_overhead_set is not defiend and cpu_shared_set then we will fallback to that
|
||||
to preserve backwards compatibility. This spec will deprecate that fallback and we
|
||||
can remove it in a future release.
|
||||
|
||||
the cpu_overhead_set cpus will not be reported to placement and may not overlap with
|
||||
|
||||
vcpu_pin_set
|
||||
------------
|
||||
The vcpu_pin_set config option is deprecated for removal in a future release.
|
||||
it is not compatibilile with dedicated cpu resouce tracking in placement and none of the
|
||||
feature described in this spec will be supported when the vcpu_pin_set is defined.
|
||||
In the rest of the document we will only disucss the cpu_shared_set, cpu_dedicated_set,
|
||||
cpu_dynamic_set and cpu_overhead_set config options.
|
||||
|
||||
Dynamic cpu partioning
|
||||
----------------------
|
||||
The proposed solution is to add a new config option called cpu_dynamic_set. This
|
||||
config option will be used to define the cpus that are available to instance vcpus.
|
||||
|
||||
All cpus in the cpu_dynamic_set will be reported to placement as both vcpus and pcpus.
|
||||
|
||||
e.g. if cpu_dynamic_set=0-7 then placement will report 8 vcpus and 8 pcpus.
|
||||
|
||||
```
|
||||
resouce_provider
|
||||
|
|
||||
---- name: <hostname>
|
||||
---- uuid: <uuid>
|
||||
|
|
||||
---- inventories
|
||||
|
|
||||
---- vcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 0
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 8
|
||||
| ---- step_size: 1
|
||||
| ---- allocation_ratio: 4.0
|
||||
|
|
||||
---- pcpu
|
||||
|
|
||||
---- total: 8
|
||||
---- reserved: 0
|
||||
---- min_unit: 1
|
||||
---- max_unit: 8
|
||||
---- step_size: 1
|
||||
```
|
||||
|
||||
With this new feature. If cpu_dynamic_set=0-7 and cpu_allocation_ratio=4.0 then
|
||||
given a flaovr with 1 vcpu and cpu_policy=shared, if we boot an instance on the host
|
||||
with the cpu_dynamic_set=0-7 then the instance will be pinned to a single host core.
|
||||
i.e. the instance will be pinned to a single host core and not a range of host cores.
|
||||
when this core assignment is done, we will update the placement resource provider
|
||||
reduce the max_unit of the pcpu and set the reserved value of the pcpu to 1.
|
||||
|
||||
```
|
||||
resouce_provider
|
||||
|
|
||||
---- name: <hostname>
|
||||
---- uuid: <uuid>
|
||||
|
|
||||
---- inventories
|
||||
|
|
||||
---- vcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 0
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 8
|
||||
| ---- step_size: 1
|
||||
| ---- allocation_ratio: 4.0
|
||||
|
|
||||
---- pcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 1
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 7
|
||||
| ---- step_size: 1
|
||||
---- allocations
|
||||
|
|
||||
---- instance-uuid-1
|
||||
|
|
||||
---- resources
|
||||
|
|
||||
---- vcpu: 1
|
||||
|
||||
```
|
||||
|
||||
If we boot a second instance with the same flavor on the same host then the second
|
||||
will be pinned to the same host core as the first instance. This is because the first
|
||||
instance has already reserved the host core for shared use and we have not reached
|
||||
the cpu_allocation_ratio.
|
||||
|
||||
```
|
||||
resouce_provider
|
||||
|
|
||||
---- name: <hostname>
|
||||
---- uuid: <uuid>
|
||||
|
|
||||
---- inventories
|
||||
|
|
||||
---- vcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 0
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 8
|
||||
| ---- step_size: 1
|
||||
| ---- allocation_ratio: 4.0
|
||||
|
|
||||
---- pcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 1
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 7
|
||||
| ---- step_size: 1
|
||||
---- allocations
|
||||
|
|
||||
---- instance-uuid-1
|
||||
| |
|
||||
| ---- resources
|
||||
| |
|
||||
| ---- vcpu: 1
|
||||
|
|
||||
---- instance-uuid-2
|
||||
|
|
||||
---- resources
|
||||
|
|
||||
---- vcpu: 1
|
||||
|
||||
```
|
||||
|
||||
if we have a second flavor that requests 1 vcpu and cpu_policy=dedicated
|
||||
and we boot an instance with that flavor on the same host then the instance will
|
||||
be pinned to a different host core. this will result in the following placement
|
||||
resource provider.
|
||||
|
||||
```
|
||||
resouce_provider
|
||||
|
|
||||
---- name: <hostname>
|
||||
---- uuid: <uuid>
|
||||
|
|
||||
---- inventories
|
||||
|
|
||||
---- vcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 4
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 7
|
||||
| ---- step_size: 1
|
||||
| ---- allocation_ratio: 4.0
|
||||
|
|
||||
---- pcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 1
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 6
|
||||
| ---- step_size: 1
|
||||
---- allocations
|
||||
|
|
||||
---- instance-uuid-1
|
||||
| |
|
||||
| ---- resources
|
||||
| |
|
||||
| ---- vcpu: 1
|
||||
|
|
||||
---- instance-uuid-2
|
||||
| |
|
||||
| ---- resources
|
||||
| |
|
||||
| ---- vcpu: 1
|
||||
|
|
||||
---- instance-uuid-3
|
||||
|
|
||||
---- resources
|
||||
|
|
||||
---- pcpu: 1
|
||||
|
||||
```
|
||||
|
||||
Note that because we have allocated a host core for dedicated use, the vcpu max_unit
|
||||
is reduced to 7 and the pcpu max_unit is reduced to 6.
|
||||
The vcpu max unit is reduced is because the host core that is reserved for dedicated
|
||||
use is not available for shared use. the pcpu max unit is reduced because the host
|
||||
core that is reserved for dedicated use and the host core that is reserved for
|
||||
shared use are not available for dedicated use. given share cpus allow over subscription
|
||||
when a host core is reserved for dedicated use, the vcpu reserved is increased by
|
||||
(1 * allocation_ratio) instead of 1.
|
||||
|
||||
if we boot two more instance with the shared cpu policy then they will be pinned
|
||||
to the same host cores as the first two instances. This is because the first
|
||||
instance has already reserved the host core for shared use and we have not reached
|
||||
the cpu_allocation_ratio and no change to the placement resource provider inventories
|
||||
are required.
|
||||
|
||||
```
|
||||
resouce_provider
|
||||
|
|
||||
---- name: <hostname>
|
||||
---- uuid: <uuid>
|
||||
|
|
||||
---- inventories
|
||||
|
|
||||
---- vcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 4
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 7
|
||||
| ---- step_size: 1
|
||||
| ---- allocation_ratio: 4.0
|
||||
|
|
||||
---- pcpu
|
||||
| |
|
||||
| ---- total: 8
|
||||
| ---- reserved: 1
|
||||
| ---- min_unit: 1
|
||||
| ---- max_unit: 6
|
||||
| ---- step_size: 1
|
||||
---- allocations
|
||||
|
|
||||
---- instance-uuid-1
|
||||
| |
|
||||
| ---- resources
|
||||
| |
|
||||
| ---- vcpu: 1
|
||||
|
|
||||
---- instance-uuid-2
|
||||
| |
|
||||
| ---- resources
|
||||
| |
|
||||
| ---- vcpu: 1
|
||||
|
|
||||
---- instance-uuid-3
|
||||
| |
|
||||
| ---- resources
|
||||
| |
|
||||
| ---- pcpu: 1
|
||||
|
|
||||
---- instance-uuid-4
|
||||
| |
|
||||
| ---- resources
|
||||
| |
|
||||
| ---- vcpu: 1
|
||||
|
|
||||
---- instance-uuid-5
|
||||
|
|
||||
---- resources
|
||||
|
|
||||
---- vcpu: 1
|
||||
|
||||
```
|
||||
|
||||
The general pattern is that when an instance is booted, the resource tracker will
|
||||
select a host core for the instance and update the placement resource provider
|
||||
inventories to reflect the change.
|
||||
|
||||
If the the instance requests a shared cpu and
|
||||
no host cores are currently reserved for shared use then the resource tracker will
|
||||
reserve a host core for shared use and update the placement resource provider
|
||||
by reducing the pcpu max_unit by 1 and increasing the pcpu reserved by 1.
|
||||
This reflects that the host core is no longer available for dedicated use even though
|
||||
there is no allocation against the pcpu inventory.
|
||||
|
||||
Similarly if the instance requests a dedicated cpu the resource tracker will reserve
|
||||
a host core for dedicated use and update the placement resource provider by reducing
|
||||
the vcpu max_unit by 1 and increasing the vcpu reserved by (1 * allocation_ratio).
|
||||
This reflects that the host core is no longer available for shared use even though there
|
||||
is no allocation against the vcpu inventory. since each host core utilized as a shared core
|
||||
allowed up to allocation_ratio vcpus to be allocated, when a host core is reserved for
|
||||
dedicated use, the vcpu reserved is increased by (1 * allocation_ratio) instead of 1.
|
||||
|
||||
resouce provider updates and concurrent allocation creations
|
||||
------------------------------------------------------------
|
||||
As mentioned above, the placement service is not the source of truth for the cpu
|
||||
resources on a host. The resource tracker is the source of truth for the cpu resources.
|
||||
when we convert a allocation_candidate to a allocation today that is done in the conductor
|
||||
prior to the assignment of cpus to the instance by the resource tracker.
|
||||
|
||||
That means that its possible for two instances to atomically create an allocation
|
||||
in placement, and then both be scheduled to the same host where the assignment of cpus
|
||||
may only be valid for one of the vms. Today when this happens because of the locking
|
||||
we have in the resource tracker, the second instance will fail to build because the
|
||||
numa topology requsted will not be fitable to the host.
|
||||
|
||||
when we increase the resvered amount of cpus in the placement resource provider we may
|
||||
temporarily over subscribe the host based on the allocations for the instance that have
|
||||
not yet been created. This is not a problem as the resource tracker will not allow
|
||||
the over subscription when the instance is created and we will reject the build request.
|
||||
The conductor will then free the allocation in placement for the rejected instance
|
||||
and an alternative host will be selected.
|
||||
|
||||
Atomicity of allocation creation and resource provider updates
|
||||
--------------------------------------------------------------
|
||||
As mentioned above, the placement service is not the source of truth for the cpu resources
|
||||
on a host. It is a cached view with incompelte knowlage that within the limits of its knowledge
|
||||
provide an atomic view of the capsity and capabilities of the host.
|
||||
|
||||
The model propeded in this spec reflects that given the placement service has incomplete knowledge
|
||||
and nova is a distributed system, we are operating in a eventually consistent model. While placement
|
||||
can definitvly say that a host does not have capasity to host an instance it cannot definitvly
|
||||
say that a host has capasity to host an instance. Like a bloom filter, placement can only say
|
||||
there is a high probability that a host has capasity to host an instance based on the information
|
||||
it has.
|
||||
|
||||
This spec does not attemept to adress the problem of atomicity of allocation creation and resource
|
||||
provider updates. But it does attempet to ensure that it can be addressed in the future.
|
||||
|
||||
One possible way to adress this in the future it to add a new rpc to the compute node to assignee
|
||||
cpus to an instance and update the placement resource provider and convert the allocation candatate to an allocation
|
||||
in a single atomic call. by delegating the creation of the allocation to the compute node we can
|
||||
ensure that the allocation is only created if the resource tracker is able to assign cpus to the instance.
|
||||
This has the the disadvantage that it requires a new rpc to the compute node and a new placement rest api
|
||||
request could be delayed while we wait for the compute node to respond.
|
||||
|
||||
another possible way to address this in the future for the asignment of the cpus to be
|
||||
done by the conductor instead. Today the numa topology filter genreate a possible assignment using
|
||||
a copy of the host numa topology and then discards it. Instead of discarding the possible assignment
|
||||
we could pass embed it in the allocation candidate and pass it to the conductor.
|
||||
The conductor could then directly update the host numa topology blob in the db and pass the instance
|
||||
numa topology to the compute node when creating the instance. we would also need to extend the host numa
|
||||
topology blob to have a generation number so that we can detect when the host numa topology has changed
|
||||
on the conductor or compute node as this is now a shared resouce which must be updated atomically.
|
||||
this has the advantage that it does not require a new rpc to the compute node and the placement rest api
|
||||
call can continue to be done in the conductor reducing the latency.
|
||||
|
||||
cpu_pinning, numa topology, and numa aware memory allocation
|
||||
------------------------------------------------------------
|
||||
|
||||
the linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node
|
||||
is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node
|
||||
until the memory usage on the numa node is below the threshold. if an instance does not request
|
||||
numa aware memory assignemnt by seting hw:mem_page_size=small|large|any|<pagesize> then the
|
||||
instance memory will not be check against the numa node and can be killed by the kernel OOM reaper.
|
||||
|
||||
with the introduction of https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
|
||||
nova gained the ablity to have vms with cpu_policy=shared and cpu_policy=dedicated on the same host
|
||||
That capablity was later extended to allow mixing shared and dedicated instances in the same instance
|
||||
https://specs.openstack.org/openstack/nova-specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html
|
||||
|
||||
nova however has never and still does not support mixing numa instance and non numa instaces on the same host.
|
||||
that means if you want to support cpu_policy=shared and cpu_policy=dediecated insatances
|
||||
on the same host all vms must have a numa toplogy.
|
||||
|
||||
To address this we need to add a new flavor extra spec hw:cpu_partioning=dynamic|static to opt into this feature.
|
||||
This can be automatically converted into a required or forbidden trait by the scheduler to select host
|
||||
that are configured for dynamic/static partioning.
|
||||
|
||||
The libvirt driver will be updated to report the approate trait based on if the cpu_dynamic_set is defined.
|
||||
if the cpu_dynamic_set is defined then the libvirt driver will report the COMPUTE_CPU_PARTITIONING_DYNAMIC trait.
|
||||
otherwise it will report the COMPUTE_CPU_PARTITIONING_STATIC trait.
|
||||
|
||||
to prevent OOM issues hw:cpu_partioning=dynamic will also imply hw:mem_page_size=small or hw:mem_page_size=any.
|
||||
|
||||
hw:mem_page_size=any
|
||||
--------------------
|
||||
The hw:mem_page_size=any flavor extra spec is used to indicate that the instance can be booted on a host
|
||||
with any page size. As a secondary effect this also indicates that the exact page size can be requested
|
||||
by the image. The api does not specify how the virt driver will select the page size when hw:mem_page_size=any
|
||||
is requested. Today the libvirt driver will select the page size based on the largest pagesize aviable that
|
||||
can fit the instance memory. This is done to reduce the memory overhead of the instance by prefering hugepages.
|
||||
This is not ideal when the cpu_policy=shared as its is likely that the hugepages should be preferencally used
|
||||
by the dedicated instances. hw:mem_page_size=any however would be a better default then hw:mem_page_size=small
|
||||
as it allow the image to request the page size it wants.
|
||||
|
||||
we could adress this by adding a new mem_page_size hw:mem_page_size=prefer_smallest which would indicate
|
||||
that the smallest page size should be prefered but still allow the image to request a larger page size.
|
||||
|
||||
we could also just change the meaning of hw:mem_page_size=any to mean that the smallest page size should be prefered
|
||||
since it is not part fo the api contract. for backwards compatibility we could make the prefernce based on if
|
||||
static or dynamic cpu partioning is enabled. if static cpu partioning is enabled then hw:mem_page_size=any
|
||||
will prefer hugepages. if dynamic cpu partioning is enabled then hw:mem_page_size=any will prefer small pages.
|
||||
|
||||
numa and memory overcommit
|
||||
--------------------------
|
||||
The linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node
|
||||
is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node
|
||||
until the memory usage on the numa node is below the threshold. nuna affined isntances cannot
|
||||
safely be over committed as the kernel OOM reaper will kill the instance if the numa node is
|
||||
over committed. This is not a new problem and is why we do not supprot numa and non numa instances
|
||||
on the same host. either all instance musts be numa aware or none of them can be numa aware.
|
||||
that means that dynamic cpu partioning cannot be used with non numa aware instances.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Nova by design never modifies an instance with out the user requesting it. This is
|
||||
to ensure that the user has full control over the instance and that the instance
|
||||
is not modified in unexpected ways. Nova is also a multi tenant system and we
|
||||
must not allow one tenant to modify the resources of another tenant even indirectly
|
||||
i.e by the side effect of a nova operation on a resouce owned by one tenant altering
|
||||
the resources of another tenant. If this were to happen it would be a security issue.
|
||||
if we did not have these constraints then we could allow shared cpus to float as we do
|
||||
today and when a instance is booted with the dedicated cpu policy we could updated the
|
||||
pinning of the shared cpu instance to ensure that the dedicated cpu instance has the
|
||||
cpus it requires. This would violate our multi tenant design constraints and is not
|
||||
an option.
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
The exisitng host numa topology blob does not have the concept of mapping instances
|
||||
to cpus directly. instead we have split the aviablity and the usage by the instance into
|
||||
to data structures. the host numa toplogy just tracks what cpus are available to instances
|
||||
and which once have been used in the context of the pinned cpus. The assocation fo what
|
||||
instance is using a givne cpu is tracked by the instance numa topology object.
|
||||
|
||||
the corraation between the two is computed by the resource tracker and is not stored in
|
||||
the db. This is because the secheduler only need to know what cpus are available not what
|
||||
instance is using what cpus. The resource tracker needs to know what cpus are available
|
||||
and what instance is using what cpus so that it can free the cpus when the instance is deleted.
|
||||
|
||||
This model work fine for static cpu partioning but does not work for dynamic cpu partioning.
|
||||
|
||||
a cleaner model would be to have a single data structure that tracks what cpus are available
|
||||
and what instance is using what cpus. The instnace numa topology object could then be removed
|
||||
and the resource tracker would not need to compute the corraation between the host numa topology
|
||||
and the instance numa topology when freeing cpus.
|
||||
|
||||
to support this the host numa topology blob would need to be extended to include the instance
|
||||
uuid for each cpu. This would allow the resource tracker to directly update the host numa topology
|
||||
blob when assigning cpus to an instance and when freeing cpus from an instance.
|
||||
|
||||
addtionally we need to track if a cpu is reserved for shared use or dedicated use and the allocation
|
||||
ratio for the host. This can be done by modeling the assingable cpus as "slot" objects.
|
||||
each slot object would contain the guest cpu number, a instance uuid. each host cpu would have n slots
|
||||
where n is the allocation ratio. if the host cpu is is asigned as a dedicated cpu then it will have only
|
||||
one slot with the instance uuid of the instance that is using the cpu.
|
||||
|
||||
This could look something like this.
|
||||
|
||||
```
|
||||
host_numa_topology:
|
||||
generation: 42
|
||||
numa_node:
|
||||
"0":
|
||||
cpus:
|
||||
"0":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: "instance-uuid-1"
|
||||
guest_cpu: 0
|
||||
"1":
|
||||
instance_uuid: "instance-uuid-2"
|
||||
guest_cpu: 0
|
||||
"2":
|
||||
instance_uuid: "instance-uuid-3"
|
||||
guest_cpu: 0
|
||||
"3":
|
||||
instance_uuid: "instance-uuid-4"
|
||||
guest_cpu: 0
|
||||
"1":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: "instance-uuid-1"
|
||||
guest_cpu: 1
|
||||
"1":
|
||||
instance_uuid: "instance-uuid-2"
|
||||
guest_cpu: 1
|
||||
"2":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"2":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: "instance-uuid-1"
|
||||
guest_cpu: 2
|
||||
"1":
|
||||
instance_uuid: "instance-uuid-2"
|
||||
guest_cpu: 2
|
||||
"2":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"1":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"2":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
memory_pages:
|
||||
"4":
|
||||
total: 1024
|
||||
used: 0
|
||||
reserved: 0
|
||||
"2048":
|
||||
total: 1024
|
||||
used: 0
|
||||
reserved: 0
|
||||
"1048576":
|
||||
total: 1024
|
||||
used: 0
|
||||
reserved: 0
|
||||
"1":
|
||||
cpus:
|
||||
"0":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"1":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"2":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"1":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"1":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"2":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"2":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"1":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"2":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
slots:
|
||||
"0":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"1":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"2":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
"3":
|
||||
instance_uuid: None
|
||||
guest_cpu: None
|
||||
memory_pages:
|
||||
"4":
|
||||
total: 1024
|
||||
used: 0
|
||||
reserved: 0
|
||||
"2048":
|
||||
total: 1024
|
||||
used: 0
|
||||
reserved: 0
|
||||
"1048576":
|
||||
total: 1024
|
||||
used: 0
|
||||
reserved: 0
|
||||
```
|
||||
|
||||
The migration data objects may will also need to be updated to reflect the new data model
|
||||
as instead of pinning shared cpus to a range of host cpus we will be pinning shared cpus
|
||||
to individual host cpus.
|
||||
|
||||
as the behavior of the cpu_policy=shared will vary based on if the host is using static
|
||||
or dynamic cpu partioning, we will need to add a new placement trait to indicate if the
|
||||
host is using static or dynamic cpu partioning. This will allow the operator to select
|
||||
hosts that are using static or dynamic cpu partioning using required or forbidden traits.
|
||||
in combination with the Filtering hosts by isolating aggregates freature,
|
||||
https://docs.openstack.org/nova/latest/reference/isolate-aggregates.html
|
||||
This will allow the operator to prevent existing workloads from being moved to a host
|
||||
with dynamic cpu partioning enabled if desired. i.e. if the operator does not want to
|
||||
move existing workloads to a host with dynamic cpu partioning enabled then they can
|
||||
add the required trait to the aggregate that the host is in and create a new flavor
|
||||
that has the required trait. This will prevent existing workloads from being moved
|
||||
to the host with dynamic cpu partioning enabled but allow new workloads to be created
|
||||
on the host with dynamic cpu partioning enabled.
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
The exsiting flavor extra specs will continue to be used to define the cpu_policy
|
||||
and the same placement resouce classes will be used so no api changes are required.
|
||||
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
As this proposal does not change the multi tenant design constraints of nova
|
||||
there is no security impact.
|
||||
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
This will potially make the scheduler slower as it will have to consider more
|
||||
constraitns in the numa toplogy filter. However since this feature will be
|
||||
opt in via the hw:cpu_partioning=dynamic flavor extra spec, the scheduler Impact
|
||||
will only be seen when the operator has opted into this feature.
|
||||
|
||||
if we modify the numa toplgoy and remove the instance numa topology object
|
||||
the new data stuctre will be smaller and more efficent to process then the
|
||||
current data structure. This might result in a small performance improvement
|
||||
but its unlikely to be noticable.
|
||||
|
||||
moving the cpu assingment from the resource tracker to the conductor should
|
||||
not have a noticable impact on performance as the schduler is already doing
|
||||
the cpu assingment calulation however it will increase size of the result
|
||||
of the select destionation call. so we would need to consider that carefully.
|
||||
given that is out of scope for this spec we will not consider it further.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
By default this feature will be disabled and will not impact existing deployments.
|
||||
|
||||
For existing workload the operator will need to create new flavor with hw:cpu_partioning=dynamic
|
||||
and resize existing instances to the new flavor if they want to use this feature.
|
||||
|
||||
For new workloads the operator will need to create new flavor with hw:cpu_partioning=dynamic
|
||||
and define the cpu_dynamic_set config option to enable this feature.
|
||||
This will not be supported inplace on hosts with existing workloads.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Who is leading the writing of the code? Or is this a blueprint where you're
|
||||
throwing it out there to see who picks it up?
|
||||
|
||||
If more than one person is working on the implementation, please designate the
|
||||
primary author and contact.
|
||||
|
||||
Primary assignee:
|
||||
<launchpad-id or None>
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Feature Liaison
|
||||
---------------
|
||||
|
||||
Feature liaison:
|
||||
sean-k-mooney
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
its 1am and i am tired so i will fill this in later.
|
||||
its also saturday so ... :)
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None unless we want to first add a placment rest api to update an allocation
|
||||
and a resource provider in a single atomic call or change how cpus are assigned
|
||||
to instances in the conductor instead of the resource tracker first.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
This needs at least unit and fucntional tests.
|
||||
we might actully be ablle to test this in temepst as well
|
||||
using the serial tests feature to ensure that no other tests
|
||||
are running at the same time.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
we need to document all the restrictions and the change in behavior for cpu_policy=shared
|
||||
i.e. that they wont float and that they will be pinned instead.
|
||||
This is thecnially not a change in the api contract but it is a change in the libvirt
|
||||
driver internal that some may not be aware of.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - 2024.1 Caracal
|
||||
- Introduced
|
Loading…
Reference in New Issue