enable nova to dynamically partion cpus

This canage introduces a backlog spec to explore
that extention of nova to supprot dynimcaly partioning
host cpus based on worklaod needs without updating
exisitng instances.

Change-Id: I95cd0ed5f10f1ea84cb8edb09015be07f95e836c
This commit is contained in:
Sean Mooney 2024-01-21 01:20:17 +00:00
parent f0ffcb6ddf
commit 8487cb0bd5
1 changed files with 918 additions and 0 deletions

View File

@ -0,0 +1,918 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=========================================
Dynamic Cpu Pinning for libvirt instances
=========================================
Include the URL of your launchpad blueprint:
https://blueprints.launchpad.net/nova/+spec/example
Nova currently supports static partioning of CPUs for instances.
This blueprint proposes to add support for dynamic cpu pinning for
instances with the shared and dedicated CPU polices.
Problem description
===================
Currently its possible to device the cpu resocues on a host into shared and
dedicated cpus. The shared cpus are available to all instances on the host
(either for vcpus or emulator threads) and the dedicated cpus are available only to
instances with the dedicated cpu policy. shared cpus are declared via the cpu_shared_set
config option and dedicated cpus are declared via the cpu_dedicated_set config option.
while this work well for static partitioning of cpus, it does not work well for
dynamic partitioning of cpus. For example, if a host has 8 cpus and 4 of them are
dedicated to instances with the dedicated cpu policy, and the remaining 4 cpus
are available to all instances with the shared cpu policy we can have underutilisation
of the platform. For example, if there are 4 instances with the dedicated cpu policy
and each instance has 1 vcpu, then the remaining 4 cpus are not available to instances
with the dedicated cpu policy. If we have 4 instances with the shared cpu policy and each
instance has 1 vcpu, the the dedicated cpus are idle and not available to the instances
with the shared cpu policy. This can lead to a bin packing problem where we have
underutilisation of the platform.
Use Cases
---------
As an operator, I want to be able to dynamically partition the cpus on a host
into shared and dedicated cpus so that I can maximise the utilisation of the
platform based on the workload requirements.
As an operator i want to be able to use dynamic cpu partioning without having to
modify my existing flavor definitions. I want to be able to use the existing
flavor definitions and have the system partition the cpus based on the workload
so that existing workloads can benefit from dynamic cpu partioning if moved to
a host with dynamic cpu partioning enabled.
As an operator, when using dynamic cpu partioning, i want unused cpus to be able
to use the recently added cpu_state power management feature so that idle cpus
will be put into a low power state.
Proposed change
===============
High level design constraints
* No new Resouces classes will be added for dynamic cpu partioning. The existing
VCPU and PCPU resources classes will be used for dynamic cpu partioning.
* The existing cpu_shared_set and cpu_dedicated_set config options will not be used
for dynamic cpu partioning. Instead, a new config option called cpu_dynamic_set
will be used for dynamic cpu partioning. This will enable coeistence of static
and dynamic cpu partioning on the same or diffent host in the future. coeistence
on the same host will be out of scope of this blueprint.
* The current roles of placement, the resouce tracker and the scheduler will not
change. The resource trakcer will continue to be the source of truth for the
cpu resources on a host. The placement service will continue to track the cpu
capsity on a host using the VCPU and PCPU resouce classes.
The scheduler will continue to select a host for an instance based on the
cpu resources on the host but the assignment of cpus to an instance will be
done by the resource tracker.
The role of placement in selecting a host.
------------------------------------------
While it may seem from an out side perspective that the placement service is the
single source of truth for the cpu resources on a host, this is not, and has never
been the case. As with all other resources, the placement service acts a consistent,
atomic, distributed cache, of a summary view of the resources on a host.
Placement is not the source of truth for the cpu resources on a host. The resource
tracker is the source of truth for the cpu resources on a host and is responsible
for updating the placement service with the capasity and capabilities the host.
As placement is not aware of the toplogy or asignment of cpus to instances on a host,
it is not possible for placement to select a host for an instance based on the cpu
resources on the host, with respect to numa affintiy, cpu pinning or other affintiy
requirements. Placement role is to fined host that have enough capsity to host a VM
and the toplogy consideration are enforced by the scheduler and the resource tracker.
Put concretely, the placement service today can only say, "this host has 8 vcpus and
the instance requires 4 vcpus, therefor it has capasity to host the instance."
It cannot say "this host has 8 vcpus and the instance requires 4 vcpus, therfor it can
use cpus 0,1,2,3 on the host to host the instance."
This spec does not change the role of placement in selecting a host for an instance.
Placement will only be aware of the total number of vCPUs and pCPUs on a host and
the assignment of cpus to instances will be done by the resource tracker.
The implications of this are that, as it is today, it will be possible for 2 VM
to atomically create an allocation in placement, and then both be scheduled to the
same host where the assignment of cpus may only be valid for one of the vms.
This is not a new problem and is not introduced by this spec. It is a problem that
already exists today and can be mitigated by introducing a new rpc to the compute node
and a new placement rest api to allow updating an allocation in placement and a
resource provider in a single atomic call. doing that is out of scope of this spec.
The meaning of cpu_policy=shared and cpu_policy=dedicated
---------------------------------------------------------
Contrary to how we commonly think of the meaning of cpu_policy=shared and
cpu_policy=dedicated, the meaning of these cpu policies is not that the cpus
are floating or pinned. The cpu dedicated policy is often referred to as cpu pinning.
While it is true that the cpu dedicated policy does pin cpus to instances, that is
a side effect of the cpu dedicated policy and not the meaning of the cpu dedicated.
Similarly the cpu shared policy does not mean that the cpus are floating. The cpu
shared policy means that the cpus are shared with other instances and not reserved
for this instance.
The cpu_policy=shared and cpu_policy=dedicated are not the only cpu policies that
nova also supports cpu_policy=mixed. The cpu_policy=mixed policy is a combination
of the cpu_policy=shared and cpu_policy=dedicated policies where some cpus are mapped
to the cpu_shared_set and some cpus are mapped to the cpu_dedicated_set.
Since the originally introduction of the vcpu_pin_set and the cpu_shared_set config
if either are defined all instance with the cpu_policy=shared (or unset) will be
pinned to the cpus in the vcpu_pin_set/cpu_shared_set.
The difference between cpu_policy=shared and cpu_policy=dedicated is that for each vcpu
in the instance, the cpu_policy=shared will ping that cpu to a range of cpus defined by the
cpu_shared_set or the vcpu_pin_set, while the cpu_policy=dedicated will pin each vcpu
to a single cpu defined by the cpu_dedicated_set/vcpu_pin_set.
This is important to understand as it means that the cpu_policy=shared are not unpinned
they are just not pinned 1:1 and other instance can share the same cores up to the
cpu_allocation_ratio.
Mechanically we cannot change the meaning of cpu_policy=shared and cpu_policy=dedicated
(shared with other instance, and dedicated to this instance) as that would break
backwards compatibility. Nor can we change the fact that cpu_policy=dedicated pins
cpus to instance to specific cpus and does not change that without user intervention.
What we can change is the relationship between the cpu_policy=shared, cpu_allocation_ratio
and how the cpus are pinned to host cpus.
cpu_allocation_ratio and cpu_policy=shared
------------------------------------------
The cpu_allocation_ratio config option is used to define the over subscription
ratio for the cpus on a host. the cpu_allocation_ratio only applies to vcpu inventories.
Regardless of the cpu_policy of an instance, a given nova instance will always be mapped
to at least 1 host cpu per flavor.vcpu. That means that if you have a host with 2 cpus
and an allocation ration of 4.0 then you can have 8 instances with 1 vcpu.
you could also have 4 instances with 2 vcpus but not 2 instances with 4 vcpus.
in other words, an instance can never over subscribe against itself.
put a different way, if a host has 2 cpus it can never boot an instance with more then
2 vcpus regardless of the cpu_allocation_ratio.
cpu_shared_set and emulator threads
-----------------------------------
The cpu_shared_set config option is used to define the cpus that are available to
instances with the cpu_policy=shared. The cpu_shared_set config option is also used
to define the cpus that are available to emulator threads. This is a annoying so
while we are here we can fix that.
While this is not strictly required it makes the code simpler and easier to understand
this spec will also add a new config option called cpu_overhead_set. If this config
option is defined then nova will be modifed to use the cpu_overhead_set when pinning
the qemu emulator thread and/io thread if we add support for iothread in the future.
when cpu_overhead_set is not defiend and cpu_shared_set then we will fallback to that
to preserve backwards compatibility. This spec will deprecate that fallback and we
can remove it in a future release.
the cpu_overhead_set cpus will not be reported to placement and may not overlap with
vcpu_pin_set
------------
The vcpu_pin_set config option is deprecated for removal in a future release.
it is not compatibilile with dedicated cpu resouce tracking in placement and none of the
feature described in this spec will be supported when the vcpu_pin_set is defined.
In the rest of the document we will only disucss the cpu_shared_set, cpu_dedicated_set,
cpu_dynamic_set and cpu_overhead_set config options.
Dynamic cpu partioning
----------------------
The proposed solution is to add a new config option called cpu_dynamic_set. This
config option will be used to define the cpus that are available to instance vcpus.
All cpus in the cpu_dynamic_set will be reported to placement as both vcpus and pcpus.
e.g. if cpu_dynamic_set=0-7 then placement will report 8 vcpus and 8 pcpus.
```
resouce_provider
|
---- name: <hostname>
---- uuid: <uuid>
|
---- inventories
|
---- vcpu
| |
| ---- total: 8
| ---- reserved: 0
| ---- min_unit: 1
| ---- max_unit: 8
| ---- step_size: 1
| ---- allocation_ratio: 4.0
|
---- pcpu
|
---- total: 8
---- reserved: 0
---- min_unit: 1
---- max_unit: 8
---- step_size: 1
```
With this new feature. If cpu_dynamic_set=0-7 and cpu_allocation_ratio=4.0 then
given a flaovr with 1 vcpu and cpu_policy=shared, if we boot an instance on the host
with the cpu_dynamic_set=0-7 then the instance will be pinned to a single host core.
i.e. the instance will be pinned to a single host core and not a range of host cores.
when this core assignment is done, we will update the placement resource provider
reduce the max_unit of the pcpu and set the reserved value of the pcpu to 1.
```
resouce_provider
|
---- name: <hostname>
---- uuid: <uuid>
|
---- inventories
|
---- vcpu
| |
| ---- total: 8
| ---- reserved: 0
| ---- min_unit: 1
| ---- max_unit: 8
| ---- step_size: 1
| ---- allocation_ratio: 4.0
|
---- pcpu
| |
| ---- total: 8
| ---- reserved: 1
| ---- min_unit: 1
| ---- max_unit: 7
| ---- step_size: 1
---- allocations
|
---- instance-uuid-1
|
---- resources
|
---- vcpu: 1
```
If we boot a second instance with the same flavor on the same host then the second
will be pinned to the same host core as the first instance. This is because the first
instance has already reserved the host core for shared use and we have not reached
the cpu_allocation_ratio.
```
resouce_provider
|
---- name: <hostname>
---- uuid: <uuid>
|
---- inventories
|
---- vcpu
| |
| ---- total: 8
| ---- reserved: 0
| ---- min_unit: 1
| ---- max_unit: 8
| ---- step_size: 1
| ---- allocation_ratio: 4.0
|
---- pcpu
| |
| ---- total: 8
| ---- reserved: 1
| ---- min_unit: 1
| ---- max_unit: 7
| ---- step_size: 1
---- allocations
|
---- instance-uuid-1
| |
| ---- resources
| |
| ---- vcpu: 1
|
---- instance-uuid-2
|
---- resources
|
---- vcpu: 1
```
if we have a second flavor that requests 1 vcpu and cpu_policy=dedicated
and we boot an instance with that flavor on the same host then the instance will
be pinned to a different host core. this will result in the following placement
resource provider.
```
resouce_provider
|
---- name: <hostname>
---- uuid: <uuid>
|
---- inventories
|
---- vcpu
| |
| ---- total: 8
| ---- reserved: 4
| ---- min_unit: 1
| ---- max_unit: 7
| ---- step_size: 1
| ---- allocation_ratio: 4.0
|
---- pcpu
| |
| ---- total: 8
| ---- reserved: 1
| ---- min_unit: 1
| ---- max_unit: 6
| ---- step_size: 1
---- allocations
|
---- instance-uuid-1
| |
| ---- resources
| |
| ---- vcpu: 1
|
---- instance-uuid-2
| |
| ---- resources
| |
| ---- vcpu: 1
|
---- instance-uuid-3
|
---- resources
|
---- pcpu: 1
```
Note that because we have allocated a host core for dedicated use, the vcpu max_unit
is reduced to 7 and the pcpu max_unit is reduced to 6.
The vcpu max unit is reduced is because the host core that is reserved for dedicated
use is not available for shared use. the pcpu max unit is reduced because the host
core that is reserved for dedicated use and the host core that is reserved for
shared use are not available for dedicated use. given share cpus allow over subscription
when a host core is reserved for dedicated use, the vcpu reserved is increased by
(1 * allocation_ratio) instead of 1.
if we boot two more instance with the shared cpu policy then they will be pinned
to the same host cores as the first two instances. This is because the first
instance has already reserved the host core for shared use and we have not reached
the cpu_allocation_ratio and no change to the placement resource provider inventories
are required.
```
resouce_provider
|
---- name: <hostname>
---- uuid: <uuid>
|
---- inventories
|
---- vcpu
| |
| ---- total: 8
| ---- reserved: 4
| ---- min_unit: 1
| ---- max_unit: 7
| ---- step_size: 1
| ---- allocation_ratio: 4.0
|
---- pcpu
| |
| ---- total: 8
| ---- reserved: 1
| ---- min_unit: 1
| ---- max_unit: 6
| ---- step_size: 1
---- allocations
|
---- instance-uuid-1
| |
| ---- resources
| |
| ---- vcpu: 1
|
---- instance-uuid-2
| |
| ---- resources
| |
| ---- vcpu: 1
|
---- instance-uuid-3
| |
| ---- resources
| |
| ---- pcpu: 1
|
---- instance-uuid-4
| |
| ---- resources
| |
| ---- vcpu: 1
|
---- instance-uuid-5
|
---- resources
|
---- vcpu: 1
```
The general pattern is that when an instance is booted, the resource tracker will
select a host core for the instance and update the placement resource provider
inventories to reflect the change.
If the the instance requests a shared cpu and
no host cores are currently reserved for shared use then the resource tracker will
reserve a host core for shared use and update the placement resource provider
by reducing the pcpu max_unit by 1 and increasing the pcpu reserved by 1.
This reflects that the host core is no longer available for dedicated use even though
there is no allocation against the pcpu inventory.
Similarly if the instance requests a dedicated cpu the resource tracker will reserve
a host core for dedicated use and update the placement resource provider by reducing
the vcpu max_unit by 1 and increasing the vcpu reserved by (1 * allocation_ratio).
This reflects that the host core is no longer available for shared use even though there
is no allocation against the vcpu inventory. since each host core utilized as a shared core
allowed up to allocation_ratio vcpus to be allocated, when a host core is reserved for
dedicated use, the vcpu reserved is increased by (1 * allocation_ratio) instead of 1.
resouce provider updates and concurrent allocation creations
------------------------------------------------------------
As mentioned above, the placement service is not the source of truth for the cpu
resources on a host. The resource tracker is the source of truth for the cpu resources.
when we convert a allocation_candidate to a allocation today that is done in the conductor
prior to the assignment of cpus to the instance by the resource tracker.
That means that its possible for two instances to atomically create an allocation
in placement, and then both be scheduled to the same host where the assignment of cpus
may only be valid for one of the vms. Today when this happens because of the locking
we have in the resource tracker, the second instance will fail to build because the
numa topology requsted will not be fitable to the host.
when we increase the resvered amount of cpus in the placement resource provider we may
temporarily over subscribe the host based on the allocations for the instance that have
not yet been created. This is not a problem as the resource tracker will not allow
the over subscription when the instance is created and we will reject the build request.
The conductor will then free the allocation in placement for the rejected instance
and an alternative host will be selected.
Atomicity of allocation creation and resource provider updates
--------------------------------------------------------------
As mentioned above, the placement service is not the source of truth for the cpu resources
on a host. It is a cached view with incompelte knowlage that within the limits of its knowledge
provide an atomic view of the capsity and capabilities of the host.
The model propeded in this spec reflects that given the placement service has incomplete knowledge
and nova is a distributed system, we are operating in a eventually consistent model. While placement
can definitvly say that a host does not have capasity to host an instance it cannot definitvly
say that a host has capasity to host an instance. Like a bloom filter, placement can only say
there is a high probability that a host has capasity to host an instance based on the information
it has.
This spec does not attemept to adress the problem of atomicity of allocation creation and resource
provider updates. But it does attempet to ensure that it can be addressed in the future.
One possible way to adress this in the future it to add a new rpc to the compute node to assignee
cpus to an instance and update the placement resource provider and convert the allocation candatate to an allocation
in a single atomic call. by delegating the creation of the allocation to the compute node we can
ensure that the allocation is only created if the resource tracker is able to assign cpus to the instance.
This has the the disadvantage that it requires a new rpc to the compute node and a new placement rest api
request could be delayed while we wait for the compute node to respond.
another possible way to address this in the future for the asignment of the cpus to be
done by the conductor instead. Today the numa topology filter genreate a possible assignment using
a copy of the host numa topology and then discards it. Instead of discarding the possible assignment
we could pass embed it in the allocation candidate and pass it to the conductor.
The conductor could then directly update the host numa topology blob in the db and pass the instance
numa topology to the compute node when creating the instance. we would also need to extend the host numa
topology blob to have a generation number so that we can detect when the host numa topology has changed
on the conductor or compute node as this is now a shared resouce which must be updated atomically.
this has the advantage that it does not require a new rpc to the compute node and the placement rest api
call can continue to be done in the conductor reducing the latency.
cpu_pinning, numa topology, and numa aware memory allocation
------------------------------------------------------------
the linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node
is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node
until the memory usage on the numa node is below the threshold. if an instance does not request
numa aware memory assignemnt by seting hw:mem_page_size=small|large|any|<pagesize> then the
instance memory will not be check against the numa node and can be killed by the kernel OOM reaper.
with the introduction of https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
nova gained the ablity to have vms with cpu_policy=shared and cpu_policy=dedicated on the same host
That capablity was later extended to allow mixing shared and dedicated instances in the same instance
https://specs.openstack.org/openstack/nova-specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html
nova however has never and still does not support mixing numa instance and non numa instaces on the same host.
that means if you want to support cpu_policy=shared and cpu_policy=dediecated insatances
on the same host all vms must have a numa toplogy.
To address this we need to add a new flavor extra spec hw:cpu_partioning=dynamic|static to opt into this feature.
This can be automatically converted into a required or forbidden trait by the scheduler to select host
that are configured for dynamic/static partioning.
The libvirt driver will be updated to report the approate trait based on if the cpu_dynamic_set is defined.
if the cpu_dynamic_set is defined then the libvirt driver will report the COMPUTE_CPU_PARTITIONING_DYNAMIC trait.
otherwise it will report the COMPUTE_CPU_PARTITIONING_STATIC trait.
to prevent OOM issues hw:cpu_partioning=dynamic will also imply hw:mem_page_size=small or hw:mem_page_size=any.
hw:mem_page_size=any
--------------------
The hw:mem_page_size=any flavor extra spec is used to indicate that the instance can be booted on a host
with any page size. As a secondary effect this also indicates that the exact page size can be requested
by the image. The api does not specify how the virt driver will select the page size when hw:mem_page_size=any
is requested. Today the libvirt driver will select the page size based on the largest pagesize aviable that
can fit the instance memory. This is done to reduce the memory overhead of the instance by prefering hugepages.
This is not ideal when the cpu_policy=shared as its is likely that the hugepages should be preferencally used
by the dedicated instances. hw:mem_page_size=any however would be a better default then hw:mem_page_size=small
as it allow the image to request the page size it wants.
we could adress this by adding a new mem_page_size hw:mem_page_size=prefer_smallest which would indicate
that the smallest page size should be prefered but still allow the image to request a larger page size.
we could also just change the meaning of hw:mem_page_size=any to mean that the smallest page size should be prefered
since it is not part fo the api contract. for backwards compatibility we could make the prefernce based on if
static or dynamic cpu partioning is enabled. if static cpu partioning is enabled then hw:mem_page_size=any
will prefer hugepages. if dynamic cpu partioning is enabled then hw:mem_page_size=any will prefer small pages.
numa and memory overcommit
--------------------------
The linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node
is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node
until the memory usage on the numa node is below the threshold. nuna affined isntances cannot
safely be over committed as the kernel OOM reaper will kill the instance if the numa node is
over committed. This is not a new problem and is why we do not supprot numa and non numa instances
on the same host. either all instance musts be numa aware or none of them can be numa aware.
that means that dynamic cpu partioning cannot be used with non numa aware instances.
Alternatives
------------
Nova by design never modifies an instance with out the user requesting it. This is
to ensure that the user has full control over the instance and that the instance
is not modified in unexpected ways. Nova is also a multi tenant system and we
must not allow one tenant to modify the resources of another tenant even indirectly
i.e by the side effect of a nova operation on a resouce owned by one tenant altering
the resources of another tenant. If this were to happen it would be a security issue.
if we did not have these constraints then we could allow shared cpus to float as we do
today and when a instance is booted with the dedicated cpu policy we could updated the
pinning of the shared cpu instance to ensure that the dedicated cpu instance has the
cpus it requires. This would violate our multi tenant design constraints and is not
an option.
Data model impact
-----------------
The exisitng host numa topology blob does not have the concept of mapping instances
to cpus directly. instead we have split the aviablity and the usage by the instance into
to data structures. the host numa toplogy just tracks what cpus are available to instances
and which once have been used in the context of the pinned cpus. The assocation fo what
instance is using a givne cpu is tracked by the instance numa topology object.
the corraation between the two is computed by the resource tracker and is not stored in
the db. This is because the secheduler only need to know what cpus are available not what
instance is using what cpus. The resource tracker needs to know what cpus are available
and what instance is using what cpus so that it can free the cpus when the instance is deleted.
This model work fine for static cpu partioning but does not work for dynamic cpu partioning.
a cleaner model would be to have a single data structure that tracks what cpus are available
and what instance is using what cpus. The instnace numa topology object could then be removed
and the resource tracker would not need to compute the corraation between the host numa topology
and the instance numa topology when freeing cpus.
to support this the host numa topology blob would need to be extended to include the instance
uuid for each cpu. This would allow the resource tracker to directly update the host numa topology
blob when assigning cpus to an instance and when freeing cpus from an instance.
addtionally we need to track if a cpu is reserved for shared use or dedicated use and the allocation
ratio for the host. This can be done by modeling the assingable cpus as "slot" objects.
each slot object would contain the guest cpu number, a instance uuid. each host cpu would have n slots
where n is the allocation ratio. if the host cpu is is asigned as a dedicated cpu then it will have only
one slot with the instance uuid of the instance that is using the cpu.
This could look something like this.
```
host_numa_topology:
generation: 42
numa_node:
"0":
cpus:
"0":
slots:
"0":
instance_uuid: "instance-uuid-1"
guest_cpu: 0
"1":
instance_uuid: "instance-uuid-2"
guest_cpu: 0
"2":
instance_uuid: "instance-uuid-3"
guest_cpu: 0
"3":
instance_uuid: "instance-uuid-4"
guest_cpu: 0
"1":
slots:
"0":
instance_uuid: "instance-uuid-1"
guest_cpu: 1
"1":
instance_uuid: "instance-uuid-2"
guest_cpu: 1
"2":
instance_uuid: None
guest_cpu: None
"3":
instance_uuid: None
guest_cpu: None
"2":
slots:
"0":
instance_uuid: "instance-uuid-1"
guest_cpu: 2
"1":
instance_uuid: "instance-uuid-2"
guest_cpu: 2
"2":
instance_uuid: None
guest_cpu: None
"3":
instance_uuid: None
guest_cpu: None
"3":
slots:
"0":
instance_uuid: None
guest_cpu: None
"1":
instance_uuid: None
guest_cpu: None
"2":
instance_uuid: None
guest_cpu: None
"3":
instance_uuid: None
guest_cpu: None
memory_pages:
"4":
total: 1024
used: 0
reserved: 0
"2048":
total: 1024
used: 0
reserved: 0
"1048576":
total: 1024
used: 0
reserved: 0
"1":
cpus:
"0":
slots:
"0":
instance_uuid: None
guest_cpu: None
"1":
instance_uuid: None
guest_cpu: None
"2":
instance_uuid: None
guest_cpu: None
"3":
instance_uuid: None
guest_cpu: None
"1":
slots:
"0":
instance_uuid: None
guest_cpu: None
"1":
instance_uuid: None
guest_cpu: None
"2":
instance_uuid: None
guest_cpu: None
"3":
instance_uuid: None
guest_cpu: None
"2":
slots:
"0":
instance_uuid: None
guest_cpu: None
"1":
instance_uuid: None
guest_cpu: None
"2":
instance_uuid: None
guest_cpu: None
"3":
instance_uuid: None
guest_cpu: None
"3":
slots:
"0":
instance_uuid: None
guest_cpu: None
"1":
instance_uuid: None
guest_cpu: None
"2":
instance_uuid: None
guest_cpu: None
"3":
instance_uuid: None
guest_cpu: None
memory_pages:
"4":
total: 1024
used: 0
reserved: 0
"2048":
total: 1024
used: 0
reserved: 0
"1048576":
total: 1024
used: 0
reserved: 0
```
The migration data objects may will also need to be updated to reflect the new data model
as instead of pinning shared cpus to a range of host cpus we will be pinning shared cpus
to individual host cpus.
as the behavior of the cpu_policy=shared will vary based on if the host is using static
or dynamic cpu partioning, we will need to add a new placement trait to indicate if the
host is using static or dynamic cpu partioning. This will allow the operator to select
hosts that are using static or dynamic cpu partioning using required or forbidden traits.
in combination with the Filtering hosts by isolating aggregates freature,
https://docs.openstack.org/nova/latest/reference/isolate-aggregates.html
This will allow the operator to prevent existing workloads from being moved to a host
with dynamic cpu partioning enabled if desired. i.e. if the operator does not want to
move existing workloads to a host with dynamic cpu partioning enabled then they can
add the required trait to the aggregate that the host is in and create a new flavor
that has the required trait. This will prevent existing workloads from being moved
to the host with dynamic cpu partioning enabled but allow new workloads to be created
on the host with dynamic cpu partioning enabled.
REST API impact
---------------
None
The exsiting flavor extra specs will continue to be used to define the cpu_policy
and the same placement resouce classes will be used so no api changes are required.
Security impact
---------------
None
As this proposal does not change the multi tenant design constraints of nova
there is no security impact.
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
This will potially make the scheduler slower as it will have to consider more
constraitns in the numa toplogy filter. However since this feature will be
opt in via the hw:cpu_partioning=dynamic flavor extra spec, the scheduler Impact
will only be seen when the operator has opted into this feature.
if we modify the numa toplgoy and remove the instance numa topology object
the new data stuctre will be smaller and more efficent to process then the
current data structure. This might result in a small performance improvement
but its unlikely to be noticable.
moving the cpu assingment from the resource tracker to the conductor should
not have a noticable impact on performance as the schduler is already doing
the cpu assingment calulation however it will increase size of the result
of the select destionation call. so we would need to consider that carefully.
given that is out of scope for this spec we will not consider it further.
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
By default this feature will be disabled and will not impact existing deployments.
For existing workload the operator will need to create new flavor with hw:cpu_partioning=dynamic
and resize existing instances to the new flavor if they want to use this feature.
For new workloads the operator will need to create new flavor with hw:cpu_partioning=dynamic
and define the cpu_dynamic_set config option to enable this feature.
This will not be supported inplace on hosts with existing workloads.
Implementation
==============
Assignee(s)
-----------
Who is leading the writing of the code? Or is this a blueprint where you're
throwing it out there to see who picks it up?
If more than one person is working on the implementation, please designate the
primary author and contact.
Primary assignee:
<launchpad-id or None>
Other contributors:
<launchpad-id or None>
Feature Liaison
---------------
Feature liaison:
sean-k-mooney
Work Items
----------
its 1am and i am tired so i will fill this in later.
its also saturday so ... :)
Dependencies
============
None unless we want to first add a placment rest api to update an allocation
and a resource provider in a single atomic call or change how cpus are assigned
to instances in the conductor instead of the resource tracker first.
Testing
=======
This needs at least unit and fucntional tests.
we might actully be ablle to test this in temepst as well
using the serial tests feature to ensure that no other tests
are running at the same time.
Documentation Impact
====================
we need to document all the restrictions and the change in behavior for cpu_policy=shared
i.e. that they wont float and that they will be pinned instead.
This is thecnially not a change in the api contract but it is a change in the libvirt
driver internal that some may not be aware of.
References
==========
None
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - 2024.1 Caracal
- Introduced