diff --git a/specs/backlog/nova-dynamic-cpus.rst b/specs/backlog/nova-dynamic-cpus.rst new file mode 100644 index 000000000..53e9c2d99 --- /dev/null +++ b/specs/backlog/nova-dynamic-cpus.rst @@ -0,0 +1,918 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================= +Dynamic Cpu Pinning for libvirt instances +========================================= + +Include the URL of your launchpad blueprint: + +https://blueprints.launchpad.net/nova/+spec/example + +Nova currently supports static partioning of CPUs for instances. +This blueprint proposes to add support for dynamic cpu pinning for +instances with the shared and dedicated CPU polices. + +Problem description +=================== + +Currently its possible to device the cpu resocues on a host into shared and +dedicated cpus. The shared cpus are available to all instances on the host +(either for vcpus or emulator threads) and the dedicated cpus are available only to +instances with the dedicated cpu policy. shared cpus are declared via the cpu_shared_set +config option and dedicated cpus are declared via the cpu_dedicated_set config option. +while this work well for static partitioning of cpus, it does not work well for +dynamic partitioning of cpus. For example, if a host has 8 cpus and 4 of them are +dedicated to instances with the dedicated cpu policy, and the remaining 4 cpus +are available to all instances with the shared cpu policy we can have underutilisation +of the platform. For example, if there are 4 instances with the dedicated cpu policy +and each instance has 1 vcpu, then the remaining 4 cpus are not available to instances +with the dedicated cpu policy. If we have 4 instances with the shared cpu policy and each +instance has 1 vcpu, the the dedicated cpus are idle and not available to the instances +with the shared cpu policy. This can lead to a bin packing problem where we have +underutilisation of the platform. + +Use Cases +--------- + +As an operator, I want to be able to dynamically partition the cpus on a host +into shared and dedicated cpus so that I can maximise the utilisation of the +platform based on the workload requirements. + +As an operator i want to be able to use dynamic cpu partioning without having to +modify my existing flavor definitions. I want to be able to use the existing +flavor definitions and have the system partition the cpus based on the workload +so that existing workloads can benefit from dynamic cpu partioning if moved to +a host with dynamic cpu partioning enabled. + +As an operator, when using dynamic cpu partioning, i want unused cpus to be able +to use the recently added cpu_state power management feature so that idle cpus +will be put into a low power state. + +Proposed change +=============== + +High level design constraints +* No new Resouces classes will be added for dynamic cpu partioning. The existing + VCPU and PCPU resources classes will be used for dynamic cpu partioning. +* The existing cpu_shared_set and cpu_dedicated_set config options will not be used + for dynamic cpu partioning. Instead, a new config option called cpu_dynamic_set + will be used for dynamic cpu partioning. This will enable coeistence of static + and dynamic cpu partioning on the same or diffent host in the future. coeistence + on the same host will be out of scope of this blueprint. +* The current roles of placement, the resouce tracker and the scheduler will not + change. The resource trakcer will continue to be the source of truth for the + cpu resources on a host. The placement service will continue to track the cpu + capsity on a host using the VCPU and PCPU resouce classes. + The scheduler will continue to select a host for an instance based on the + cpu resources on the host but the assignment of cpus to an instance will be + done by the resource tracker. + +The role of placement in selecting a host. +------------------------------------------ +While it may seem from an out side perspective that the placement service is the +single source of truth for the cpu resources on a host, this is not, and has never +been the case. As with all other resources, the placement service acts a consistent, +atomic, distributed cache, of a summary view of the resources on a host. + +Placement is not the source of truth for the cpu resources on a host. The resource +tracker is the source of truth for the cpu resources on a host and is responsible +for updating the placement service with the capasity and capabilities the host. + +As placement is not aware of the toplogy or asignment of cpus to instances on a host, +it is not possible for placement to select a host for an instance based on the cpu +resources on the host, with respect to numa affintiy, cpu pinning or other affintiy +requirements. Placement role is to fined host that have enough capsity to host a VM +and the toplogy consideration are enforced by the scheduler and the resource tracker. + +Put concretely, the placement service today can only say, "this host has 8 vcpus and +the instance requires 4 vcpus, therefor it has capasity to host the instance." +It cannot say "this host has 8 vcpus and the instance requires 4 vcpus, therfor it can +use cpus 0,1,2,3 on the host to host the instance." + +This spec does not change the role of placement in selecting a host for an instance. +Placement will only be aware of the total number of vCPUs and pCPUs on a host and +the assignment of cpus to instances will be done by the resource tracker. +The implications of this are that, as it is today, it will be possible for 2 VM +to atomically create an allocation in placement, and then both be scheduled to the +same host where the assignment of cpus may only be valid for one of the vms. + +This is not a new problem and is not introduced by this spec. It is a problem that +already exists today and can be mitigated by introducing a new rpc to the compute node +and a new placement rest api to allow updating an allocation in placement and a +resource provider in a single atomic call. doing that is out of scope of this spec. + +The meaning of cpu_policy=shared and cpu_policy=dedicated +--------------------------------------------------------- +Contrary to how we commonly think of the meaning of cpu_policy=shared and +cpu_policy=dedicated, the meaning of these cpu policies is not that the cpus +are floating or pinned. The cpu dedicated policy is often referred to as cpu pinning. +While it is true that the cpu dedicated policy does pin cpus to instances, that is +a side effect of the cpu dedicated policy and not the meaning of the cpu dedicated. +Similarly the cpu shared policy does not mean that the cpus are floating. The cpu +shared policy means that the cpus are shared with other instances and not reserved +for this instance. + +The cpu_policy=shared and cpu_policy=dedicated are not the only cpu policies that +nova also supports cpu_policy=mixed. The cpu_policy=mixed policy is a combination +of the cpu_policy=shared and cpu_policy=dedicated policies where some cpus are mapped +to the cpu_shared_set and some cpus are mapped to the cpu_dedicated_set. + +Since the originally introduction of the vcpu_pin_set and the cpu_shared_set config +if either are defined all instance with the cpu_policy=shared (or unset) will be +pinned to the cpus in the vcpu_pin_set/cpu_shared_set. +The difference between cpu_policy=shared and cpu_policy=dedicated is that for each vcpu +in the instance, the cpu_policy=shared will ping that cpu to a range of cpus defined by the +cpu_shared_set or the vcpu_pin_set, while the cpu_policy=dedicated will pin each vcpu +to a single cpu defined by the cpu_dedicated_set/vcpu_pin_set. + +This is important to understand as it means that the cpu_policy=shared are not unpinned +they are just not pinned 1:1 and other instance can share the same cores up to the +cpu_allocation_ratio. + +Mechanically we cannot change the meaning of cpu_policy=shared and cpu_policy=dedicated +(shared with other instance, and dedicated to this instance) as that would break +backwards compatibility. Nor can we change the fact that cpu_policy=dedicated pins +cpus to instance to specific cpus and does not change that without user intervention. +What we can change is the relationship between the cpu_policy=shared, cpu_allocation_ratio +and how the cpus are pinned to host cpus. + +cpu_allocation_ratio and cpu_policy=shared +------------------------------------------ + +The cpu_allocation_ratio config option is used to define the over subscription +ratio for the cpus on a host. the cpu_allocation_ratio only applies to vcpu inventories. +Regardless of the cpu_policy of an instance, a given nova instance will always be mapped +to at least 1 host cpu per flavor.vcpu. That means that if you have a host with 2 cpus +and an allocation ration of 4.0 then you can have 8 instances with 1 vcpu. +you could also have 4 instances with 2 vcpus but not 2 instances with 4 vcpus. +in other words, an instance can never over subscribe against itself. +put a different way, if a host has 2 cpus it can never boot an instance with more then +2 vcpus regardless of the cpu_allocation_ratio. + + +cpu_shared_set and emulator threads +----------------------------------- +The cpu_shared_set config option is used to define the cpus that are available to +instances with the cpu_policy=shared. The cpu_shared_set config option is also used +to define the cpus that are available to emulator threads. This is a annoying so +while we are here we can fix that. + +While this is not strictly required it makes the code simpler and easier to understand +this spec will also add a new config option called cpu_overhead_set. If this config +option is defined then nova will be modifed to use the cpu_overhead_set when pinning +the qemu emulator thread and/io thread if we add support for iothread in the future. +when cpu_overhead_set is not defiend and cpu_shared_set then we will fallback to that +to preserve backwards compatibility. This spec will deprecate that fallback and we +can remove it in a future release. + +the cpu_overhead_set cpus will not be reported to placement and may not overlap with + +vcpu_pin_set +------------ +The vcpu_pin_set config option is deprecated for removal in a future release. +it is not compatibilile with dedicated cpu resouce tracking in placement and none of the +feature described in this spec will be supported when the vcpu_pin_set is defined. +In the rest of the document we will only disucss the cpu_shared_set, cpu_dedicated_set, +cpu_dynamic_set and cpu_overhead_set config options. + +Dynamic cpu partioning +---------------------- +The proposed solution is to add a new config option called cpu_dynamic_set. This +config option will be used to define the cpus that are available to instance vcpus. + +All cpus in the cpu_dynamic_set will be reported to placement as both vcpus and pcpus. + +e.g. if cpu_dynamic_set=0-7 then placement will report 8 vcpus and 8 pcpus. + +``` +resouce_provider + | + ---- name: + ---- uuid: + | + ---- inventories + | + ---- vcpu + | | + | ---- total: 8 + | ---- reserved: 0 + | ---- min_unit: 1 + | ---- max_unit: 8 + | ---- step_size: 1 + | ---- allocation_ratio: 4.0 + | + ---- pcpu + | + ---- total: 8 + ---- reserved: 0 + ---- min_unit: 1 + ---- max_unit: 8 + ---- step_size: 1 +``` + +With this new feature. If cpu_dynamic_set=0-7 and cpu_allocation_ratio=4.0 then +given a flaovr with 1 vcpu and cpu_policy=shared, if we boot an instance on the host +with the cpu_dynamic_set=0-7 then the instance will be pinned to a single host core. +i.e. the instance will be pinned to a single host core and not a range of host cores. +when this core assignment is done, we will update the placement resource provider +reduce the max_unit of the pcpu and set the reserved value of the pcpu to 1. + +``` +resouce_provider + | + ---- name: + ---- uuid: + | + ---- inventories + | + ---- vcpu + | | + | ---- total: 8 + | ---- reserved: 0 + | ---- min_unit: 1 + | ---- max_unit: 8 + | ---- step_size: 1 + | ---- allocation_ratio: 4.0 + | + ---- pcpu + | | + | ---- total: 8 + | ---- reserved: 1 + | ---- min_unit: 1 + | ---- max_unit: 7 + | ---- step_size: 1 + ---- allocations + | + ---- instance-uuid-1 + | + ---- resources + | + ---- vcpu: 1 + +``` + +If we boot a second instance with the same flavor on the same host then the second +will be pinned to the same host core as the first instance. This is because the first +instance has already reserved the host core for shared use and we have not reached +the cpu_allocation_ratio. + +``` +resouce_provider + | + ---- name: + ---- uuid: + | + ---- inventories + | + ---- vcpu + | | + | ---- total: 8 + | ---- reserved: 0 + | ---- min_unit: 1 + | ---- max_unit: 8 + | ---- step_size: 1 + | ---- allocation_ratio: 4.0 + | + ---- pcpu + | | + | ---- total: 8 + | ---- reserved: 1 + | ---- min_unit: 1 + | ---- max_unit: 7 + | ---- step_size: 1 + ---- allocations + | + ---- instance-uuid-1 + | | + | ---- resources + | | + | ---- vcpu: 1 + | + ---- instance-uuid-2 + | + ---- resources + | + ---- vcpu: 1 + +``` + +if we have a second flavor that requests 1 vcpu and cpu_policy=dedicated +and we boot an instance with that flavor on the same host then the instance will +be pinned to a different host core. this will result in the following placement +resource provider. + +``` +resouce_provider + | + ---- name: + ---- uuid: + | + ---- inventories + | + ---- vcpu + | | + | ---- total: 8 + | ---- reserved: 4 + | ---- min_unit: 1 + | ---- max_unit: 7 + | ---- step_size: 1 + | ---- allocation_ratio: 4.0 + | + ---- pcpu + | | + | ---- total: 8 + | ---- reserved: 1 + | ---- min_unit: 1 + | ---- max_unit: 6 + | ---- step_size: 1 + ---- allocations + | + ---- instance-uuid-1 + | | + | ---- resources + | | + | ---- vcpu: 1 + | + ---- instance-uuid-2 + | | + | ---- resources + | | + | ---- vcpu: 1 + | + ---- instance-uuid-3 + | + ---- resources + | + ---- pcpu: 1 + +``` + +Note that because we have allocated a host core for dedicated use, the vcpu max_unit +is reduced to 7 and the pcpu max_unit is reduced to 6. +The vcpu max unit is reduced is because the host core that is reserved for dedicated +use is not available for shared use. the pcpu max unit is reduced because the host +core that is reserved for dedicated use and the host core that is reserved for +shared use are not available for dedicated use. given share cpus allow over subscription +when a host core is reserved for dedicated use, the vcpu reserved is increased by +(1 * allocation_ratio) instead of 1. + +if we boot two more instance with the shared cpu policy then they will be pinned +to the same host cores as the first two instances. This is because the first +instance has already reserved the host core for shared use and we have not reached +the cpu_allocation_ratio and no change to the placement resource provider inventories +are required. + +``` +resouce_provider + | + ---- name: + ---- uuid: + | + ---- inventories + | + ---- vcpu + | | + | ---- total: 8 + | ---- reserved: 4 + | ---- min_unit: 1 + | ---- max_unit: 7 + | ---- step_size: 1 + | ---- allocation_ratio: 4.0 + | + ---- pcpu + | | + | ---- total: 8 + | ---- reserved: 1 + | ---- min_unit: 1 + | ---- max_unit: 6 + | ---- step_size: 1 + ---- allocations + | + ---- instance-uuid-1 + | | + | ---- resources + | | + | ---- vcpu: 1 + | + ---- instance-uuid-2 + | | + | ---- resources + | | + | ---- vcpu: 1 + | + ---- instance-uuid-3 + | | + | ---- resources + | | + | ---- pcpu: 1 + | + ---- instance-uuid-4 + | | + | ---- resources + | | + | ---- vcpu: 1 + | + ---- instance-uuid-5 + | + ---- resources + | + ---- vcpu: 1 + +``` + +The general pattern is that when an instance is booted, the resource tracker will +select a host core for the instance and update the placement resource provider +inventories to reflect the change. + +If the the instance requests a shared cpu and +no host cores are currently reserved for shared use then the resource tracker will +reserve a host core for shared use and update the placement resource provider +by reducing the pcpu max_unit by 1 and increasing the pcpu reserved by 1. +This reflects that the host core is no longer available for dedicated use even though +there is no allocation against the pcpu inventory. + +Similarly if the instance requests a dedicated cpu the resource tracker will reserve +a host core for dedicated use and update the placement resource provider by reducing +the vcpu max_unit by 1 and increasing the vcpu reserved by (1 * allocation_ratio). +This reflects that the host core is no longer available for shared use even though there +is no allocation against the vcpu inventory. since each host core utilized as a shared core +allowed up to allocation_ratio vcpus to be allocated, when a host core is reserved for +dedicated use, the vcpu reserved is increased by (1 * allocation_ratio) instead of 1. + +resouce provider updates and concurrent allocation creations +------------------------------------------------------------ +As mentioned above, the placement service is not the source of truth for the cpu +resources on a host. The resource tracker is the source of truth for the cpu resources. +when we convert a allocation_candidate to a allocation today that is done in the conductor +prior to the assignment of cpus to the instance by the resource tracker. + +That means that its possible for two instances to atomically create an allocation +in placement, and then both be scheduled to the same host where the assignment of cpus +may only be valid for one of the vms. Today when this happens because of the locking +we have in the resource tracker, the second instance will fail to build because the +numa topology requsted will not be fitable to the host. + +when we increase the resvered amount of cpus in the placement resource provider we may +temporarily over subscribe the host based on the allocations for the instance that have +not yet been created. This is not a problem as the resource tracker will not allow +the over subscription when the instance is created and we will reject the build request. +The conductor will then free the allocation in placement for the rejected instance +and an alternative host will be selected. + +Atomicity of allocation creation and resource provider updates +-------------------------------------------------------------- +As mentioned above, the placement service is not the source of truth for the cpu resources +on a host. It is a cached view with incompelte knowlage that within the limits of its knowledge +provide an atomic view of the capsity and capabilities of the host. + +The model propeded in this spec reflects that given the placement service has incomplete knowledge +and nova is a distributed system, we are operating in a eventually consistent model. While placement +can definitvly say that a host does not have capasity to host an instance it cannot definitvly +say that a host has capasity to host an instance. Like a bloom filter, placement can only say +there is a high probability that a host has capasity to host an instance based on the information +it has. + +This spec does not attemept to adress the problem of atomicity of allocation creation and resource +provider updates. But it does attempet to ensure that it can be addressed in the future. + +One possible way to adress this in the future it to add a new rpc to the compute node to assignee +cpus to an instance and update the placement resource provider and convert the allocation candatate to an allocation +in a single atomic call. by delegating the creation of the allocation to the compute node we can +ensure that the allocation is only created if the resource tracker is able to assign cpus to the instance. +This has the the disadvantage that it requires a new rpc to the compute node and a new placement rest api +request could be delayed while we wait for the compute node to respond. + +another possible way to address this in the future for the asignment of the cpus to be +done by the conductor instead. Today the numa topology filter genreate a possible assignment using +a copy of the host numa topology and then discards it. Instead of discarding the possible assignment +we could pass embed it in the allocation candidate and pass it to the conductor. +The conductor could then directly update the host numa topology blob in the db and pass the instance +numa topology to the compute node when creating the instance. we would also need to extend the host numa +topology blob to have a generation number so that we can detect when the host numa topology has changed +on the conductor or compute node as this is now a shared resouce which must be updated atomically. +this has the advantage that it does not require a new rpc to the compute node and the placement rest api +call can continue to be done in the conductor reducing the latency. + +cpu_pinning, numa topology, and numa aware memory allocation +------------------------------------------------------------ + +the linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node +is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node +until the memory usage on the numa node is below the threshold. if an instance does not request +numa aware memory assignemnt by seting hw:mem_page_size=small|large|any| then the +instance memory will not be check against the numa node and can be killed by the kernel OOM reaper. + +with the introduction of https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html +nova gained the ablity to have vms with cpu_policy=shared and cpu_policy=dedicated on the same host +That capablity was later extended to allow mixing shared and dedicated instances in the same instance +https://specs.openstack.org/openstack/nova-specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html + +nova however has never and still does not support mixing numa instance and non numa instaces on the same host. +that means if you want to support cpu_policy=shared and cpu_policy=dediecated insatances +on the same host all vms must have a numa toplogy. + +To address this we need to add a new flavor extra spec hw:cpu_partioning=dynamic|static to opt into this feature. +This can be automatically converted into a required or forbidden trait by the scheduler to select host +that are configured for dynamic/static partioning. + +The libvirt driver will be updated to report the approate trait based on if the cpu_dynamic_set is defined. +if the cpu_dynamic_set is defined then the libvirt driver will report the COMPUTE_CPU_PARTITIONING_DYNAMIC trait. +otherwise it will report the COMPUTE_CPU_PARTITIONING_STATIC trait. + +to prevent OOM issues hw:cpu_partioning=dynamic will also imply hw:mem_page_size=small or hw:mem_page_size=any. + +hw:mem_page_size=any +-------------------- +The hw:mem_page_size=any flavor extra spec is used to indicate that the instance can be booted on a host +with any page size. As a secondary effect this also indicates that the exact page size can be requested +by the image. The api does not specify how the virt driver will select the page size when hw:mem_page_size=any +is requested. Today the libvirt driver will select the page size based on the largest pagesize aviable that +can fit the instance memory. This is done to reduce the memory overhead of the instance by prefering hugepages. +This is not ideal when the cpu_policy=shared as its is likely that the hugepages should be preferencally used +by the dedicated instances. hw:mem_page_size=any however would be a better default then hw:mem_page_size=small +as it allow the image to request the page size it wants. + +we could adress this by adding a new mem_page_size hw:mem_page_size=prefer_smallest which would indicate +that the smallest page size should be prefered but still allow the image to request a larger page size. + +we could also just change the meaning of hw:mem_page_size=any to mean that the smallest page size should be prefered +since it is not part fo the api contract. for backwards compatibility we could make the prefernce based on if +static or dynamic cpu partioning is enabled. if static cpu partioning is enabled then hw:mem_page_size=any +will prefer hugepages. if dynamic cpu partioning is enabled then hw:mem_page_size=any will prefer small pages. + +numa and memory overcommit +-------------------------- +The linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node +is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node +until the memory usage on the numa node is below the threshold. nuna affined isntances cannot +safely be over committed as the kernel OOM reaper will kill the instance if the numa node is +over committed. This is not a new problem and is why we do not supprot numa and non numa instances +on the same host. either all instance musts be numa aware or none of them can be numa aware. +that means that dynamic cpu partioning cannot be used with non numa aware instances. + +Alternatives +------------ + +Nova by design never modifies an instance with out the user requesting it. This is +to ensure that the user has full control over the instance and that the instance +is not modified in unexpected ways. Nova is also a multi tenant system and we +must not allow one tenant to modify the resources of another tenant even indirectly +i.e by the side effect of a nova operation on a resouce owned by one tenant altering +the resources of another tenant. If this were to happen it would be a security issue. +if we did not have these constraints then we could allow shared cpus to float as we do +today and when a instance is booted with the dedicated cpu policy we could updated the +pinning of the shared cpu instance to ensure that the dedicated cpu instance has the +cpus it requires. This would violate our multi tenant design constraints and is not +an option. + + +Data model impact +----------------- + +The exisitng host numa topology blob does not have the concept of mapping instances +to cpus directly. instead we have split the aviablity and the usage by the instance into +to data structures. the host numa toplogy just tracks what cpus are available to instances +and which once have been used in the context of the pinned cpus. The assocation fo what +instance is using a givne cpu is tracked by the instance numa topology object. + +the corraation between the two is computed by the resource tracker and is not stored in +the db. This is because the secheduler only need to know what cpus are available not what +instance is using what cpus. The resource tracker needs to know what cpus are available +and what instance is using what cpus so that it can free the cpus when the instance is deleted. + +This model work fine for static cpu partioning but does not work for dynamic cpu partioning. + +a cleaner model would be to have a single data structure that tracks what cpus are available +and what instance is using what cpus. The instnace numa topology object could then be removed +and the resource tracker would not need to compute the corraation between the host numa topology +and the instance numa topology when freeing cpus. + +to support this the host numa topology blob would need to be extended to include the instance +uuid for each cpu. This would allow the resource tracker to directly update the host numa topology +blob when assigning cpus to an instance and when freeing cpus from an instance. + +addtionally we need to track if a cpu is reserved for shared use or dedicated use and the allocation +ratio for the host. This can be done by modeling the assingable cpus as "slot" objects. +each slot object would contain the guest cpu number, a instance uuid. each host cpu would have n slots +where n is the allocation ratio. if the host cpu is is asigned as a dedicated cpu then it will have only +one slot with the instance uuid of the instance that is using the cpu. + +This could look something like this. + +``` +host_numa_topology: + generation: 42 + numa_node: + "0": + cpus: + "0": + slots: + "0": + instance_uuid: "instance-uuid-1" + guest_cpu: 0 + "1": + instance_uuid: "instance-uuid-2" + guest_cpu: 0 + "2": + instance_uuid: "instance-uuid-3" + guest_cpu: 0 + "3": + instance_uuid: "instance-uuid-4" + guest_cpu: 0 + "1": + slots: + "0": + instance_uuid: "instance-uuid-1" + guest_cpu: 1 + "1": + instance_uuid: "instance-uuid-2" + guest_cpu: 1 + "2": + instance_uuid: None + guest_cpu: None + "3": + instance_uuid: None + guest_cpu: None + "2": + slots: + "0": + instance_uuid: "instance-uuid-1" + guest_cpu: 2 + "1": + instance_uuid: "instance-uuid-2" + guest_cpu: 2 + "2": + instance_uuid: None + guest_cpu: None + "3": + instance_uuid: None + guest_cpu: None + "3": + slots: + "0": + instance_uuid: None + guest_cpu: None + "1": + instance_uuid: None + guest_cpu: None + "2": + instance_uuid: None + guest_cpu: None + "3": + instance_uuid: None + guest_cpu: None + memory_pages: + "4": + total: 1024 + used: 0 + reserved: 0 + "2048": + total: 1024 + used: 0 + reserved: 0 + "1048576": + total: 1024 + used: 0 + reserved: 0 + "1": + cpus: + "0": + slots: + "0": + instance_uuid: None + guest_cpu: None + "1": + instance_uuid: None + guest_cpu: None + "2": + instance_uuid: None + guest_cpu: None + "3": + instance_uuid: None + guest_cpu: None + "1": + slots: + "0": + instance_uuid: None + guest_cpu: None + "1": + instance_uuid: None + guest_cpu: None + "2": + instance_uuid: None + guest_cpu: None + "3": + instance_uuid: None + guest_cpu: None + "2": + slots: + "0": + instance_uuid: None + guest_cpu: None + "1": + instance_uuid: None + guest_cpu: None + "2": + instance_uuid: None + guest_cpu: None + "3": + instance_uuid: None + guest_cpu: None + "3": + slots: + "0": + instance_uuid: None + guest_cpu: None + "1": + instance_uuid: None + guest_cpu: None + "2": + instance_uuid: None + guest_cpu: None + "3": + instance_uuid: None + guest_cpu: None + memory_pages: + "4": + total: 1024 + used: 0 + reserved: 0 + "2048": + total: 1024 + used: 0 + reserved: 0 + "1048576": + total: 1024 + used: 0 + reserved: 0 +``` + +The migration data objects may will also need to be updated to reflect the new data model +as instead of pinning shared cpus to a range of host cpus we will be pinning shared cpus +to individual host cpus. + +as the behavior of the cpu_policy=shared will vary based on if the host is using static +or dynamic cpu partioning, we will need to add a new placement trait to indicate if the +host is using static or dynamic cpu partioning. This will allow the operator to select +hosts that are using static or dynamic cpu partioning using required or forbidden traits. +in combination with the Filtering hosts by isolating aggregates freature, +https://docs.openstack.org/nova/latest/reference/isolate-aggregates.html +This will allow the operator to prevent existing workloads from being moved to a host +with dynamic cpu partioning enabled if desired. i.e. if the operator does not want to +move existing workloads to a host with dynamic cpu partioning enabled then they can +add the required trait to the aggregate that the host is in and create a new flavor +that has the required trait. This will prevent existing workloads from being moved +to the host with dynamic cpu partioning enabled but allow new workloads to be created +on the host with dynamic cpu partioning enabled. + + +REST API impact +--------------- + +None + +The exsiting flavor extra specs will continue to be used to define the cpu_policy +and the same placement resouce classes will be used so no api changes are required. + + + +Security impact +--------------- + +None + +As this proposal does not change the multi tenant design constraints of nova +there is no security impact. + + +Notifications impact +-------------------- + +None + + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +This will potially make the scheduler slower as it will have to consider more +constraitns in the numa toplogy filter. However since this feature will be +opt in via the hw:cpu_partioning=dynamic flavor extra spec, the scheduler Impact +will only be seen when the operator has opted into this feature. + +if we modify the numa toplgoy and remove the instance numa topology object +the new data stuctre will be smaller and more efficent to process then the +current data structure. This might result in a small performance improvement +but its unlikely to be noticable. + +moving the cpu assingment from the resource tracker to the conductor should +not have a noticable impact on performance as the schduler is already doing +the cpu assingment calulation however it will increase size of the result +of the select destionation call. so we would need to consider that carefully. +given that is out of scope for this spec we will not consider it further. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +None + +Upgrade impact +-------------- + +By default this feature will be disabled and will not impact existing deployments. + +For existing workload the operator will need to create new flavor with hw:cpu_partioning=dynamic +and resize existing instances to the new flavor if they want to use this feature. + +For new workloads the operator will need to create new flavor with hw:cpu_partioning=dynamic +and define the cpu_dynamic_set config option to enable this feature. +This will not be supported inplace on hosts with existing workloads. + + +Implementation +============== + +Assignee(s) +----------- + +Who is leading the writing of the code? Or is this a blueprint where you're +throwing it out there to see who picks it up? + +If more than one person is working on the implementation, please designate the +primary author and contact. + +Primary assignee: + + +Other contributors: + + +Feature Liaison +--------------- + +Feature liaison: + sean-k-mooney + +Work Items +---------- + +its 1am and i am tired so i will fill this in later. +its also saturday so ... :) + + +Dependencies +============ + +None unless we want to first add a placment rest api to update an allocation +and a resource provider in a single atomic call or change how cpus are assigned +to instances in the conductor instead of the resource tracker first. + + +Testing +======= + +This needs at least unit and fucntional tests. +we might actully be ablle to test this in temepst as well +using the serial tests feature to ensure that no other tests +are running at the same time. + +Documentation Impact +==================== + +we need to document all the restrictions and the change in behavior for cpu_policy=shared +i.e. that they wont float and that they will be pinned instead. +This is thecnially not a change in the api contract but it is a change in the libvirt +driver internal that some may not be aware of. + +References +========== + +None + +History +======= + +Optional section intended to be used each time the spec is updated to describe +new design, API or any database schema updated. Useful to let reader understand +what's happened along the time. + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - 2024.1 Caracal + - Introduced