enable nova to dynamically partion cpus

This canage introduces a backlog spec to explore that extention of nova to supprot dynimcaly partioning host cpus based on worklaod needs without updating exisitng instances. Change-Id: I95cd0ed5f10f1ea84cb8edb09015be07f95e836c
2024-01-21 01:20:17 +00:00 · 2024-01-21 01:20:17 +00:00 · 8487cb0bd5
parent f0ffcb6ddf
commit 8487cb0bd5
1 changed files with 918 additions and 0 deletions
--- a/specs/backlog/nova-dynamic-cpus.rst
+++ b/specs/backlog/nova-dynamic-cpus.rst
@ -0,0 +1,918 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=========================================
+Dynamic Cpu Pinning for libvirt instances
+=========================================
+
+Include the URL of your launchpad blueprint:
+
+https://blueprints.launchpad.net/nova/+spec/example
+
+Nova currently supports static partioning of CPUs for instances.
+This blueprint proposes to add support for dynamic cpu pinning for
+instances with the shared and dedicated CPU polices.
+
+Problem description
+===================
+
+Currently its possible to device the cpu resocues on a host into shared and
+dedicated cpus. The shared cpus are available to all instances on the host
+(either for vcpus or emulator threads) and the dedicated cpus are available only to
+instances with the dedicated cpu policy. shared cpus are declared via the cpu_shared_set
+config option and dedicated cpus are declared via the cpu_dedicated_set config option.
+while this work well for static partitioning of cpus, it does not work well for
+dynamic partitioning of cpus. For example, if a host has 8 cpus and 4 of them are
+dedicated to instances with the dedicated cpu policy, and the remaining 4 cpus
+are available to all instances with the shared cpu policy we can have underutilisation
+of the platform. For example, if there are 4 instances with the dedicated cpu policy
+and each instance has 1 vcpu, then the remaining 4 cpus are not available to instances
+with the dedicated cpu policy. If we have 4 instances with the shared cpu policy and each
+instance has 1 vcpu, the the dedicated cpus are idle and not available to the instances
+with the shared cpu policy. This can lead to a bin packing problem where we have
+underutilisation of the platform.
+
+Use Cases
+---------
+
+As an operator, I want to be able to dynamically partition the cpus on a host
+into shared and dedicated cpus so that I can maximise the utilisation of the
+platform based on the workload requirements.
+
+As an operator i want to be able to use dynamic cpu partioning without having to
+modify my existing flavor definitions. I want to be able to use the existing
+flavor definitions and have the system partition the cpus based on the workload
+so that existing workloads can benefit from dynamic cpu partioning if moved to
+a host with dynamic cpu partioning enabled.
+
+As an operator, when using dynamic cpu partioning, i want unused cpus to be able
+to use the recently added cpu_state power management feature so that idle cpus
+will be put into a low power state.
+
+Proposed change
+===============
+
+High level design constraints
+* No new Resouces classes will be added for dynamic cpu partioning. The existing
+  VCPU and PCPU resources classes will be used for dynamic cpu partioning.
+* The existing cpu_shared_set and cpu_dedicated_set config options will not be used
+  for dynamic cpu partioning. Instead, a new config option called cpu_dynamic_set
+  will be used for dynamic cpu partioning. This will enable coeistence of static
+  and dynamic cpu partioning on the same or diffent host in the future. coeistence
+  on the same host will be out of scope of this blueprint.
+* The current roles of placement, the resouce tracker and the scheduler will not
+  change. The resource trakcer will continue to be the source of truth for the
+  cpu resources on a host. The placement service will continue to track the cpu
+  capsity on a host using the VCPU and PCPU resouce classes. 
+  The scheduler will continue to select a host for an instance based on the
+  cpu resources on the host but the assignment of cpus to an instance will be
+  done by the resource tracker.
+
+The role of placement in selecting a host.
+------------------------------------------
+While it may seem from an out side perspective that the placement service is the
+single source of truth for the cpu resources on a host, this is not, and has never
+been the case. As with all other resources, the placement service acts a consistent,
+atomic, distributed cache, of a summary view of the resources on a host.
+
+Placement is not the source of truth for the cpu resources on a host. The resource
+tracker is the source of truth for the cpu resources on a host and is responsible
+for updating the placement service with the capasity and capabilities the host.
+
+As placement is not aware of the toplogy or asignment of cpus to instances on a host,
+it is not possible for placement to select a host for an instance based on the cpu
+resources on the host, with respect to numa affintiy, cpu pinning or other affintiy
+requirements. Placement role is to fined host that have enough capsity to host a VM
+and the toplogy consideration are enforced by the scheduler and the resource tracker.
+
+Put concretely, the placement service today can only say, "this host has 8 vcpus and
+the instance requires 4 vcpus, therefor it has capasity to host the instance." 
+It cannot say "this host has 8 vcpus and the instance requires 4 vcpus, therfor it can
+use cpus 0,1,2,3 on the host to host the instance."
+
+This spec does not change the role of placement in selecting a host for an instance.
+Placement will only be aware of the total number of vCPUs and pCPUs on a host and
+the assignment of cpus to instances will be done by the resource tracker.
+The implications of this are that, as it is today, it will be possible for 2 VM
+to atomically create an allocation in placement, and then both be scheduled to the
+same host where the assignment of cpus may only be valid for one of the vms.
+
+This is not a new problem and is not introduced by this spec. It is a problem that
+already exists today and can be mitigated by introducing a new rpc to the compute node
+and a new placement rest api to allow updating an allocation in placement and a
+resource provider in a single atomic call. doing that is out of scope of this spec.
+
+The meaning of cpu_policy=shared and cpu_policy=dedicated
+---------------------------------------------------------
+Contrary to how we commonly think of the meaning of cpu_policy=shared and
+cpu_policy=dedicated, the meaning of these cpu policies is not that the cpus
+are floating or pinned. The cpu dedicated policy is often referred to as cpu pinning.
+While it is true that the cpu dedicated policy does pin cpus to instances, that is
+a side effect of the cpu dedicated policy and not the meaning of the cpu dedicated.
+Similarly the cpu shared policy does not mean that the cpus are floating. The cpu
+shared policy means that the cpus are shared with other instances and not reserved
+for this instance.
+
+The cpu_policy=shared and cpu_policy=dedicated are not the only cpu policies that
+nova also supports cpu_policy=mixed. The cpu_policy=mixed policy is a combination
+of the cpu_policy=shared and cpu_policy=dedicated policies where some cpus are mapped
+to the cpu_shared_set and some cpus are mapped to the cpu_dedicated_set.
+
+Since the originally introduction of the vcpu_pin_set and the cpu_shared_set config
+if either are defined all instance with the cpu_policy=shared (or unset) will be
+pinned to the cpus in the vcpu_pin_set/cpu_shared_set.
+The difference between cpu_policy=shared and cpu_policy=dedicated is that for each vcpu
+in the instance, the cpu_policy=shared will ping that cpu to a range of cpus defined by the 
+cpu_shared_set or the vcpu_pin_set, while the cpu_policy=dedicated will pin each vcpu
+to a single cpu defined by the cpu_dedicated_set/vcpu_pin_set.
+
+This is important to understand as it means that the cpu_policy=shared are not unpinned
+they are just not pinned 1:1 and other instance can share the same cores up to the
+cpu_allocation_ratio.
+
+Mechanically we cannot change the meaning of cpu_policy=shared and cpu_policy=dedicated
+(shared with other instance, and dedicated to this instance) as that would break
+backwards compatibility. Nor can we change the fact that cpu_policy=dedicated pins
+cpus to instance to specific cpus and does not change that without user intervention.
+What we can change is the relationship between the cpu_policy=shared, cpu_allocation_ratio
+and how the cpus are pinned to host cpus.
+
+cpu_allocation_ratio and cpu_policy=shared
+------------------------------------------
+
+The cpu_allocation_ratio config option is used to define the over subscription
+ratio for the cpus on a host. the cpu_allocation_ratio only applies to vcpu inventories.
+Regardless of the cpu_policy of an instance, a given nova instance will always be mapped
+to at least 1 host cpu per flavor.vcpu. That means that if you have a host with 2 cpus
+and an allocation ration of 4.0 then you can have 8 instances with 1 vcpu.
+you could also have 4 instances with 2 vcpus but not 2 instances with 4 vcpus.
+in other words, an instance can never over subscribe against itself.
+put a different way, if a host has 2 cpus it can never boot an instance with more then
+2 vcpus regardless of the cpu_allocation_ratio.
+
+
+cpu_shared_set and emulator threads
+-----------------------------------
+The cpu_shared_set config option is used to define the cpus that are available to
+instances with the cpu_policy=shared. The cpu_shared_set config option is also used
+to define the cpus that are available to emulator threads. This is a annoying so
+while we are here we can fix that.
+
+While this is not strictly required it makes the code simpler and easier to understand
+this spec will also add a new config option called cpu_overhead_set. If this config
+option is defined then nova will be modifed to use the cpu_overhead_set when pinning
+the qemu emulator thread and/io thread if we add support for iothread in the future.
+when cpu_overhead_set is not defiend and cpu_shared_set then we will fallback to that
+to preserve backwards compatibility. This spec will deprecate that fallback and we
+can remove it in a future release.
+
+the cpu_overhead_set cpus will not be reported to placement and may not overlap with
+
+vcpu_pin_set
+------------
+The vcpu_pin_set config option is deprecated for removal in a future release.
+it is not compatibilile with dedicated cpu resouce tracking in placement and none of the
+feature described in this spec will be supported when the vcpu_pin_set is defined.
+In the rest of the document we will only disucss the cpu_shared_set, cpu_dedicated_set,
+cpu_dynamic_set and cpu_overhead_set config options.
+
+Dynamic cpu partioning
+----------------------
+The proposed solution is to add a new config option called cpu_dynamic_set. This
+config option will be used to define the cpus that are available to instance vcpus.
+
+All cpus in the cpu_dynamic_set will be reported to placement as both vcpus and pcpus.
+
+e.g. if cpu_dynamic_set=0-7 then placement will report 8 vcpus and 8 pcpus.
+
+```
+resouce_provider
+    |
+    ---- name: <hostname>
+    ---- uuid: <uuid>
+    |
+    ---- inventories
+            |
+            ---- vcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 0
+            |       ---- min_unit: 1
+            |       ---- max_unit: 8
+            |       ---- step_size: 1
+            |       ---- allocation_ratio: 4.0
+            |
+            ---- pcpu
+                    |
+                    ---- total: 8
+                    ---- reserved: 0
+                    ---- min_unit: 1
+                    ---- max_unit: 8
+                    ---- step_size: 1
+```
+
+With this new feature. If cpu_dynamic_set=0-7 and cpu_allocation_ratio=4.0 then
+given a flaovr with 1 vcpu and cpu_policy=shared, if we boot an instance on the host
+with the cpu_dynamic_set=0-7 then the instance will be pinned to a single host core.
+i.e. the instance will be pinned to a single host core and not a range of host cores.
+when this core assignment is done, we will update the placement resource provider
+reduce the max_unit of the pcpu and set the reserved value of the pcpu to 1.
+
+```
+resouce_provider
+    |
+    ---- name: <hostname>
+    ---- uuid: <uuid>
+    |
+    ---- inventories
+            |
+            ---- vcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 0
+            |       ---- min_unit: 1
+            |       ---- max_unit: 8
+            |       ---- step_size: 1
+            |       ---- allocation_ratio: 4.0
+            |
+            ---- pcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 1
+            |       ---- min_unit: 1
+            |       ---- max_unit: 7
+            |       ---- step_size: 1
+            ---- allocations
+                    |
+                    ---- instance-uuid-1
+                            |
+                            ---- resources
+                                    |
+                                    ---- vcpu: 1
+                    
+```
+
+If we boot a second instance with the same flavor on the same host then the second
+will be pinned to the same host core as the first instance. This is because the first
+instance has already reserved the host core for shared use and we have not reached
+the cpu_allocation_ratio.
+
+```
+resouce_provider
+    |
+    ---- name: <hostname>
+    ---- uuid: <uuid>
+    |
+    ---- inventories
+            |
+            ---- vcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 0
+            |       ---- min_unit: 1
+            |       ---- max_unit: 8
+            |       ---- step_size: 1
+            |       ---- allocation_ratio: 4.0
+            |
+            ---- pcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 1
+            |       ---- min_unit: 1
+            |       ---- max_unit: 7
+            |       ---- step_size: 1
+            ---- allocations
+                    |
+                    ---- instance-uuid-1
+                    |       |
+                    |       ---- resources
+                    |               |
+                    |               ---- vcpu: 1
+                    |
+                    ---- instance-uuid-2
+                            |
+                            ---- resources
+                                    |
+                                    ---- vcpu: 1
+                    
+```
+
+if we have a second flavor that requests 1 vcpu and cpu_policy=dedicated
+and we boot an instance with that flavor on the same host then the instance will
+be pinned to a different host core. this will result in the following placement
+resource provider.
+
+```
+resouce_provider
+    |
+    ---- name: <hostname>
+    ---- uuid: <uuid>
+    |
+    ---- inventories
+            |
+            ---- vcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 4
+            |       ---- min_unit: 1
+            |       ---- max_unit: 7
+            |       ---- step_size: 1
+            |       ---- allocation_ratio: 4.0
+            |
+            ---- pcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 1
+            |       ---- min_unit: 1
+            |       ---- max_unit: 6
+            |       ---- step_size: 1
+            ---- allocations
+                    |
+                    ---- instance-uuid-1
+                    |       |
+                    |       ---- resources
+                    |               |
+                    |               ---- vcpu: 1
+                    |
+                    ---- instance-uuid-2
+                    |       |
+                    |       ---- resources
+                    |               |
+                    |               ---- vcpu: 1
+                    |
+                    ---- instance-uuid-3
+                            |
+                            ---- resources
+                                    |
+                                    ---- pcpu: 1
+                    
+```
+
+Note that because we have allocated a host core for dedicated use, the vcpu max_unit
+is reduced to 7 and the pcpu max_unit is reduced to 6.
+The vcpu max unit is reduced is because the host core that is reserved for dedicated
+use is not available for shared use. the pcpu max unit is reduced because the host
+core that is reserved for dedicated use and the host core that is reserved for
+shared use are not available for dedicated use. given share cpus allow over subscription
+when a host core is reserved for dedicated use, the vcpu reserved is increased by
+(1 * allocation_ratio) instead of 1.
+
+if we boot two more instance with the shared cpu policy then they will be pinned
+to the same host cores as the first two instances.  This is because the first
+instance has already reserved the host core for shared use and we have not reached
+the cpu_allocation_ratio and no change to the placement resource provider inventories
+are required.
+
+```
+resouce_provider
+    |
+    ---- name: <hostname>
+    ---- uuid: <uuid>
+    |
+    ---- inventories
+            |
+            ---- vcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 4
+            |       ---- min_unit: 1
+            |       ---- max_unit: 7
+            |       ---- step_size: 1
+            |       ---- allocation_ratio: 4.0
+            |
+            ---- pcpu
+            |       |
+            |       ---- total: 8
+            |       ---- reserved: 1
+            |       ---- min_unit: 1
+            |       ---- max_unit: 6
+            |       ---- step_size: 1
+            ---- allocations
+                    |
+                    ---- instance-uuid-1
+                    |       |
+                    |       ---- resources
+                    |               |
+                    |               ---- vcpu: 1
+                    |
+                    ---- instance-uuid-2
+                    |       |
+                    |       ---- resources
+                    |               |
+                    |               ---- vcpu: 1
+                    |
+                    ---- instance-uuid-3
+                    |       |
+                    |       ---- resources
+                    |               |
+                    |               ---- pcpu: 1
+                    |
+                    ---- instance-uuid-4
+                    |       |
+                    |       ---- resources
+                    |               |
+                    |               ---- vcpu: 1
+                    |
+                    ---- instance-uuid-5
+                            |
+                            ---- resources
+                                    |
+                                    ---- vcpu: 1
+                    
+```
+
+The general pattern is that when an instance is booted, the resource tracker will
+select a host core for the instance and update the placement resource provider
+inventories to reflect the change. 
+
+If the the instance requests a shared cpu and
+no host cores are currently reserved for shared use then the resource tracker will
+reserve a host core for shared use and update the placement resource provider
+by reducing the pcpu max_unit by 1 and increasing the pcpu reserved by 1.
+This reflects that the host core is no longer available for dedicated use even though
+there is no allocation against the pcpu inventory.
+
+Similarly if the instance requests a dedicated cpu the resource tracker will reserve
+a host core for dedicated use and update the placement resource provider by reducing
+the vcpu max_unit by 1 and increasing the vcpu reserved by (1 * allocation_ratio).
+This reflects that the host core is no longer available for shared use even though there
+is no allocation against the vcpu inventory. since each host core utilized as a shared core
+allowed up to allocation_ratio vcpus to be allocated, when a host core is reserved for
+dedicated use, the vcpu reserved is increased by (1 * allocation_ratio) instead of 1.
+
+resouce provider updates and concurrent allocation creations
+------------------------------------------------------------
+As mentioned above, the placement service is not the source of truth for the cpu
+resources on a host. The resource tracker is the source of truth for the cpu resources.
+when we convert a allocation_candidate to a allocation today that is done in the conductor
+prior to the assignment of cpus to the instance by the resource tracker.
+
+That means that its possible for two instances to atomically create an allocation
+in placement, and then both be scheduled to the same host where the assignment of cpus
+may only be valid for one of the vms.  Today when this happens because of the locking
+we have in the resource tracker, the second instance will fail to build because the
+numa topology requsted will not be fitable to the host.
+
+when we increase the resvered amount of cpus in the placement resource provider we may
+temporarily over subscribe the host based on the allocations  for the instance that have
+not yet been created. This is not a problem as the resource tracker will not allow
+the over subscription when the instance is created and we will reject the build request.
+The conductor will then free the allocation in placement for the rejected instance
+and an alternative host will be selected.
+
+Atomicity of allocation creation and resource provider updates
+--------------------------------------------------------------
+As mentioned above, the placement service is not the source of truth for the cpu resources
+on a host. It is a cached view with incompelte knowlage that within the limits of its knowledge
+provide an atomic view of the capsity and capabilities of the host.
+
+The model propeded in this spec reflects that given the placement service has incomplete knowledge
+and nova is a distributed system, we are operating in a eventually consistent model. While placement
+can definitvly say that a host does not have capasity to host an instance it cannot definitvly
+say that a host has capasity to host an instance. Like a bloom filter, placement can only say
+there is a high probability that a host has capasity to host an instance based on the information
+it has.
+
+This spec does not attemept to adress the problem of atomicity of allocation creation and resource
+provider updates. But it does attempet to ensure that it can be addressed in the future.
+
+One possible way to adress this in the future it to add a new rpc to the compute node to assignee
+cpus to an instance and update the placement resource provider and convert the allocation candatate to an allocation
+in a single atomic call. by delegating the creation of the allocation to the compute node we can
+ensure that the allocation is only created if the resource tracker is able to assign cpus to the instance.
+This has the the disadvantage that it requires a new rpc to the compute node and a new placement rest api
+request could be delayed while we wait for the compute node to respond.
+
+another possible way to address this in the future for the asignment of the cpus to be
+done by the conductor instead. Today the numa topology filter genreate a possible assignment using
+a copy of the host numa topology and then discards it. Instead of discarding the possible assignment
+we could pass embed it in the allocation candidate and pass it to the conductor.
+The conductor could then directly update the host numa topology blob in the db and pass the instance
+numa topology to the compute node when creating the instance. we would also need to extend the host numa
+topology blob to have a generation number so that we can detect when the host numa topology has changed
+on the conductor or compute node as this is now a shared resouce which must be updated atomically.
+this has the advantage that it does not require a new rpc to the compute node and the placement rest api
+call can continue to be done in the conductor reducing the latency.
+
+cpu_pinning, numa topology, and numa aware memory allocation
+------------------------------------------------------------
+
+the linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node
+is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node
+until the memory usage on the numa node is below the threshold. if an instance does not request
+numa aware memory assignemnt by seting hw:mem_page_size=small|large|any|<pagesize> then the
+instance memory will not be check against the numa node and can be killed by the kernel OOM reaper.
+
+with the introduction of https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
+nova gained the ablity to have vms with cpu_policy=shared and cpu_policy=dedicated on the same host
+That capablity was later extended to allow mixing shared and dedicated instances in the same instance
+https://specs.openstack.org/openstack/nova-specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html
+
+nova however has never and still does not support mixing numa instance and non numa instaces on the same host.
+that means if you want to support cpu_policy=shared and cpu_policy=dediecated insatances
+on the same host all vms must have a numa toplogy.
+
+To address this we need to add a new flavor extra spec hw:cpu_partioning=dynamic|static to opt into this feature.
+This can be automatically converted into a required or forbidden trait by the scheduler to select host
+that are configured for dynamic/static partioning.
+
+The libvirt driver will be updated to report the approate trait based on if the cpu_dynamic_set is defined.
+if the cpu_dynamic_set is defined then the libvirt driver will report the COMPUTE_CPU_PARTITIONING_DYNAMIC trait.
+otherwise it will report the COMPUTE_CPU_PARTITIONING_STATIC trait.
+
+to prevent OOM issues hw:cpu_partioning=dynamic will also imply hw:mem_page_size=small or hw:mem_page_size=any.
+
+hw:mem_page_size=any
+--------------------
+The hw:mem_page_size=any flavor extra spec is used to indicate that the instance can be booted on a host
+with any page size. As a secondary effect this also indicates that the exact page size can be requested
+by the image. The api does not specify how the virt driver will select the page size when hw:mem_page_size=any
+is requested. Today the libvirt driver will select the page size based on the largest pagesize aviable that
+can fit the instance memory. This is done to reduce the memory overhead of the instance by prefering hugepages.
+This is not ideal when the cpu_policy=shared as its is likely that the hugepages should be preferencally used
+by the dedicated instances. hw:mem_page_size=any however would be a better default then hw:mem_page_size=small
+as it allow the image to request the page size it wants.
+
+we could adress this by adding a new mem_page_size hw:mem_page_size=prefer_smallest which would indicate
+that the smallest page size should be prefered but still allow the image to request a larger page size.
+
+we could also just change the meaning of hw:mem_page_size=any to mean that the smallest page size should be prefered
+since it is not part fo the api contract. for backwards compatibility we could make the prefernce based on if
+static or dynamic cpu partioning is enabled. if static cpu partioning is enabled then hw:mem_page_size=any
+will prefer hugepages. if dynamic cpu partioning is enabled then hw:mem_page_size=any will prefer small pages.
+
+numa and memory overcommit
+--------------------------
+The linux kernel OOM reaper operates on a per numa node basis. This means that if a numa node
+is over committed and the kernel OOM reaper is invoked, it will kill processes on the numa node
+until the memory usage on the numa node is below the threshold. nuna affined isntances cannot
+safely be over committed as the kernel OOM reaper will kill the instance if the numa node is
+over committed. This is not a new problem and is why we do not supprot numa and non numa instances
+on the same host. either all instance musts be numa aware or none of them can be numa aware.
+that means that dynamic cpu partioning cannot be used with non numa aware instances.
+
+Alternatives
+------------
+
+Nova by design never modifies an instance with out the user requesting it. This is
+to ensure that the user has full control over the instance and that the instance
+is not modified in unexpected ways. Nova is also a multi tenant system and we
+must not allow one tenant to modify the resources of another tenant even indirectly
+i.e by the side effect of a nova operation on a resouce owned by one tenant altering
+the resources of another tenant. If this were to happen it would be a security issue.
+if we did not have these constraints then we could allow shared cpus to float as we do
+today and when a instance is booted with the dedicated cpu policy we could updated the
+pinning of the shared cpu instance to ensure that the dedicated cpu instance has the
+cpus it requires. This would violate our multi tenant design constraints and is not
+an option.
+
+
+Data model impact
+-----------------
+
+The exisitng host numa topology blob does not have the concept of mapping instances
+to cpus directly. instead we have split the aviablity and the usage by the instance into
+to data structures. the host numa toplogy just tracks what cpus are available to instances
+and which once have been used in the context of the pinned cpus. The assocation fo what
+instance is using a givne cpu is tracked by the instance numa topology object.
+
+the corraation between the two is computed by the resource tracker and is not stored in
+the db. This is because the secheduler  only need to know what cpus are available not what
+instance is using what cpus. The resource tracker needs to know what cpus are available
+and what instance is using what cpus so that it can free the cpus when the instance is deleted.
+
+This model work fine for static cpu partioning but does not work for dynamic cpu partioning.
+
+a cleaner model would be to have a single data structure that tracks what cpus are available
+and what instance is using what cpus. The instnace numa topology object could then be removed
+and the resource tracker would not need to compute the corraation between the host numa topology
+and the instance numa topology when freeing cpus.
+
+to support this the host numa topology blob would need to be extended to include the instance
+uuid for each cpu. This would allow the resource tracker to directly update the host numa topology
+blob when assigning cpus to an instance and when freeing cpus from an instance.
+
+addtionally we need to track if a cpu is reserved for shared use or dedicated use and the allocation
+ratio for the host. This can be done by modeling the assingable cpus as "slot" objects.
+each slot object would contain the guest cpu number, a instance uuid. each host cpu would have n slots
+where n is the allocation ratio. if the host cpu is is asigned as a dedicated cpu then it will have only 
+one slot with the instance uuid of the instance that is using the cpu.
+
+This could look something like this.
+
+```
+host_numa_topology:
+    generation: 42
+    numa_node:
+        "0":
+            cpus:
+                "0":
+                    slots:
+                        "0":
+                            instance_uuid: "instance-uuid-1"
+                            guest_cpu: 0
+                        "1":
+                            instance_uuid: "instance-uuid-2"
+                            guest_cpu: 0
+                        "2":
+                            instance_uuid: "instance-uuid-3"
+                            guest_cpu: 0
+                        "3":
+                            instance_uuid: "instance-uuid-4"
+                            guest_cpu: 0
+                "1":
+                    slots:
+                        "0":
+                            instance_uuid: "instance-uuid-1"
+                            guest_cpu: 1
+                        "1":
+                            instance_uuid: "instance-uuid-2"
+                            guest_cpu: 1
+                        "2":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "3":
+                            instance_uuid: None
+                            guest_cpu: None
+                "2":
+                    slots:
+                        "0":
+                            instance_uuid: "instance-uuid-1"
+                            guest_cpu: 2
+                        "1":
+                            instance_uuid: "instance-uuid-2"
+                            guest_cpu: 2
+                        "2":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "3":
+                            instance_uuid: None
+                            guest_cpu: None
+                "3":
+                    slots:
+                        "0":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "1":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "2":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "3":
+                            instance_uuid: None
+                            guest_cpu: None
+            memory_pages:
+                "4":
+                    total: 1024
+                    used: 0
+                    reserved: 0
+                "2048":
+                    total: 1024
+                    used: 0
+                    reserved: 0
+                "1048576":
+                    total: 1024
+                    used: 0
+                    reserved: 0
+        "1":
+            cpus:
+                "0":
+                    slots:
+                        "0":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "1":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "2":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "3":
+                            instance_uuid: None
+                            guest_cpu: None
+                "1":
+                    slots:
+                        "0":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "1":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "2":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "3":
+                            instance_uuid: None
+                            guest_cpu: None
+                "2":
+                    slots:
+                        "0":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "1":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "2":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "3":
+                            instance_uuid: None
+                            guest_cpu: None
+                "3":
+                    slots:
+                        "0":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "1":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "2":
+                            instance_uuid: None
+                            guest_cpu: None
+                        "3":
+                            instance_uuid: None
+                            guest_cpu: None
+            memory_pages:
+                "4":
+                    total: 1024
+                    used: 0
+                    reserved: 0
+                "2048":
+                    total: 1024
+                    used: 0
+                    reserved: 0
+                "1048576":
+                    total: 1024
+                    used: 0
+                    reserved: 0
+```
+
+The migration data objects may will also need to be updated to reflect the new data model
+as instead of pinning shared cpus to a range of host cpus we will be pinning shared cpus
+to individual host cpus.
+
+as the behavior of the cpu_policy=shared will vary based on if the host is using static
+or dynamic cpu partioning, we will need to add a new placement trait to indicate if the
+host is using static or dynamic cpu partioning. This will allow the operator to select
+hosts that are using static or dynamic cpu partioning using required or forbidden traits.
+in combination with the Filtering hosts by isolating aggregates freature,
+https://docs.openstack.org/nova/latest/reference/isolate-aggregates.html
+This will allow the operator to prevent existing workloads from being moved to a host
+with dynamic cpu partioning enabled if desired. i.e. if the operator does not want to
+move existing workloads to a host with dynamic cpu partioning enabled then they can
+add the required trait to the aggregate that the host is in and create a new flavor
+that has the required trait. This will prevent existing workloads from being moved
+to the host with dynamic cpu partioning enabled but allow new workloads to be created
+on the host with dynamic cpu partioning enabled.
+ 
+
+REST API impact
+---------------
+
+None
+
+The exsiting flavor extra specs will continue to be used to define the cpu_policy
+and the same placement resouce classes will be used so no api changes are required.
+
+
+
+Security impact
+---------------
+
+None
+
+As this proposal does not change the multi tenant design constraints of nova
+there is no security impact.
+
+
+Notifications impact
+--------------------
+
+None
+
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+This will potially make the scheduler slower as it will have to consider more
+constraitns in the numa toplogy filter. However since this feature will be
+opt in via the hw:cpu_partioning=dynamic flavor extra spec, the scheduler Impact
+will only be seen when the operator has opted into this feature.
+
+if we modify the numa toplgoy and remove the instance numa topology object
+the new data stuctre will be smaller and more efficent to process then the
+current data structure. This might result in a small performance improvement
+but its unlikely to be noticable.
+
+moving the cpu assingment from the resource tracker to the conductor should
+not have a noticable impact on performance as the schduler is already doing
+the cpu assingment calulation however it will increase size of the result
+of the select destionation call. so we would need to consider that carefully.
+given that is out of scope for this spec we will not consider it further.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Upgrade impact
+--------------
+
+By default this feature will be disabled and will not impact existing deployments.
+
+For existing workload the operator will need to create new flavor with hw:cpu_partioning=dynamic
+and resize existing instances to the new flavor if they want to use this feature.
+
+For new workloads the operator will need to create new flavor with hw:cpu_partioning=dynamic
+and define the cpu_dynamic_set config option to enable this feature.
+This will not be supported inplace on hosts with existing workloads.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Who is leading the writing of the code? Or is this a blueprint where you're
+throwing it out there to see who picks it up?
+
+If more than one person is working on the implementation, please designate the
+primary author and contact.
+
+Primary assignee:
+  <launchpad-id or None>
+
+Other contributors:
+  <launchpad-id or None>
+
+Feature Liaison
+---------------
+
+Feature liaison:
+  sean-k-mooney
+
+Work Items
+----------
+
+its 1am and i am tired so i will fill this in later.
+its also saturday so ... :)
+
+
+Dependencies
+============
+
+None unless we want to first add a placment rest api to update an allocation
+and a resource provider in a single atomic call or change how cpus are assigned
+to instances in the conductor instead of the resource tracker first.
+
+
+Testing
+=======
+
+This needs at least unit and fucntional tests.
+we might actully be ablle to test this in temepst as well
+using the serial tests feature to ensure that no other tests
+are running at the same time.
+
+Documentation Impact
+====================
+
+we need to document all the restrictions and the change in behavior for cpu_policy=shared
+i.e. that they wont float and that they will be pinned instead.
+This is thecnially not a change in the api contract but it is a change in the libvirt
+driver internal that some may not be aware of.
+
+References
+==========
+
+None
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - 2024.1 Caracal
+     - Introduced