diff --git a/specs/newton/approved/resource-providers.rst b/specs/newton/approved/resource-providers.rst new file mode 100644 index 000000000..e4b07306a --- /dev/null +++ b/specs/newton/approved/resource-providers.rst @@ -0,0 +1,387 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +================================ +Resource Providers - Base Models +================================ + +https://blueprints.launchpad.net/nova/+spec/resource-providers + +This blueprint partially addresses the problem of Nova assuming all resources +are provided by a single compute node by introducing a new concept -- a +resource provider -- that will allow Nova to accurately track and reserve +resources regardless of whether the resource is being exposed by a single +compute node, some shared resource pool or an external resource-providing +service of some sort. + +.. note:: Note that the majority of the work described here was completed in + Mitaka. The single remaining work item is the creation of + an `AllocationItem`. + +Problem description +=================== + +Within a cloud deployment, there are a number of resources that may be consumed +by a user. Some resource types are provided by a compute node; these types of +resources include CPU, memory, PCI devices and local ephemeral disk. Other +types of resources, however, are not provided by a compute node, but instead +are provided by some external resource pool. An example of such a resource +would be a shared storage pool like that provided by Ceph or an NFS share. + +Unfortunately, due to legacy reasons, Nova only thinks of resources as being +provided by a compute node. The tracking of resources assumes that it is the +compute node that provides the resource, and therefore when reporting usage of +certain resources, Nova naively calculates resource usage and availability by +simply summing amounts across all compute nodes in its database. This ends up +causing a number of problems [1] with usage and capacity amounts being +incorrect. + +Use Cases +---------- + +As a deployer that has chosen to use a shared storage solution for storing +instance ephemeral disks, I want Nova and Horizon to report the correct +usage and capacity information. + +Proposed change +=============== + +We propose to introduce new database tables and object models in Nova that +store information about the inventory/capacity information of generic providers +of various resources, along with a table structure that can store +usage/allocation information for that inventory. + +**This blueprint intentionally does NOT insert records into these new database +tables**. The tables will be populated with the work in the follow-up +`compute-node-inventory`, `compute-node-allocations`, and +`generic-resource-pools` blueprints. + +We are going to need a lookup table for the IDs of various resource +providers in the system, too. We'll call this lookup table +`resource_providers`:: + + CREATE TABLE resource_providers ( + id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY, + uuid CHAR (36) NOT NULL, + name VARCHAR(200) NOT NULL CHARACTER SET utf8, + generation INT NOT NULL, + can_host INT NOT NULL, + UNIQUE INDEX (uuid) + ); + +The `generation` and `can_host` fields are internal implementation fields that +respectively allow for atomic allocation operations and tell the scheduler +whether the resource provider can be a destination for an instance to land on +(hint: a resource pool never can be the target for an instance). + +An `inventories` table records the amount of a particular resource that is +provided by a particular resource provider:: + + CREATE TABLE inventories ( + id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY, + resource_provider_id INT UNSIGNED NOT NULL, + resource_class_id INT UNSIGNED NOT NULL, + total INT UNSIGNED NOT NULL, + reserved INT UNSIGNED NOT NULL, + min_unit INT UNSIGNED NOT NULL, + max_unit INT UNSIGNED NOT NULL, + step_size INT UNSIGNED NOT NULL, + allocation_ratio FLOAT NOT NULL, + INDEX (resource_provider_id), + INDEX (resource_class_id) + ); + +The `reserved` field shall store the amount the resource provider "sets aside" +for unmanaged consumption of its resources. By "unmanaged", we refer here to +Nova (or the eventual broken-out scheduler) not being involved in the +allocation of some of the resources from the provider. As an example, let's say +that a compute node wants to reserve some amount of RAM for use by the host, +and therefore reduce the amount of RAM that the compute node advertises as its +capacity. As another example, imagine a shared resource pool that has some +amount of disk space consumed by things other than Nova instances. Or, further, +a Neutron routed network containing a pool of IPv4 addresses, but Nova +instances may not be assigned the first 5 IP addresses in the pool. + +The `allocation_ratio` field shall store the "overcommit" ratio for a +particular class of resource that the provider is willing to tolerate. This +information is currently stored only for CPU and RAM in the +`cpu_allocation_ratio` and `ram_allocation_ratio` fields in the `compute_nodes` +table. + +The `min_unit` and `max_unit` fields shall store "limits" information for the +type of resource. This information is necessary to ensure that a request for +more or fewer resource that can be provided as a single unit will not be +accepted. + +.. note:: + + **How min_unit, max_unit, and allocation_ratio work together** + + As an example, let us say that a particular compute node has two + quad-core Xeon processors, providing 8 total physical cores. Even though the + cloud administrator may have set the `cpu_allocation_ratio` to 16 + (the default), the compute node cannot accept requests for instances needing + more than 8 vCPUs. So, while there may be 128 total vCPUs available on the + compute node, the `min_unit` would be set to 1 and the `max_unit` would be + set to `8` in order to prevent unacceptable matching of resources to requests. + +The `step_size` is a representation of the divisible unit amount of the +resource that may be requested, *if the requested amount is greater than +the `min_unit` value*. + +For instance, let's say that an operator wants to ensure that a user can only +request disk resources in 10G increments, with nothing less than 5G and nothing +more than 1TB. For the `DISK_GB` resource class, the operator would set the +inventory of the shared storage pool to a `min_unit` of 5, a `max_unit` of +1000, and a `step_size` of 10. This would allow a request for 5G of disk space +as well as 10G and 20G of disk space, but not 6, 7, or 8GB of disk space. As +another example, let's say an operator set their `VCPU` inventory record on a +particular compute node to be `min_unit` of 1, `max_unit` of 16, and +`step_size` of 2, that would mean a user can request an instance only consumes +1 vCPU, but if the user requests more than a single vCPU, that number must be +divisible evenly by 2, up to a maximum of 16. + +In order to track resources that have been assigned and used by some consumer +of that resource, we need an `allocations` table. Records in this table +will indicate the amount of a particular resource that has been allocated to a +given consumer of that resource from a particular resource provider:: + + CREATE TABLE allocations ( + id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY, + resource_provider_id INT UNSIGNED NOT NULL, + consumer_id VARCHAR(64) NOT NULL, + resource_class_id INT UNSIGNED NOT NULL, + used INT UNSIGNED NOT NULL, + INDEX (resource_provider_id, resource_class_id, used), + INDEX (consumer_id), + INDEX (resource_class_id) + ); + +When a consumer of a particular resource claims resources from a provider, +a record is inserted into to the `allocations` table. + +.. note:: + + The `consumer_id` field will be the UUID of the entity that is consuming + this resource. This will always be the Nova instance UUID until some future + point when the Nova scheduler may be broken out to support more than just + compute resources. The `allocations` table is populated by logic outlined + in the `compute-node-allocations` specification. + +The process of claiming a set of resources in the `allocations` table will look +something like this:: + + BEGIN TRANSACTION; + FOR $RESOURCE_CLASS, $REQUESTED_AMOUNT IN requested_resources: + INSERT INTO allocations ( + resource_provider_id, + resource_class_id, + consumer_id, + used + ) VALUES ( + $RESOURCE_PROVIDER_ID, + $RESOURCE_CLASS, + $INSTANCE_UUID, + $REQUESTED_AMOUNT + ); + COMMIT TRANSACTION; + +The problem with the above is that if two threads run a query and select the +same resource provider to place an instance on, they will have selected the +resource provider after making a point-in-time view of the available inventory +on that resource provider. By the time the `COMMIT_TRANSACTION` occurs, one +thread may have claimed resources on that resource provider and changed that +point-in-time view in the other thread. If the other thread just proceeds and +adds records to the `allocations` table, we could end up with more resources +consumed on the host than can actually fit on the host. The traditional way of +solving this problem was to use a `SELECT FOR UPDATE` query when retrieving the +point-in-time view of the resource provider's inventory. However, the `SELECT +FOR UPDATE` statement is not supported properly when running MySQL Galera +Cluster in a multi-writer mode. In addition, it uses a heavy pessimistic +locking algorithm which locks the selected records for a (relatively) long +period of time. + +To solve this particular problem, applications can use a "compare and update" +strategy. In this approach, reader threads save some information about the +point-in-time view and when sending writes to the database, include a `WHERE` +condition containing the piece of data from the point-in-time view. The write +will only succeed (return >0 rows affected) if the original condition holds and +another thread hasn't updated the viewed rows in between the time of the +initial point-in-time read and the attempt to write to the same rows in the +table. + +The `resource_providers.generation` field enables atomic writes to the +`allocations` table using this "compare and update" strategy. + +Essentially, in pseudo-code, this is how the `generation` field is used in a +"compare and update" approach to claiming resources on a provider:: + + deadlock_retry: + + $ID, $GENERATION = SELECT id, generation FROM resource_providers + WHERE ( ); + + BEGIN TRANSACTION; + FOR $RESOURCE_CLASS, $REQUESTED_AMOUNT IN requested_resources: + INSERT INTO allocations ( + resource_provider_id, + resource_class_id, + consumer_id, + used + ) VALUES ( + $RESOURCE_PROVIDER_ID, + $RESOURCE_CLASS, + $INSTANCE_UUID, + $REQUESTED_AMOUNT + ); + $ROWS_AFFECTED = UPDATE resource_providers + SET generation = $GENERATION + 1 + WHERE generation = $GENERATION; + IF $ROWS_AFFECTED == 0: + ROLLBACK TRANSACTION; + GO TO deadlock_retry; + COMMIT TRANSACTION; + +Alternatives +------------ + +Continue to use the `compute_nodes` table to store all resource usage and +capacity information. The problem with this are as follows: + +* Any new resources require changes to the database schema +* We have nowhere in the database to indicate that some resource is shared + among compute nodes + +Data model impact +----------------- + +A number of data model changes will be needed. + +* New models for: + + * `ResourceProvider` + * `InventoryItem` + * `AllocationItem` + +* New database tables for all of the above + +* Database migrations needed: + + * Addition of following tables into the schema: + + * `resource_providers` + * `inventories` + * `allocations` + +REST API impact +--------------- + +None. + +Security impact +--------------- + +None. + +Notifications impact +-------------------- + +None. + +Other end user impact +--------------------- + +None. + +Performance Impact +------------------ + +None. + +Other deployer impact +--------------------- + +None. + +Developer impact +---------------- + +None. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + dstepanenko + +Other contributors: + jaypipes + +Work Items +---------- + +* Create database migration that creates the `resource_providers`, + `inventories`, and `allocations` tables +* Create the new `nova.objects` models for `ResourceProvider`, `InventoryItem`, + and `AllocationItem` + +In Mitaka, all of this work was completed except for the creation of +the `AllocationItem`, which will be completed in Newton. + +Dependencies +============ + +* The `resource-classes` blueprint work is a foundation for this work, since + the `resource_class_id` field in the `inventories` and `allocations` table + refers (logically, not via a foreign key constraint) to the resource class + concept introduced in that blueprint spec. + +Testing +======= + +New unit tests for the migrations and new object models should suffice for this +spec. + +Documentation Impact +==================== + +None. + +References +========== + +[1] Bugs related to resource usage reporting and calculation: + +* Hypervisor summary shows incorrect total storage (Ceph) + https://bugs.launchpad.net/nova/+bug/1387812 +* rbd backend reports wrong 'local_gb_used' for compute node + https://bugs.launchpad.net/nova/+bug/1493760 +* nova hypervisor-stats shows wrong disk usage with shared storage + https://bugs.launchpad.net/nova/+bug/1414432 +* report disk consumption incorrect in nova-compute + https://bugs.launchpad.net/nova/+bug/1315988 +* VMWare: available disk spaces(hypervisor-list) only based on a single + datastore instead of all available datastores from cluster + https://bugs.launchpad.net/nova/+bug/1347039 + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Mitaka + - Introduced + * - Mitaka (M3) + - Added name, generation and can_host fields to the `resource_providers` + table + * - Newton + - Re-proposed