Re-propose resource-providers for Newton
This blueprint partially addresses the problem of Nova assuming all resources are provided by a single compute node by introducing a new concept -- a resource provider -- that will allow Nova to accurately track and reserve resources regardless of whether the resource is being exposed by a single compute node, some shared resource pool or an external resource-providing service of some sort. The spec has been annotated to reflect which work has been completed and which remains to be done in Newton. Change-Id: Id0c030159fa741a5eb51d52332d6fdbce05000ae Previously-approved: Mitaka
This commit is contained in:
387
specs/newton/approved/resource-providers.rst
Normal file
387
specs/newton/approved/resource-providers.rst
Normal file
@@ -0,0 +1,387 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================
|
||||
Resource Providers - Base Models
|
||||
================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/resource-providers
|
||||
|
||||
This blueprint partially addresses the problem of Nova assuming all resources
|
||||
are provided by a single compute node by introducing a new concept -- a
|
||||
resource provider -- that will allow Nova to accurately track and reserve
|
||||
resources regardless of whether the resource is being exposed by a single
|
||||
compute node, some shared resource pool or an external resource-providing
|
||||
service of some sort.
|
||||
|
||||
.. note:: Note that the majority of the work described here was completed in
|
||||
Mitaka. The single remaining work item is the creation of
|
||||
an `AllocationItem`.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Within a cloud deployment, there are a number of resources that may be consumed
|
||||
by a user. Some resource types are provided by a compute node; these types of
|
||||
resources include CPU, memory, PCI devices and local ephemeral disk. Other
|
||||
types of resources, however, are not provided by a compute node, but instead
|
||||
are provided by some external resource pool. An example of such a resource
|
||||
would be a shared storage pool like that provided by Ceph or an NFS share.
|
||||
|
||||
Unfortunately, due to legacy reasons, Nova only thinks of resources as being
|
||||
provided by a compute node. The tracking of resources assumes that it is the
|
||||
compute node that provides the resource, and therefore when reporting usage of
|
||||
certain resources, Nova naively calculates resource usage and availability by
|
||||
simply summing amounts across all compute nodes in its database. This ends up
|
||||
causing a number of problems [1] with usage and capacity amounts being
|
||||
incorrect.
|
||||
|
||||
Use Cases
|
||||
----------
|
||||
|
||||
As a deployer that has chosen to use a shared storage solution for storing
|
||||
instance ephemeral disks, I want Nova and Horizon to report the correct
|
||||
usage and capacity information.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We propose to introduce new database tables and object models in Nova that
|
||||
store information about the inventory/capacity information of generic providers
|
||||
of various resources, along with a table structure that can store
|
||||
usage/allocation information for that inventory.
|
||||
|
||||
**This blueprint intentionally does NOT insert records into these new database
|
||||
tables**. The tables will be populated with the work in the follow-up
|
||||
`compute-node-inventory`, `compute-node-allocations`, and
|
||||
`generic-resource-pools` blueprints.
|
||||
|
||||
We are going to need a lookup table for the IDs of various resource
|
||||
providers in the system, too. We'll call this lookup table
|
||||
`resource_providers`::
|
||||
|
||||
CREATE TABLE resource_providers (
|
||||
id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY,
|
||||
uuid CHAR (36) NOT NULL,
|
||||
name VARCHAR(200) NOT NULL CHARACTER SET utf8,
|
||||
generation INT NOT NULL,
|
||||
can_host INT NOT NULL,
|
||||
UNIQUE INDEX (uuid)
|
||||
);
|
||||
|
||||
The `generation` and `can_host` fields are internal implementation fields that
|
||||
respectively allow for atomic allocation operations and tell the scheduler
|
||||
whether the resource provider can be a destination for an instance to land on
|
||||
(hint: a resource pool never can be the target for an instance).
|
||||
|
||||
An `inventories` table records the amount of a particular resource that is
|
||||
provided by a particular resource provider::
|
||||
|
||||
CREATE TABLE inventories (
|
||||
id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY,
|
||||
resource_provider_id INT UNSIGNED NOT NULL,
|
||||
resource_class_id INT UNSIGNED NOT NULL,
|
||||
total INT UNSIGNED NOT NULL,
|
||||
reserved INT UNSIGNED NOT NULL,
|
||||
min_unit INT UNSIGNED NOT NULL,
|
||||
max_unit INT UNSIGNED NOT NULL,
|
||||
step_size INT UNSIGNED NOT NULL,
|
||||
allocation_ratio FLOAT NOT NULL,
|
||||
INDEX (resource_provider_id),
|
||||
INDEX (resource_class_id)
|
||||
);
|
||||
|
||||
The `reserved` field shall store the amount the resource provider "sets aside"
|
||||
for unmanaged consumption of its resources. By "unmanaged", we refer here to
|
||||
Nova (or the eventual broken-out scheduler) not being involved in the
|
||||
allocation of some of the resources from the provider. As an example, let's say
|
||||
that a compute node wants to reserve some amount of RAM for use by the host,
|
||||
and therefore reduce the amount of RAM that the compute node advertises as its
|
||||
capacity. As another example, imagine a shared resource pool that has some
|
||||
amount of disk space consumed by things other than Nova instances. Or, further,
|
||||
a Neutron routed network containing a pool of IPv4 addresses, but Nova
|
||||
instances may not be assigned the first 5 IP addresses in the pool.
|
||||
|
||||
The `allocation_ratio` field shall store the "overcommit" ratio for a
|
||||
particular class of resource that the provider is willing to tolerate. This
|
||||
information is currently stored only for CPU and RAM in the
|
||||
`cpu_allocation_ratio` and `ram_allocation_ratio` fields in the `compute_nodes`
|
||||
table.
|
||||
|
||||
The `min_unit` and `max_unit` fields shall store "limits" information for the
|
||||
type of resource. This information is necessary to ensure that a request for
|
||||
more or fewer resource that can be provided as a single unit will not be
|
||||
accepted.
|
||||
|
||||
.. note::
|
||||
|
||||
**How min_unit, max_unit, and allocation_ratio work together**
|
||||
|
||||
As an example, let us say that a particular compute node has two
|
||||
quad-core Xeon processors, providing 8 total physical cores. Even though the
|
||||
cloud administrator may have set the `cpu_allocation_ratio` to 16
|
||||
(the default), the compute node cannot accept requests for instances needing
|
||||
more than 8 vCPUs. So, while there may be 128 total vCPUs available on the
|
||||
compute node, the `min_unit` would be set to 1 and the `max_unit` would be
|
||||
set to `8` in order to prevent unacceptable matching of resources to requests.
|
||||
|
||||
The `step_size` is a representation of the divisible unit amount of the
|
||||
resource that may be requested, *if the requested amount is greater than
|
||||
the `min_unit` value*.
|
||||
|
||||
For instance, let's say that an operator wants to ensure that a user can only
|
||||
request disk resources in 10G increments, with nothing less than 5G and nothing
|
||||
more than 1TB. For the `DISK_GB` resource class, the operator would set the
|
||||
inventory of the shared storage pool to a `min_unit` of 5, a `max_unit` of
|
||||
1000, and a `step_size` of 10. This would allow a request for 5G of disk space
|
||||
as well as 10G and 20G of disk space, but not 6, 7, or 8GB of disk space. As
|
||||
another example, let's say an operator set their `VCPU` inventory record on a
|
||||
particular compute node to be `min_unit` of 1, `max_unit` of 16, and
|
||||
`step_size` of 2, that would mean a user can request an instance only consumes
|
||||
1 vCPU, but if the user requests more than a single vCPU, that number must be
|
||||
divisible evenly by 2, up to a maximum of 16.
|
||||
|
||||
In order to track resources that have been assigned and used by some consumer
|
||||
of that resource, we need an `allocations` table. Records in this table
|
||||
will indicate the amount of a particular resource that has been allocated to a
|
||||
given consumer of that resource from a particular resource provider::
|
||||
|
||||
CREATE TABLE allocations (
|
||||
id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY,
|
||||
resource_provider_id INT UNSIGNED NOT NULL,
|
||||
consumer_id VARCHAR(64) NOT NULL,
|
||||
resource_class_id INT UNSIGNED NOT NULL,
|
||||
used INT UNSIGNED NOT NULL,
|
||||
INDEX (resource_provider_id, resource_class_id, used),
|
||||
INDEX (consumer_id),
|
||||
INDEX (resource_class_id)
|
||||
);
|
||||
|
||||
When a consumer of a particular resource claims resources from a provider,
|
||||
a record is inserted into to the `allocations` table.
|
||||
|
||||
.. note::
|
||||
|
||||
The `consumer_id` field will be the UUID of the entity that is consuming
|
||||
this resource. This will always be the Nova instance UUID until some future
|
||||
point when the Nova scheduler may be broken out to support more than just
|
||||
compute resources. The `allocations` table is populated by logic outlined
|
||||
in the `compute-node-allocations` specification.
|
||||
|
||||
The process of claiming a set of resources in the `allocations` table will look
|
||||
something like this::
|
||||
|
||||
BEGIN TRANSACTION;
|
||||
FOR $RESOURCE_CLASS, $REQUESTED_AMOUNT IN requested_resources:
|
||||
INSERT INTO allocations (
|
||||
resource_provider_id,
|
||||
resource_class_id,
|
||||
consumer_id,
|
||||
used
|
||||
) VALUES (
|
||||
$RESOURCE_PROVIDER_ID,
|
||||
$RESOURCE_CLASS,
|
||||
$INSTANCE_UUID,
|
||||
$REQUESTED_AMOUNT
|
||||
);
|
||||
COMMIT TRANSACTION;
|
||||
|
||||
The problem with the above is that if two threads run a query and select the
|
||||
same resource provider to place an instance on, they will have selected the
|
||||
resource provider after making a point-in-time view of the available inventory
|
||||
on that resource provider. By the time the `COMMIT_TRANSACTION` occurs, one
|
||||
thread may have claimed resources on that resource provider and changed that
|
||||
point-in-time view in the other thread. If the other thread just proceeds and
|
||||
adds records to the `allocations` table, we could end up with more resources
|
||||
consumed on the host than can actually fit on the host. The traditional way of
|
||||
solving this problem was to use a `SELECT FOR UPDATE` query when retrieving the
|
||||
point-in-time view of the resource provider's inventory. However, the `SELECT
|
||||
FOR UPDATE` statement is not supported properly when running MySQL Galera
|
||||
Cluster in a multi-writer mode. In addition, it uses a heavy pessimistic
|
||||
locking algorithm which locks the selected records for a (relatively) long
|
||||
period of time.
|
||||
|
||||
To solve this particular problem, applications can use a "compare and update"
|
||||
strategy. In this approach, reader threads save some information about the
|
||||
point-in-time view and when sending writes to the database, include a `WHERE`
|
||||
condition containing the piece of data from the point-in-time view. The write
|
||||
will only succeed (return >0 rows affected) if the original condition holds and
|
||||
another thread hasn't updated the viewed rows in between the time of the
|
||||
initial point-in-time read and the attempt to write to the same rows in the
|
||||
table.
|
||||
|
||||
The `resource_providers.generation` field enables atomic writes to the
|
||||
`allocations` table using this "compare and update" strategy.
|
||||
|
||||
Essentially, in pseudo-code, this is how the `generation` field is used in a
|
||||
"compare and update" approach to claiming resources on a provider::
|
||||
|
||||
deadlock_retry:
|
||||
|
||||
$ID, $GENERATION = SELECT id, generation FROM resource_providers
|
||||
WHERE ( <QUERY_TO_IDENTIFY_AVAILABLE_INVENTORY> );
|
||||
|
||||
BEGIN TRANSACTION;
|
||||
FOR $RESOURCE_CLASS, $REQUESTED_AMOUNT IN requested_resources:
|
||||
INSERT INTO allocations (
|
||||
resource_provider_id,
|
||||
resource_class_id,
|
||||
consumer_id,
|
||||
used
|
||||
) VALUES (
|
||||
$RESOURCE_PROVIDER_ID,
|
||||
$RESOURCE_CLASS,
|
||||
$INSTANCE_UUID,
|
||||
$REQUESTED_AMOUNT
|
||||
);
|
||||
$ROWS_AFFECTED = UPDATE resource_providers
|
||||
SET generation = $GENERATION + 1
|
||||
WHERE generation = $GENERATION;
|
||||
IF $ROWS_AFFECTED == 0:
|
||||
ROLLBACK TRANSACTION;
|
||||
GO TO deadlock_retry;
|
||||
COMMIT TRANSACTION;
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Continue to use the `compute_nodes` table to store all resource usage and
|
||||
capacity information. The problem with this are as follows:
|
||||
|
||||
* Any new resources require changes to the database schema
|
||||
* We have nowhere in the database to indicate that some resource is shared
|
||||
among compute nodes
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
A number of data model changes will be needed.
|
||||
|
||||
* New models for:
|
||||
|
||||
* `ResourceProvider`
|
||||
* `InventoryItem`
|
||||
* `AllocationItem`
|
||||
|
||||
* New database tables for all of the above
|
||||
|
||||
* Database migrations needed:
|
||||
|
||||
* Addition of following tables into the schema:
|
||||
|
||||
* `resource_providers`
|
||||
* `inventories`
|
||||
* `allocations`
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
dstepanenko
|
||||
|
||||
Other contributors:
|
||||
jaypipes
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create database migration that creates the `resource_providers`,
|
||||
`inventories`, and `allocations` tables
|
||||
* Create the new `nova.objects` models for `ResourceProvider`, `InventoryItem`,
|
||||
and `AllocationItem`
|
||||
|
||||
In Mitaka, all of this work was completed except for the creation of
|
||||
the `AllocationItem`, which will be completed in Newton.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* The `resource-classes` blueprint work is a foundation for this work, since
|
||||
the `resource_class_id` field in the `inventories` and `allocations` table
|
||||
refers (logically, not via a foreign key constraint) to the resource class
|
||||
concept introduced in that blueprint spec.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
New unit tests for the migrations and new object models should suffice for this
|
||||
spec.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] Bugs related to resource usage reporting and calculation:
|
||||
|
||||
* Hypervisor summary shows incorrect total storage (Ceph)
|
||||
https://bugs.launchpad.net/nova/+bug/1387812
|
||||
* rbd backend reports wrong 'local_gb_used' for compute node
|
||||
https://bugs.launchpad.net/nova/+bug/1493760
|
||||
* nova hypervisor-stats shows wrong disk usage with shared storage
|
||||
https://bugs.launchpad.net/nova/+bug/1414432
|
||||
* report disk consumption incorrect in nova-compute
|
||||
https://bugs.launchpad.net/nova/+bug/1315988
|
||||
* VMWare: available disk spaces(hypervisor-list) only based on a single
|
||||
datastore instead of all available datastores from cluster
|
||||
https://bugs.launchpad.net/nova/+bug/1347039
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Mitaka
|
||||
- Introduced
|
||||
* - Mitaka (M3)
|
||||
- Added name, generation and can_host fields to the `resource_providers`
|
||||
table
|
||||
* - Newton
|
||||
- Re-proposed
|
||||
Reference in New Issue
Block a user