379 lines
14 KiB
ReStructuredText
379 lines
14 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
================================
|
|
Resource Providers - Base Models
|
|
================================
|
|
|
|
https://blueprints.launchpad.net/nova/+spec/resource-providers
|
|
|
|
This blueprint partially addresses the problem of Nova assuming all resources
|
|
are provided by a single compute node by introducing a new concept -- a
|
|
resource provider -- that will allow Nova to accurately track and reserve
|
|
resources regardless of whether the resource is being exposed by a single
|
|
compute node, some shared resource pool or an external resource-providing
|
|
service of some sort.
|
|
|
|
Problem description
|
|
===================
|
|
|
|
Within a cloud deployment, there are a number of resources that may be consumed
|
|
by a user. Some resource types are provided by a compute node; these types of
|
|
resources include CPU, memory, PCI devices and local ephemeral disk. Other
|
|
types of resources, however, are not provided by a compute node, but instead
|
|
are provided by some external resource pool. An example of such a resource
|
|
would be a shared storage pool like that provided by Ceph or an NFS share.
|
|
|
|
Unfortunately, due to legacy reasons, Nova only thinks of resources as being
|
|
provided by a compute node. The tracking of resources assumes that it is the
|
|
compute node that provides the resource, and therefore when reporting usage of
|
|
certain resources, Nova naively calculates resource usage and availability by
|
|
simply summing amounts across all compute nodes in its database. This ends up
|
|
causing a number of problems [1] with usage and capacity amounts being
|
|
incorrect.
|
|
|
|
Use Cases
|
|
----------
|
|
|
|
As a deployer that has chosen to use a shared storage solution for storing
|
|
instance ephemeral disks, I want Nova and Horizon to report the correct
|
|
usage and capacity information.
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
We propose to introduce new database tables and object models in Nova that
|
|
store information about the inventory/capacity information of generic providers
|
|
of various resources, along with a table structure that can store
|
|
usage/allocation information for that inventory.
|
|
|
|
**This blueprint intentionally does NOT insert records into these new database
|
|
tables**. The tables will be populated with the work in the follow-up
|
|
`compute-node-inventory`, `compute-node-allocations`, and
|
|
`generic-resource-pools` blueprints.
|
|
|
|
We are going to need a lookup table for the IDs of various resource
|
|
providers in the system, too. We'll call this lookup table
|
|
`resource_providers`::
|
|
|
|
CREATE TABLE resource_providers (
|
|
id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY,
|
|
uuid CHAR (36) NOT NULL,
|
|
name VARCHAR(200) NOT NULL CHARACTER SET utf8,
|
|
generation INT NOT NULL,
|
|
can_host INT NOT NULL,
|
|
UNIQUE INDEX (uuid)
|
|
);
|
|
|
|
The `generation` and `can_host` fields are internal implementation fields that
|
|
respectively allow for atomic allocation operations and tell the scheduler
|
|
whether the resource provider can be a destination for an instance to land on
|
|
(hint: a resource pool never can be the target for an instance).
|
|
|
|
An `inventories` table records the amount of a particular resource that is
|
|
provided by a particular resource provider::
|
|
|
|
CREATE TABLE inventories (
|
|
id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY,
|
|
resource_provider_id INT UNSIGNED NOT NULL,
|
|
resource_class_id INT UNSIGNED NOT NULL,
|
|
total INT UNSIGNED NOT NULL,
|
|
reserved INT UNSIGNED NOT NULL,
|
|
min_unit INT UNSIGNED NOT NULL,
|
|
max_unit INT UNSIGNED NOT NULL,
|
|
step_size INT UNSIGNED NOT NULL,
|
|
allocation_ratio FLOAT NOT NULL,
|
|
INDEX (resource_provider_id),
|
|
INDEX (resource_class_id)
|
|
);
|
|
|
|
The `reserved` field shall store the amount the resource provider "sets aside"
|
|
for unmanaged consumption of its resources. By "unmanaged", we refer here to
|
|
Nova (or the eventual broken-out scheduler) not being involved in the
|
|
allocation of some of the resources from the provider. As an example, let's say
|
|
that a compute node wants to reserve some amount of RAM for use by the host,
|
|
and therefore reduce the amount of RAM that the compute node advertises as its
|
|
capacity. As another example, imagine a shared resource pool that has some
|
|
amount of disk space consumed by things other than Nova instances. Or, further,
|
|
a Neutron routed network containing a pool of IPv4 addresses, but Nova
|
|
instances may not be assigned the first 5 IP addresses in the pool.
|
|
|
|
The `allocation_ratio` field shall store the "overcommit" ratio for a
|
|
particular class of resource that the provider is willing to tolerate. This
|
|
information is currently stored only for CPU and RAM in the
|
|
`cpu_allocation_ratio` and `ram_allocation_ratio` fields in the `compute_nodes`
|
|
table.
|
|
|
|
The `min_unit` and `max_unit` fields shall store "limits" information for the
|
|
type of resource. This information is necessary to ensure that a request for
|
|
more or fewer resource that can be provided as a single unit will not be
|
|
accepted.
|
|
|
|
.. note::
|
|
|
|
**How min_unit, max_unit, and allocation_ratio work together**
|
|
|
|
As an example, let us say that a particular compute node has two
|
|
quad-core Xeon processors, providing 8 total physical cores. Even though the
|
|
cloud administrator may have set the `cpu_allocation_ratio` to 16
|
|
(the default), the compute node cannot accept requests for instances needing
|
|
more than 8 vCPUs. So, while there may be 128 total vCPUs available on the
|
|
compute node, the `min_unit` would be set to 1 and the `max_unit` would be
|
|
set to `8` in order to prevent unacceptable matching of resources to requests.
|
|
|
|
The `step_size` is a representation of the divisible unit amount of the
|
|
resource that may be requested, *if the requested amount is greater than
|
|
the `min_unit` value*.
|
|
|
|
For instance, let's say that an operator wants to ensure that a user can only
|
|
request disk resources in 10G increments, with nothing less than 5G and nothing
|
|
more than 1TB. For the `DISK_GB` resource class, the operator would set the
|
|
inventory of the shared storage pool to a `min_unit` of 5, a `max_unit` of
|
|
1000, and a `step_size` of 10. This would allow a request for 5G of disk space
|
|
as well as 10G and 20G of disk space, but not 6, 7, or 8GB of disk space. As
|
|
another example, let's say an operator set their `VCPU` inventory record on a
|
|
particular compute node to be `min_unit` of 1, `max_unit` of 16, and
|
|
`step_size` of 2, that would mean a user can request an instance only consumes
|
|
1 vCPU, but if the user requests more than a single vCPU, that number must be
|
|
divisible evenly by 2, up to a maximum of 16.
|
|
|
|
In order to track resources that have been assigned and used by some consumer
|
|
of that resource, we need an `allocations` table. Records in this table
|
|
will indicate the amount of a particular resource that has been allocated to a
|
|
given consumer of that resource from a particular resource provider::
|
|
|
|
CREATE TABLE allocations (
|
|
id INT UNSIGNED NOT NULL AUTOINCREMENT PRIMARY KEY,
|
|
resource_provider_id INT UNSIGNED NOT NULL,
|
|
consumer_id VARCHAR(64) NOT NULL,
|
|
resource_class_id INT UNSIGNED NOT NULL,
|
|
used INT UNSIGNED NOT NULL,
|
|
INDEX (resource_provider_id, resource_class_id, used),
|
|
INDEX (consumer_id),
|
|
INDEX (resource_class_id)
|
|
);
|
|
|
|
When a consumer of a particular resource claims resources from a provider,
|
|
a record is inserted into to the `allocations` table.
|
|
|
|
.. note::
|
|
|
|
The `consumer_id` field will be the UUID of the entity that is consuming
|
|
this resource. This will always be the Nova instance UUID until some future
|
|
point when the Nova scheduler may be broken out to support more than just
|
|
compute resources. The `allocations` table is populated by logic outlined
|
|
in the `compute-node-allocations` specification.
|
|
|
|
The process of claiming a set of resources in the `allocations` table will look
|
|
something like this::
|
|
|
|
BEGIN TRANSACTION;
|
|
FOR $RESOURCE_CLASS, $REQUESTED_AMOUNT IN requested_resources:
|
|
INSERT INTO allocations (
|
|
resource_provider_id,
|
|
resource_class_id,
|
|
consumer_id,
|
|
used
|
|
) VALUES (
|
|
$RESOURCE_PROVIDER_ID,
|
|
$RESOURCE_CLASS,
|
|
$INSTANCE_UUID,
|
|
$REQUESTED_AMOUNT
|
|
);
|
|
COMMIT TRANSACTION;
|
|
|
|
The problem with the above is that if two threads run a query and select the
|
|
same resource provider to place an instance on, they will have selected the
|
|
resource provider after making a point-in-time view of the available inventory
|
|
on that resource provider. By the time the `COMMIT_TRANSACTION` occurs, one
|
|
thread may have claimed resources on that resource provider and changed that
|
|
point-in-time view in the other thread. If the other thread just proceeds and
|
|
adds records to the `allocations` table, we could end up with more resources
|
|
consumed on the host than can actually fit on the host. The traditional way of
|
|
solving this problem was to use a `SELECT FOR UPDATE` query when retrieving the
|
|
point-in-time view of the resource provider's inventory. However, the `SELECT
|
|
FOR UPDATE` statement is not supported properly when running MySQL Galera
|
|
Cluster in a multi-writer mode. In addition, it uses a heavy pessimistic
|
|
locking algorithm which locks the selected records for a (relatively) long
|
|
period of time.
|
|
|
|
To solve this particular problem, applications can use a "compare and update"
|
|
strategy. In this approach, reader threads save some information about the
|
|
point-in-time view and when sending writes to the database, include a `WHERE`
|
|
condition containing the piece of data from the point-in-time view. The write
|
|
will only succeed (return >0 rows affected) if the original condition holds and
|
|
another thread hasn't updated the viewed rows in between the time of the
|
|
initial point-in-time read and the attempt to write to the same rows in the
|
|
table.
|
|
|
|
The `resource_providers.generation` field enables atomic writes to the
|
|
`allocations` table using this "compare and update" strategy.
|
|
|
|
Essentially, in pseudo-code, this is how the `generation` field is used in a
|
|
"compare and update" approach to claiming resources on a provider::
|
|
|
|
deadlock_retry:
|
|
|
|
$ID, $GENERATION = SELECT id, generation FROM resource_providers
|
|
WHERE ( <QUERY_TO_IDENTIFY_AVAILABLE_INVENTORY> );
|
|
|
|
BEGIN TRANSACTION;
|
|
FOR $RESOURCE_CLASS, $REQUESTED_AMOUNT IN requested_resources:
|
|
INSERT INTO allocations (
|
|
resource_provider_id,
|
|
resource_class_id,
|
|
consumer_id,
|
|
used
|
|
) VALUES (
|
|
$RESOURCE_PROVIDER_ID,
|
|
$RESOURCE_CLASS,
|
|
$INSTANCE_UUID,
|
|
$REQUESTED_AMOUNT
|
|
);
|
|
$ROWS_AFFECTED = UPDATE resource_providers
|
|
SET generation = $GENERATION + 1
|
|
WHERE generation = $GENERATION;
|
|
IF $ROWS_AFFECTED == 0:
|
|
ROLLBACK TRANSACTION;
|
|
GO TO deadlock_retry;
|
|
COMMIT TRANSACTION;
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
Continue to use the `compute_nodes` table to store all resource usage and
|
|
capacity information. The problem with this are as follows:
|
|
|
|
* Any new resources require changes to the database schema
|
|
* We have nowhere in the database to indicate that some resource is shared
|
|
among compute nodes
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
A number of data model changes will be needed.
|
|
|
|
* New models for:
|
|
|
|
* `ResourceProvider`
|
|
* `InventoryItem`
|
|
* `AllocationItem`
|
|
|
|
* New database tables for all of the above
|
|
|
|
* Database migrations needed:
|
|
|
|
* Addition of following tables into the schema:
|
|
|
|
* `resource_providers`
|
|
* `inventories`
|
|
* `allocations`
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
None.
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
None.
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None.
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
None.
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
None.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
None.
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
None.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
dstepanenko
|
|
|
|
Other contributors:
|
|
jaypipes
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Create database migration that creates the `resource_providers`,
|
|
`inventories`, and `allocations` tables
|
|
* Create the new `nova.objects` models for `ResourceProvider`, `InventoryItem`,
|
|
and `AllocationItem`
|
|
|
|
Dependencies
|
|
============
|
|
|
|
* The `resource-classes` blueprint work is a foundation for this work, since
|
|
the `resource_class_id` field in the `inventories` and `allocations` table
|
|
refers (logically, not via a foreign key constraint) to the resource class
|
|
concept introduced in that blueprint spec.
|
|
|
|
Testing
|
|
=======
|
|
|
|
New unit tests for the migrations and new object models should suffice for this
|
|
spec.
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
None.
|
|
|
|
References
|
|
==========
|
|
|
|
[1] Bugs related to resource usage reporting and calculation:
|
|
|
|
* Hypervisor summary shows incorrect total storage (Ceph)
|
|
https://bugs.launchpad.net/nova/+bug/1387812
|
|
* rbd backend reports wrong 'local_gb_used' for compute node
|
|
https://bugs.launchpad.net/nova/+bug/1493760
|
|
* nova hypervisor-stats shows wrong disk usage with shared storage
|
|
https://bugs.launchpad.net/nova/+bug/1414432
|
|
* report disk consumption incorrect in nova-compute
|
|
https://bugs.launchpad.net/nova/+bug/1315988
|
|
* VMWare: available disk spaces(hypervisor-list) only based on a single
|
|
datastore instead of all available datastores from cluster
|
|
https://bugs.launchpad.net/nova/+bug/1347039
|
|
|
|
History
|
|
=======
|
|
|
|
.. list-table:: Revisions
|
|
:header-rows: 1
|
|
|
|
* - Release Name
|
|
- Description
|
|
* - Mitaka
|
|
- Introduced
|
|
* - Mitaka (M3)
|
|
- Added name, generation and can_host fields to the `resource_providers`
|
|
table
|