Move Stein implemented specs
This moves the completed Stein specs to the implemented directory and adds the redirects. This was done using: $ tox -e move-implemented-specs stein This also removes the stein-template symlink from the docs. And renames the handling-down-cell_new.rst spec to handling-down-cell.rst to match the blueprint. Change-Id: Id92bec8c5a2436a4053765f735d252c7c165f019
This commit is contained in:
292
specs/stein/implemented/alloc-candidates-in-tree.rst
Normal file
292
specs/stein/implemented/alloc-candidates-in-tree.rst
Normal file
@@ -0,0 +1,292 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================
|
||||
Filter Allocation Candidates by Provider Tree
|
||||
=============================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/alloc-candidates-in-tree
|
||||
|
||||
This blueprint proposes to support for filtering allocation candidates
|
||||
by provider tree.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Placement currently supports ``in_tree`` query parameters for the
|
||||
``GET /resource_providers`` endpoints. This parameter is a string representing
|
||||
a resource provider uuid, and when this is present, the response is limited to
|
||||
resource providers within the same tree of the provider indicated by the uuid.
|
||||
See `Nested Resource Providers`_ spec for details.
|
||||
|
||||
However, ``GET /allocation_candidates`` doesn't support the ``in_tree`` query
|
||||
parameter to filter the allocation candidates by resource tree. This results
|
||||
in inefficient post-processing in some cases where the caller has already
|
||||
selected the resource provider tree before calling that API.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
This feature is useful when the caller of the ``GET /allocation_candidates``
|
||||
has already picked up resource providers they want to use.
|
||||
|
||||
As described in the `Bug#1777591`_, when an admin operator creates an instance
|
||||
on a specific host, nova now explicitly sets no limitation for getting
|
||||
allocation candidates to prevent placement from filtering out the
|
||||
pre-determined target resource provider by the random limitation. (For the
|
||||
limitation feature of the API, see the `Limiting Allocation Candidates`_
|
||||
spec)
|
||||
|
||||
Instead of issuing the inefficient request to placement, we can use ``in_tree``
|
||||
query with the pre-determined target host resource provider uuid calling the
|
||||
``GET /allocation_candidates`` API.
|
||||
|
||||
We would solve the same problem for cases of live migration to a specified
|
||||
host and rebuilding an instance on the same host.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The ``GET /allocation_candidates`` call will accept a new query parameter
|
||||
``in_tree``. This parameter is a string representing a resource provider uuid.
|
||||
When this is present, the only resource providers returned will be those in the
|
||||
same tree with the given resource provider.
|
||||
|
||||
The numbered syntax ``in_tree<N>`` is also supported. This restricts providers
|
||||
satisfying the Nth granular request group to the tree of the specified
|
||||
provider. This may be redundant with other ``in_tree<N>`` values specified in
|
||||
other groups (including the unnumbered group). However, it can be useful in
|
||||
cases where a specific resource (e.g. DISK_GB) needs to come from a specific
|
||||
sharing provider (e.g. shared storage).
|
||||
|
||||
In the following environments,
|
||||
|
||||
.. code::
|
||||
|
||||
+-----------------------+ +-----------------------+
|
||||
| sharing storage (ss1) | | sharing storage (ss2) |
|
||||
| DISK_GB: 1000 | | DISK_GB: 1000 |
|
||||
+-----------+-----------+ +-----------+-----------+
|
||||
| |
|
||||
+-----------------+----------------+
|
||||
|
|
||||
| Shared via an aggregate
|
||||
+-----------------+----------------+
|
||||
| |
|
||||
+--------------|---------------+ +--------------|--------------+
|
||||
| +------------+-------------+ | | +------------+------------+ |
|
||||
| | compute node (cn1) | | | |compute node (cn2) | |
|
||||
| | DISK_GB: 1000 | | | | DISK_GB: 1000 | |
|
||||
| +-----+-------------+------+ | | +----+-------------+------+ |
|
||||
| | nested | nested | | | nested | nested |
|
||||
| +-----+------+ +----+------+ | | +----+------+ +----+------+ |
|
||||
| | numa1_1 | | numa1_2 | | | | numa2_1 | | numa2_2 | |
|
||||
| | VCPU: 4 | | VCPU: 4 | | | | VCPU: 4 | | VCPU: 4 | |
|
||||
| +------------+ +-----------+ | | +-----------+ +-----------+ |
|
||||
+------------------------------+ +-----------------------------+
|
||||
|
||||
for example::
|
||||
|
||||
GET /allocation_candidates?resources=VCPU:1,DISK_GB:50&in_tree={cn1_uuid}
|
||||
|
||||
will return 2 combinations of allocation candidates.
|
||||
|
||||
result A::
|
||||
|
||||
1. numa1_1 (VCPU) + cn1 (DISK_GB)
|
||||
2. numa1_2 (VCPU) + cn1 (DISK_GB)
|
||||
|
||||
The specified tree can be a non-root provider::
|
||||
|
||||
GET /allocation_candidates?resources=VCPU:1,DISK_GB:50&in_tree={numa1_1_uuid}
|
||||
|
||||
will return the same result.
|
||||
|
||||
result B::
|
||||
|
||||
1. numa1_1 (VCPU) + cn1 (DISK_GB)
|
||||
2. numa1_2 (VCPU) + cn1 (DISK_GB)
|
||||
|
||||
When you want to have ``VCPU`` from ``cn1`` and ``DISK_GB`` from wherever,
|
||||
the request may look like::
|
||||
|
||||
GET /allocation_candidates?resources=VCPU:1&in_tree={cn1_uuid}
|
||||
&resources1=DISK_GB:10
|
||||
|
||||
which will return the sharing providers as well.
|
||||
|
||||
result C::
|
||||
|
||||
1. numa1_1 (VCPU) + cn1 (DISK_GB)
|
||||
2. numa1_2 (VCPU) + cn1 (DISK_GB)
|
||||
3. numa1_1 (VCPU) + ss1 (DISK_GB)
|
||||
4. numa1_2 (VCPU) + ss1 (DISK_GB)
|
||||
5. numa1_1 (VCPU) + ss2 (DISK_GB)
|
||||
6. numa1_2 (VCPU) + ss2 (DISK_GB)
|
||||
|
||||
When you want to have ``VCPU`` from wherever and ``DISK_GB`` from ``ss1``,
|
||||
the request may look like::
|
||||
|
||||
GET: /allocation_candidates?resources=VCPU:1
|
||||
&resources1=DISK_GB:10&in_tree1={ss1_uuid}
|
||||
|
||||
which will stick to the first sharing provider for DISK_GB.
|
||||
|
||||
result D::
|
||||
|
||||
1. numa1_1 (VCPU) + ss1 (DISK_GB)
|
||||
2. numa1_2 (VCPU) + ss1 (DISK_GB)
|
||||
3. numa2_1 (VCPU) + ss1 (DISK_GB)
|
||||
4. numa2_2 (VCPU) + ss1 (DISK_GB)
|
||||
|
||||
When you want to have ``VCPU`` from ``cn1`` and ``DISK_GB`` from ``ss1``,
|
||||
the request may look like::
|
||||
|
||||
GET: /allocation_candidates?resources1=VCPU:1&in_tree1={cn1_uuid}
|
||||
&resources2=DISK_GB:10&in_tree2={ss1_uuid}
|
||||
&group_policy=isolate
|
||||
|
||||
which will return only 2 candidates.
|
||||
|
||||
result E::
|
||||
|
||||
1. numa1_1 (VCPU) + ss1 (DISK_GB)
|
||||
2. numa1_2 (VCPU) + ss1 (DISK_GB)
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Alternative 1:
|
||||
|
||||
We could mitigate the restriction to include sharing providers assuming that
|
||||
they are in specified non-sharing tree that shares them. For example, we could
|
||||
change result A to return::
|
||||
|
||||
1. numa1_1 (VCPU) + cn1 (DISK_GB)
|
||||
2. numa1_2 (VCPU) + cn1 (DISK_GB)
|
||||
3. numa1_1 (VCPU) + ss1 (DISK_GB)
|
||||
4. numa1_2 (VCPU) + ss1 (DISK_GB)
|
||||
5. numa1_1 (VCPU) + ss2 (DISK_GB)
|
||||
6. numa1_2 (VCPU) + ss2 (DISK_GB)
|
||||
|
||||
This is possible if we assume that ``ss1`` and ``ss2`` are in "an expanded
|
||||
concept of a tree" of ``cn1``, but we don't take this way because we can get
|
||||
the same result using the granular request. Different result for a different
|
||||
request means we support more use cases than the same result for a different
|
||||
request.
|
||||
|
||||
Alternative 2:
|
||||
|
||||
In result B, we could exclude ``numa1_2`` resource provider (the second
|
||||
candidate), but we don't take this way for the following reason:
|
||||
It is not consistent with the existing ``in_tree`` behavior in
|
||||
``GET /resource_providers``. The inconsistency despite of the same queryparam
|
||||
name could confuse users. If we need this behaivor, that would be something
|
||||
like ``subtree`` queryparam which should be symmetrically implemented to
|
||||
``GET /resource_providers`` as well. This is already proposed in
|
||||
`Support subtree filter for GET /resource_providers`_ spec.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
A new microversion will be created to add the ``in_tree`` parameter to
|
||||
``GET /allocation_candidates`` API.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
If the callers of the ``GET /allocation_candidates`` has already picked up
|
||||
resource providers they want to use, they would get improved performance
|
||||
using this new ``in_tree`` query because we don't need to get all the
|
||||
candidates from the database.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
This feature enables us to develop efficient query in nova for cases that is
|
||||
described in the `Use Cases`_ section.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Tetsuro Nakamura (nakamura.tetsuro@lab.ntt.co.jp)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Update the ``AllocationCandidates.get_by_requests`` method to change the
|
||||
database queries to filter on the specified provider tree.
|
||||
* Update the placement API handlers for ``GET /allocation_candidates`` in
|
||||
a new microversion to pass the new ``in_tree`` parameter to the methods
|
||||
changed in the steps above, including input validation adjustments.
|
||||
* Add functional tests of the modified database queries.
|
||||
* Add gabbi tests that express the new queries, both successful queries and
|
||||
those that should cause a 400 response.
|
||||
* Release note for the API change.
|
||||
* Update the microversion documents to indicate the new version.
|
||||
* Update placement-api-ref to show the new query handling.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Normal functional and unit testing.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document the REST API microversion in the appropriate reference docs.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* `Nested Resource Providers`_ spec
|
||||
* `Bug#1777591`_ reported in the launchpad
|
||||
* `Limiting Allocation Candidates`_ spec
|
||||
|
||||
.. _`Nested Resource Providers`: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/nested-resource-providers.html
|
||||
.. _`Bug#1777591`: https://bugs.launchpad.net/nova/+bug/1777591
|
||||
.. _`Limiting Allocation Candidates`: https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/allocation-candidates-limit.html
|
||||
.. _`Support subtree filter for GET /resource_providers`: https://review.openstack.org/#/c/595236/
|
||||
624
specs/stein/implemented/bandwidth-resource-provider.rst
Normal file
624
specs/stein/implemented/bandwidth-resource-provider.rst
Normal file
@@ -0,0 +1,624 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===================================
|
||||
Network Bandwidth resource provider
|
||||
===================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/bandwidth-resource-provider
|
||||
|
||||
This spec proposes adding new resource classes representing network
|
||||
bandwidth and modeling network backends as resource providers in
|
||||
Placement. As well as adding scheduling support for the new resources in Nova.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently there is no method in the Nova scheduler to place a server
|
||||
based on the network bandwidth available in a host. The Placement service
|
||||
doesn't track the different network back-ends present in a host and their
|
||||
available bandwidth.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
A user wants to spawn a server with a port associated with a specific physical
|
||||
network. The user also wants a defined guaranteed minimum bandwidth for this
|
||||
port. The Nova scheduler must select a host which satisfies this request.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
This spec proposes Neutron to model the bandwidth resource of the physical NICs
|
||||
on a compute host and their resources providers in the Placement service,
|
||||
express the bandwidth request in the Neutron port, and modify Nova to consider
|
||||
the requested bandwidth resource during the scheduling of the server based on
|
||||
the available bandwidth resources on each compute host.
|
||||
|
||||
This also means that this spec proposes to use Placement and the nova-scheduler
|
||||
to select which bandwidth providing RP and therefore which physical device will
|
||||
provide the bandwidth for a given Neutron port. Today selecting the physical
|
||||
device happens during Neutron port binding but after this spec is implemented
|
||||
this selection will happen when an allocation candidate is selected for the
|
||||
server in the nova-scheduler. Therefore Neutron needs to provide enough
|
||||
information in the Networking RP model in Placement and in the resource_request
|
||||
field of the port so that Nova can query Placement and receive allocation
|
||||
candidates that are not conflicting with Neutron port binding logic.
|
||||
The Networking RP model and the schema of the new resource_request port
|
||||
attribute is described in `QoS minimum bandwidth allocation in Placement API`_
|
||||
Neutron spec.
|
||||
|
||||
Please note that today Neutron port binding could fail if the nova-scheduler
|
||||
selects a compute host where Neutron cannot bind the port. We are not aiming to
|
||||
remove this limitation by this spec but also we don't want to increase the
|
||||
frequency of such port binding failures as it would ruin the usability of the
|
||||
system.
|
||||
|
||||
|
||||
Separation of responsibilities
|
||||
------------------------------
|
||||
|
||||
* Nova creates the root RP of the compute node RP tree as today
|
||||
* Neutron creates the networking RP tree of a compute node under the compute
|
||||
node root RP and reports bandwidth inventories
|
||||
* Neutron provides the resource_request of a port in the Neutron API
|
||||
* Nova takes the ports' resource_request and includes it in the GET
|
||||
/allocation_candidate request. Nova does not need to understand or manipulate
|
||||
the actual resource request. But Nova needs to assign unique granular
|
||||
resource request group suffix for each port's resource request.
|
||||
* Nova selects one allocation candidate and claims the resources in Placement.
|
||||
* Nova passes the RP UUID used to fulfill the port resource request to Neutron
|
||||
during port binding
|
||||
|
||||
Scoping
|
||||
-------
|
||||
|
||||
Due to the size and complexity of this feature the scope of the current spec
|
||||
is limited. To keep backward compatibility while the feature is not fully
|
||||
implemented both new Neutron API extensions will be optional and turned off by
|
||||
default. Nova will check for the extension that introduces the port's
|
||||
resource_request field and fall back to the current resource handling behavior
|
||||
if the extension is not loaded.
|
||||
|
||||
Out of scope from Nova perspective:
|
||||
|
||||
* Supporting separate proximity policy for the granular resource request groups
|
||||
created from the Neutron port's resource_request. Nova will use the policy
|
||||
defined in the flavor extra_spec for the whole request as today such policy
|
||||
is global for an allocation_candidate request.
|
||||
* Handling Neutron mechanism driver preference order in a weigher in the
|
||||
nova-scheduler
|
||||
* Interface attach with a port or network having a QoS minimum bandwidth policy
|
||||
rule as interface_attach does not call scheduler today. Nova will reject
|
||||
interface_attach request if the port (passed in or created in network that is
|
||||
passed in) resource request in non empty.
|
||||
* Server create with network having QoS minimum bandwidth policy rule as a port
|
||||
in this network is created by the nova-compute *after* the scheduling
|
||||
decision. This spec proposes to fail such boot in the compute-manager.
|
||||
* QoS policy rule create or update on bound port
|
||||
* QoS aware trunk subport create under a bound parent port
|
||||
* Baremetal port having a QoS bandwidth policy rule is out of scope as Neutron
|
||||
does not own the networking devices on a baremetal compute node.
|
||||
|
||||
Scenarios
|
||||
---------
|
||||
|
||||
This spec needs to consider multiple flows and scenarios detailed in the
|
||||
following sections.
|
||||
|
||||
Neutron agent first start
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The Neutron agent running on a given compute host uses the existing ``host``
|
||||
neutron.conf variable to find the compute RP related to its host in Placement.
|
||||
See `Finding the compute RP`_ for details and reasoning.
|
||||
|
||||
The Neutron agent creates the networking RPs under the compute RP with proper
|
||||
traits then reports resource inventories based on the discovered and / or
|
||||
configured resource inventory of the compute host. See
|
||||
`QoS minimum bandwidth allocation in Placement API`_ for details.
|
||||
|
||||
Neutron agent restart
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
During restart the Neutron agent ensures that the proper RP tree exists in
|
||||
Placement with correct inventories and traits by creating / updating the RP
|
||||
tree if necessary. The Neutron agent only modifies the inventory and traits of
|
||||
the RPs that were created by the agent. Also Neutron only modifies the pieces
|
||||
that actually got added or deleted. Unmodified pieces should be left in place
|
||||
(no delete and re-create).
|
||||
|
||||
Server create with pre-created Neutron ports having QoS policy rule
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The end user creates a Neutron port with the Neutron API and attaches a QoS
|
||||
policy minimum bandwidth rule to it, either directly or indirectly by attaching
|
||||
the rule to the network the port is created in. Then the end user creates a
|
||||
server in Nova and passes in the port UUID in the server create request.
|
||||
|
||||
Nova fetches the port data from Neutron. This already happens in
|
||||
create_pci_requests_for_sriov_ports in the current code base. The port contains
|
||||
the requested resources and required traits. See
|
||||
`Resource request in the port`_.
|
||||
|
||||
The create_pci_requests_for_sriov_ports() call needs to be refactored to a more
|
||||
generic call that not just generates PCI requests but also collects the
|
||||
requested resources from the Neutron ports.
|
||||
|
||||
The nova-api stores the requested resources and required traits in a new field
|
||||
of the RequestSpec object called requested_resources. The new
|
||||
`requested_resources` field should not be persisted in the api database as
|
||||
it is computed data based on the resource requests from different sources in
|
||||
this case from the Neutron ports and the data in the port might change outside
|
||||
of Nova.
|
||||
|
||||
The nova-scheduler uses this information from the RequestSpec to send an
|
||||
allocation candidate request to Placement that contains the port related
|
||||
resource requests besides the compute related resource requests. The requested
|
||||
resources and required traits from each port will be considered to be
|
||||
restricted to a single RP with a separate, numbered request group as defined in
|
||||
the `granular-resource-request`_ spec. This is necessary as mixing requested
|
||||
resource and required traits from different ports (i.e. one OVS and one
|
||||
SRIOV) towards placement will cause empty allocation candidate response as no
|
||||
RP will have both OVS and SRIOV traits at the same time.
|
||||
|
||||
Alternatively we could extend and use the requested_networks
|
||||
(NetworkRequestList ovo) parameter of the build_instance code path to store and
|
||||
communicate the resource needs coming from the Neutron ports. Then the
|
||||
select_destinations() scheduler rpc call needs to be extended with a new
|
||||
parameter holding the NetworkRequestList.
|
||||
|
||||
The `nova.scheduler.utils.resources_from_request_spec()` call needs to be
|
||||
modified to use the newly introduced `requested_resources` field from the
|
||||
RequestSpec object to generate the proper allocation candidate request.
|
||||
|
||||
Later on the resource request in the Neutron port API can be evolved to support
|
||||
the same level of granularity that the Nova flavor resource override
|
||||
functionality supports.
|
||||
|
||||
Then Placement returns allocation candidates. After additional filtering and
|
||||
weighing in the nova-scheduler, the scheduler claims the resources in the
|
||||
selected candidate in a single transaction in Placement. The consumer_id of the
|
||||
created allocations is the instance_uuid. See `The consumer of the port related
|
||||
resources`_.
|
||||
|
||||
When multiple ports, having QoS policy rules towards the same physical network,
|
||||
are attached to the server (e.g. two VFs on the same PF) then the resulting
|
||||
allocation is the sum of the resource amounts of each individual port request.
|
||||
|
||||
Delete a server with ports having QoS policy rule
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
During normal delete, `local delete`_ and shelve_offload Nova today deletes the
|
||||
resource allocation in placement where the consumer_id is the instance_uuid. As
|
||||
this allocation will include the port related resources those are also cleaned
|
||||
up.
|
||||
|
||||
Detach_interface with a port having QoS policy rule
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
After the detach succeeds in Neutron and in the hypervisor, the nova-compute
|
||||
needs to delete the allocation related to the detached port in Placement. The
|
||||
rest of the server's allocation will not be changed.
|
||||
|
||||
Server move operations (cold migrate, evacuate, resize, live migrate)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
During the move operation Nova makes allocation on the destination host
|
||||
with consumer_id == instance_uuid while the allocation on the source host is
|
||||
changed to have consumer_id == migration_uuid. These allocation sets will
|
||||
contain the port related allocations as well. When the move operation succeeds
|
||||
Nova deletes the allocation towards the source host. If the move operation
|
||||
rolled back Nova cleans up the allocations towards the destination host.
|
||||
|
||||
As the port related resource request is not persisted in the RequestSpec object
|
||||
Nova needs to re-calculate that from the ports' data before calling the
|
||||
scheduler.
|
||||
|
||||
Move operations with force host flag (evacuate, live-migrate) do not call the
|
||||
scheduler. So to make sure that every case is handled we have to go through
|
||||
every direct or indirect call of `reportclient.claim_resources()` function and
|
||||
ensure that the port related resources are handled properly. Today we `blindly
|
||||
copy the allocation from source host to destination host`_ by using the
|
||||
destination host as the RP. This will be lot more complex as there will be
|
||||
more than one RP to be replaced and Nova will have a hard time to figure out
|
||||
what Network RP from the source host maps to what Network RP on the
|
||||
destination host. A possible solution is to `send the move requests through
|
||||
the scheduler`_ regardless of the force flag but skipping the scheduler
|
||||
filters.
|
||||
|
||||
.. note::
|
||||
Server move operations with ports having resource request are not
|
||||
supported in Stein.
|
||||
|
||||
Shelve_offload and unshelve
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
During shelve_offload Nova deletes the resource allocations including the port
|
||||
related resources as those also have the same consumer_id, the instance uuid.
|
||||
During unshelve a new scheduling is done in the same way as described in the
|
||||
server create case.
|
||||
|
||||
.. note::
|
||||
Unshelve after Shelve offload operations with ports having resource
|
||||
request are not supported in Stein.
|
||||
|
||||
|
||||
Details
|
||||
-------
|
||||
|
||||
Finding the compute RP
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Neutron already depends on the ``host`` conf variable to be set to the same id
|
||||
that Nova uses in the Neutron port binding. Nova uses the hostname in the port
|
||||
binding. If the ``host`` is not defined in the Neutron config then it defaults
|
||||
to the hostname as well. This way Neutron and Nova are in sync today. The same
|
||||
mechanism (i.e. the hostname) can be used in Neutron agent to find the compute
|
||||
RP created by Nova for the same compute host.
|
||||
|
||||
Having non fully qualified hostnames in a deployment can cause ambiguity. For
|
||||
example different cells might contain hosts with the same hostname. This
|
||||
hostname ambiguity in a multicell deployment is already a problem without the
|
||||
currently proposed feature as Nova uses the hostname as the compute RP name in
|
||||
Placement and the name field has a unique constraint in the Placement db model.
|
||||
So in an ambiguous situation the Nova compute services having non unique
|
||||
hostnames have already failed to create RPs in Placement.
|
||||
|
||||
The ambiguity can be fixed by enforcing that hostnames are FQDNs. However as
|
||||
this problem is not special for the currently proposed feature this fix is out
|
||||
of scope of this spec. The `override-compute-node-uuid`_ blueprint describes a
|
||||
possible solution.
|
||||
|
||||
The consumer of the port related resources
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This spec proposes to use the instance_uuid as the consumer_id for the port
|
||||
related resource as well.
|
||||
|
||||
During the server move operations Nova needs to handle two sets of allocations
|
||||
for a single server (one for the source and one for the destination host). If
|
||||
the consumer_id of the port related resources are the port_id then during move
|
||||
operations the two sets of allocations couldn't be distinguished, especially in
|
||||
case of resize to same host. Therefore the port_id is not a good consumer_id.
|
||||
|
||||
Another possibility would be to use a UUID from the port binding as consumer_id
|
||||
but the port binding does not have a UUID today. Also today the port binding
|
||||
is created after the allocations are made which is too late.
|
||||
|
||||
In both cases having multiple allocations for a single server on a single host
|
||||
would make it complex to find every allocation for that server both for Nova
|
||||
and for the deployer using the Placement API.
|
||||
|
||||
Separating non QoS aware and QoS aware ports
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If QoS aware and non QoS aware ports are mixed on the same physical port then
|
||||
the minimum bandwidth rule cannot be fulfilled. The separation can be achieved
|
||||
at least on two levels:
|
||||
|
||||
* Separating compute hosts via host aggregates. The deployer can create two
|
||||
host aggregates in Nova, one for QoS aware server and another for non QoS
|
||||
aware servers. This separation can be done without changing either Nova or
|
||||
Neutron. This is the proposed solution for the first version of this feature.
|
||||
* Separating physical ports via traits. The Neutron agent can put traits, like
|
||||
`CUSTOM_GUARANTEED_BW_ONLY` and `CUSTOM_BEST_EFFORT_BW_ONLY` to the network
|
||||
RPs to indicate which physical port belongs to which group. Neutron can offer
|
||||
this configurability via neutron.conf. Then Neutron can add
|
||||
`CUSTOM_GUARANTEED_BW_ONLY` trait in resource request of the port that is QoS
|
||||
aware and add `CUSTOM_BEST_EFFORT_BW_ONLY` trait otherwise. This solution
|
||||
would allow better granularity as a server can request guaranteed bandwidth
|
||||
on its data port and can accept best effort connectivity on its control port.
|
||||
This solution needs additional work in Neutron but no additional work in
|
||||
Nova. Also this would mean that ports without QoS policy rules would also
|
||||
have at least a trait request (`CUSTOM_BEST_EFFORT_BW_ONLY`) and it would
|
||||
cause scheduling problems with a port created by the nova-compute.
|
||||
Therefore this option can only be supported
|
||||
`after nova port create is moved to the conductor`_.
|
||||
* If we use \*_ONLY traits then we can never combine them, though that would be
|
||||
desirable. For example it makes perfect sense to guarantee 5 gigabits of a
|
||||
10 gigabit card to somebody and let the rest to be used on a best effort
|
||||
basis. To allow this we only need to turn the logic around and use traits
|
||||
CUSTOM_GUARANTEED_BW and CUSTOM_BEST_EFFORT_BW. If the admin still wants to
|
||||
keep guaranteed and best effort traffic fully separated then s/he never puts
|
||||
both traits on the same RP. But one can mix them if one wants to. Even the
|
||||
possible starvation of best effort traffic (next to guaranteed traffic) could
|
||||
be easily addressed by reserving some of the bandwidth inventory.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Alternatives are discussed in their respective sub chapters in this spec.
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Two new standard Resource Classes will be defined to represent the bandwidth in
|
||||
each direction, named as `NET_BW_IGR_KILOBIT_PER_SEC` and
|
||||
`NET_BW_EGR_KILOBIT_PER_SEC`. The kbps unit is selected as the
|
||||
Neutron API already use this unit in the `QoS minimum bandwidth rule`_ API and
|
||||
we would like to keep the units in sync.
|
||||
|
||||
A new `requested_resources` field is added to the RequestSpec versioned
|
||||
object with ListOfObjectField('RequestGroup') type to store the resource and
|
||||
trait requests coming from the Neutron ports. This field will not be persisted
|
||||
in the database.
|
||||
|
||||
A new field ``requester_id`` is added to the InstancePCIRequest versioned
|
||||
object to connect the PCI request created from a Neutron port to the resource
|
||||
request created from the same Neutron port. Nova will populate this field with
|
||||
the ``port_id`` of the Neutron port and the ``requester_id`` field of the
|
||||
RequestGroup versioned object will be used to match the PCI request with the
|
||||
resource request.
|
||||
|
||||
The `QoS minimum bandwidth allocation in Placement API`_ Neutron spec will
|
||||
propose the modeling of the Networking RP subtree in Placement. Nova will
|
||||
not depend on the exact structure of such model as Neutron will provide the
|
||||
port's resource request in an opaque way and Nova will only need to blindly
|
||||
include that resource request to the ``GET allocation_candidates`` request.
|
||||
|
||||
Resource request in the port
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Neutron needs to express the port's resource needs in the port API in a similar
|
||||
way the resource request can be done via flavor extra_spec. For now we assume
|
||||
that a single port requests resources from a single RP. Therefore Nova will map
|
||||
each port's resource request to a single numbered resource request group as
|
||||
defined in `granular-resource-request`_ spec. That spec requires that the name
|
||||
of the numbered resource groups has a form of `resources<integer>`. Nova will
|
||||
map a port's resource_request to the first unused numbered group in the
|
||||
allocation_candidate request. Neutron does not know which ports are used
|
||||
together in a server create request, and which numbered groups have already
|
||||
been used by the flavor extra_spec therefore Neutron cannot assign unique
|
||||
integer ids to the resource groups in these ports.
|
||||
|
||||
From implementation perspective it means Nova will create one RequestGroup
|
||||
instance for each Neutron port based on the port's resource_request and insert
|
||||
it to the end of the list in `RequestSpec.requested_resources`.
|
||||
|
||||
In case when the Neutron multi-provider extension is used and a logical network
|
||||
maps to more than one physnet then the port's resource request will require
|
||||
that the selected network RP has one of the physnet traits the network maps to.
|
||||
This any-traits type of request is not supported by Placement today but can be
|
||||
implemented similarly to member_of query param used for aggregate selection in
|
||||
Placement. This will be proposed in a separate spec
|
||||
`any-traits-in-allocation_candidates-query`_.
|
||||
|
||||
Mapping between physical resource consumption and claimed resources
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Neutron must ensure that the resources allocated in Placement for a port are
|
||||
the same as the resources consumed by that port from the physical
|
||||
infrastructure. To be able to do that Neutron needs to know the mapping between
|
||||
a port's resource request and a specific RP (or RPs) in the allocation record
|
||||
of the server that are fulfilling the request.
|
||||
|
||||
Nova will calculate which port is fulfilled by which RP and the RP UUID will be
|
||||
provided to Neutron during the port binding.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Neutron REST API impact is discussed in the separate
|
||||
`QoS minimum bandwidth allocation in Placement API`_ Neutron spec.
|
||||
|
||||
The Placement REST API needs to be extended to support querying allocation
|
||||
candidates with an RP that has at least one of the traits from a list
|
||||
of requested traits. This feature will be described in the separate
|
||||
`any-traits-in-allocation_candidates-query`_ spec.
|
||||
|
||||
This feature also depends on the `granular-resource-request`_ and
|
||||
`nested-resource-providers`_ features which impact the Placement REST API.
|
||||
|
||||
A new microversion will be added to the Nova REST API to indicate that server
|
||||
create supports ports with resource request. Server operations
|
||||
(e.g. create, interface_attach, move) involving ports having resource request
|
||||
will be rejected with older microversion. However server delete and port detach
|
||||
will be supported with old microversion for these server too.
|
||||
|
||||
.. note::
|
||||
Server move operations are not supported in Stein even with the new
|
||||
microversion.
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
* Placement API will be used from Neutron to create RPs and the compute RP tree
|
||||
will grow in size.
|
||||
|
||||
* Nova will send more complex allocation candidate request to Placement as it
|
||||
will include the port related resource request as well.
|
||||
|
||||
* Nova will calculate the mapping between each port's resource request and the
|
||||
RP in the overall allocation that fulfills such request.
|
||||
|
||||
As Placement do not seem to be a bottleneck today we do not foresee
|
||||
performance degradation due to the above changes.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
This feature impacts multiple modules and creates new dependencies between
|
||||
Nova, Neutron and Placement.
|
||||
|
||||
Also the deployer should be aware that after this feature the server create and
|
||||
move operations could fail due to bandwidth limits managed by Neutron.
|
||||
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
Servers could exist today with SRIOV ports having QoS minimum bandwidth policy
|
||||
rule and for them the resource allocation is not enforced in Placement during
|
||||
scheduling. Upgrading to an OpenStack version that implements this feature
|
||||
will make it possible to change the rule in Neutron to be placement aware (i.e.
|
||||
request resources) then (live) migrate the servers and during the selection of
|
||||
the target of the migration the minimum bandwidth rule will be enforced by the
|
||||
scheduler. Tools can also be provided to search for existing instances and try
|
||||
to do the minimum bandwidth allocation in place. This way the number of
|
||||
necessary migrations can be limited.
|
||||
|
||||
The end user will see behavior change of the Nova API after such upgrade:
|
||||
|
||||
* Booting a server with a network that has QoS minimum bandwidth policy rule
|
||||
requesting bandwidth resources will fail. The current Neutron feature
|
||||
proposal introduces the possibility of a QoS policy rule to request
|
||||
resources but in the first iteration Nova will only support such rule on
|
||||
a pre-created port.
|
||||
* Attaching a port or a network having QoS minimum bandwidth policy rule
|
||||
requesting bandwidth resources to a running server will fail. The current
|
||||
Neutron feature proposal introduces the possibility of a QoS policy rule to
|
||||
request resources but in the first iteration Nova will not support
|
||||
such rule for interface_attach.
|
||||
|
||||
The new QoS rule API extension and the new port API extension in Neutron will
|
||||
be marked experimental until the above two limitations are resolved.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
|
||||
* balazs-gibizer (Balazs Gibizer)
|
||||
|
||||
Other contributors:
|
||||
|
||||
* xuhj (Alex Xu)
|
||||
* minsel (Miguel Lavalle)
|
||||
* bence-romsics (Bence Romsics)
|
||||
* lajos-katona (Lajos Katona)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
This spec does not list work items for the Neutron impact.
|
||||
|
||||
* Make RequestGroup an ovo and add the new `requested_resources` field to the
|
||||
RequestSpec. Then refactor the `resources_from_request_spec()` to use the
|
||||
new field.
|
||||
|
||||
* Implement `any-traits-in-allocation_candidates-query`_ and
|
||||
`mixing-required-traits-with-any-traits`_ support in Placement.
|
||||
This work can be done in parallel with the below work items as any-traits
|
||||
type of query only needed for a small subset of the use cases.
|
||||
|
||||
* Read the resource_request from the Neutron port in the nova-api and store
|
||||
the requests in the RequestSpec object.
|
||||
|
||||
* Include the port related resources in the allocation candidate request in
|
||||
nova-scheduler and nova-conductor and claim port related resources based
|
||||
on a selected candidate.
|
||||
|
||||
* Send the server's whole allocation to the Neutron during port binding
|
||||
|
||||
* Ensure that server move operations with force flag handles port resource
|
||||
correctly by sending such operations through the scheduler.
|
||||
|
||||
* Delete the port related allocations from Placement after successful interface
|
||||
detach operation
|
||||
|
||||
* Reject an interface_attach request that contains a port or a network having
|
||||
a QoS policy rule attached that requests resources.
|
||||
|
||||
* Check in nova-compute that a port created by the nova-compute during server
|
||||
boot has a non empty resource_request in the Neutron API and fail the boot if
|
||||
it has
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* `any-traits-in-allocation_candidates-query`_ and
|
||||
`mixing-required-traits-with-any-traits`_ to support multi-provider
|
||||
networks. While these placement enhancements are not in place this feature
|
||||
will only support networks with a single network segment having a physnet
|
||||
defined.
|
||||
|
||||
* `nested-resource-providers`_ to allow modelling the networking RPs
|
||||
|
||||
* `granular-resource-request`_ to allow requesting each port related resource
|
||||
from a single RP
|
||||
|
||||
* `QoS minimum bandwidth allocation in Placement API`_ for the Neutron impacts
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Tempest tests as well as functional tests will be added to ensure that server
|
||||
create operation, server move operations, shelve_offload and unshelve and
|
||||
interface detach work with QoS aware ports and the resource allocation is
|
||||
correct.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* User documentation about how to use the QoS aware ports.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* `nested-resource-providers`_ feature in Nova
|
||||
* `granular-resource-request`_ feature in Nova
|
||||
* `QoS minimum bandwidth allocation in Placement API`_ feature in Neutron
|
||||
* `override-compute-node-uuid`_ proposal to avoid hostname ambiguity
|
||||
|
||||
|
||||
.. _`nested-resource-providers`: https://review.openstack.org/556873
|
||||
.. _`granular-resource-request`: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/granular-resource-requests.html
|
||||
.. _`QoS minimum bandwidth allocation in Placement API`: https://review.openstack.org/#/c/508149
|
||||
.. _`override-compute-node-uuid`: https://blueprints.launchpad.net/nova/+spec/override-compute-node-uuid
|
||||
.. _`vnic_types are defined in the Neutron API`: > https://developer.openstack.org/api-ref/network/v2/#show-port-details
|
||||
.. _`blindly copy the allocation from source host to destination host`: https://github.com/openstack/nova/blob/9273b082026080122d104762ec04591c69f75a44/nova/scheduler/utils.py#L372
|
||||
.. _`QoS minimum bandwidth rule`: https://docs.openstack.org/neutron/latest/admin/config-qos.html
|
||||
.. _`any-traits-in-allocation_candidates-query`: https://blueprints.launchpad.net/nova/+spec/any-traits-in-allocation-candidates-query
|
||||
.. _`mixing-required-traits-with-any-traits`: https://blueprints.launchpad.net/nova/+spec/mixing-required-traits-with-any-traits
|
||||
.. _`local delete`: https://github.com/openstack/nova/blob/4b0d0ea9f18139d58103a520a6a4e9119e19a4de/nova/compute/api.py#L2023
|
||||
.. _`send the move requests through the scheduler`: https://github.com/openstack/nova/blob/9273b082026080122d104762ec04591c69f75a44/nova/scheduler/utils.py#L339
|
||||
.. _`after nova port create is moved to the conductor`: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/prep-for-network-aware-scheduling-pike.html
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
||||
* - Rocky
|
||||
- Reworked after several discussions
|
||||
* - Stein
|
||||
- * Re-proposed as implementation hasn't been finished in Rocky
|
||||
* Updated based on what was implemented in Stein
|
||||
@@ -0,0 +1,197 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================================
|
||||
Boot instance specific storage backend
|
||||
======================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/boot-instance-specific-storage-backend
|
||||
|
||||
This blueprint proposes adding support for specifying ``volume_type`` when
|
||||
booting instances.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
Currently, when creating a new boot-from-volume instance, the user can only
|
||||
control the type of the volume by pre-creating a bootable image-backed volume
|
||||
with the desired type in cinder and providing it to nova during the boot
|
||||
process. When the user wants to boot the instance on the specified backend,
|
||||
this is not friendly to the user when there are multiple storage backends in
|
||||
the environment.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
As a user, I would like to specify volume type to my instances when I boot
|
||||
them, especially when I am doing bulk boot. The "bulk boot" means creating
|
||||
multiple servers in separate requests but at the same time.
|
||||
|
||||
However, if the user wants to boot instance on a different storage backends,
|
||||
they only need to specify a different backend to send the create request
|
||||
again.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
Add a new microversion to the servers create API to support specifying volume
|
||||
type when booting instances.
|
||||
|
||||
This would only apply to BDMs with ``source_type`` of blank, image and
|
||||
snapshot. The ``volume_type`` will be passed from ``nova-api`` through to
|
||||
``nova-compute`` (via the BlockDeviceMapping object) where the volume gets
|
||||
created and then attached to the new server.
|
||||
|
||||
The ``nova-api`` service will validate that the requested ``volume_type``
|
||||
actually exists in cinder so we can fail fast if it does not or the user does
|
||||
not have access to it.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
You can also combine cinder and nova to do this.
|
||||
|
||||
* Create the volume with the non-default type in cinder and then provide the
|
||||
volume to nova when creating the server.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
We'll have to store the ``volume_type`` on the BlockDeviceMapping object
|
||||
(and block_device_mapping table in the DB).
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
* URL:
|
||||
* /v2.1/servers:
|
||||
|
||||
* Request method:
|
||||
* POST
|
||||
|
||||
The volume_type data will be able to add to request payload ::
|
||||
|
||||
{
|
||||
"server" : {
|
||||
"name" : "device-tagging-server",
|
||||
"flavorRef" : "http://openstack.example.com/flavors/1",
|
||||
"networks" : [{
|
||||
"uuid" : "ff608d40-75e9-48cb-b745-77bb55b5eaf2",
|
||||
"tag": "nic1"
|
||||
}],
|
||||
"block_device_mapping_v2": [{
|
||||
"uuid": "70a599e0-31e7-49b7-b260-868f441e862b",
|
||||
"source_type": "image",
|
||||
"destination_type": "volume",
|
||||
"boot_index": 0,
|
||||
"volume_size": "1",
|
||||
"tag": "disk1",
|
||||
"volume_type": "lvm_volume_type"
|
||||
}]
|
||||
}
|
||||
}
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
Add ``volume_type`` field to BlockDevicePayload object.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
The python-novaclient and python-openstackclient will be updated.
|
||||
|
||||
When we snapshot a volume-backed server, the block_device_mapping_v2 image
|
||||
metadata will include the volume_type from the BDM record so if the user then
|
||||
creates another server from that snapshot, the volume that nova creates from
|
||||
that snapshot will use the same volume_type. If a user wishes to change that
|
||||
volume type in the image metadata, they can via the image API. For more
|
||||
information on image-defined BDMs, see [1]_ and [2]_.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
To support rolling upgrades, the API will have to determine if the minimum
|
||||
``nova-compute`` service version across the deployment (all cells) is
|
||||
high enough to support user-specified volume types. If ``volume_type`` is
|
||||
specified but the deployment is not new enough to handle it, a 409 error will
|
||||
be returned to the user.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
Brin Zhang
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Add ``volume_type`` support in compute API
|
||||
* Add related tests
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
* Add Tempest integration tests for scenarios like:
|
||||
|
||||
* Boot from volume with a non-default volume type.
|
||||
* Snapshot a volume-backed instance and assert the non-default volume
|
||||
type is stored in the image snapshot metadata.
|
||||
|
||||
* Add related unit test for negative scenarios like:
|
||||
|
||||
* Attempting to boot from volume with a specific volume type before the
|
||||
new microversion.
|
||||
* Attempting to boot from volume with a volume type that does not exist
|
||||
and/or the user does not have access to.
|
||||
* Attempting to boot from volume with a volume type with old computes that
|
||||
do not yet support volume type.
|
||||
|
||||
* Add related functional tests for positive scenarios
|
||||
|
||||
* The functional API samples tests will cover the positive scenario for
|
||||
boot from volume with a specific volume type and all computes in all
|
||||
cells are running the latest code.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
Add docs that mention the volume type can be specified when boot instances
|
||||
after the microversion.
|
||||
|
||||
References
|
||||
==========
|
||||
For a discussion of this feature, please refer to:
|
||||
|
||||
* https://etherpad.openstack.org/p/nova-ptg-stein
|
||||
Stein PTG etherpad, discussion on or around line 496.
|
||||
|
||||
* http://lists.openstack.org/pipermail/openstack-dev/2018-July/132052.html
|
||||
Matt Riedemann's recap email to the dev list on Stein PTG, about halfway
|
||||
down.
|
||||
|
||||
.. [1] https://docs.openstack.org/nova/latest/user/block-device-mapping.html
|
||||
.. [2] https://github.com/openstack/tempest/blob/3674fb138/tempest/scenario/test_volume_boot_pattern.py#L210
|
||||
|
||||
History
|
||||
=======
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
|
||||
194
specs/stein/implemented/conf-max-attach-volumes.rst
Normal file
194
specs/stein/implemented/conf-max-attach-volumes.rst
Normal file
@@ -0,0 +1,194 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================
|
||||
Configure maximum number of volumes to attach
|
||||
=============================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/conf-max-attach-volumes
|
||||
|
||||
Currently, there is a limitation in the libvirt driver restricting the maximum
|
||||
number of volumes to attach to a single instance to 26. Depending on virt
|
||||
driver and operator environment, operators would like to be able to attach
|
||||
more than 26 volumes to a single instance. We propose adding a configuration
|
||||
option that operators can use to select the maximum number of volumes allowed
|
||||
to attach to a single instance.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
We've had customers ask for the ability to attach more than 26 volumes to a
|
||||
single instance and we've seen launchpad bugs opened from users trying to
|
||||
attach more than 26 volumes (see `References`_). Because the supportability of
|
||||
any number of volumes depends heavily on which virt driver is being used and
|
||||
the operator's particular environment, we propose to make the maximum
|
||||
configurable by operators. Choosing an appropriate maximum number will require
|
||||
tuning with the specific virt driver and deployed environment, so we expect
|
||||
operators to set the maximum, test, tune, and adjust the configuration option
|
||||
until the maximum is working well in their environment.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* Operators wish to be able to attach a maximum number of volumes to a single
|
||||
instance, with the ability to choose a maximum well-tuned for their
|
||||
environments.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
When a user attempts to attach more than 26 volumes with the libvirt driver,
|
||||
the attach fails in the ``reserve_block_device_name`` method in nova-compute,
|
||||
which is eventually called by the ``attach_volume`` method in nova-api. The
|
||||
``reserve_block_device_name`` method calls
|
||||
``self.driver.get_device_name_for_instance`` to get the next available device
|
||||
name for attaching the volume. If the driver has implemented the method, this
|
||||
is where an attempt to go beyond the maximum allowed number of volumes to
|
||||
attach, will fail. The libvirt driver fails after 26 volumes have been
|
||||
attached. Drivers that have not implemented ``get_device_name_for_instance``
|
||||
appear to have no limit on the maximum number of volumes. The default
|
||||
implementation of ``get_device_name_for_instance`` is located in the
|
||||
``nova.compute.utils`` module. Only the libvirt driver has provided its own
|
||||
implementation of ``get_device_name_for_instance``.
|
||||
|
||||
The ``reserve_block_device_name`` method is a synchronous RPC call (not cast).
|
||||
This means we can have the configured allowed maximum set differently per
|
||||
nova-compute and still fail fast in the API if the maximum has been exceeded.
|
||||
|
||||
We propose to add a new configuration option ``[compute]max_volumes_to_attach``
|
||||
IntOpt to use to configure the maximum allowed volumes to attach to a single
|
||||
instance per nova-compute. This way, operators can set it appropriately
|
||||
depending on what virt driver they are running and what their deployed
|
||||
environment is like. The default will be unlimited (-1) to keep the current
|
||||
behavior for all drivers except the libvirt driver.
|
||||
|
||||
The configuration option will be enforced in the
|
||||
``get_device_name_for_instance`` methods, using the count of the number of
|
||||
already attached volumes. Upon failure, an exception will be propagated to
|
||||
nova-api via the synchronous RPC call to nova-compute, and the user will
|
||||
receive a 403 error (as opposed to the current 500 error).
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Other ways we could solve this include: choosing a new hard-coded maximum only
|
||||
for the libvirt driver or creating a new quota limit for "maximum volumes
|
||||
allowed to attach" (see the ML thread in `References`_).
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Deployers will be able to set the ``[compute]max_volumes_to_attach``
|
||||
configuration option to control how many volumes are allowed to be attached
|
||||
to a single instance per nova-compute in their deployment.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
melwitt
|
||||
|
||||
Other contributors:
|
||||
yukari-papa
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add a new configuration option ``[compute]max_volumes_to_attach``, IntOpt
|
||||
* Modify (or remove) the libvirt driver's implementation of the
|
||||
``get_device_name_for_instance`` method to accomodate more than 26 volumes
|
||||
* Add enforcement of ``[compute]max_volumes_to_attach`` to the
|
||||
``get_device_name_for_instance`` methods
|
||||
* Add handling of the raised exception in the API to translate to a 403 to the
|
||||
user, if the maximum number of allowed volumes is exceeded
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The new functionality will be tested by new unit and functional tests.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The documentation for the new configuration option will be automatically
|
||||
included in generated documentation of the configuration reference.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* https://bugs.launchpad.net/nova/+bug/1770527
|
||||
|
||||
* https://bugs.launchpad.net/nova/+bug/1773941
|
||||
|
||||
* http://lists.openstack.org/pipermail/openstack-dev/2018-June/131289.html
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
@@ -0,0 +1,350 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================================
|
||||
Expose virtual device tags in REST API
|
||||
======================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/expose-virtual-device-tags-in-rest-api
|
||||
|
||||
The 2.32 microversion allows creating a server with tagged block devices
|
||||
and virtual interfaces (ports) and the 2.49 microversion allows specifying
|
||||
a tag when attaching volumes and ports, but there is no way to get those
|
||||
tags out of the REST API. This spec proposes to expose the block device
|
||||
mapping and virtual interface tags in the REST API when listing volume
|
||||
attachments and ports for a given server.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
It is possible to attach volumes and ports to a server with tags but there
|
||||
is nothing in the REST API that allows a user to read those back. The virtual
|
||||
device tags are available *within* the guest VM via the config drive or
|
||||
metadata API service, but not the front-end compute REST API.
|
||||
|
||||
Furthermore, correlating volume attachments to BlockDeviceMapping objects
|
||||
via tag has come up in the `Remove device from volume attach requests`_ spec
|
||||
as a way to replace reliance on the otherwise unused ``device`` field to
|
||||
determine ordering of block devices within the guest.
|
||||
|
||||
Using volume attachment tags was also an option discussed in the
|
||||
`Detach and attach boot volumes`_ spec as a way to indicate which volume
|
||||
was the root volume attached to the server without relying on the
|
||||
server ``OS-EXT-SRV-ATTR:root_device_name`` field.
|
||||
|
||||
.. _Remove device from volume attach requests: https://review.openstack.org/600628/
|
||||
.. _Detach and attach boot volumes: https://review.openstack.org/600628/
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As a user, I want to correlate information, based on tags, to the volumes and
|
||||
ports attached to my server.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
In a new microversion, expose virtual device tags in the REST API response
|
||||
when showing volume attachments and attached ports.
|
||||
|
||||
See the `REST API impact`_ section for details on route and response changes.
|
||||
|
||||
**Technical / performance considerations**
|
||||
|
||||
When showing attached volume tags, there would really be no additional effort
|
||||
in exposing the tag since we already query the database for a
|
||||
BlockDeviceMappingList per instance.
|
||||
|
||||
However, the ``os-interface`` port tags present a different challenge. The
|
||||
``GET /servers/{server_id}/os-interface`` and
|
||||
``GET /servers/{server_id}/os-interface/{port_id}`` APIs are today simply
|
||||
proxies to the neutron networking APIs to list ports attached to an instance
|
||||
and show details about a port attached to an instance, respectively.
|
||||
|
||||
The device tag for a port attached to an instance is not stored in neutron,
|
||||
it is stored in the nova cell database ``virtual_interfaces`` table. So the
|
||||
problem we have is one of performance when listing ports attached to a server
|
||||
and we want to show tags because we would have to query both the neutron API
|
||||
to list ports and then the ``virtual_interfaces`` table for the instance to
|
||||
get the tags. We have two options:
|
||||
|
||||
1. Accept that when listing ports for a single instance, doing one more DB
|
||||
query to get the tags is not that big of an issue.
|
||||
|
||||
2. Rather than proxy the calls to neutron, we could rely on the instance
|
||||
network info cache to provide the details. That might be OK except we
|
||||
currently do not store the tags in the info cache, so for any existing
|
||||
instance the tags would not be shown, unless we did a DB query to look
|
||||
for them and heal the cache.
|
||||
|
||||
Given the complications with option 2, this spec will just use option 1.
|
||||
|
||||
**Non-volume BDMs**
|
||||
|
||||
It should be noted that swap and ephemeral block devices can also have
|
||||
tags when the server is created, however there is nothing in the API
|
||||
today which exposes those types of BDMs; the API only exposes volume BDMs.
|
||||
As such, this spec does not intend to expose non-volume block device mapping
|
||||
tags. It is possible that in the future if a kind of
|
||||
``GET /servers/{server_id}/disks`` API were added we could expose swap and
|
||||
ephemeral block devices along with their tags, but that is out of scope
|
||||
for this spec.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
In addition to showing the tags in the ``os-volume_attachments`` and
|
||||
``os-interface`` APIs, we could also modify the server show/list view builder
|
||||
to provide tags in the server resource ``os-extended-volumes:volumes_attached``
|
||||
and ``addresses`` fields. This would be trivial to do for showing attached
|
||||
volume tags since we already query the DB per instance to get the BDMs, but as
|
||||
noted in the `Proposed change`_ section, it would be non-trivial for port tags
|
||||
since those are not currently stored in the instance network info cache which
|
||||
is used to build the ``addresses`` field response value. And it would be odd
|
||||
to show the attached volume tags in the server response but not the virtual
|
||||
interface tags. We could heal the network info cache over time, but that
|
||||
seems unnecessarily complicated when the proposed change already provides a
|
||||
way to get the tag information for all volumes/ports attached to a given
|
||||
server resource.
|
||||
|
||||
We could also take this opportunity to expose other fields on the
|
||||
BlockDeviceMapping which are inputs when creating a server, like
|
||||
``boot_index``, ``volume_type``, ``source_type``, ``destination_type``,
|
||||
``guest_format``, etc. For simplicity, that is omitted from the proposed
|
||||
change since it's simpler to just focus on tag exposure for multiple types
|
||||
of resources.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
There are two API resource routes which would be changed. In all cases,
|
||||
if the block device mapping or virtual interface record does not have a tag
|
||||
specified, the response value for the ``tag`` key will be ``None``.
|
||||
|
||||
os-volume_attachments
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A ``tag`` field will be added to the response for each of the following APIs.
|
||||
|
||||
* ``GET /servers/{server_id}/os-volume_attachments (list)``
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"volumeAttachments": [{
|
||||
"device": "/dev/sdd",
|
||||
"id": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
|
||||
"serverId": "4d8c3732-a248-40ed-bebc-539a6ffd25c0",
|
||||
"volumeId": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
|
||||
"tag": "os"
|
||||
}]
|
||||
}
|
||||
|
||||
* ``GET /servers/{server_id}/os-volume_attachments/{volume_id} (show)``
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"volumeAttachment": {
|
||||
"device": "/dev/sdd",
|
||||
"id": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
|
||||
"serverId": "2390fb4d-1693-45d7-b309-e29c4af16538",
|
||||
"volumeId": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
|
||||
"tag": "os"
|
||||
}
|
||||
}
|
||||
|
||||
* ``POST /servers/{server_id}/os-volume_attachments (attach)``
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"volumeAttachment": {
|
||||
"device": "/dev/vdb",
|
||||
"id": "c996dd74-44a0-4fd1-a582-a14a4007cc94",
|
||||
"serverId": "2390fb4d-1693-45d7-b309-e29c4af16538",
|
||||
"volumeId": "c996dd74-44a0-4fd1-a582-a14a4007cc94",
|
||||
"tag": "data"
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
os-interface
|
||||
~~~~~~~~~~~~
|
||||
|
||||
A ``tag`` field will be added to the response for each of the following APIs.
|
||||
|
||||
* ``GET /servers/{server_id}/os-interface (list)``
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"interfaceAttachments": [{
|
||||
"fixed_ips": [{
|
||||
"ip_address": "192.168.1.3",
|
||||
"subnet_id": "f8a6e8f8-c2ec-497c-9f23-da9616de54ef"
|
||||
}],
|
||||
"mac_addr": "fa:16:3e:4c:2c:30",
|
||||
"net_id": "3cb9bc59-5699-4588-a4b1-b87f96708bc6",
|
||||
"port_id": "ce531f90-199f-48c0-816c-13e38010b442",
|
||||
"port_state": "ACTIVE",
|
||||
"tag": "public"
|
||||
}]
|
||||
}
|
||||
|
||||
* ``GET /servers/{server_id}/os-interface/{port_id} (show)``
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"interfaceAttachment": {
|
||||
"fixed_ips": [{
|
||||
"ip_address": "192.168.1.3",
|
||||
"subnet_id": "f8a6e8f8-c2ec-497c-9f23-da9616de54ef"
|
||||
}],
|
||||
"mac_addr": "fa:16:3e:4c:2c:30",
|
||||
"net_id": "3cb9bc59-5699-4588-a4b1-b87f96708bc6",
|
||||
"port_id": "ce531f90-199f-48c0-816c-13e38010b442",
|
||||
"port_state": "ACTIVE",
|
||||
"tag": "public"
|
||||
}
|
||||
}
|
||||
|
||||
* ``POST /servers/{server_id}/os-interface (attach)``
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"interfaceAttachment": {
|
||||
"fixed_ips": [{
|
||||
"ip_address": "192.168.1.4",
|
||||
"subnet_id": "f8a6e8f8-c2ec-497c-9f23-da9616de54ef"
|
||||
}],
|
||||
"mac_addr": "fa:16:3e:4c:2c:31",
|
||||
"net_id": "3cb9bc59-5699-4588-a4b1-b87f96708bc6",
|
||||
"port_id": "ce531f90-199f-48c0-816c-13e38010b443",
|
||||
"port_state": "ACTIVE",
|
||||
"tag": "management"
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
The ``BlockDevicePayload`` object already exposes BDM tags for
|
||||
versioned notifications. The ``IpPayload`` object does not expose tags
|
||||
since they are not in the instance network info cache, but these payloads
|
||||
are only exposed via the ``InstancePayload`` and like the ``servers`` API
|
||||
we will not make additional changes to try and show the tags for the resources
|
||||
nested within the server (InstancePayload) body. This could be done in the
|
||||
future if desired, potentially with a configuration option like
|
||||
``[notifications]/bdms_in_notifications``, but it is out of scope for this
|
||||
spec.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
python-novaclient and python-openstackclient will be updated as necessary
|
||||
to support the new microversion. This likely just means adding a new ``Tag``
|
||||
column in CLI output when listing attached volumes and ports.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
There will be a new database query to the ``virtual_interfaces`` table when
|
||||
showing device tags for ports attached to a server. This should have a minimal
|
||||
impact to API response times though.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Matt Riedemann <mriedem.os@gmail.com> (mriedem)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Implement a new microversion and use that to determine if a new ``tag``
|
||||
field should be in the ``os-volume_attachments`` and ``os-interface`` API
|
||||
responses when listing/showing/attaching volumes/ports to a server.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Functional API samples tests should be sufficient coverage of this feature.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The compute API reference will be updated to note the ``tag`` field in the
|
||||
response for the ``os-volume_attachments`` and ``os-interface`` APIs.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
This was originally discussed at the `Ocata summit`_. It came up again at the
|
||||
`Rocky PTG`_.
|
||||
|
||||
Related specs:
|
||||
|
||||
* Remove ``device`` from volume attach API: https://review.openstack.org/452546/
|
||||
* Detach/attach boot volume: https://review.openstack.org/600628/
|
||||
|
||||
.. _Ocata summit: https://etherpad.openstack.org/p/ocata-nova-summit-api
|
||||
.. _Rocky PTG: https://etherpad.openstack.org/p/nova-ptg-rocky
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Pike
|
||||
- Originally proposed but abandoned
|
||||
* - Stein
|
||||
- Re-proposed
|
||||
@@ -0,0 +1,156 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=================================================
|
||||
Flavor Extra Spec and Image Properties Validation
|
||||
=================================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/flavor-extra-spec-image-property-validation
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently flavor extra-specs and image properties validation are done in
|
||||
separate places. If they are not compatible, the instance may fail to launch
|
||||
and go into an ERROR state, or may reschedule an unknown number of times
|
||||
depending on the virt driver behaviour.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As an end user I would like to have instant feedback if flavor extra spec or
|
||||
image properties are not valid or they are not compatible with each other so
|
||||
I can correct my configuration and retry the operation.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We want to validate the combination of the flavor extra-specs and image
|
||||
properties as early as possible once they're both known.
|
||||
|
||||
If validation fails then synchronously return error to user.
|
||||
|
||||
We'd need to do this anywhere the flavor or image changes, so basically
|
||||
instance creation, rebuild, and resize. More precisely, rename
|
||||
_check_requested_image() to something more generic, take it out of
|
||||
_checks_for_create_and_rebuild(), modify it to check more things and call it
|
||||
from all three operations: creation, rebuild, and resize.
|
||||
|
||||
.. note:: Only things that are not virt driver specific are validated.
|
||||
|
||||
Examples of validations to be added [1]_:
|
||||
|
||||
* Call hardware.numa_get_constraints to validate all the various numa-related
|
||||
things. This is currently done only on _create_instance(), should be done for
|
||||
resize/rebuild as well.
|
||||
* Ensure that the cpu policy, cpu thread policy and emulator thread policy
|
||||
values are valid.
|
||||
* Validate the realtime mask.
|
||||
* Validate the number of serial ports.
|
||||
* Validate the cpu topology constraints.
|
||||
* Validate the ``quota:*``settings (that are not virt driver specific) in the
|
||||
flavor.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Due to the new validations, users could face more 4xx errors for more cases
|
||||
than we did before in create/rebuild/resize operations.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Negligible.
|
||||
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jackding
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add validations mostly in nova/compute/api.py.
|
||||
* Add/update unit tests.
|
||||
* Update documentation/release-note if necessary depending on the new
|
||||
validations added.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Will add unit tests.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://docs.openstack.org/nova/latest/user/flavors.html
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
465
specs/stein/implemented/generic-os-vif-offloads.rst
Normal file
465
specs/stein/implemented/generic-os-vif-offloads.rst
Normal file
@@ -0,0 +1,465 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================
|
||||
Generic os-vif datapath offloads
|
||||
================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/generic-os-vif-offloads
|
||||
|
||||
The existing method in os-vif is to pass datapath offload metadata via a
|
||||
``VIFPortProfileOVSRepresentor`` port profile object. This is currently used by
|
||||
the ``ovs`` reference plugin and the external ``agilio_ovs`` plugin. This spec
|
||||
proposes a refactor of the interface to support more VIF types and offload
|
||||
modes.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Background on Offloads
|
||||
----------------------
|
||||
|
||||
While composing this spec, it became clear that the "offloads" term had
|
||||
historical meaning that caused confusion about the scope of this spec. This
|
||||
subsection was added in order to clarify the distinctions between different
|
||||
classes of offloads.
|
||||
|
||||
Protocol Offloads
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Network-specific computation being handled by dedicated peripherals is well
|
||||
established on many platforms. For Linux, the `ethtool man page`_ details a
|
||||
number of settings for the ``--offload`` option that are available on many
|
||||
NICs, for specific protocols.
|
||||
|
||||
``ethtool`` type offloads typically:
|
||||
|
||||
#. are available to guests (and hosts),
|
||||
#. have a strong relationship with a network endpoint,
|
||||
#. have a role with generating and consuming packets,
|
||||
#. can be modeled as capabilities of the virtual NIC on the instance.
|
||||
|
||||
Currently, Nova has little modelling for these types of offload capabilities.
|
||||
Ensuring that instances can live migrate to a compute node capable of
|
||||
providing the required features is not something Nova can currently determine
|
||||
ahead of time.
|
||||
|
||||
This spec only touches lightly on this class of offloads.
|
||||
|
||||
Datapath Offloads
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Relatively recently, SmartNICs emerged that allow complex packet processing on
|
||||
the NIC. This allows the implementation of constructs like bridges and routers
|
||||
under control of the host. In contrast with procotol offloads, these offloads
|
||||
apply to the dataplane.
|
||||
|
||||
In Open vSwitch, the dataplane can be implemented by, for example, the kernel
|
||||
datapath (the ``openvswitch.ko`` module), the userspace datapath, or the
|
||||
``tc-flower`` classifier. In turn, portions of the ``tc-flower`` classifier can
|
||||
be delegated to a SmartNIC as described in this `TC Flower Offload paper`_.
|
||||
|
||||
.. note:: Open vSwitch refers to specific implementations of its packet
|
||||
processing pipeline as datapaths, not dataplanes. This spec follows
|
||||
the datapath terminology.
|
||||
|
||||
Datapath offloads typically have the following characteristics:
|
||||
|
||||
#. The interfaces controlling and managing these offloads are under host
|
||||
control.
|
||||
#. Network-level operations such as routing, tunneling, NAT and firewalling can
|
||||
be described.
|
||||
#. A special plugging mode could be required, since the packets might bypass
|
||||
the host hypervisor entirely.
|
||||
|
||||
The simplest case of this is an SR-IOV NIC in Virtual Ethernet Bridge (VEB)
|
||||
mode, as used by the ``sriovnicswitch`` Neutron driver. A special plugging mode
|
||||
is necessary, (namely IOMMU PCI passthrough), and the hypervisor configures the
|
||||
VEB with the required MAC ACL filters.
|
||||
|
||||
This spec focuses on this class of offloads.
|
||||
|
||||
Hybrid Offloads
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
In future, it might be possible to push out datapath offloads as a service to
|
||||
guest instances. In particular, trusted NFV instances might gain access to
|
||||
sections of the packet processing pipeline, with various levels of isolation
|
||||
and composition. This spec is does not target this use case.
|
||||
|
||||
Core Problem Statement
|
||||
----------------------
|
||||
|
||||
In order to support hardware acceleration for datapath offloads, Nova
|
||||
core and os-vif need to model the datapath offload plugging metadata. The
|
||||
existing method in os-vif is to pass this via a
|
||||
``VIFPortProfileOVSRepresentor`` port profile object. This is used by the
|
||||
``ovs`` reference plugin and the external ``agilio_ovs`` plugin.
|
||||
|
||||
With ``vrouter`` being a potential third user of such metadata (proposed in the
|
||||
`blueprint for vrouter hardware offloads`_), it's worthwhile to abstract the
|
||||
interface before the pattern solidifies further.
|
||||
|
||||
This spec is limited to refactoring the interface, with future expansion in
|
||||
mind, while allowing existing plugins to remain functional.
|
||||
|
||||
SmartNICs are able to route packets directly to individual SR-IOV Virtual
|
||||
Functions. These can be connected to instances using IOMMU (vfio-pci
|
||||
passthrough) or a low-latency vhost-user `virtio-forwarder`_ running on the
|
||||
compute node.
|
||||
|
||||
In Nova, a VIF should fully describe how an instance is plugged into the
|
||||
datapath. This includes information for the hypervisor to perform the required
|
||||
plugging, and also info for the datapath control software. For the ``ovs`` VIF,
|
||||
the hypervisor is generally able to also perform the datapath control, but this
|
||||
is not the case for every VIF type (hence the existence of os-vif).
|
||||
|
||||
The VNIC type is a property of a VIF. It has taken on the semantics of
|
||||
describing a specific "plugging mode" for the VIF. In the Nova network API,
|
||||
there is a `list of VNIC types that will trigger a PCI request`_, if Neutron
|
||||
has passed a VIF to Nova with one of those VNIC types set. `Open vSwitch
|
||||
offloads`_ uses the following VNIC types to distinguish between offloaded
|
||||
modes:
|
||||
|
||||
* The ``normal`` (or default) VNIC type indicates that the Instance is plugged
|
||||
into the software bridge.
|
||||
* The ``direct`` VNIC type indicates that a VF is passed through to the
|
||||
Instance.
|
||||
|
||||
In addition, the Agilio OVS VIF type implements the following offload mode:
|
||||
|
||||
* The ``virtio-forwarder`` VNIC type indicates that a VF is attached via a
|
||||
`virtio-forwarder`_.
|
||||
|
||||
Currently, os-vif and Nova implement `switchdev SR-IOV offloads`_ for Open
|
||||
vSwitch with ``tc-flower`` offloads. In this model, a representor netdev on the
|
||||
host is associated with each Virtual Function. This representor functions like
|
||||
a handle for the corresponding virtual port on the NIC's packet processing
|
||||
pipeline.
|
||||
|
||||
Nova passes the PCI address it received from the PCI request to the os-vif
|
||||
plugin. Optionally, a netdev name can also be passed to allow for friendly
|
||||
renaming of the representor by the os-vif plugin.
|
||||
|
||||
The ``ovs`` and ``agilio_ovs`` os-vif plugins then look up the associated
|
||||
representor for the VF and perform the datapath plugging. From Nova's
|
||||
perspective the hypervisor then either passes through a VF using the data from
|
||||
the ``VIFHostDevice`` os-vif object (with the ``direct`` VNIC type), or plugs
|
||||
the Instance into a vhost-user handle with data from a ``VIFVHostUser`` os-vif
|
||||
object (with the ``virtio-forwarder`` VNIC type).
|
||||
|
||||
In both cases, the os-vif object has a port profile of
|
||||
``VIFPortProfileOVSRepresentor`` that carries the offload metadata as well as
|
||||
Open vSwitch metadata.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
Currently, switchdev VF offloads are modelled for one port profile only. Should
|
||||
a developer, using a different datapath, wish to pass offload metadata to an
|
||||
os-vif plugin, they would have to extend the object model, or pass the metadata
|
||||
using a confusingly named object. This spec aims to establish a recommended
|
||||
mechanism to extend the object model.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Use composition instead of inheritance
|
||||
--------------------------------------
|
||||
|
||||
Instead of using an inheritance based pattern to model the offload
|
||||
capabilities and metadata, use a composition pattern:
|
||||
|
||||
* Implement a ``DatapathOffloadBase`` class.
|
||||
|
||||
* Subclass this to ``DatapathOffloadRepresentor`` with the following members:
|
||||
|
||||
* ``representor_name: StringField()``
|
||||
* ``representor_address: StringField()``
|
||||
|
||||
* Add a ``datapath_offload`` member to ``VIFPortProfileBase``:
|
||||
|
||||
* ``datapath_offload: ObjectField('DatapathOffloadBase', nullable=True,
|
||||
default=None)``
|
||||
|
||||
* Update the os-vif OVS reference plugin to accept and use the new versions and
|
||||
fields.
|
||||
|
||||
Future os-vif plugins combining an existing form of datapath offload (i.e.
|
||||
switchdev offload) with a new VIF type will not require modifications to
|
||||
os-vif. Future datapath offload methods will require subclassing
|
||||
``DatapathOffloadBase``.
|
||||
|
||||
Instead of implementing potentially brittle backlevelling code, this option
|
||||
proposes to keep two parallel interfaces alive in Nova for at least one
|
||||
overlapping release cycle, before the Open vSwitch plugin is updated in os-vif.
|
||||
|
||||
Instead of bumping object versions and creating composition version maps, this
|
||||
option proposes that versioning be deliberately ignored until the next major
|
||||
release of os-vif. Currently, version negotiation and backlevelling in os-vif
|
||||
is not used in Nova or os-vif plugins.
|
||||
|
||||
Kuryr Kubernetes is also a user of os-vif and is using object versioning in a
|
||||
manner not yet supported publicly in os-vif. There is an `ongoing discussion
|
||||
attempting to find a solution for Kuryr's use case`_.
|
||||
|
||||
Should protocol offloads also need to be modeled in os-vif, ``VIFBase`` or
|
||||
``VIFPortProfileBase`` could gain a ``protocol_offloads`` list of capabilities.
|
||||
|
||||
Summary of plugging methods affected
|
||||
------------------------------------
|
||||
|
||||
* Before changes:
|
||||
|
||||
* VIF type: ``ovs`` (os-vif plugin: ``ovs``)
|
||||
|
||||
* VNIC type: ``direct``
|
||||
* os-vif object: ``VIFHostDevice``
|
||||
* ``port_profile: VIFPortProfileOVSRepresentor``
|
||||
|
||||
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
|
||||
|
||||
* VNIC type: ``direct``
|
||||
* os-vif object: ``VIFHostDevice``
|
||||
* ``port_profile: VIFPortProfileOVSRepresentor``
|
||||
|
||||
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
|
||||
|
||||
* VNIC type: ``virtio-forwarder``
|
||||
* os-vif object: ``VIFVHostUser``
|
||||
* ``port_profile: VIFPortProfileOVSRepresentor``
|
||||
|
||||
* After this model has been adopted in Nova:
|
||||
|
||||
* VIF type: ``ovs`` (os-vif plugin: ``ovs``)
|
||||
|
||||
* VNIC type: ``direct``
|
||||
* os-vif object: ``VIFHostDevice``
|
||||
* ``port_profile: VIFPortProfileOpenVSwitch``
|
||||
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
|
||||
|
||||
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
|
||||
|
||||
* VNIC type: ``direct``
|
||||
* os-vif object: ``VIFHostDevice``
|
||||
* ``port_profile: VIFPortProfileOpenVSwitch``
|
||||
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
|
||||
|
||||
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
|
||||
|
||||
* VNIC type: ``virtio-forwarder``
|
||||
* os-vif object: ``VIFVHostUser``
|
||||
* ``port_profile: VIFPortProfileOpenVSwitch``
|
||||
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
|
||||
|
||||
|
||||
Additional Impact
|
||||
-----------------
|
||||
|
||||
os-vif needs to issue a release before these profiles will be available to
|
||||
general CI testing in Nova. Once this is done, Nova can be adapted to use the
|
||||
new generic interfaces.
|
||||
|
||||
* In Stein, os-vif's object model will gain the interfaces described in this
|
||||
spec. If needed, a major os-vif release will be issued.
|
||||
* Then, Nova will depend on the new release and use the new interfaces for new
|
||||
plugins.
|
||||
* During this time, os-vif will have two parallel interfaces supporting this
|
||||
metadata. This is expected to last at least from Stein to Train.
|
||||
* From Train onwards, existing plugins should be transitioned to the new
|
||||
model.
|
||||
* Once all plugins have been transitioned, the parallel interfaces can be
|
||||
removed in a major release of os-vif.
|
||||
* Support will be lent to Kuryr Kubernetes during this period, to transition to
|
||||
a better supported model.
|
||||
|
||||
Additional notes
|
||||
----------------
|
||||
|
||||
* No corresponding changes in Neutron are expected: currently os-vif is
|
||||
consumed by Nova and Kuryr Kubernetes.
|
||||
* Even though representor addresses are currently modeled as PCI address
|
||||
objects, it was felt that stricter type checking would be of limited
|
||||
benefit. Future networking systems might require paths, UUIDs or other
|
||||
methods of describing representors. Leaving the address member a string was
|
||||
deemed an acceptable compromise.
|
||||
* The main concern raised against composition over inheritance was the increase
|
||||
of the serialization size of the objects.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
During the development of this spec it was not immediately clear whether the
|
||||
composition or inheritance model would be the consensus solution. Because the
|
||||
two models have wildly different effects on future code, it was decided that
|
||||
both be implemented in order to compare and contrast.
|
||||
|
||||
The implementation for the inheritance model is illustrated in
|
||||
https://review.openstack.org/608693
|
||||
|
||||
Use inheritance to create a generic representor profile
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Keep using an inheritance based pattern to model the offload capabilities and
|
||||
metadata:
|
||||
|
||||
* Implement ``VIFPortProfileRepresentor`` by subclassing ``VIFPortProfileBase``
|
||||
and adding the following members:
|
||||
|
||||
* ``representor_name: StringField(nullable=True)``
|
||||
* ``representor_address: StringField()``
|
||||
|
||||
Summary of new plugging methods available in an inheritance model
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
* After os-vif changes:
|
||||
|
||||
* Generic VIF with SR-IOV passthrough:
|
||||
|
||||
* VNIC type: ``direct``
|
||||
* os-vif object: ``VIFHostDevice``
|
||||
* ``port_profile: VIFPortProfileRepresentor``
|
||||
|
||||
* Generic VIF with virtio-forwarder:
|
||||
|
||||
* VNIC type: ``virtio-forwarder``
|
||||
* os-vif object: ``VIFVHostUser``
|
||||
* ``port_profile: VIFPortProfileRepresentor``
|
||||
|
||||
Other alternatives considered
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Other alternatives proposed require much more invasive patches to Nova and
|
||||
os-vif:
|
||||
|
||||
* Create a new VIF type for every future datapath/offload combination.
|
||||
|
||||
* The inheritance based pattern could be made more generic by renaming the
|
||||
``VIFPortProfileOVSRepresentor`` class to ``VIFPortProfileRepresentor`` as
|
||||
illustrated in https://review.openstack.org/608448
|
||||
|
||||
* The versioned objects could be backleveled by using a suitable negotiation
|
||||
mechanism to provide overlap.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
os-vif plugins run with elevated privileges, but no new functionality will be
|
||||
implemented.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Extending the model in this fashion adds more bytes to the VIF objects passed
|
||||
to the os-vif plugin. At the moment, this effect is negligible, but when the
|
||||
objects are serialized and passed over the wire, this will increase the size of
|
||||
the API messages.
|
||||
|
||||
However, it's very likely that the object model would undergo a major
|
||||
version change with a redesign, before this becomes a problem.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Deployers might notice a deprecation warning in logs if Nova, os-vif or the
|
||||
os-vif plugin is out of sync.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Core os-vif semantics will be slightly changed. The details for extending
|
||||
os-vif objects would be slightly more established.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
The minimum required version of os-vif in Nova wil be bumped in both
|
||||
``requirements.txt`` and ``lower-constraints.txt``. Deployers should be
|
||||
following at least those minimums.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Jan Gutter <jan.gutter@netronome.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implementation of the composition model in os-vif:
|
||||
https://review.openstack.org/572081
|
||||
|
||||
* Adopt the new os-vif interfaces in Nova. This would likely happen after a
|
||||
major version release of os-vif.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* After both options have been reviewed, and the chosen version has been
|
||||
merged, an os-vif release needs to be made.
|
||||
|
||||
* When updating Nova to use the newer release of os-vif, the corresponding
|
||||
changes should be made to move away from the deprecated classes. This change
|
||||
is expected to be minimal.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Unit tests for the os-vif changes will test the object model impact.
|
||||
|
||||
* Third-party CI is already testing the accelerated plugging modes, no new
|
||||
new functionality needs to be tested.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The os-vif development documentation will be updated with the new classes.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* `ethtool man page`_
|
||||
* `TC Flower Offload paper`_
|
||||
* `virtio-forwarder`_
|
||||
* `Open vSwitch offloads`_
|
||||
* `switchdev SR-IOV offloads`_
|
||||
* `blueprint for vrouter hardware offloads`_
|
||||
* `list of VNIC types that will trigger a PCI request`_
|
||||
* `section in the API where the PCI request is triggered`_
|
||||
* `ongoing discussion attempting to find a solution for Kuryr's use case`_
|
||||
|
||||
.. _`ethtool man page`: http://man7.org/linux/man-pages/man8/ethtool.8.html
|
||||
.. _`TC Flower Offload paper`: https://www.netdevconf.org/2.2/papers/horman-tcflower-talk.pdf
|
||||
.. _`virtio-forwarder`: http://virtio-forwarder.readthedocs.io/en/latest/
|
||||
.. _`Open vSwitch offloads`: https://docs.openstack.org/neutron/queens/admin/config-ovs-offload.html
|
||||
.. _`switchdev SR-IOV offloads`: https://netdevconf.org/1.2/slides/oct6/04_gerlitz_efraim_introduction_to_switchdev_sriov_offloads.pdf
|
||||
.. _`blueprint for vrouter hardware offloads`: https://blueprints.launchpad.net/nova/+spec/vrouter-hw-offloads
|
||||
.. _`list of VNIC types that will trigger a PCI request`: https://github.com/openstack/nova/blob/e3eb5f916580a9bab8f67b0fd685c6b3b23a97b7/nova/network/model.py#L111
|
||||
.. _`section in the API where the PCI request is triggered`: https://github.com/openstack/nova/blob/e3eb5f916580a9bab8f67b0fd685c6b3b23a97b7/nova/network/neutronv2/api.py#L1921
|
||||
.. _`ongoing discussion attempting to find a solution for Kuryr's use case`: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000569.html
|
||||
454
specs/stein/implemented/handling-down-cell.rst
Normal file
454
specs/stein/implemented/handling-down-cell.rst
Normal file
@@ -0,0 +1,454 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Handling a down cell
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/handling-down-cell
|
||||
|
||||
This spec aims at addressing the behavioural changes that are required to
|
||||
support some of the basic nova operations like listing of instances and
|
||||
services when a cell goes down.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently in nova when a cell goes down (for instance if the cell DB is not
|
||||
reachable) basic functionalities like ``nova list`` and ``nova service-list``
|
||||
do not work and return an API error message. However a single cell going down
|
||||
should not stop these operations from working for the end users and operators.
|
||||
Another issue is while calculating quotas during VM creations, the resources
|
||||
of the down cell are not taken into account and the ``nova boot`` operation is
|
||||
permitted into the cells which are up. This may result in incorrect quota
|
||||
reporting for a particular project during boot time which may have implications
|
||||
when the down cell comes back.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
The specific use cases that are being addressed in the spec include:
|
||||
|
||||
#. ``nova list`` should work even if a cell goes down. This can be partitioned
|
||||
into two use cases:
|
||||
|
||||
#. The user has no instances in the down cell: Expected behaviour would be
|
||||
for everything to work as normal. This has been fixed through
|
||||
`smart server listing`_ if used with the right config options.
|
||||
#. The user has instances in the down cell: This needs to be gracefully
|
||||
handled which can be split into two stages:
|
||||
|
||||
#. We just skip the down cell and return results from the cells that are
|
||||
available instead of returning a 500 which has been fixed through
|
||||
`resilient server listing`_.
|
||||
#. Instead of skipping the down cell, we build on (modify) the existing
|
||||
API response to return a minimalistic construct. This will be fixed in
|
||||
this spec.
|
||||
|
||||
#. ``nova show`` should also return a minimalistic construct for instances in
|
||||
the down cell similar to ``nova list``.
|
||||
|
||||
#. ``nova service-list`` should work even if a cell goes down. The solution can
|
||||
be split into two stages:
|
||||
|
||||
#. We skip the down cell and end up displaying all the services from the
|
||||
other cells as was in cells_v1 setup. This has been fixed through
|
||||
`resilient service listing`_.
|
||||
#. We handle this gracefully for the down cell. This will be fixed through
|
||||
this spec by creating a minimalistic construct.
|
||||
|
||||
#. ``nova boot`` should not succeed if that project has any living VMs in the
|
||||
down cell until an all-cell-iteration independent solution for quota
|
||||
calculation is implemented through `quotas using placement`_.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
This spec proposes to add a new ``queued_for_delete`` column in the
|
||||
``nova_api.instance_mappings`` table as discussed in the
|
||||
`cells summary in Dublin PTG`_. This column would be of type Boolean which by
|
||||
default will be False and upon the deletion (normal/local/soft) of the
|
||||
respective instance, will be set to True. In the case of soft delete, if the
|
||||
instance is restored, then the value of the column will be set to False again.
|
||||
The corresponding ``queued_for_delete`` field will be added in the
|
||||
InstanceMapping object.
|
||||
|
||||
Listing of instances and services from the down cell will return a
|
||||
`did_not_respond_sentinel`_ object from the scatter-gather utility. Using this
|
||||
response we can know if a cell is down or not and accordingly modify the
|
||||
listing commands to work in the following manner for those records which are
|
||||
from the down cell:
|
||||
|
||||
#. ``nova list`` should return a minimalistic construct from the available
|
||||
information in the API DB which would include:
|
||||
|
||||
#. created_at, instance_uuid and project_id from the instance_mapping table.
|
||||
#. status of the instance would be "UNKNOWN" which would be the major
|
||||
indication that the record for this instance is partial.
|
||||
#. rest of the field keys will be missing.
|
||||
|
||||
See the `Edge Cases`_ section for more info on running this command with
|
||||
filters, marker, sorting and paging.
|
||||
|
||||
#. ``nova show`` should return a minimalistic construct from the available
|
||||
information in the API DB which would be similar to ``nova list``. If
|
||||
``GET /servers/{id}`` cannot reach the cell DB, we can look into the
|
||||
instance_mapping and request_spec table for the instance details which would
|
||||
include:
|
||||
|
||||
#. instance_uuid, created_at and project_id from the instance_mapping table.
|
||||
#. status of the instance would be "UNKNOWN" which would be the major
|
||||
indication that the record for this instance is partial.
|
||||
#. user_id, flavor, image and availability_zone from the request_spec table.
|
||||
#. power_state is set to NOSTATE.
|
||||
#. rest of the field keys will be missing.
|
||||
|
||||
#. ``nova service-list`` should return a minimalistic construct from the
|
||||
available information in the API DB which would include:
|
||||
|
||||
#. host and binary from the host_mapping table for the compute services.
|
||||
#. rest of the field keys will be missing.
|
||||
|
||||
Note that if cell0 goes down the controller services will not be listed.
|
||||
|
||||
#. ``nova boot`` should not succeed if the requesting project has living VMs in
|
||||
the down cell. So if the scatter-gather utility returns a
|
||||
did_not_respond_sentinel while calculating quotas, we have to go and check
|
||||
if this project has living instances in the down cell from the
|
||||
instance_mapping table and prevent the boot request if it has. However it
|
||||
might not be desirable to block VM creation for users having VMs in multiple
|
||||
cells if a single cell goes down. Hence a new policy rule
|
||||
``os_compute_api:servers:create:cell_down`` which defaults to
|
||||
``rule:admin_api`` can be added by which the ability to create instances
|
||||
when a project has instances in a down cell can be controlled between
|
||||
users/admin. Using this deployments can configure their setup in whichever
|
||||
way they desire.
|
||||
|
||||
For the 1st, 2nd and 4th operations to work when a cell is down, we need to
|
||||
have the information regarding if an instance is in SOFT_DELETED/DELETED state
|
||||
in the API DB so that the living instances can be distinguished from the
|
||||
deleted ones which is why we add the new column ``queued_for_delete``.
|
||||
|
||||
In order to prevent the client side from complaining about missing keys, we
|
||||
would need a new microversion that would accept the above stated minimal
|
||||
constructs for the servers in the down cells into the same list of full
|
||||
constructs of the servers in the up cells. In future we could use a caching
|
||||
mechanism to have the ability to fill in the down cell instances information.
|
||||
|
||||
Note that all other non-listing operations like create and delete will simply
|
||||
not work for the servers in the down cell since one cannot clearly do anything
|
||||
about it if the cell database is not reachable. They will continue to return
|
||||
500 as is the present scenario.
|
||||
|
||||
Edge Cases
|
||||
----------
|
||||
|
||||
* Filters: If the user is listing servers using filters the results from the
|
||||
down cell will be skipped and no minimalistic construct will be provided
|
||||
since there is no way of validating the filtered results from the down cell
|
||||
if the value of the filter key itself is missing. Note that by default
|
||||
``nova list`` uses the ``deleted=False`` and ``project_id=tenant_id``
|
||||
filters and since we know both of these values from the instance_mapping
|
||||
table, they will be the only allowed filters. Hence only doing ``nova list``
|
||||
and ``nova list --minimal`` will show minimalistic results for the down cell.
|
||||
Other filters like ``nova list --deleted`` or ``nova list --host xx`` will
|
||||
skip the results for the down cell.
|
||||
|
||||
* Marker: If the user does ``nova list --marker`` it will fail with a 500 if
|
||||
the marker is in the down cell.
|
||||
|
||||
* Sorting: We ignore the down cell just like we do for filters since there is
|
||||
no way of obtaining valid results from the down cell with missing key info.
|
||||
|
||||
* Paging: We ignore the down cell. For instance if we have three cells A (up),
|
||||
B (down) and C (up) and if the marker is half way in A, we would get the
|
||||
rest half of the results from A, all the results from C and ignore cell B.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* An alternative to adding the new column in the instance_mappings table is to
|
||||
have the deleted information in the respective RequestSpec record, however it
|
||||
was decided at the PTG to go ahead with adding the new column in the
|
||||
instance_mappings table as it is more appropriate. For the main logic there
|
||||
is no alternative solution other than having the deleted info in the API DB
|
||||
if the listing operations have to work when a cell goes down.
|
||||
|
||||
* Without a new microversion, include 'shell' servers in the response when
|
||||
listing over down cells which would have UNKNOWN values for those keys
|
||||
whose information is missing. However the client side would not be able to
|
||||
digest the response with "UNKNOWN" values. Also it is not possible to assign
|
||||
"UNKNOWN" to all the fields since not all of them are of string types.
|
||||
|
||||
* With a new microversion include the set of server uuids in the down cells
|
||||
in a new top level API response key called ``unavailable_servers`` and treat
|
||||
the two lists (one for the servers from the up cells and other for the
|
||||
servers from the down cells) separately. See `POC for unavailable_servers`_
|
||||
for more details.
|
||||
|
||||
* Using searchlight to backfill when there are down cells. Check
|
||||
`listing instances using Searchlight`_ for more details.
|
||||
|
||||
* Adding backup DBs for each cell database which would act as read-only copies
|
||||
of the original DB in times of crisis, however this would need massive
|
||||
syncing and may fetch stale results.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
A nova_api DB schema change will be required for adding the
|
||||
``queued_for_delete`` column of type Boolean to the
|
||||
``nova_api.instance_mappings`` table. This column will be set to False by
|
||||
default.
|
||||
|
||||
Also, the ``InstanceMapping`` object will have a new field called
|
||||
``queued_for_delete``. An online data migration tool will be added to populate
|
||||
this field for existing instance_mappings. This tool would basically go over
|
||||
the instance records in all the cells, and if the vm_state of the instance is
|
||||
either DELETED or SOFT_DELETED, it will update the ``queued_for_delete`` to
|
||||
True else leave it at its default value.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
When a cell is down, we currently skip that cell and this spec aims at
|
||||
giving partial info for ``GET /servers``, ``GET /os-services``,
|
||||
``GET /servers/detail`` and ``GET /servers/{server_id}`` REST APIs.
|
||||
There will be a new microversion for the client to recognise missing keys and
|
||||
NULL values for certain keys in the response.
|
||||
|
||||
An example server response for ``GET /servers/detail`` is given below which
|
||||
includes one available server and one unavailable server.
|
||||
|
||||
JSON response body example::
|
||||
|
||||
{
|
||||
"servers": [
|
||||
{
|
||||
"OS-EXT-STS:task_state": null,
|
||||
"addresses": {
|
||||
"public": [
|
||||
{
|
||||
"OS-EXT-IPS-MAC:mac_addr": "fa:xx:xx:xx:xx:1a",
|
||||
"version": 4,
|
||||
"addr": "1xx.xx.xx.xx3",
|
||||
"OS-EXT-IPS:type": "fixed"
|
||||
},
|
||||
{
|
||||
"OS-EXT-IPS-MAC:mac_addr": "fa:xx:xx:xx:xx:1a",
|
||||
"version": 6,
|
||||
"addr": "2sss:sss::s",
|
||||
"OS-EXT-IPS:type": "fixed"
|
||||
}
|
||||
]
|
||||
},
|
||||
"links": [
|
||||
{
|
||||
"href": "http://1xxx.xxx.xxx.xxx/compute/v2.1/servers/b546af1e-3893-44ea-a660-c6b998a64ba7",
|
||||
"rel": "self"
|
||||
},
|
||||
{
|
||||
"href": "http://1xx.xxx.xxx.xxx/compute/servers/b546af1e-3893-44ea-a660-c6b998a64ba7",
|
||||
"rel": "bookmark"
|
||||
}
|
||||
],
|
||||
"image": {
|
||||
"id": "9da3b809-2998-4ada-8cc6-f24bc0b6dd7f",
|
||||
"links": [
|
||||
{
|
||||
"href": "http://1xx.xxx.xxx.xxx/compute/images/9da3b809-2998-4ada-8cc6-f24bc0b6dd7f",
|
||||
"rel": "bookmark"
|
||||
}
|
||||
]
|
||||
},
|
||||
"OS-EXT-SRV-ATTR:user_data": null,
|
||||
"OS-EXT-STS:vm_state": "active",
|
||||
"OS-EXT-SRV-ATTR:instance_name": "instance-00000001",
|
||||
"OS-EXT-SRV-ATTR:root_device_name": "/dev/vda",
|
||||
"OS-SRV-USG:launched_at": "2018-06-29T15:07:39.000000",
|
||||
"flavor": {
|
||||
"ephemeral": 0,
|
||||
"ram": 64,
|
||||
"original_name": "m1.nano",
|
||||
"vcpus": 1,
|
||||
"extra_specs": {},
|
||||
"swap": 0,
|
||||
"disk": 0
|
||||
},
|
||||
"id": "b546af1e-3893-44ea-a660-c6b998a64ba7",
|
||||
"security_groups": [
|
||||
{
|
||||
"name": "default"
|
||||
}
|
||||
],
|
||||
"OS-SRV-USG:terminated_at": null,
|
||||
"os-extended-volumes:volumes_attached": [],
|
||||
"user_id": "187160b0afe041368258c0b195ab9822",
|
||||
"OS-EXT-SRV-ATTR:hostname": "surya-probes-001",
|
||||
"OS-DCF:diskConfig": "MANUAL",
|
||||
"accessIPv4": "",
|
||||
"accessIPv6": "",
|
||||
"OS-EXT-SRV-ATTR:reservation_id": "r-uxbso3q4",
|
||||
"progress": 0,
|
||||
"OS-EXT-STS:power_state": 1,
|
||||
"OS-EXT-AZ:availability_zone": "nova",
|
||||
"config_drive": "",
|
||||
"status": "ACTIVE",
|
||||
"OS-EXT-SRV-ATTR:ramdisk_id": "",
|
||||
"updated": "2018-06-29T15:07:39Z",
|
||||
"hostId": "e8dcf7ab9762810efdec4307e6219f85a53d5dfe642747c75a87db06",
|
||||
"OS-EXT-SRV-ATTR:host": "cn1",
|
||||
"description": null,
|
||||
"tags": [],
|
||||
"key_name": null,
|
||||
"OS-EXT-SRV-ATTR:kernel_id": "",
|
||||
"OS-EXT-SRV-ATTR:hypervisor_hostname": "cn1",
|
||||
"locked": false,
|
||||
"name": "surya-probes-001",
|
||||
"OS-EXT-SRV-ATTR:launch_index": 0,
|
||||
"created": "2018-06-29T15:07:29Z",
|
||||
"tenant_id": "940f47b984034c7f8f9624ab28f5643c",
|
||||
"host_status": "UP",
|
||||
"trusted_image_certificates": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"created": "2018-06-29T15:07:29Z",
|
||||
"status": "UNKNOWN",
|
||||
"tenant_id": "940f47b984034c7f8f9624ab28f5643c",
|
||||
"id": "bcc6c6dd-3d0a-4633-9586-60878fd68edb",
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
When a cell DB cannot be connected, ``nova list``, ``nova show`` and
|
||||
``nova service-list`` will work with the records from the down cell not having
|
||||
all the information. When these commands are used with filters/sorting/paging,
|
||||
the output will totally skip the down cell and return only information from the
|
||||
up cells. As per default policy ``nova boot`` will not work if that tenant_id
|
||||
has any living instances in the down cell.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
There will not be any major impact on performance in normal situations. However
|
||||
when a cell is down, during show/list/boot time there will be a slight
|
||||
performance impact because of the extra check into the instance_mapping and/or
|
||||
request_spec tables and the time required for the construction of a
|
||||
minimalistic record in case a did_not_respond_sentinel is received from the
|
||||
scatter-gather utility.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
Since there will be a change in the api DB schema, the ``nova-manage api_db
|
||||
sync`` command will have to be run to update the instance_mappings table. The
|
||||
new online data migration tool that will be added to populate the new column
|
||||
will have to be run.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
<tssurya>
|
||||
|
||||
Other contributors:
|
||||
<belmoreira>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
#. Add a new column ``queued_for_delete`` to nova_api.instance_mappings table.
|
||||
#. Add a new field ``queued_for_delete`` to InstanceMapping object.
|
||||
#. Add a new online migration tool for populating ``queued_for_delete`` of
|
||||
existing instance_mappings.
|
||||
#. Handle ``nova list`` gracefully on receiving a timeout from a cell `here`_.
|
||||
#. Handle ``nova service-list`` gracefully on receiving a timeout from a cell.
|
||||
#. Handle ``nova boot`` during quota calculation in `quota calculation code`_
|
||||
when the result is a did_not_respond_sentinel or raised_exception_sentinel.
|
||||
Implement the extra check into the instance_mapping table to see if the
|
||||
requesting project has any living instances in the down cell and block the
|
||||
request accordingly.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit and functional tests for verifying the working when a
|
||||
did_not_respond_sentinel is received.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Update the description of the Compute API reference with regards to these
|
||||
commands to include the meaning of UNKNOWN records.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _smart server listing: https://review.openstack.org/#/c/509003/
|
||||
|
||||
.. _resilient server listing: https://review.openstack.org/#/c/575734/
|
||||
|
||||
.. _resilient service listing: https://review.openstack.org/#/c/568271/
|
||||
|
||||
.. _quotas using placement: https://review.openstack.org/#/c/509042/
|
||||
|
||||
.. _cells summary in Dublin PTG: http://lists.openstack.org/pipermail/openstack-dev/2018-March/128304.html
|
||||
|
||||
.. _did_not_respond_sentinel: https://github.com/openstack/nova/blob/f902e0d/nova/context.py#L464
|
||||
|
||||
.. _POC for unavailable_servers: https://review.openstack.org/#/c/575996/
|
||||
|
||||
.. _listing instances using Searchlight: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/list-instances-using-searchlight.html
|
||||
|
||||
.. _here: https://github.com/openstack/nova/blob/f902e0d/nova/compute/multi_cell_list.py#L246
|
||||
|
||||
.. _quota calculation code: https://github.com/openstack/nova/blob/f902e0d/nova/quota.py#L1317
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- Introduced
|
||||
* - Stein
|
||||
- Reproposed
|
||||
313
specs/stein/implemented/initial-allocation-ratios.rst
Normal file
313
specs/stein/implemented/initial-allocation-ratios.rst
Normal file
@@ -0,0 +1,313 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================================
|
||||
Default allocation ratio configuration
|
||||
======================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/initial-allocation-ratios
|
||||
|
||||
Provide separate CONF options for specifying the initial allocation
|
||||
ratio for compute nodes. Change the default values for
|
||||
CONF.xxx_allocation_ratio options to None and change the behaviour of
|
||||
the resource tracker to only override allocation ratios for *existing*
|
||||
compute nodes if the CONF.xxx_allocation_ratio value is not None.
|
||||
|
||||
The primary goal of this feature is to support both the API and config way to
|
||||
pass allocation ratios.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Manually set placement allocation ratios are overwritten
|
||||
--------------------------------------------------------------------
|
||||
|
||||
There is currently no way for an admin to set the allocation ratio on an
|
||||
individual compute node resource provider's inventory record in the placement
|
||||
API without the resource tracker eventually overwriting that value the next
|
||||
time it runs the ``update_available_resources`` periodic task on the
|
||||
``nova-compute`` service.
|
||||
|
||||
The saga of the allocation ratio values on the compute host
|
||||
-----------------------------------------------------------
|
||||
|
||||
The process by which nova determines the allocation ratio for CPU, RAM and disk
|
||||
resources on a hypervisor is confusing and `error`_ `prone`_. The
|
||||
``compute_nodes`` table in the nova cell DB contains three fields representing
|
||||
the allocation ratio for CPU, RAM and disk resources on that hypervisor. These
|
||||
fields are populated using different default values depending on the version of
|
||||
nova running on the ``nova-compute`` service.
|
||||
|
||||
.. _error: https://bugs.launchpad.net/nova/+bug/1742747
|
||||
.. _prone: https://bugs.launchpad.net/nova/+bug/1789654
|
||||
|
||||
Upon starting up, the resource tracker in the ``nova-compute`` service worker
|
||||
`checks`_ to see if a record exists in the ``compute_nodes`` table of the nova
|
||||
cell DB for itself. If it does not find one, the resource tracker `creates`_ a
|
||||
record in the table, `setting`_ the associated allocation ratio values in the
|
||||
``compute_nodes`` table to the value it finds in the ``cpu_allocation_ratio``,
|
||||
``ram_allocation_ratio`` and ``disk_allocation_ratio`` nova.conf configuration
|
||||
options but only if the config option value is not equal to 0.0.
|
||||
|
||||
.. _checks: https://github.com/openstack/nova/blob/852de1e/nova/compute/resource_tracker.py#L566
|
||||
.. _creates: https://github.com/openstack/nova/blob/852de1e/nova/compute/resource_tracker.py#L577-L590
|
||||
.. _setting: https://github.com/openstack/nova/blob/6a68f9140/nova/compute/resource_tracker.py#L621-L645
|
||||
|
||||
The default values of the ``cpu_allocation_ratio``, ``ram_allocation_ratio``
|
||||
and ``disk_allocation_ratio`` CONF options is `currently set`_ to ``0.0``.
|
||||
|
||||
.. _currently set: https://github.com/openstack/nova/blob/852de1e/nova/conf/compute.py#L400
|
||||
|
||||
The resource tracker saves these default ``0.0`` values to the
|
||||
``compute_nodes`` table when the resource tracker calls ``save()`` on the
|
||||
compute node object. However, there is `code`_ in the
|
||||
``ComputeNode._from_db_obj`` that, upon **reading** the record back from the
|
||||
database on first save, changes the values from ``0.0`` to ``16.0``, ``1.5`` or
|
||||
``1.0``.
|
||||
|
||||
.. _code: https://github.com/openstack/nova/blob/852de1e/nova/objects/compute_node.py#L177-L207
|
||||
|
||||
The ``ComputeNode`` object that was ``save()``'d by the resource tracker has
|
||||
these new values for some period of time while the record in the
|
||||
``compute_nodes`` table continues to have the wrong ``0.0`` values. When the
|
||||
resource tracker runs its ``update_available_resource()`` next perioidic task,
|
||||
the new ``16.0``/``1.5``/``1.0`` values are then saved to the compute nodes
|
||||
table.
|
||||
|
||||
There is a `fix`_ for `bug/1789654`_, which is to not persist
|
||||
zero allocation ratios in ResourceTracker to avoid initializing placement
|
||||
allocation_ratio with 0.0 (due to the allocation ratio of 0.0 being multiplied
|
||||
by the total amount in inventory, leading to 0 resources shown on the system).
|
||||
|
||||
.. _fix: https://review.openstack.org/#/c/598365/
|
||||
.. _bug/1789654: https://bugs.launchpad.net/nova/+bug/1789654
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
An administrator would like to set allocation ratios for individual resources
|
||||
on a compute node via the placement API *without that value being overwritten*
|
||||
by the compute node's resource tracker.
|
||||
|
||||
An administrator chooses to only use the configuration file to set allocation
|
||||
ratio overrides on their compute nodes and does not want to use the placement
|
||||
API to set these ratios.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
First, we propose to change the default option values of existing
|
||||
``CONF.cpu_allocation_ratio``, ``CONF.ram_allocation_ratio`` and
|
||||
``CONF.disk_allocation_ratio`` options relating to allocation ratios to
|
||||
``None`` from the existing default values of ``0.0``. The reason we change
|
||||
it is that this value will be change from ``0.0`` to ``16.0``, ``1.5`` or
|
||||
``1.0`` later, which is weird and confusing.
|
||||
|
||||
We will also change the resource tracker to **only** overwrite the compute
|
||||
node's allocation ratios to the value of the ``cpu_allocation_ratio``,
|
||||
``ram_allocation_ratio`` and ``disk_allocation_ratio`` CONF options **if the
|
||||
value of these options is NOT ``None``**.
|
||||
|
||||
In other words, if any of these CONF options is set to something *other than*
|
||||
``None``, then the CONF option should be considered the complete override value
|
||||
for that resource class' allocation ratio. Even if an admin manually adjusts
|
||||
the allocation ratio of the resource class in the placement API, the next time
|
||||
the ``update_available_resource()`` periodic task runs, it will be overwritten
|
||||
to the value of the CONF option.
|
||||
|
||||
Second, we propose to add 3 new nova.conf configuration options:
|
||||
|
||||
* ``initial_cpu_allocation_ratio``
|
||||
* ``initial_ram_allocation_ratio``
|
||||
* ``initial_disk_allocation_ratio``
|
||||
|
||||
That will used to determine how to set the *initial* allocation ratio of
|
||||
``VCPU``, ``MEMORY_MB`` and ``DISK_GB`` resource classes when a compute worker
|
||||
first starts up and creates its compute node record in the nova cell DB and
|
||||
corresponding inventory records in the placement service. The value of these
|
||||
new configuration options will only be used if the compute service's resource
|
||||
tracker is not able to find a record in the placement service for the compute
|
||||
node the resource tracker is managing.
|
||||
|
||||
The default value of each of these CONF options shall be ``16.0``, ``1.5``, and
|
||||
``1.0`` respectively. This is to match the default values for the original
|
||||
allocation ratio CONF options before they were set to ``0.0``.
|
||||
|
||||
These new ``initial_xxx_allocation_ratio`` CONF options shall **ONLY** be used
|
||||
if the resource tracker detects no existing record in the ``compute_nodes``
|
||||
nova cell DB for that hypervisor.
|
||||
|
||||
Finally, we will need also add an online data migration and continue to read
|
||||
the ``xxx_allocation_ratio`` or ``initial_xxx_allocation_ratio`` config on
|
||||
read from the DB if the values are ``0.0`` or ``None``. If it's an existing
|
||||
record with 0.0 values, we'd want to do what the compute does, which is use
|
||||
the configure ``xxx_allocation_ratio`` config if it's not None, and fallback
|
||||
to using the ``initial_xxx_allocation_ratio`` otherwise.
|
||||
|
||||
And add an online data migration that updates all compute_nodes
|
||||
table records that have ``0.0`` or ``None`` allocation ratios. Then we drop
|
||||
that at some point with a blocker migration and remove the code in the
|
||||
``nova.objects.ComputeNode._from_db_obj`` that adjusts allocation ratios.
|
||||
|
||||
We propose to add a nova-status upgrade check to iterate the cells looking
|
||||
for compute_nodes records with ``0.0`` or ``None`` allocation ratios and signal
|
||||
that as a warning that you haven't done the online data migration. We could
|
||||
also check the conf options to see if they are explicitly set to 0.0 and if
|
||||
so, we should fail the status check.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
We need an online data migrations for any compute_nodes with existing ``0.0``
|
||||
and ``None`` allocation ratio. If it's an existing record with 0.0 values, we
|
||||
will replace it with the configure ``xxx_allocation_ratio`` config if it's not
|
||||
None, and fallback to using the ``initial_xxx_allocation_ratio`` otherwise.
|
||||
|
||||
.. note:: Migrating 0.0 allocation ratios from existing ``compute_nodes`` table
|
||||
records is necessary because the ComputeNode object based on those table
|
||||
records is what gets used in the scheduler [1]_, specifically the
|
||||
``NUMATopologyFilter`` and ``CPUWeigher`` (the ``CoreFilter``,
|
||||
``DiskFilter`` and ``RamFilter`` also use them but those filters are
|
||||
deprecated for removal so they are not a concern here).
|
||||
|
||||
And clearly in order to take advantage of the ability to manually set
|
||||
allocation ratios on a compute node, that hypervisor would need to be upgraded.
|
||||
No impact to old compute hosts.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
yikun
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Change the default values for ``CONF.xxx_allocation_ratio`` options to
|
||||
``None``.
|
||||
* Modify resource tracker to only set allocation ratios on the compute node
|
||||
object when the CONF options are non- ``None``
|
||||
* Add new ``initial_xxx_allocation_ratio`` CONF options and modify resource
|
||||
tracker's initial compute node creation to use these values
|
||||
* Remove code in the ``ComputeNode._from_db_obj()`` that changes allocation
|
||||
ratio values
|
||||
* Add a db online migration to process all compute_nodes with existing ``0.0``
|
||||
and ``None`` allocation ratio.
|
||||
* Add a nova-status upgrade check for ``0.0`` or ``None`` allocation ratio.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
No extraordinary testing outside normal unit and functional testing
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A release note explaining the use of the new ``initial_xxx_allocation_ratio``
|
||||
CONF options should be created along with a more detailed doc in the admin
|
||||
guide explaining the following primary scenarios:
|
||||
|
||||
* When the deployer wants to **ALWAYS** set an override value for a resource on
|
||||
a compute node. This is where the deployer would ensure that the
|
||||
``cpu_allocation_ratio``, ``ram_allocation_ratio`` and
|
||||
``disk_allocation_ratio`` CONF options were set to a non- ``None`` value.
|
||||
* When the deployer wants to set an **INITIAL** value for a compute node's
|
||||
allocation ratio but wants to allow an admin to adjust this afterwards
|
||||
without making any CONF file changes. This scenario uses the new
|
||||
``initial_xxx_allocation_ratios`` for the initial ratio values and then shows
|
||||
the deployer using the osc placement commands to manually set an allocation
|
||||
ratio for a resource class on a resource provider.
|
||||
* When the deployer wants to **ALWAYS** use the placement API to set allocation
|
||||
ratios, then the deployer should ensure that ``CONF.xxx_allocation_ratio``
|
||||
options are all set to ``None`` and the deployer should issue Placement
|
||||
REST API calls to
|
||||
``PUT /resource_providers/{uuid}/inventories/{resource_class}`` [2]_ or
|
||||
``PUT /resource_providers/{uuid}/inventories`` [3]_ to set the allocation
|
||||
ratios of their resources as needed (or use the related ``osc-placement``
|
||||
plugin commands [4]_).
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://github.com/openstack/nova/blob/a534ccc5a7/nova/scheduler/host_manager.py#L255
|
||||
.. [2] https://developer.openstack.org/api-ref/placement/#update-resource-provider-inventory
|
||||
.. [3] https://developer.openstack.org/api-ref/placement/#update-resource-provider-inventories
|
||||
.. [4] https://docs.openstack.org/osc-placement/latest/
|
||||
|
||||
Nova Stein PTG discussion:
|
||||
|
||||
* https://etherpad.openstack.org/p/nova-ptg-stein
|
||||
|
||||
Bugs:
|
||||
|
||||
* https://bugs.launchpad.net/nova/+bug/1742747
|
||||
* https://bugs.launchpad.net/nova/+bug/1729621
|
||||
* https://bugs.launchpad.net/nova/+bug/1739349
|
||||
* https://bugs.launchpad.net/nova/+bug/1789654
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Proposed
|
||||
199
specs/stein/implemented/ironic-conductor-groups.rst
Normal file
199
specs/stein/implemented/ironic-conductor-groups.rst
Normal file
@@ -0,0 +1,199 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================================================
|
||||
Use conductor groups to partition nova-compute services for Ironic
|
||||
==================================================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/ironic-conductor-groups
|
||||
|
||||
Use ironic's conductor group feature to limit the subset of nodes which a
|
||||
nova-compute service will manage. This allows for partitioning nova-compute
|
||||
services to a particular location (building, aisle, rack, etc), and provides a
|
||||
way for operators to manage the failure domain of a given nova-compute service.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
As OpenStack deployments become larger, and edge compute becomes a reality,
|
||||
there is a desire to be able to co-locate the nova-compute service with
|
||||
some subset of ironic nodes.
|
||||
|
||||
There is also a desire to be able to reduce the failure domain of a
|
||||
nova-compute service, and to be able to make the failure domain more
|
||||
predictable in terms of which ironic nodes can no longer be scheduled to.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
Operators managing large and/or distributed ironic environments need more
|
||||
control over the failure domain of a nova-compute service.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
A configuration option ``partition_key`` will be added, to tell the
|
||||
nova-compute service which ``conductor_group`` (an ironic-ism) it is
|
||||
responsible for managing. This will be used as a filter when querying the list
|
||||
of nodes from ironic, so that only the subset of nodes which have a
|
||||
``conductor_group`` matching the ``partition_key`` will be returned.
|
||||
|
||||
As nova-compute services have a hash ring which further partitions the subset
|
||||
of nodes which a given nova-compute service is managing, we need a mechanism to
|
||||
tell the service which other compute services are managing the same
|
||||
``partition_key``. To do this, we will add another configuration option,
|
||||
``peer_list``, which is a comma-separated list of hostnames of other compute
|
||||
services managing the same subset of nodes. If set, this will be used instead
|
||||
of the current code, which fetches a list of all compute services running the
|
||||
ironic driver from the database. To ensure that the hash ring splits nodes only
|
||||
between currently running compute services, we will check this list against the
|
||||
database and filter out any inactive services (i.e. has not checked in
|
||||
recently) listed in ``peer_list``.
|
||||
|
||||
``partition_key`` will default to ``None``. If the value is ``None``, this
|
||||
functionality will be disabled, and the behavior will be the same as before,
|
||||
where all nodes are eligible to be managed by the compute service, and all
|
||||
compute services are considered as peers. Any other value will enable this
|
||||
service, limiting the nodes to the conductor group matching ``partition_key``,
|
||||
and using the ``peer_list`` configuration option to determine the list of
|
||||
peers.
|
||||
|
||||
Both options will be added to the ``[ironic]`` config group, and will be
|
||||
"mutable", meaning it only requires a SIGHUP to update the running service with
|
||||
new config values.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Ideally, we wouldn't need a ``peer_list`` configuration option, as we would be
|
||||
able to dynamically fetch this list from the database, and it's prone to
|
||||
operator mistakes.
|
||||
|
||||
One option to do this is to add a field to the compute service record, to store
|
||||
the partition key. Compute services running the ironic driver could then use
|
||||
this field to determine their peer list. During the Stein PTG discussion
|
||||
about this feature, we agreed not to do this, as adding fields or blobjects
|
||||
in the service record for a single driver is a layer violation.
|
||||
|
||||
Another option is for the ironic driver to manage its own list of live services
|
||||
in something like etcd, and the peer list could be determined from here. This
|
||||
also feels like a layer violation, and requiring an etcd cluster only for a
|
||||
particular driver feels confusing at best from an operator POV.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Using this feature slightly improves the performance of the resource tracker
|
||||
update. Instead of iterating over the list of *all* ironic nodes to determine
|
||||
which should be managed, the compute service will iterate over a subset of
|
||||
ironic nodes.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
The two configuration options mentioned above are added, but are optional.
|
||||
The feature isn't enabled unless ``partition_key`` is set.
|
||||
|
||||
It's worth noting what happens when a node's conductor group changes. If the
|
||||
node has an instance, it continues being managed by the compute service
|
||||
responsible for the instance, as we do today with rebalancing the hash ring.
|
||||
Without an instance, the node will be picked up by a compute service managing
|
||||
the new group at the next resource tracker run after the conductor group
|
||||
changes.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jroll
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add the configuration options and the new code paths.
|
||||
|
||||
* Add functional tests to ensure that the compute services manage the correct
|
||||
subset of nodes when this is enabled.
|
||||
|
||||
* Add documentation for deployers and operators.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
This will need to be tested in functional tests, as it would require spinning
|
||||
up at least three nova-compute services to properly test the feature. While
|
||||
possible in integration tests, this isn't a great use of CI resources.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Deployer and operator documentation will need updates.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
This feature and its implementation was roughly agreed upon during the Stein
|
||||
PTG. See line 662 or so (at the time of this writing):
|
||||
https://etherpad.openstack.org/p/nova-ptg-stein
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
184
specs/stein/implemented/live-migration-force-after-timeout.rst
Normal file
184
specs/stein/implemented/live-migration-force-after-timeout.rst
Normal file
@@ -0,0 +1,184 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================
|
||||
Live-Migration force after timeout
|
||||
==================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/live-migration-force-after-timeout
|
||||
|
||||
Replace the existing flawed automatic post-copy logic with the option to
|
||||
force-complete live-migrations on completion timeout, instead of aborting.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In an ideal world, we could tell when a VM looks unable to move, and warn
|
||||
the operator sooner than the `completion timeout`_. This was the idea with the
|
||||
`progress timeout`_. Sadly we do not get enough information from QEMU and
|
||||
libvirt to correctly detect this case. As we were sampling a saw tooth
|
||||
wave, it was possible for us to think little progress was being made, when in
|
||||
fact that was not the case. In addition, only memory was being monitored, so
|
||||
large block_migrations always looked like they were making no progress. Refer
|
||||
to the `References`_ section for details.
|
||||
|
||||
In Ocata we `deprecated`_ that progress timeout, and disabled it by default.
|
||||
Given there is no quick way to make that work, it should be removed now.
|
||||
The automatic post-copy is using the same flawed data, so that logic should
|
||||
also be removed.
|
||||
|
||||
Nova currently optimizes for limited guest downtime, over ensuring the
|
||||
live-migration operation always succeeds. When performing a host maintenance,
|
||||
operators may want to move all VMs from the affected host to an unaffected
|
||||
host. In some cases, the VM could be too busy to move before the completion
|
||||
timeout, and currently that means the live-migration will fail with a timeout
|
||||
error.
|
||||
|
||||
Automatic post-copy used to be able to help with this use case, ensuring Nova
|
||||
does its best to ensure the live-migration completes, at the cost of a little
|
||||
more VM downtime. We should look at a replacement for automatic post-copy.
|
||||
|
||||
.. _completion timeout: https://docs.openstack.org/nova/rocky/configuration/config.html#libvirt.live_migration_completion_timeout
|
||||
.. _progress timeout: https://docs.openstack.org/nova/rocky/configuration/config.html#libvirt.live_migration_progress_timeout
|
||||
.. _deprecated: https://review.openstack.org/#/c/431635/
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* Operators want to patch a host and want to move all the VMs out of that
|
||||
host, with minimal impact to the VMs, so they use live-migration. If the VM
|
||||
isn't live-migrated there will be significant VM downtime, so its better to
|
||||
take a little more VM downtime during the live-migration so the VM is able
|
||||
to avoid the much larger amount of downtime should the VM not get moved
|
||||
by the live-migration.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
* Config option ``libvirt.live_migration_progress_timeout`` was deprecated in
|
||||
Ocata, and can now be removed.
|
||||
* Current logic in libvirt driver to auto trigger post-copy will be removed.
|
||||
* A new configuration option ``libvirt.live_migration_timeout_action`` will be
|
||||
added. This new option will have choice to ``abort`` (default) or
|
||||
``force_complete``. This option will determine what actions will be taken
|
||||
against a VM after ``live_migration_completion_timeout`` expires. Currently
|
||||
nova just aborts the LM operation after completion timeout expires.
|
||||
By default, we keep the same behavior of aborting after completion timeout.
|
||||
|
||||
Please note the ``abort`` and ``force_complete`` actions that are options in
|
||||
``live_migration_timeout_action`` config option are the same as if you were to
|
||||
call the existing REST APIs of the same name. In particular,
|
||||
``force_complete`` will either pause the VM or trigger post_copy depending on
|
||||
if post copy is enabled and available.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could just remove the automatic post copy logic and not replace it, but
|
||||
this stops us helping operators with the above use case.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Kevin Zheng
|
||||
|
||||
Other contributors:
|
||||
Yikun Jiang
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Remove ``libvirt.live_migration_progress_timeout`` and auto post copy logic.
|
||||
* Add a new libvirt conf option ``live_migration_timeout_action``.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Add in-tree functional and unit tests to test new logic. Testing these types
|
||||
of scenarios in Tempest is not really possible given the unpredictable nature
|
||||
of a timeout test. Therefore we can simulate and test the logic in functional
|
||||
tests like those that `already exist`_.
|
||||
|
||||
.. _already exist: https://github.com/openstack/nova/blob/89c9127de/nova/tests/functional/test_servers.py#L3482
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document new config options.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* Live migration progress timeout bug: https://launchpad.net/bugs/1644248
|
||||
* OSIC whitepaper: http://superuser.openstack.org/wp-content/uploads/2017/06/ha-livemigrate-whitepaper.pdf
|
||||
* Boston summit session: https://www.openstack.org/videos/boston-2017/openstack-in-motion-live-migration
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Pike
|
||||
- Approved but not implemented
|
||||
* - Stein
|
||||
- Reproposed
|
||||
195
specs/stein/implemented/per-aggregate-scheduling-weight.rst
Normal file
195
specs/stein/implemented/per-aggregate-scheduling-weight.rst
Normal file
@@ -0,0 +1,195 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===============================
|
||||
Per aggregate scheduling weight
|
||||
===============================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/per-aggregate-scheduling-weight
|
||||
|
||||
This spec proposes to add ability to allow users to use ``Aggregate``'s
|
||||
``metadata`` to override the global config options for weights to achieve
|
||||
more fine-grained control over resource weights.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In the current implementation, the weights are controlled by config options
|
||||
like [filter_scheduler] cpu_weight_multiplier, the total weight of a compute
|
||||
node is calculated by combination of several weigher:
|
||||
weight = w1_multiplier * norm(w1) + w2_multiplier * norm(w2) + ...
|
||||
|
||||
As it is controlled by config options, the weights are global across the whole
|
||||
deployment, this is not convenient enough for operators and users.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As an operator I may want to have a more fine-grained control over resource
|
||||
scheduling weight configuration so that I can control my resource allocations.
|
||||
|
||||
Operators may divide the resource pool by hardware type and their(hardware)
|
||||
suitable workloads with host aggregates. Setting independent scheduling weight
|
||||
for each aggregate can make it easier to control the scheduling behavior(
|
||||
spreading or concentrate). For example, by default I want my deployment to
|
||||
stack resources to conserve energy, but for my HPC aggregate, I want to set
|
||||
``cpu_weight_multiplier=10.0`` to spread instances across the hosts in that
|
||||
aggregate because I want to avoid noisy neighbors as much as possible.
|
||||
|
||||
Operators may also restrict flavors/images to host aggregates, and those
|
||||
flavors/images may have preferences about the importance of CPU/RAM/DISK,
|
||||
setting a suitable weight for this aggregate other than the global weight
|
||||
could provide a more suitable resource allocation for the corresponding
|
||||
workloads. For example, I want to deploy a big data analysis cluster(for
|
||||
example Hadoop), there are different roles for each vm in this cluster,
|
||||
for some of them the amount of CPU and RAM is much more important than DISK,
|
||||
like the ``HDFS NameNode`` and nodes that runs ``MapReduce`` tasks; for some
|
||||
of them, the size of DISK is more important, like the ``HDFS DataNodes``.
|
||||
By creating different flavor/image and restrict them to aggregates that have
|
||||
suitable scheduling weight can have a overall better resource allocation and
|
||||
performance.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
This spec proposes to add abilities in existing weighers to read the
|
||||
``*_weight_multiplier`` from ``aggregate metadata`` to override the
|
||||
``*_weight_multiplier`` from config files to achieve a more flexible
|
||||
weight during scheduling.
|
||||
|
||||
This will be done by making the ``weight_multiplier()`` method take a
|
||||
``HostState`` object as a parameter and get the corresponding
|
||||
``weight_multiplier`` from the aggregate metadata similar to how
|
||||
``nova.scheduler.utils.aggregate_values_from_key()`` is used by the
|
||||
``AggregateCoreFilter`` filter. If the host is in multiple aggregates and
|
||||
there are conflicting weight values in the metadata, we will use the minimum
|
||||
value among them.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Add abilities to read the above mentioned multipliers from
|
||||
``flavor extra_specs`` to make them per-flavor.
|
||||
|
||||
This alternative will not be implemented because:
|
||||
|
||||
- It could be very difficult to manage per-flavor weights in a
|
||||
cloud with a lot of flavors, e.g. public cloud.
|
||||
|
||||
- Per-flavor weights does not help the case of an image that
|
||||
requires some kind of extra weight to the host it is used on, so
|
||||
per-flavor weights is less flexible, but with the proposed solution
|
||||
we can apply the weights to aggregates which can then be used to
|
||||
restrict both flavors (AggregateInstanceExtraSpecsFilter) and
|
||||
images (AggregateImagePropertiesIsolation).
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
There could be a minor decrease in the scheduling performance as
|
||||
some data gathering and calculation will be added.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Zhenyu Zheng
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
#. Add the ability to existing weigher to read the
|
||||
``*_weight_multiplier`` from ``aggregate metadata`` to override
|
||||
the ``*_weight_multiplier`` from config files to achieve a more
|
||||
flexible weight during scheduling
|
||||
|
||||
#. Update docs about the new change
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit tests for verifying when a ``*_weight_multiplier`` is provided in
|
||||
aggregate metadata.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Update the weights user reference documentation here:
|
||||
|
||||
https://docs.openstack.org/nova/latest/user/filter-scheduler.html#weights
|
||||
|
||||
The aggregate metadata key/value for each weigher will be called out in
|
||||
the documentation.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None.
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
186
specs/stein/implemented/per-instance-libvirt-sysinfo-serial.rst
Normal file
186
specs/stein/implemented/per-instance-libvirt-sysinfo-serial.rst
Normal file
@@ -0,0 +1,186 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================
|
||||
Per-instance serial number
|
||||
==========================
|
||||
|
||||
Add support for providing unique per-instance serial numbers to servers.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
A libvirt guest's serial number in the machine BIOS comes from the
|
||||
``[libvirt]/sysinfo_serial`` configuration option [1]_, which defaults to
|
||||
reading it from the compute host's ``/etc/machine-id`` file or if that does
|
||||
not exist, reading it from the libvirt host capabilities. Either way, all
|
||||
guests on the same host have the same serial number in the guest BIOS.
|
||||
|
||||
This can be problematic for guests running licensed software that charges per
|
||||
installation based on the serial number because if the guest is migrated, it
|
||||
will incur new charges even though it is only running a single instance of the
|
||||
software.
|
||||
|
||||
If the guest has a specific serial unique to itself, then the license
|
||||
essentially travels with the guest.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As a user (or cloud provider), I do not want workloads to incur license
|
||||
charges simply because of those workloads being migrated during normal
|
||||
operation of the cloud.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
To allow users to control this behavior (if the cloud provides it), a new
|
||||
flavor extra spec ``hw:unique_serial`` and corresponding image property
|
||||
``hw_unique_serial`` will be introduced which when either is set to ``True``
|
||||
will result in the guest serial number being set to the instance UUID.
|
||||
|
||||
For operators that just want per-instance serial numbers either globally
|
||||
or for a set of host aggregates, a new "unique" choice will be added to the
|
||||
existing ``[libvirt]/sysinfo_serial`` configuration which if set will result
|
||||
in the guest serial number being set to the instance UUID. Note that the
|
||||
default value for the option will not change as part of this blueprint.
|
||||
|
||||
The flavor/image value, if set, supersedes the host configuration.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could allow users to pass through a serial number UUID when creating
|
||||
a server and then pass that down through to the hypervisor, but that seems
|
||||
somewhat excessive for this small change. It is also not clear that all
|
||||
hypervisor backends support specifying the serial number in the guest and we
|
||||
want to avoid adding API features that not all compute drivers can support.
|
||||
Allowing the user to specify a serial number could also potentially be abused
|
||||
for pirating software unless a unique constraint was put in place, but even
|
||||
then it would have to span an entire deployment (per-cell DB restrictions would
|
||||
not be good enough).
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None besides a new ``FlexibleBooleanField`` field being added to the
|
||||
``ImageMetaProps`` object.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None. Users can leverage the functionality by creating new servers with an
|
||||
enabled flavor/image, or rebuild/resize existing servers with an enabled
|
||||
flavor/image.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Operators that wish to expose this functionality can do so by adding the
|
||||
extra spec to their flavors and/or images or setting
|
||||
``[libvirt]/sysinfo_serial=unique`` in nova configuration. If they want to
|
||||
restrict the functionality to a set of compute hosts, that can also be done by
|
||||
restricting enabled flavors/images to host aggregates.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None, except maintainers of other compute drivers besides the libvirt driver
|
||||
may wish to support the feature eventually.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
There is not an explicit upgrade impact except that obviously older compute
|
||||
code would not know about the new flavor extra spec or image property and thus
|
||||
if a user was requesting a server with the property, but the serial in the
|
||||
guest did not match the instance UUID, they could be confused about why it
|
||||
does not work. Again, operators can control this by deciding when to enable
|
||||
the feature or by restricting it to certain host aggregates.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Zhenyu Zheng <zhengzhenyu@huawei.com> (Kevin_Zheng)
|
||||
|
||||
Other contributors:
|
||||
Matt Riedemann <mriedem.os@gmail.com> (mriedem)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add the ``ImageMetaProps.hw_unique_serial`` field.
|
||||
* Add a new choice, "unique", to the ``[libvirt]/sysinfo_serial`` configuration
|
||||
option.
|
||||
* Check for the flavor extra spec and image property in the libvirt driver
|
||||
where the serial number config is set.
|
||||
* Docs and tests.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit tests should be sufficient for this relatively small feature.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* The flavor extra spec will be documented: https://docs.openstack.org/nova/latest/user/flavors.html
|
||||
* The image property will be documented: https://docs.openstack.org/glance/latest/admin/useful-image-properties.html
|
||||
* The new configuration option choice will be documented [1]_
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sysinfo_serial
|
||||
|
||||
* Libvirt documentation: https://libvirt.org/formatdomain.html#elementsSysinfo
|
||||
* Nova meeting discussion: http://eavesdrop.openstack.org/meetings/nova/2018/nova.2018-10-18-14.00.log.html#l-199
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
@@ -0,0 +1,185 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================================
|
||||
Remove force flag from live-migrate and evacuate
|
||||
================================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/remove-force-flag-from-live-migrate-and-evacuate
|
||||
|
||||
Force live-migrate and evacuate operations cannot be meaningfully supported for
|
||||
servers having complex resource allocations. So this spec proposes to remove
|
||||
the ``force`` flag from these operations in a new REST API microversion.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Today when ``force: True`` is specified nova tries to blindly copy the resource
|
||||
allocation from the source host to the target host. This only works if the
|
||||
the server's allocation is satisfied by the single root resource provider both
|
||||
on the source host and on the destination host. As soon as the allocation
|
||||
become more complex (e.g. it allocates from more than one provider
|
||||
(including sharing providers) or allocates only from a nested provider) the
|
||||
blind copy will fail.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
This change removes the following use case from the system:
|
||||
|
||||
* The admin cannot force a live-migration to a specified destination host
|
||||
against the Nova scheduler and Placement agreement.
|
||||
* The admin cannot force a evacuate to a specified destination host against
|
||||
the Nova scheduler and Placement agreement.
|
||||
|
||||
This does not effect the use cases when the operator specifies the destination
|
||||
host and let Nova and Placement verify that host before the move.
|
||||
|
||||
Please note that this removes the possibility to force live-migrate servers to
|
||||
hosts where the nova-compute is disabled as the ComputeFilter in the filter
|
||||
scheduler will reject such hosts.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Forcing the destination host in a complex allocation case cannot supported
|
||||
without calling Placement to get allocation candidates on the destination host
|
||||
as Nova does not know how to copy the complex allocation. The documentation
|
||||
of the force flag states that Nova will not call the scheduler to verify the
|
||||
destination host. This rule has already been broken since Pike by two
|
||||
`bugfixes`_. Also supporting complex allocations requires to get allocation
|
||||
candidates from Placement. So the spec proposes to remove the ``force`` flag as
|
||||
it cannot be supported any more.
|
||||
|
||||
Note that fixing old microversions to fail cleanly without leaking resources
|
||||
in complex allocation scenarios is not part of this spec but handled as part of
|
||||
`use-nested-allocation-candidates`_ That change will make sure that the forced
|
||||
move operation on a server that either has complex allocation on the source
|
||||
host or would require complex allocation on the destination host will be
|
||||
rejected with a NoValidHost exception by the Nova conductor.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* Try to guess when the server needs a complex allocation on the destination
|
||||
host and only ignore the force flag in these cases.
|
||||
* Do not manage resource allocations for forced move operations.
|
||||
|
||||
See more details in the `ML thread`_
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
In a new microversion remove the ``force`` flag from both APIs:
|
||||
|
||||
* POST /servers/{server_id}/action (os-migrateLive Action)
|
||||
* POST /servers/{server_id}/action (evacuate Action)
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
Update python-novaclient and python-openstackclient to support the new
|
||||
microversion.
|
||||
|
||||
As the admin cannot skip the scheduler any more when moving servers, such move
|
||||
can fail with scheduler and Placement related reasons.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
As the admin cannot skip the scheduler when moving a server, such move will
|
||||
take a bit more time as Nova will call the scheduler and Placement.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Please note that this spec removes the possibility to force live-migrate
|
||||
servers to hosts where the nova-compute is disabled as the ComputeFilter in
|
||||
the filter scheduler will reject such hosts.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Supporting the force flag has been a detriment to maintaining nova since it's
|
||||
an edge case and requires workarounds like the ones made in Pike to support it.
|
||||
Dropping support over time will be a benefit to maintaining the project and
|
||||
improve consistency/reliability/usability of the API.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
|
||||
Primary assignee:
|
||||
balazs-gibizer
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add a new microversion to the API that removes the ``force`` flag from the
|
||||
payload. If the new microversion is used in the request then default
|
||||
``force`` to False when calling Nova internals.
|
||||
* Document the new microversion
|
||||
* Add support for the new microversion in the python-novaclient and in the
|
||||
python-openstackclient
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Some part of `use-nested-allocation-candidates`_ is a dependecy of this work.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Functional and unit test will be provided
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* API reference document needs to be updated
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _`use-nested-allocation-candidates`: https://blueprints.launchpad.net/nova/+spec/use-nested-allocation-candidates
|
||||
.. _`ML thread`: http://lists.openstack.org/pipermail/openstack-dev/2018-October/135551.html
|
||||
.. _`bugfixes`: https://review.openstack.org/#/q/I6590f0eda4ec4996543ad40d8c2640b83fc3dd9d+OR+I40b5af5e85b1266402a7e4bdeb3705e1b0bd6f3b
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
164
specs/stein/implemented/show-server-group.rst
Normal file
164
specs/stein/implemented/show-server-group.rst
Normal file
@@ -0,0 +1,164 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================================
|
||||
show which server group a server is in "nova show"
|
||||
==================================================
|
||||
|
||||
bp link:
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/show-server-group
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently you had to loop over all groups to find the group the server
|
||||
belongs to. This spec tries to address this by proposing showing the server
|
||||
group information in API `GET /servers/{server_id}`.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* Admin/End user want to know the server group that the server belongs to
|
||||
in a direct way.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Proposes to add the server-group UUID to ``GET /servers/{id}``,
|
||||
``PUT /servers/{server_id}`` and REBUILD API
|
||||
``POST /servers/{server_id}/action``.
|
||||
|
||||
The server-group information will not be included in
|
||||
``GET /servers/detail`` API, because the server-group information
|
||||
needs another DB query.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* One alternative is support the server groups filter by server UUID. Like
|
||||
"GET /os-server-groups?server=<UUID>".
|
||||
|
||||
* Another alternative to support the server group query is following API:
|
||||
"GET /servers/{server_id}/server_groups".
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
NO
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
|
||||
Allows the `GET /servers/{server_id}` API to show server group's UUID.
|
||||
"PUT /servers/{server_id}" and REBUILD API "POST /servers/{server_id}/action"
|
||||
also response same information.
|
||||
|
||||
.. highlight:: json
|
||||
|
||||
The returned information for server group::
|
||||
|
||||
{
|
||||
"server": {
|
||||
"server_groups": [ # not cached
|
||||
"0b5d2c72-12cc-4ba6-a8d7-3ff5cc1d8cb8"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
N/A
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
N/A
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
* python novaclient would contain the server_group information.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
* Need another DB query retrieve the server group UUID. To reduce the
|
||||
perfermance impact for batch API call, "GET /servers/detail" won't
|
||||
return server group information.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
N/A
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
N/A
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
N/A
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Yongli He
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add new microversion for this change.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
N/A
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Add functional api_sample tests.
|
||||
* Add microversion releated test to tempest.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* The API document should be changed to introduce this new feature.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* Stein PTG discussion:https://etherpad.openstack.org/p/nova-ptg-stein
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Version
|
||||
|
||||
* - Stein
|
||||
- First Version
|
||||
|
||||
@@ -1 +0,0 @@
|
||||
../../stein-template.rst
|
||||
219
specs/stein/implemented/support-hpet-on-guest.rst
Normal file
219
specs/stein/implemented/support-hpet-on-guest.rst
Normal file
@@ -0,0 +1,219 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=======================================================
|
||||
Support High Precision Event Timer (HPET) on x86 guests
|
||||
=======================================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/support-hpet-on-guest
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As an end user looking to migrate an existing appliance to run in a cloud
|
||||
environment I would like to be able to request a guest with HPET so that I can
|
||||
share common code between my virtualized and physical products.
|
||||
|
||||
As an operator I would like to support onboarding legacy VNFs for my telco
|
||||
customers where a guest image cannot be modified to work without a HPET.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
End users can indicate their desire to have HPET in the guest by specifying a
|
||||
image property ``hw_time_hpet=True``.
|
||||
|
||||
Setting the new image property to "True" would only be guaranteed to be valid
|
||||
in combination with ``hypervisor_type=qemu`` and either ``architecture=i686``
|
||||
or ``architecture=x86_64``.
|
||||
|
||||
.. note:: A corresponding flavor extra spec will not be introduced since
|
||||
enabling HPET is really a per-image concern rather than a resource concern
|
||||
for capacity planning.
|
||||
|
||||
A few options to use Traits were considered as described in the next section,
|
||||
but we end up choosing the simpler approach due to the following reasons:
|
||||
|
||||
1) HPET is provided by qemu via emulation, so there are no security
|
||||
implications as there are already better clock sources available.
|
||||
|
||||
2) The HPET was turned off by default purely because of issues with time
|
||||
drifting on Windows guests. (See nova commit ba3fd16605.)
|
||||
|
||||
3) The emulated HPET device is unconditionally available on all versions of
|
||||
libvirt/qemu supported by OpenStack.
|
||||
|
||||
4) The HPET device is only supported for x86 architectures, so in a cloud with
|
||||
a mix of architectures the image would have to be specific to ensure the
|
||||
instance is scheduled on an x86 host.
|
||||
|
||||
5) Initially we would only support enabling HPET on qemu. Specifying the
|
||||
hypervisor type will ensure the instance is scheduled on a host using the
|
||||
qemu hypervisor. It would be possible to extend this to other hypervisors
|
||||
as well if applicable (vmware supports the ability to enable/disable HPET,
|
||||
I think), and which ones are supported could be documented in the "useful
|
||||
image properties" documentation.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The following options to use Trait were considered, but ultimatedly we chose
|
||||
a simpler approach without using Trait.
|
||||
|
||||
Explicit Trait, Implicit Config
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Operators can indicate their desire to have HPET in the guest by specifying a
|
||||
placement trait ``trait:COMPUTE_TIME_HPET=required`` in the flavor extra-specs.
|
||||
|
||||
End users can indicate their desire to have HPET in the guest by uploading
|
||||
their own images with the same trait.
|
||||
|
||||
Existing nova scheduler code picks up the trait and passes it to
|
||||
``GET /allocation_candidates``.
|
||||
|
||||
Once scheduled to a compute node, the virt driver looks for
|
||||
``trait:COMPUTE_TIME_HPET=required`` in the flavor/image or
|
||||
``trait*:COMPUTE_TIME_HPET=required`` for numbered request group in flavor and
|
||||
uses that as its cue to enable HPET on the guest.
|
||||
|
||||
If we do get down to the virt driver and the trait is set, and the driver for
|
||||
whatever reason (e.g. value(s) wrong in the flavor; wind up on a host that
|
||||
doesn't support HPET etc.) determines it's not capable of flipping the switch,
|
||||
it should fail. [1]_
|
||||
|
||||
**CON:** We're using a trait to effect guest configuration.
|
||||
|
||||
Explicit Config, Implicit Trait
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
* Operator specifies extra spec ``hw:hpet=True`` in the flavor.
|
||||
* Nova recognizes this as a known special case and adds
|
||||
``required=COMPUTE_TIME_HPET`` to the ``GET /allocation_candidates`` query.
|
||||
* The driver uses the ``hw:hpet=True`` extra spec as its cue to enable HPET on
|
||||
the guest.
|
||||
|
||||
**CON:** The implicit transformation of a special extra spec into
|
||||
placement-isms is arcane. This wouldn't be the only instance of this; we would
|
||||
need to organize the "special" extra specs in the code for maintainability, and
|
||||
document them thoroughly.
|
||||
|
||||
Explicit Config, Explicit Trait
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
* Operator specifies **both** extra specs, ``hw:hpet=True`` and
|
||||
``trait:COMPUTE_TIME_HPET=required``, in the flavor.
|
||||
* Existing nova scheduler code picks up the latter and passes it to ``GET
|
||||
/allocation_candidates``.
|
||||
* The driver uses the ``hw:hpet=True`` extra spec as its cue to enable HPET on
|
||||
the guest.
|
||||
|
||||
**CON:** The operator has to remember to set both extra specs, which is kind of
|
||||
gross UX. (If she forgets ``hw:hpet=True``, she ends up with HPET off; if she
|
||||
forgets ``trait:COMPUTE_TIME_HPET=required``, she can end up with late-failing
|
||||
NoValidHosts.)
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Negligible.
|
||||
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
The new image property will only work reliably after all nodes have been
|
||||
upgraded.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jackding
|
||||
|
||||
Other contributors:
|
||||
jaypipes, efried
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* libvirt driver changes to support HPET
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Will add unit tests.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Update User Documentation for image properties [2]_.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] http://lists.openstack.org/pipermail/openstack-dev/2018-October/135446.html
|
||||
.. [2] https://docs.openstack.org/glance/latest/admin/useful-image-properties.html
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
@@ -0,0 +1,189 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
========================================================
|
||||
Support to query nova resources filter by changes-before
|
||||
========================================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/support-to-query-nova-resources-filter-by-changes-before
|
||||
|
||||
The compute API already has the changes-since filter to filter servers updated
|
||||
since the given time and this spec proposes to add a changes-before filter to
|
||||
filter servers updated before the given time. In addition, the filters could
|
||||
be used in conjunction to build a kind of time range filter, e.g. to get the
|
||||
nova resources between changes-since and changes-before.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
By default, nova can query the instance resource in the
|
||||
updated_at >= changes-since time period. Users can only query resources
|
||||
operated at given time, not during given period. Users may be interested in
|
||||
resources operated in a specific period for monitoring or statistics purpose
|
||||
but currently they have to retrieve and filter the resources by themselves.
|
||||
This change can bring facility to users and also improve the efficiency of
|
||||
timestamp based query.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
In large scale environment, lots of resources were created in system.
|
||||
For tracing the change of resource, user or manage system only need to get
|
||||
those resources which was changed with some time period, instead of querying
|
||||
all resources every time to see which was changed.
|
||||
|
||||
For example, if you are trying to get the nova resources that were changed
|
||||
before '2018-07-26T10:31:49Z', you can filter servers like:
|
||||
|
||||
* GET /servers/detail?changes-before=2018-07-26T10:31:49Z
|
||||
|
||||
Or if you want to filter servers in the time range(e.g. changes-since=
|
||||
2018-07-26T10:31:49Z -> changes-before=2018-07-30T10:31:49Z), you can
|
||||
filter servers like:
|
||||
|
||||
* GET /servers/detail?changes-since=2018-07-26T10:31:49Z&changes-before=
|
||||
2018-07-30T10:31:49Z
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
Add a new microversion to os-instance-actions, os-migrations and servers
|
||||
list APIs to support changes-before.
|
||||
|
||||
Introduce a new changes-before filter for retrieving resources. It accepts a
|
||||
timestamp and projects will return resources whose updated_at fields are
|
||||
earlier than this timestamp, it means that "updated_at <= changes-before".
|
||||
Its(changes-before) value is optional. If changes-since and changes-before
|
||||
pass the value, the projects will return resources whose updated_at fields
|
||||
are earlier than or equal to this changes-before, and later than or equal
|
||||
to changes-since.
|
||||
|
||||
**Reading deleted resources**
|
||||
|
||||
Like the ``changes-since`` filter, the ``changes-before`` filter will also
|
||||
return deleted servers.
|
||||
|
||||
This spec does not propose to change any read-deleted behavior in the
|
||||
os-instance-actions or os-migrations APIs. The os-instance-actions API
|
||||
with the 2.21 microversion allows retrieving instance actions for a deleted
|
||||
server resource. The os-migrations API takes an optional ``instance_uuid``
|
||||
filter parameter but does not support returning deleted migration records like
|
||||
``changes-since`` does in the servers API.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
As discussed in `Problem description`_ section, users can retrieve and then
|
||||
filter resources by themselves, but this method is extremely inconvenient.
|
||||
Having said that, services like Searchlight do exist which have similar
|
||||
functionality, i.e. listening for nova notifications and storing them in
|
||||
a time-series database like elasticsearch from which results can later be
|
||||
queried. However, requiring Searchlight or a similar alternative solution for
|
||||
this relatively small change is likely excessive.
|
||||
Leaving filtering work to the database can utilize the optimization of database
|
||||
engine and also reduce data transmitted from server to client.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
A new microversion will be added.
|
||||
|
||||
List API will accept new query string parameter changes-before.
|
||||
Judging in the following cases:
|
||||
|
||||
* If the user specifies the changes-before < changes-since, it will
|
||||
return HTTPBadRequest 400.
|
||||
* If the user only specifies changes-before, all nova resource before
|
||||
changes-before will be returned, including the deleted servers.
|
||||
* If the user specifies changes-since and changes-before, that will
|
||||
get changes from a specific period, including the deleted servers.
|
||||
* When the user only specifies changes-since, the original features
|
||||
remain unchanged.
|
||||
|
||||
Users can pass time to the list API url to retrieve resources operated since
|
||||
a specific time.
|
||||
|
||||
* GET /servers?changes-before=2018-07-26T10:31:49Z
|
||||
* GET /servers/detail?changes-before=2018-07-26T10:31:49Z
|
||||
* GET /servers/{server_id}/os-instance-actions?changes-before=
|
||||
2018-07-26T10:31:49Z
|
||||
* GET /os-migrations?changes-before=2018-07-26T10:31:49Z
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
Python client may add help to inform users this new filter.
|
||||
Add support for the changes-before filter in python-novaclient
|
||||
for the 'nova list', 'nova migration-list' and
|
||||
'nova instance-action-list' command.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
Brin Zhang
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Add querying support in sql
|
||||
* Add API filter
|
||||
* Add related test
|
||||
* Add support for changes-before to the 'nova list' operation in novaclient
|
||||
* Add support for changes-before to the 'nova instance-action-list'
|
||||
in novaclient
|
||||
* Add support for changes-before to the 'nova migration-list' in novaclient
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
* Add related unittest
|
||||
* Add related functional test
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
The nova API documentation will need to be updated to reflect the
|
||||
REST API changes, and adding microversion instructions.
|
||||
|
||||
References
|
||||
==========
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Stein
|
||||
- Introduced
|
||||
161
specs/stein/implemented/vmware-live-migration.rst
Normal file
161
specs/stein/implemented/vmware-live-migration.rst
Normal file
@@ -0,0 +1,161 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================
|
||||
VMware live migration
|
||||
=====================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/vmware-live-migration
|
||||
|
||||
This is a proposal for adding support for live migration in the VMware
|
||||
driver. When the VMware driver is used, each nova-compute is managing a
|
||||
single vCenter cluster. For the purposes of this proposal we assume that
|
||||
all nova-computes are managing clusters under the same vCenter server. If
|
||||
migration across different vCenter servers is attempted, an error message
|
||||
will be generated and no migration will occur.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Live migration is not supported when the VMware driver is used.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As an Operator I want to live migrate instances from one compute cluster
|
||||
(nova-compute host) to another compute cluster (nova-compute host) in the
|
||||
same vCenter server.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Relocating VMs to another cluster/datastore is a simple matter of calling the
|
||||
RelocateVM_Task() vSphere API. The source compute host needs to know the
|
||||
cluster name and the datastore regex of the target compute host. If the
|
||||
instance is located on a datastore shared between the two clusters, it will
|
||||
remain there. Otherwise we will choose a datastore that matches the
|
||||
datastore_regex of the target host and migrate the instance there. There will
|
||||
be a pre live-migration check that will verify that both source and
|
||||
destination compute nodes correspond to clusters in the same vCenter server.
|
||||
|
||||
A new object will be introduced (VMwareLiveMigrateData) which will carry the
|
||||
host IP, the cluster name and the datastore regex of the target compute host.
|
||||
All of them are obtained from the nova config (CONF.vmware.host_ip,
|
||||
CONF.vmware.cluster_name and CONF.vmware.datastore_regex).
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
https://review.openstack.org/#/c/270116/
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
rgerganov
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add ``VMwareLiveMigrateData`` object
|
||||
* Implement pre live-migration checks
|
||||
* Implement methods for selecting target ESX host and datastore
|
||||
* Ensure CI coverage for live-migration
|
||||
* Update support-matrix
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The VMware CI will provision two nova-computes and will execute the live
|
||||
migration tests from tempest.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The feature support matrix should be updated to indicate that live migration
|
||||
is supported with the VMware driver.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
http://pubs.vmware.com/vsphere-60/topic/com.vmware.wssdk.apiref.doc/vim.VirtualMachine.html#relocate
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Newton
|
||||
- Introduced
|
||||
* - Ocata
|
||||
- Reproposed
|
||||
* - Pike
|
||||
- Reproposed
|
||||
* - Queens
|
||||
- Reproposed
|
||||
* - Rocky
|
||||
- Reproposed
|
||||
* - Stein
|
||||
- Reproposed
|
||||
409
specs/stein/implemented/vrouter-hw-offloads.rst
Normal file
409
specs/stein/implemented/vrouter-hw-offloads.rst
Normal file
@@ -0,0 +1,409 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===================================
|
||||
vRouter Hardware Offload Enablement
|
||||
===================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/vrouter-hw-offloads
|
||||
|
||||
SmartNICs allow complex packet processing on the NIC. In order to support
|
||||
hardware acceleration for them, Nova core and os-vif needs modifications to
|
||||
support the combination of VIF and vRouter plugging they support. This spec
|
||||
proposes a hybrid SR-IOV and vRouter model to enable acceleration.
|
||||
|
||||
.. note:: In this spec, `Juniper Contrail`_, `OpenContrail`_ and
|
||||
`Tungsten Fabric`_ will be used interchangeably.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
SmartNICs are able to route packets directly to individual SR-IOV Virtual
|
||||
Functions. These can be connected to instances using IOMMU (vfio-pci
|
||||
passthrough) or a low-latency vhost-user `virtio-forwarder`_ running on the
|
||||
compute node. The `vRouter packet processing pipeline`_ is managed by a
|
||||
`Contrail Agent`_. If `Offload hooks in kernel vRouter`_ are present, then
|
||||
datapath match/action rules can be fully offloaded to the SmartNIC instead of
|
||||
executed on the hypervisor.
|
||||
|
||||
For a deeper discussion on datapath offloads, it is highly recommended
|
||||
to read the `Generic os-vif datapath offloads spec`_.
|
||||
|
||||
The ``vrouter`` VIF type has not been converted to the os-vif plugin model.
|
||||
This spec proposes completing the conversion to an os-vif plugin as the first
|
||||
stage.
|
||||
|
||||
Currently, Nova supports multiple types of Contrail plugging: TAP plugs,
|
||||
vhost-user socket plugs or VEB SR-IOV plugs. Neutron and the Contrail
|
||||
controller decides what VIF type to pass to Nova based on the Neutron port
|
||||
semantics and the configuration of the compute node. This VIF type is then
|
||||
passed to Nova:
|
||||
|
||||
* The ``vrouter`` VIF type plugs a TAP device into the kernel vrouter.ko
|
||||
datapath.
|
||||
* The ``vhostuser`` VIF type with the ``vhostuser_vrouter_plug`` mode plugs
|
||||
into the DPDK-based vRouter datapath.
|
||||
* The ``hw_veb`` VIF type plugs a VM into the VEB datapath of a NIC using
|
||||
vfio-pci passthrough.
|
||||
|
||||
In order to enable full datapath offloads for SmartNICs, Nova needs to support
|
||||
additional VNIC types when plugging a VM with the ``vrouter`` VIF type, while
|
||||
consuming a PCIe Virtual Function resource.
|
||||
|
||||
`Open vSwitch offloads`_ recognises the following VNIC types:
|
||||
|
||||
* The ``normal`` (or default) VNIC type indicates that the Instance is plugged
|
||||
into the software bridge. The ``vrouter`` VIF type currently supports only
|
||||
this VNIC type.
|
||||
* The ``direct`` VNIC type indicates that a VF is passed through to the
|
||||
Instance.
|
||||
|
||||
In addition, the Agilio OVS VIF type implements the following offload mode:
|
||||
|
||||
* The ``virtio-forwarder`` VNIC type indicates that a VF is attached via a
|
||||
`virtio-forwarder`_.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* Currently, an end user is able to attach a port to an Instance, running on a
|
||||
hypervisor with support for plugging vRouter VIFs, by using one of the
|
||||
following methods:
|
||||
|
||||
* Normal: Standard kernel based plugging, or vhost-user based plugging
|
||||
depending on the datapath running on the hypervisor.
|
||||
* Direct: PCI passthrough plugging into the VEB of an SR-IOV NIC.
|
||||
|
||||
* In addition, an end user should be able to attach a port to an Instance
|
||||
running on a properly configured hypervisor, equipped with a SmartNIC, using
|
||||
one of the following methods:
|
||||
|
||||
* Passthrough: Accelerated IOMMU passthrough to an offloaded vRouter
|
||||
datapath, ideal for NFV-like applications.
|
||||
* Virtio Forwarder: Accelerated vhost-user passthrough, maximum
|
||||
software compatibility with standard virtio drivers and with support for
|
||||
live migration.
|
||||
|
||||
* This enables Juniper, Tungsten Fabric (and partners like Netronome) to
|
||||
achieve functional parity with the existing OVS VF Representor datapath
|
||||
offloads for vRouter.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
* Stage 1: vRouter migration to os-vif.
|
||||
|
||||
* The `vRouter os-vif plugin`_ has been updated with the required code on the
|
||||
master branch. Changes in Nova for this stage are gated on a release being
|
||||
issued on that project in order to reflect the specific version required
|
||||
in the release notes.
|
||||
|
||||
Progress on this task is tracked on the `vRouter os-vif conversion
|
||||
blueprint`_.
|
||||
|
||||
* In ``nova/virt/libvirt/vif.py``:
|
||||
|
||||
Remove the Legacy vRouter config generation code,
|
||||
``LibvirtGenericVIFDriver.get_config_vrouter()``, and migrate the plugging
|
||||
code, ``LibvirtGenericVIFDriver.{plug,unplug}_vrouter()``, to an external
|
||||
os-vif plugin.
|
||||
|
||||
For kernel-based plugging, VIFGeneric will be used.
|
||||
|
||||
* In ``privsep/libvirt.py``
|
||||
|
||||
Remove privsep code, ``{plug,unplug}_contrail_vif()``:
|
||||
|
||||
The call to ``vrouter-port-control`` will be migrated to the external
|
||||
os-vif plugin, and further changes will be beyond the scope of Nova.
|
||||
|
||||
* Stage 2: Extend os-vif with better abstraction for representors.
|
||||
|
||||
os-vif's object model needs to be updated with a better abstraction model
|
||||
to allow representors to be applicable to the ``vrouter`` datapath.
|
||||
|
||||
This stage will be covered by implementing the `Generic os-vif datapath
|
||||
offloads spec`_.
|
||||
|
||||
* Stage 3: Extend the ``vrouter`` VIF type in Nova.
|
||||
|
||||
Modify ``_nova_to_osvif_vif_vrouter`` to support two additional VNIC types:
|
||||
|
||||
* ``VNIC_TYPE_DIRECT``: os-vif ``VIFHostDevice`` will be used.
|
||||
|
||||
* ``VNIC_TYPE_VIRTIO_FORWARDER``: os-vif ``VIFVHostUser`` will be used.
|
||||
|
||||
Code impact to Nova will be to pass through the representor information to
|
||||
the os-vif plugin using the extensions developed in Stage 2.
|
||||
|
||||
Summary of plugging methods
|
||||
---------------------------
|
||||
|
||||
* Existing methods supported by Contrail:
|
||||
|
||||
* VIF type: ``hw_veb`` (legacy)
|
||||
|
||||
* VNIC type: ``direct``
|
||||
|
||||
* VIF type: ``vhostuser`` (os-vif plugin: ``contrail_vrouter``)
|
||||
|
||||
* VNIC type: ``normal``
|
||||
* ``details: vhostuser_vrouter_plug: True``
|
||||
* os-vif object ``VIFVHostUser``
|
||||
|
||||
* VIF type: ``vrouter`` (legacy)
|
||||
|
||||
* VNIC type: ``normal``
|
||||
|
||||
* After migration to os-vif (Stage 1):
|
||||
|
||||
* VIF type: ``hw_veb`` (legacy)
|
||||
|
||||
* VNIC type: ``direct``
|
||||
|
||||
* VIF type: ``vhostuser`` (os-vif plugin: ``contrail_vrouter``)
|
||||
|
||||
* VNIC type: ``normal``
|
||||
* ``details: vhostuser_vrouter_plug: True``
|
||||
* os-vif object: ``VIFVHostUser``
|
||||
|
||||
* VIF type: ``vrouter`` (os-vif plugin: ``vrouter``)
|
||||
|
||||
* VNIC type: ``normal``
|
||||
* os-vif object: ``VIFGeneric``
|
||||
|
||||
* Additional accelerated plugging modes (Stage 3):
|
||||
|
||||
* VIF type: ``vrouter`` (os-vif plugin: ``vrouter``)
|
||||
|
||||
* VNIC type: ``direct``
|
||||
* os-vif object: ``VIFHostDevice``
|
||||
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
|
||||
|
||||
* VIF type: ``vrouter`` (os-vif plugin: ``vrouter``)
|
||||
|
||||
* VNIC type: ``virtio-forwarder``
|
||||
* os-vif object: ``VIFVHostUser``
|
||||
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
|
||||
|
||||
Additional notes
|
||||
----------------
|
||||
|
||||
* Stage 1 and Stage 2 can be completed and verified in parallel. The
|
||||
abstraction layer will be tested on the Open vSwitch offloads.
|
||||
|
||||
* Selecting between the VEB passthrough mode and the offloaded vRouter
|
||||
datapath passthrough mode happens at the `Contrail Controller`_. This is
|
||||
keyed on the provider network associated with the Neutron port.
|
||||
|
||||
* The `vRouter os-vif plugin`_ has been updated to adopt ``vrouter`` as the new
|
||||
os-vif plugin name. ``contrail_vrouter``, is kept as a backwards compatible
|
||||
alias. This prevents namespace fragmentation. `Tungsten Fabric`_,
|
||||
`OpenContrail`_ and `Juniper Contrail`_ can use a single os-vif plugin
|
||||
for the vRouter datapath.
|
||||
|
||||
* No corresponding changes in Neutron are expected. The Contrail Neutron
|
||||
plugin and agent require minimal changes in order to allow the semantics
|
||||
to propagate correctly.
|
||||
|
||||
* This change is agnostic to the SmartNIC datapath: should Contrail switch
|
||||
to TC based offloads, eBPF or a third-party method, the Nova plugging
|
||||
logic will remain the same for full offloads.
|
||||
|
||||
* A deployer/administrator still has to register the PCI devices on the
|
||||
hypervisor with ``pci_passthrough_whitelist`` in ``nova.conf``.
|
||||
|
||||
* SmartNIC-enabled nodes and standard compute nodes can run side-by-side.
|
||||
Standard scheduling filters allocate and place Instances according to port
|
||||
types and driver capabilities.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Alternatives proposed require much more invasive patches to Nova:
|
||||
|
||||
* Create a new VIF type:
|
||||
|
||||
* This would add three VIF types for Contrail to maintain. This is not
|
||||
ideal.
|
||||
|
||||
* Add glance or flavor annotations:
|
||||
|
||||
* This would force an Instance to have one type of acceleration. Code would
|
||||
possibly move out to more VIF types and Virtual Function reservation would
|
||||
still need to be updated.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
os-vif plugins run with elevated privileges.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
End users will be able to plug VIFs into Instances with either ``normal``,
|
||||
``direct`` or ``virtio-forwarder`` VNIC types on hardware enabled Nova nodes
|
||||
running Contrail.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
This code is likely to be called at VIF plugging and unplugging. Performance
|
||||
is not expected to regress.
|
||||
|
||||
On accelerated ports, dataplane performance between Instances is expected to
|
||||
increase.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
A deployer would still need to configure the SmartNIC components of Contrail
|
||||
and configure the PCI whitelist in Nova at deployment. This would not require
|
||||
core OpenStack changes.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Core Nova semantics will be slightly changed. ``vrouter`` VIFs will support
|
||||
more VNIC types.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
New VNIC type semantics will be available on compute nodes with this patch.
|
||||
|
||||
A deployer would be mandated to install the os-vif plugin to retain existing
|
||||
functionality in Nova. This is expected to be handled by minimum required
|
||||
versions in Contrail.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Jan Gutter <jan.gutter@netronome.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* contrail-controller review implementing the semantics has been merged and is
|
||||
awaiting a release tag:
|
||||
https://review.opencontrail.org/42850
|
||||
|
||||
* The OpenContrail os-vif reference plugin has been updated and is awaiting a
|
||||
release tag:
|
||||
https://review.opencontrail.org/43399
|
||||
|
||||
* Stage 1: os-vif porting for vRouter VIF has been submitted:
|
||||
https://review.openstack.org/571325
|
||||
|
||||
* Stage 2: `Generic os-vif datapath offloads spec`_ needs to be implemented.
|
||||
|
||||
* Stage 3: The OpenContrail os-vif reference plugin needs to be amended with
|
||||
the interfaces added to os-vif in Stage 2.
|
||||
|
||||
* Stage 3: The ``vrouter`` VNIC support needs to be added in Nova:
|
||||
https://review.openstack.org/572082
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
The following dependencies on Tungsten Fabric have been merged on the master
|
||||
branch and are awaiting a release tag:
|
||||
|
||||
* The Contrail/Tungsten Fabric controller required minor updates to enable the
|
||||
proposed semantics. This was merged in:
|
||||
https://review.opencontrail.org/42850
|
||||
|
||||
* The os-vif reference plugin has been updated in:
|
||||
https://review.opencontrail.org/43399
|
||||
|
||||
The following items can occur in parallel:
|
||||
|
||||
* os-vif extensions for accelerated datapath plugin modes need to be released.
|
||||
Consult the `Generic os-vif datapath offloads spec`_ for more details. The
|
||||
os-vif library update is planned for the Stein release.
|
||||
|
||||
* Pending release tags on the Contrail os-vif plugin, the `vRouter os-vif
|
||||
conversion blueprint`_ can be completed. This is currently planned for the
|
||||
Tungsten Fabric 5.1 release.
|
||||
|
||||
Once both of the preceding tasks have been implemented, the following items
|
||||
can occur in parallel:
|
||||
|
||||
* Nova can implement the VNIC support for the ``contrail`` os-vif plugin.
|
||||
|
||||
* The ``contrail`` os-vif plugin can be updated to use the new os-vif
|
||||
interfaces.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Unit tests have been refreshed and now cover the VIF operations more
|
||||
completely.
|
||||
|
||||
* Third-party CI testing will be necessary to validate the Contrail and
|
||||
Tungsten Fabric compatibility.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Since this spec affects a non-reference Neutron plugin, a release note in Nova
|
||||
should suffice. Specific versions of Contrail / Tungsten Fabric need to be
|
||||
mentioned when a new plugin is required to provide existing functionality. The
|
||||
external documentation to configure and use the new plugging modes should be
|
||||
driven from the Contrail / Tungsten Fabric side.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* `Juniper Contrail`_
|
||||
* `OpenContrail`_
|
||||
* `Tungsten Fabric`_
|
||||
* `virtio-forwarder`_
|
||||
* `vRouter packet processing pipeline`_
|
||||
* `Offload hooks in kernel vRouter`_
|
||||
* `Open vSwitch offloads`_
|
||||
* `Generic os-vif datapath offloads spec`_
|
||||
* `Contrail Agent`_
|
||||
* `Contrail Controller`_
|
||||
* `vRouter os-vif plugin`_
|
||||
* `vRouter os-vif conversion blueprint`_
|
||||
* `Contrail Controller to Neutron translation unit`_
|
||||
* `Nova review implementing offloads for legacy plugging <https://review.openstack.org/567147>`_
|
||||
(this review serves as an example and has been obsoleted)
|
||||
|
||||
.. _`Juniper Contrail`: https://www.juniper.net/us/en/products-services/sdn/contrail/
|
||||
.. _`OpenContrail`: http://www.opencontrail.org/
|
||||
.. _`Tungsten Fabric`: https://tungsten.io/
|
||||
.. _`virtio-forwarder`: http://virtio-forwarder.readthedocs.io/en/latest/
|
||||
.. _`vRouter packet processing pipeline`: https://github.com/Juniper/contrail-vrouter
|
||||
.. _`Offload hooks in kernel vRouter`: https://github.com/Juniper/contrail-vrouter/blob/R4.1/include/vr_offloads.h
|
||||
.. _`Open vSwitch offloads`: https://docs.openstack.org/neutron/queens/admin/config-ovs-offload.html
|
||||
.. _`Contrail Agent`: https://github.com/Juniper/contrail-controller/tree/R4.1/src/vnsw/agent
|
||||
.. _`Contrail Controller`: https://github.com/Juniper/contrail-controller
|
||||
.. _`vRouter os-vif plugin`: https://github.com/Juniper/contrail-nova-vif-driver/blob/master/vif_plug_vrouter/
|
||||
.. _`Generic os-vif datapath offloads spec`: https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/generic-os-vif-offloads.html
|
||||
.. _`vRouter os-vif conversion blueprint`: https://blueprints.launchpad.net/nova/+spec/vrouter-os-vif-conversion
|
||||
.. _`Contrail Controller to Neutron translation unit`: https://github.com/Juniper/contrail-controller/blob/R4.1/src/config/api-server/vnc_cfg_types.py
|
||||
Reference in New Issue
Block a user