Move Stein implemented specs

This moves the completed Stein specs to the implemented directory and
adds the redirects. This was done using:

  $ tox -e move-implemented-specs stein

This also removes the stein-template symlink from the docs.

And renames the handling-down-cell_new.rst spec to handling-down-cell.rst
to match the blueprint.

Change-Id: Id92bec8c5a2436a4053765f735d252c7c165f019
This commit is contained in:
melanie witt
2019-03-22 16:48:34 +00:00
parent 22da2a8021
commit b7677c59ac
23 changed files with 19 additions and 9 deletions

View File

@@ -0,0 +1,292 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=============================================
Filter Allocation Candidates by Provider Tree
=============================================
https://blueprints.launchpad.net/nova/+spec/alloc-candidates-in-tree
This blueprint proposes to support for filtering allocation candidates
by provider tree.
Problem description
===================
Placement currently supports ``in_tree`` query parameters for the
``GET /resource_providers`` endpoints. This parameter is a string representing
a resource provider uuid, and when this is present, the response is limited to
resource providers within the same tree of the provider indicated by the uuid.
See `Nested Resource Providers`_ spec for details.
However, ``GET /allocation_candidates`` doesn't support the ``in_tree`` query
parameter to filter the allocation candidates by resource tree. This results
in inefficient post-processing in some cases where the caller has already
selected the resource provider tree before calling that API.
Use Cases
---------
This feature is useful when the caller of the ``GET /allocation_candidates``
has already picked up resource providers they want to use.
As described in the `Bug#1777591`_, when an admin operator creates an instance
on a specific host, nova now explicitly sets no limitation for getting
allocation candidates to prevent placement from filtering out the
pre-determined target resource provider by the random limitation. (For the
limitation feature of the API, see the `Limiting Allocation Candidates`_
spec)
Instead of issuing the inefficient request to placement, we can use ``in_tree``
query with the pre-determined target host resource provider uuid calling the
``GET /allocation_candidates`` API.
We would solve the same problem for cases of live migration to a specified
host and rebuilding an instance on the same host.
Proposed change
===============
The ``GET /allocation_candidates`` call will accept a new query parameter
``in_tree``. This parameter is a string representing a resource provider uuid.
When this is present, the only resource providers returned will be those in the
same tree with the given resource provider.
The numbered syntax ``in_tree<N>`` is also supported. This restricts providers
satisfying the Nth granular request group to the tree of the specified
provider. This may be redundant with other ``in_tree<N>`` values specified in
other groups (including the unnumbered group). However, it can be useful in
cases where a specific resource (e.g. DISK_GB) needs to come from a specific
sharing provider (e.g. shared storage).
In the following environments,
.. code::
+-----------------------+ +-----------------------+
| sharing storage (ss1) | | sharing storage (ss2) |
| DISK_GB: 1000 | | DISK_GB: 1000 |
+-----------+-----------+ +-----------+-----------+
| |
+-----------------+----------------+
|
| Shared via an aggregate
+-----------------+----------------+
| |
+--------------|---------------+ +--------------|--------------+
| +------------+-------------+ | | +------------+------------+ |
| | compute node (cn1) | | | |compute node (cn2) | |
| | DISK_GB: 1000 | | | | DISK_GB: 1000 | |
| +-----+-------------+------+ | | +----+-------------+------+ |
| | nested | nested | | | nested | nested |
| +-----+------+ +----+------+ | | +----+------+ +----+------+ |
| | numa1_1 | | numa1_2 | | | | numa2_1 | | numa2_2 | |
| | VCPU: 4 | | VCPU: 4 | | | | VCPU: 4 | | VCPU: 4 | |
| +------------+ +-----------+ | | +-----------+ +-----------+ |
+------------------------------+ +-----------------------------+
for example::
GET /allocation_candidates?resources=VCPU:1,DISK_GB:50&in_tree={cn1_uuid}
will return 2 combinations of allocation candidates.
result A::
1. numa1_1 (VCPU) + cn1 (DISK_GB)
2. numa1_2 (VCPU) + cn1 (DISK_GB)
The specified tree can be a non-root provider::
GET /allocation_candidates?resources=VCPU:1,DISK_GB:50&in_tree={numa1_1_uuid}
will return the same result.
result B::
1. numa1_1 (VCPU) + cn1 (DISK_GB)
2. numa1_2 (VCPU) + cn1 (DISK_GB)
When you want to have ``VCPU`` from ``cn1`` and ``DISK_GB`` from wherever,
the request may look like::
GET /allocation_candidates?resources=VCPU:1&in_tree={cn1_uuid}
&resources1=DISK_GB:10
which will return the sharing providers as well.
result C::
1. numa1_1 (VCPU) + cn1 (DISK_GB)
2. numa1_2 (VCPU) + cn1 (DISK_GB)
3. numa1_1 (VCPU) + ss1 (DISK_GB)
4. numa1_2 (VCPU) + ss1 (DISK_GB)
5. numa1_1 (VCPU) + ss2 (DISK_GB)
6. numa1_2 (VCPU) + ss2 (DISK_GB)
When you want to have ``VCPU`` from wherever and ``DISK_GB`` from ``ss1``,
the request may look like::
GET: /allocation_candidates?resources=VCPU:1
&resources1=DISK_GB:10&in_tree1={ss1_uuid}
which will stick to the first sharing provider for DISK_GB.
result D::
1. numa1_1 (VCPU) + ss1 (DISK_GB)
2. numa1_2 (VCPU) + ss1 (DISK_GB)
3. numa2_1 (VCPU) + ss1 (DISK_GB)
4. numa2_2 (VCPU) + ss1 (DISK_GB)
When you want to have ``VCPU`` from ``cn1`` and ``DISK_GB`` from ``ss1``,
the request may look like::
GET: /allocation_candidates?resources1=VCPU:1&in_tree1={cn1_uuid}
&resources2=DISK_GB:10&in_tree2={ss1_uuid}
&group_policy=isolate
which will return only 2 candidates.
result E::
1. numa1_1 (VCPU) + ss1 (DISK_GB)
2. numa1_2 (VCPU) + ss1 (DISK_GB)
Alternatives
------------
Alternative 1:
We could mitigate the restriction to include sharing providers assuming that
they are in specified non-sharing tree that shares them. For example, we could
change result A to return::
1. numa1_1 (VCPU) + cn1 (DISK_GB)
2. numa1_2 (VCPU) + cn1 (DISK_GB)
3. numa1_1 (VCPU) + ss1 (DISK_GB)
4. numa1_2 (VCPU) + ss1 (DISK_GB)
5. numa1_1 (VCPU) + ss2 (DISK_GB)
6. numa1_2 (VCPU) + ss2 (DISK_GB)
This is possible if we assume that ``ss1`` and ``ss2`` are in "an expanded
concept of a tree" of ``cn1``, but we don't take this way because we can get
the same result using the granular request. Different result for a different
request means we support more use cases than the same result for a different
request.
Alternative 2:
In result B, we could exclude ``numa1_2`` resource provider (the second
candidate), but we don't take this way for the following reason:
It is not consistent with the existing ``in_tree`` behavior in
``GET /resource_providers``. The inconsistency despite of the same queryparam
name could confuse users. If we need this behaivor, that would be something
like ``subtree`` queryparam which should be symmetrically implemented to
``GET /resource_providers`` as well. This is already proposed in
`Support subtree filter for GET /resource_providers`_ spec.
Data model impact
-----------------
None.
REST API impact
---------------
A new microversion will be created to add the ``in_tree`` parameter to
``GET /allocation_candidates`` API.
Security impact
---------------
None.
Notifications impact
--------------------
None.
Other end user impact
---------------------
None.
Performance Impact
------------------
If the callers of the ``GET /allocation_candidates`` has already picked up
resource providers they want to use, they would get improved performance
using this new ``in_tree`` query because we don't need to get all the
candidates from the database.
Other deployer impact
---------------------
This feature enables us to develop efficient query in nova for cases that is
described in the `Use Cases`_ section.
Developer impact
----------------
None.
Upgrade impact
--------------
None.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Tetsuro Nakamura (nakamura.tetsuro@lab.ntt.co.jp)
Work Items
----------
* Update the ``AllocationCandidates.get_by_requests`` method to change the
database queries to filter on the specified provider tree.
* Update the placement API handlers for ``GET /allocation_candidates`` in
a new microversion to pass the new ``in_tree`` parameter to the methods
changed in the steps above, including input validation adjustments.
* Add functional tests of the modified database queries.
* Add gabbi tests that express the new queries, both successful queries and
those that should cause a 400 response.
* Release note for the API change.
* Update the microversion documents to indicate the new version.
* Update placement-api-ref to show the new query handling.
Dependencies
============
None.
Testing
=======
Normal functional and unit testing.
Documentation Impact
====================
Document the REST API microversion in the appropriate reference docs.
References
==========
* `Nested Resource Providers`_ spec
* `Bug#1777591`_ reported in the launchpad
* `Limiting Allocation Candidates`_ spec
.. _`Nested Resource Providers`: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/nested-resource-providers.html
.. _`Bug#1777591`: https://bugs.launchpad.net/nova/+bug/1777591
.. _`Limiting Allocation Candidates`: https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/allocation-candidates-limit.html
.. _`Support subtree filter for GET /resource_providers`: https://review.openstack.org/#/c/595236/

View File

@@ -0,0 +1,624 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===================================
Network Bandwidth resource provider
===================================
https://blueprints.launchpad.net/nova/+spec/bandwidth-resource-provider
This spec proposes adding new resource classes representing network
bandwidth and modeling network backends as resource providers in
Placement. As well as adding scheduling support for the new resources in Nova.
Problem description
===================
Currently there is no method in the Nova scheduler to place a server
based on the network bandwidth available in a host. The Placement service
doesn't track the different network back-ends present in a host and their
available bandwidth.
Use Cases
---------
A user wants to spawn a server with a port associated with a specific physical
network. The user also wants a defined guaranteed minimum bandwidth for this
port. The Nova scheduler must select a host which satisfies this request.
Proposed change
===============
This spec proposes Neutron to model the bandwidth resource of the physical NICs
on a compute host and their resources providers in the Placement service,
express the bandwidth request in the Neutron port, and modify Nova to consider
the requested bandwidth resource during the scheduling of the server based on
the available bandwidth resources on each compute host.
This also means that this spec proposes to use Placement and the nova-scheduler
to select which bandwidth providing RP and therefore which physical device will
provide the bandwidth for a given Neutron port. Today selecting the physical
device happens during Neutron port binding but after this spec is implemented
this selection will happen when an allocation candidate is selected for the
server in the nova-scheduler. Therefore Neutron needs to provide enough
information in the Networking RP model in Placement and in the resource_request
field of the port so that Nova can query Placement and receive allocation
candidates that are not conflicting with Neutron port binding logic.
The Networking RP model and the schema of the new resource_request port
attribute is described in `QoS minimum bandwidth allocation in Placement API`_
Neutron spec.
Please note that today Neutron port binding could fail if the nova-scheduler
selects a compute host where Neutron cannot bind the port. We are not aiming to
remove this limitation by this spec but also we don't want to increase the
frequency of such port binding failures as it would ruin the usability of the
system.
Separation of responsibilities
------------------------------
* Nova creates the root RP of the compute node RP tree as today
* Neutron creates the networking RP tree of a compute node under the compute
node root RP and reports bandwidth inventories
* Neutron provides the resource_request of a port in the Neutron API
* Nova takes the ports' resource_request and includes it in the GET
/allocation_candidate request. Nova does not need to understand or manipulate
the actual resource request. But Nova needs to assign unique granular
resource request group suffix for each port's resource request.
* Nova selects one allocation candidate and claims the resources in Placement.
* Nova passes the RP UUID used to fulfill the port resource request to Neutron
during port binding
Scoping
-------
Due to the size and complexity of this feature the scope of the current spec
is limited. To keep backward compatibility while the feature is not fully
implemented both new Neutron API extensions will be optional and turned off by
default. Nova will check for the extension that introduces the port's
resource_request field and fall back to the current resource handling behavior
if the extension is not loaded.
Out of scope from Nova perspective:
* Supporting separate proximity policy for the granular resource request groups
created from the Neutron port's resource_request. Nova will use the policy
defined in the flavor extra_spec for the whole request as today such policy
is global for an allocation_candidate request.
* Handling Neutron mechanism driver preference order in a weigher in the
nova-scheduler
* Interface attach with a port or network having a QoS minimum bandwidth policy
rule as interface_attach does not call scheduler today. Nova will reject
interface_attach request if the port (passed in or created in network that is
passed in) resource request in non empty.
* Server create with network having QoS minimum bandwidth policy rule as a port
in this network is created by the nova-compute *after* the scheduling
decision. This spec proposes to fail such boot in the compute-manager.
* QoS policy rule create or update on bound port
* QoS aware trunk subport create under a bound parent port
* Baremetal port having a QoS bandwidth policy rule is out of scope as Neutron
does not own the networking devices on a baremetal compute node.
Scenarios
---------
This spec needs to consider multiple flows and scenarios detailed in the
following sections.
Neutron agent first start
~~~~~~~~~~~~~~~~~~~~~~~~~
The Neutron agent running on a given compute host uses the existing ``host``
neutron.conf variable to find the compute RP related to its host in Placement.
See `Finding the compute RP`_ for details and reasoning.
The Neutron agent creates the networking RPs under the compute RP with proper
traits then reports resource inventories based on the discovered and / or
configured resource inventory of the compute host. See
`QoS minimum bandwidth allocation in Placement API`_ for details.
Neutron agent restart
~~~~~~~~~~~~~~~~~~~~~
During restart the Neutron agent ensures that the proper RP tree exists in
Placement with correct inventories and traits by creating / updating the RP
tree if necessary. The Neutron agent only modifies the inventory and traits of
the RPs that were created by the agent. Also Neutron only modifies the pieces
that actually got added or deleted. Unmodified pieces should be left in place
(no delete and re-create).
Server create with pre-created Neutron ports having QoS policy rule
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The end user creates a Neutron port with the Neutron API and attaches a QoS
policy minimum bandwidth rule to it, either directly or indirectly by attaching
the rule to the network the port is created in. Then the end user creates a
server in Nova and passes in the port UUID in the server create request.
Nova fetches the port data from Neutron. This already happens in
create_pci_requests_for_sriov_ports in the current code base. The port contains
the requested resources and required traits. See
`Resource request in the port`_.
The create_pci_requests_for_sriov_ports() call needs to be refactored to a more
generic call that not just generates PCI requests but also collects the
requested resources from the Neutron ports.
The nova-api stores the requested resources and required traits in a new field
of the RequestSpec object called requested_resources. The new
`requested_resources` field should not be persisted in the api database as
it is computed data based on the resource requests from different sources in
this case from the Neutron ports and the data in the port might change outside
of Nova.
The nova-scheduler uses this information from the RequestSpec to send an
allocation candidate request to Placement that contains the port related
resource requests besides the compute related resource requests. The requested
resources and required traits from each port will be considered to be
restricted to a single RP with a separate, numbered request group as defined in
the `granular-resource-request`_ spec. This is necessary as mixing requested
resource and required traits from different ports (i.e. one OVS and one
SRIOV) towards placement will cause empty allocation candidate response as no
RP will have both OVS and SRIOV traits at the same time.
Alternatively we could extend and use the requested_networks
(NetworkRequestList ovo) parameter of the build_instance code path to store and
communicate the resource needs coming from the Neutron ports. Then the
select_destinations() scheduler rpc call needs to be extended with a new
parameter holding the NetworkRequestList.
The `nova.scheduler.utils.resources_from_request_spec()` call needs to be
modified to use the newly introduced `requested_resources` field from the
RequestSpec object to generate the proper allocation candidate request.
Later on the resource request in the Neutron port API can be evolved to support
the same level of granularity that the Nova flavor resource override
functionality supports.
Then Placement returns allocation candidates. After additional filtering and
weighing in the nova-scheduler, the scheduler claims the resources in the
selected candidate in a single transaction in Placement. The consumer_id of the
created allocations is the instance_uuid. See `The consumer of the port related
resources`_.
When multiple ports, having QoS policy rules towards the same physical network,
are attached to the server (e.g. two VFs on the same PF) then the resulting
allocation is the sum of the resource amounts of each individual port request.
Delete a server with ports having QoS policy rule
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
During normal delete, `local delete`_ and shelve_offload Nova today deletes the
resource allocation in placement where the consumer_id is the instance_uuid. As
this allocation will include the port related resources those are also cleaned
up.
Detach_interface with a port having QoS policy rule
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After the detach succeeds in Neutron and in the hypervisor, the nova-compute
needs to delete the allocation related to the detached port in Placement. The
rest of the server's allocation will not be changed.
Server move operations (cold migrate, evacuate, resize, live migrate)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
During the move operation Nova makes allocation on the destination host
with consumer_id == instance_uuid while the allocation on the source host is
changed to have consumer_id == migration_uuid. These allocation sets will
contain the port related allocations as well. When the move operation succeeds
Nova deletes the allocation towards the source host. If the move operation
rolled back Nova cleans up the allocations towards the destination host.
As the port related resource request is not persisted in the RequestSpec object
Nova needs to re-calculate that from the ports' data before calling the
scheduler.
Move operations with force host flag (evacuate, live-migrate) do not call the
scheduler. So to make sure that every case is handled we have to go through
every direct or indirect call of `reportclient.claim_resources()` function and
ensure that the port related resources are handled properly. Today we `blindly
copy the allocation from source host to destination host`_ by using the
destination host as the RP. This will be lot more complex as there will be
more than one RP to be replaced and Nova will have a hard time to figure out
what Network RP from the source host maps to what Network RP on the
destination host. A possible solution is to `send the move requests through
the scheduler`_ regardless of the force flag but skipping the scheduler
filters.
.. note::
Server move operations with ports having resource request are not
supported in Stein.
Shelve_offload and unshelve
~~~~~~~~~~~~~~~~~~~~~~~~~~~
During shelve_offload Nova deletes the resource allocations including the port
related resources as those also have the same consumer_id, the instance uuid.
During unshelve a new scheduling is done in the same way as described in the
server create case.
.. note::
Unshelve after Shelve offload operations with ports having resource
request are not supported in Stein.
Details
-------
Finding the compute RP
~~~~~~~~~~~~~~~~~~~~~~
Neutron already depends on the ``host`` conf variable to be set to the same id
that Nova uses in the Neutron port binding. Nova uses the hostname in the port
binding. If the ``host`` is not defined in the Neutron config then it defaults
to the hostname as well. This way Neutron and Nova are in sync today. The same
mechanism (i.e. the hostname) can be used in Neutron agent to find the compute
RP created by Nova for the same compute host.
Having non fully qualified hostnames in a deployment can cause ambiguity. For
example different cells might contain hosts with the same hostname. This
hostname ambiguity in a multicell deployment is already a problem without the
currently proposed feature as Nova uses the hostname as the compute RP name in
Placement and the name field has a unique constraint in the Placement db model.
So in an ambiguous situation the Nova compute services having non unique
hostnames have already failed to create RPs in Placement.
The ambiguity can be fixed by enforcing that hostnames are FQDNs. However as
this problem is not special for the currently proposed feature this fix is out
of scope of this spec. The `override-compute-node-uuid`_ blueprint describes a
possible solution.
The consumer of the port related resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This spec proposes to use the instance_uuid as the consumer_id for the port
related resource as well.
During the server move operations Nova needs to handle two sets of allocations
for a single server (one for the source and one for the destination host). If
the consumer_id of the port related resources are the port_id then during move
operations the two sets of allocations couldn't be distinguished, especially in
case of resize to same host. Therefore the port_id is not a good consumer_id.
Another possibility would be to use a UUID from the port binding as consumer_id
but the port binding does not have a UUID today. Also today the port binding
is created after the allocations are made which is too late.
In both cases having multiple allocations for a single server on a single host
would make it complex to find every allocation for that server both for Nova
and for the deployer using the Placement API.
Separating non QoS aware and QoS aware ports
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If QoS aware and non QoS aware ports are mixed on the same physical port then
the minimum bandwidth rule cannot be fulfilled. The separation can be achieved
at least on two levels:
* Separating compute hosts via host aggregates. The deployer can create two
host aggregates in Nova, one for QoS aware server and another for non QoS
aware servers. This separation can be done without changing either Nova or
Neutron. This is the proposed solution for the first version of this feature.
* Separating physical ports via traits. The Neutron agent can put traits, like
`CUSTOM_GUARANTEED_BW_ONLY` and `CUSTOM_BEST_EFFORT_BW_ONLY` to the network
RPs to indicate which physical port belongs to which group. Neutron can offer
this configurability via neutron.conf. Then Neutron can add
`CUSTOM_GUARANTEED_BW_ONLY` trait in resource request of the port that is QoS
aware and add `CUSTOM_BEST_EFFORT_BW_ONLY` trait otherwise. This solution
would allow better granularity as a server can request guaranteed bandwidth
on its data port and can accept best effort connectivity on its control port.
This solution needs additional work in Neutron but no additional work in
Nova. Also this would mean that ports without QoS policy rules would also
have at least a trait request (`CUSTOM_BEST_EFFORT_BW_ONLY`) and it would
cause scheduling problems with a port created by the nova-compute.
Therefore this option can only be supported
`after nova port create is moved to the conductor`_.
* If we use \*_ONLY traits then we can never combine them, though that would be
desirable. For example it makes perfect sense to guarantee 5 gigabits of a
10 gigabit card to somebody and let the rest to be used on a best effort
basis. To allow this we only need to turn the logic around and use traits
CUSTOM_GUARANTEED_BW and CUSTOM_BEST_EFFORT_BW. If the admin still wants to
keep guaranteed and best effort traffic fully separated then s/he never puts
both traits on the same RP. But one can mix them if one wants to. Even the
possible starvation of best effort traffic (next to guaranteed traffic) could
be easily addressed by reserving some of the bandwidth inventory.
Alternatives
------------
Alternatives are discussed in their respective sub chapters in this spec.
Data model impact
-----------------
Two new standard Resource Classes will be defined to represent the bandwidth in
each direction, named as `NET_BW_IGR_KILOBIT_PER_SEC` and
`NET_BW_EGR_KILOBIT_PER_SEC`. The kbps unit is selected as the
Neutron API already use this unit in the `QoS minimum bandwidth rule`_ API and
we would like to keep the units in sync.
A new `requested_resources` field is added to the RequestSpec versioned
object with ListOfObjectField('RequestGroup') type to store the resource and
trait requests coming from the Neutron ports. This field will not be persisted
in the database.
A new field ``requester_id`` is added to the InstancePCIRequest versioned
object to connect the PCI request created from a Neutron port to the resource
request created from the same Neutron port. Nova will populate this field with
the ``port_id`` of the Neutron port and the ``requester_id`` field of the
RequestGroup versioned object will be used to match the PCI request with the
resource request.
The `QoS minimum bandwidth allocation in Placement API`_ Neutron spec will
propose the modeling of the Networking RP subtree in Placement. Nova will
not depend on the exact structure of such model as Neutron will provide the
port's resource request in an opaque way and Nova will only need to blindly
include that resource request to the ``GET allocation_candidates`` request.
Resource request in the port
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Neutron needs to express the port's resource needs in the port API in a similar
way the resource request can be done via flavor extra_spec. For now we assume
that a single port requests resources from a single RP. Therefore Nova will map
each port's resource request to a single numbered resource request group as
defined in `granular-resource-request`_ spec. That spec requires that the name
of the numbered resource groups has a form of `resources<integer>`. Nova will
map a port's resource_request to the first unused numbered group in the
allocation_candidate request. Neutron does not know which ports are used
together in a server create request, and which numbered groups have already
been used by the flavor extra_spec therefore Neutron cannot assign unique
integer ids to the resource groups in these ports.
From implementation perspective it means Nova will create one RequestGroup
instance for each Neutron port based on the port's resource_request and insert
it to the end of the list in `RequestSpec.requested_resources`.
In case when the Neutron multi-provider extension is used and a logical network
maps to more than one physnet then the port's resource request will require
that the selected network RP has one of the physnet traits the network maps to.
This any-traits type of request is not supported by Placement today but can be
implemented similarly to member_of query param used for aggregate selection in
Placement. This will be proposed in a separate spec
`any-traits-in-allocation_candidates-query`_.
Mapping between physical resource consumption and claimed resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Neutron must ensure that the resources allocated in Placement for a port are
the same as the resources consumed by that port from the physical
infrastructure. To be able to do that Neutron needs to know the mapping between
a port's resource request and a specific RP (or RPs) in the allocation record
of the server that are fulfilling the request.
Nova will calculate which port is fulfilled by which RP and the RP UUID will be
provided to Neutron during the port binding.
REST API impact
---------------
Neutron REST API impact is discussed in the separate
`QoS minimum bandwidth allocation in Placement API`_ Neutron spec.
The Placement REST API needs to be extended to support querying allocation
candidates with an RP that has at least one of the traits from a list
of requested traits. This feature will be described in the separate
`any-traits-in-allocation_candidates-query`_ spec.
This feature also depends on the `granular-resource-request`_ and
`nested-resource-providers`_ features which impact the Placement REST API.
A new microversion will be added to the Nova REST API to indicate that server
create supports ports with resource request. Server operations
(e.g. create, interface_attach, move) involving ports having resource request
will be rejected with older microversion. However server delete and port detach
will be supported with old microversion for these server too.
.. note::
Server move operations are not supported in Stein even with the new
microversion.
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
* Placement API will be used from Neutron to create RPs and the compute RP tree
will grow in size.
* Nova will send more complex allocation candidate request to Placement as it
will include the port related resource request as well.
* Nova will calculate the mapping between each port's resource request and the
RP in the overall allocation that fulfills such request.
As Placement do not seem to be a bottleneck today we do not foresee
performance degradation due to the above changes.
Other deployer impact
---------------------
This feature impacts multiple modules and creates new dependencies between
Nova, Neutron and Placement.
Also the deployer should be aware that after this feature the server create and
move operations could fail due to bandwidth limits managed by Neutron.
Developer impact
----------------
None
Upgrade impact
--------------
Servers could exist today with SRIOV ports having QoS minimum bandwidth policy
rule and for them the resource allocation is not enforced in Placement during
scheduling. Upgrading to an OpenStack version that implements this feature
will make it possible to change the rule in Neutron to be placement aware (i.e.
request resources) then (live) migrate the servers and during the selection of
the target of the migration the minimum bandwidth rule will be enforced by the
scheduler. Tools can also be provided to search for existing instances and try
to do the minimum bandwidth allocation in place. This way the number of
necessary migrations can be limited.
The end user will see behavior change of the Nova API after such upgrade:
* Booting a server with a network that has QoS minimum bandwidth policy rule
requesting bandwidth resources will fail. The current Neutron feature
proposal introduces the possibility of a QoS policy rule to request
resources but in the first iteration Nova will only support such rule on
a pre-created port.
* Attaching a port or a network having QoS minimum bandwidth policy rule
requesting bandwidth resources to a running server will fail. The current
Neutron feature proposal introduces the possibility of a QoS policy rule to
request resources but in the first iteration Nova will not support
such rule for interface_attach.
The new QoS rule API extension and the new port API extension in Neutron will
be marked experimental until the above two limitations are resolved.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
* balazs-gibizer (Balazs Gibizer)
Other contributors:
* xuhj (Alex Xu)
* minsel (Miguel Lavalle)
* bence-romsics (Bence Romsics)
* lajos-katona (Lajos Katona)
Work Items
----------
This spec does not list work items for the Neutron impact.
* Make RequestGroup an ovo and add the new `requested_resources` field to the
RequestSpec. Then refactor the `resources_from_request_spec()` to use the
new field.
* Implement `any-traits-in-allocation_candidates-query`_ and
`mixing-required-traits-with-any-traits`_ support in Placement.
This work can be done in parallel with the below work items as any-traits
type of query only needed for a small subset of the use cases.
* Read the resource_request from the Neutron port in the nova-api and store
the requests in the RequestSpec object.
* Include the port related resources in the allocation candidate request in
nova-scheduler and nova-conductor and claim port related resources based
on a selected candidate.
* Send the server's whole allocation to the Neutron during port binding
* Ensure that server move operations with force flag handles port resource
correctly by sending such operations through the scheduler.
* Delete the port related allocations from Placement after successful interface
detach operation
* Reject an interface_attach request that contains a port or a network having
a QoS policy rule attached that requests resources.
* Check in nova-compute that a port created by the nova-compute during server
boot has a non empty resource_request in the Neutron API and fail the boot if
it has
Dependencies
============
* `any-traits-in-allocation_candidates-query`_ and
`mixing-required-traits-with-any-traits`_ to support multi-provider
networks. While these placement enhancements are not in place this feature
will only support networks with a single network segment having a physnet
defined.
* `nested-resource-providers`_ to allow modelling the networking RPs
* `granular-resource-request`_ to allow requesting each port related resource
from a single RP
* `QoS minimum bandwidth allocation in Placement API`_ for the Neutron impacts
Testing
=======
Tempest tests as well as functional tests will be added to ensure that server
create operation, server move operations, shelve_offload and unshelve and
interface detach work with QoS aware ports and the resource allocation is
correct.
Documentation Impact
====================
* User documentation about how to use the QoS aware ports.
References
==========
* `nested-resource-providers`_ feature in Nova
* `granular-resource-request`_ feature in Nova
* `QoS minimum bandwidth allocation in Placement API`_ feature in Neutron
* `override-compute-node-uuid`_ proposal to avoid hostname ambiguity
.. _`nested-resource-providers`: https://review.openstack.org/556873
.. _`granular-resource-request`: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/granular-resource-requests.html
.. _`QoS minimum bandwidth allocation in Placement API`: https://review.openstack.org/#/c/508149
.. _`override-compute-node-uuid`: https://blueprints.launchpad.net/nova/+spec/override-compute-node-uuid
.. _`vnic_types are defined in the Neutron API`: > https://developer.openstack.org/api-ref/network/v2/#show-port-details
.. _`blindly copy the allocation from source host to destination host`: https://github.com/openstack/nova/blob/9273b082026080122d104762ec04591c69f75a44/nova/scheduler/utils.py#L372
.. _`QoS minimum bandwidth rule`: https://docs.openstack.org/neutron/latest/admin/config-qos.html
.. _`any-traits-in-allocation_candidates-query`: https://blueprints.launchpad.net/nova/+spec/any-traits-in-allocation-candidates-query
.. _`mixing-required-traits-with-any-traits`: https://blueprints.launchpad.net/nova/+spec/mixing-required-traits-with-any-traits
.. _`local delete`: https://github.com/openstack/nova/blob/4b0d0ea9f18139d58103a520a6a4e9119e19a4de/nova/compute/api.py#L2023
.. _`send the move requests through the scheduler`: https://github.com/openstack/nova/blob/9273b082026080122d104762ec04591c69f75a44/nova/scheduler/utils.py#L339
.. _`after nova port create is moved to the conductor`: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/prep-for-network-aware-scheduling-pike.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Queens
- Introduced
* - Rocky
- Reworked after several discussions
* - Stein
- * Re-proposed as implementation hasn't been finished in Rocky
* Updated based on what was implemented in Stein

View File

@@ -0,0 +1,197 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
======================================
Boot instance specific storage backend
======================================
https://blueprints.launchpad.net/nova/+spec/boot-instance-specific-storage-backend
This blueprint proposes adding support for specifying ``volume_type`` when
booting instances.
Problem description
===================
Currently, when creating a new boot-from-volume instance, the user can only
control the type of the volume by pre-creating a bootable image-backed volume
with the desired type in cinder and providing it to nova during the boot
process. When the user wants to boot the instance on the specified backend,
this is not friendly to the user when there are multiple storage backends in
the environment.
Use Cases
---------
As a user, I would like to specify volume type to my instances when I boot
them, especially when I am doing bulk boot. The "bulk boot" means creating
multiple servers in separate requests but at the same time.
However, if the user wants to boot instance on a different storage backends,
they only need to specify a different backend to send the create request
again.
Proposed change
===============
Add a new microversion to the servers create API to support specifying volume
type when booting instances.
This would only apply to BDMs with ``source_type`` of blank, image and
snapshot. The ``volume_type`` will be passed from ``nova-api`` through to
``nova-compute`` (via the BlockDeviceMapping object) where the volume gets
created and then attached to the new server.
The ``nova-api`` service will validate that the requested ``volume_type``
actually exists in cinder so we can fail fast if it does not or the user does
not have access to it.
Alternatives
------------
You can also combine cinder and nova to do this.
* Create the volume with the non-default type in cinder and then provide the
volume to nova when creating the server.
Data model impact
-----------------
We'll have to store the ``volume_type`` on the BlockDeviceMapping object
(and block_device_mapping table in the DB).
REST API impact
---------------
* URL:
* /v2.1/servers:
* Request method:
* POST
The volume_type data will be able to add to request payload ::
{
"server" : {
"name" : "device-tagging-server",
"flavorRef" : "http://openstack.example.com/flavors/1",
"networks" : [{
"uuid" : "ff608d40-75e9-48cb-b745-77bb55b5eaf2",
"tag": "nic1"
}],
"block_device_mapping_v2": [{
"uuid": "70a599e0-31e7-49b7-b260-868f441e862b",
"source_type": "image",
"destination_type": "volume",
"boot_index": 0,
"volume_size": "1",
"tag": "disk1",
"volume_type": "lvm_volume_type"
}]
}
}
Security impact
---------------
None
Notifications impact
--------------------
Add ``volume_type`` field to BlockDevicePayload object.
Other end user impact
---------------------
The python-novaclient and python-openstackclient will be updated.
When we snapshot a volume-backed server, the block_device_mapping_v2 image
metadata will include the volume_type from the BDM record so if the user then
creates another server from that snapshot, the volume that nova creates from
that snapshot will use the same volume_type. If a user wishes to change that
volume type in the image metadata, they can via the image API. For more
information on image-defined BDMs, see [1]_ and [2]_.
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
To support rolling upgrades, the API will have to determine if the minimum
``nova-compute`` service version across the deployment (all cells) is
high enough to support user-specified volume types. If ``volume_type`` is
specified but the deployment is not new enough to handle it, a 409 error will
be returned to the user.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Brin Zhang
Work Items
----------
* Add ``volume_type`` support in compute API
* Add related tests
Dependencies
============
None
Testing
=======
* Add Tempest integration tests for scenarios like:
* Boot from volume with a non-default volume type.
* Snapshot a volume-backed instance and assert the non-default volume
type is stored in the image snapshot metadata.
* Add related unit test for negative scenarios like:
* Attempting to boot from volume with a specific volume type before the
new microversion.
* Attempting to boot from volume with a volume type that does not exist
and/or the user does not have access to.
* Attempting to boot from volume with a volume type with old computes that
do not yet support volume type.
* Add related functional tests for positive scenarios
* The functional API samples tests will cover the positive scenario for
boot from volume with a specific volume type and all computes in all
cells are running the latest code.
Documentation Impact
====================
Add docs that mention the volume type can be specified when boot instances
after the microversion.
References
==========
For a discussion of this feature, please refer to:
* https://etherpad.openstack.org/p/nova-ptg-stein
Stein PTG etherpad, discussion on or around line 496.
* http://lists.openstack.org/pipermail/openstack-dev/2018-July/132052.html
Matt Riedemann's recap email to the dev list on Stein PTG, about halfway
down.
.. [1] https://docs.openstack.org/nova/latest/user/block-device-mapping.html
.. [2] https://github.com/openstack/tempest/blob/3674fb138/tempest/scenario/test_volume_boot_pattern.py#L210
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,194 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=============================================
Configure maximum number of volumes to attach
=============================================
https://blueprints.launchpad.net/nova/+spec/conf-max-attach-volumes
Currently, there is a limitation in the libvirt driver restricting the maximum
number of volumes to attach to a single instance to 26. Depending on virt
driver and operator environment, operators would like to be able to attach
more than 26 volumes to a single instance. We propose adding a configuration
option that operators can use to select the maximum number of volumes allowed
to attach to a single instance.
Problem description
===================
We've had customers ask for the ability to attach more than 26 volumes to a
single instance and we've seen launchpad bugs opened from users trying to
attach more than 26 volumes (see `References`_). Because the supportability of
any number of volumes depends heavily on which virt driver is being used and
the operator's particular environment, we propose to make the maximum
configurable by operators. Choosing an appropriate maximum number will require
tuning with the specific virt driver and deployed environment, so we expect
operators to set the maximum, test, tune, and adjust the configuration option
until the maximum is working well in their environment.
Use Cases
---------
* Operators wish to be able to attach a maximum number of volumes to a single
instance, with the ability to choose a maximum well-tuned for their
environments.
Proposed change
===============
When a user attempts to attach more than 26 volumes with the libvirt driver,
the attach fails in the ``reserve_block_device_name`` method in nova-compute,
which is eventually called by the ``attach_volume`` method in nova-api. The
``reserve_block_device_name`` method calls
``self.driver.get_device_name_for_instance`` to get the next available device
name for attaching the volume. If the driver has implemented the method, this
is where an attempt to go beyond the maximum allowed number of volumes to
attach, will fail. The libvirt driver fails after 26 volumes have been
attached. Drivers that have not implemented ``get_device_name_for_instance``
appear to have no limit on the maximum number of volumes. The default
implementation of ``get_device_name_for_instance`` is located in the
``nova.compute.utils`` module. Only the libvirt driver has provided its own
implementation of ``get_device_name_for_instance``.
The ``reserve_block_device_name`` method is a synchronous RPC call (not cast).
This means we can have the configured allowed maximum set differently per
nova-compute and still fail fast in the API if the maximum has been exceeded.
We propose to add a new configuration option ``[compute]max_volumes_to_attach``
IntOpt to use to configure the maximum allowed volumes to attach to a single
instance per nova-compute. This way, operators can set it appropriately
depending on what virt driver they are running and what their deployed
environment is like. The default will be unlimited (-1) to keep the current
behavior for all drivers except the libvirt driver.
The configuration option will be enforced in the
``get_device_name_for_instance`` methods, using the count of the number of
already attached volumes. Upon failure, an exception will be propagated to
nova-api via the synchronous RPC call to nova-compute, and the user will
receive a 403 error (as opposed to the current 500 error).
Alternatives
------------
Other ways we could solve this include: choosing a new hard-coded maximum only
for the libvirt driver or creating a new quota limit for "maximum volumes
allowed to attach" (see the ML thread in `References`_).
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
Deployers will be able to set the ``[compute]max_volumes_to_attach``
configuration option to control how many volumes are allowed to be attached
to a single instance per nova-compute in their deployment.
Developer impact
----------------
None
Upgrade impact
--------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
melwitt
Other contributors:
yukari-papa
Work Items
----------
* Add a new configuration option ``[compute]max_volumes_to_attach``, IntOpt
* Modify (or remove) the libvirt driver's implementation of the
``get_device_name_for_instance`` method to accomodate more than 26 volumes
* Add enforcement of ``[compute]max_volumes_to_attach`` to the
``get_device_name_for_instance`` methods
* Add handling of the raised exception in the API to translate to a 403 to the
user, if the maximum number of allowed volumes is exceeded
Dependencies
============
None
Testing
=======
The new functionality will be tested by new unit and functional tests.
Documentation Impact
====================
The documentation for the new configuration option will be automatically
included in generated documentation of the configuration reference.
References
==========
* https://bugs.launchpad.net/nova/+bug/1770527
* https://bugs.launchpad.net/nova/+bug/1773941
* http://lists.openstack.org/pipermail/openstack-dev/2018-June/131289.html
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,350 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
======================================
Expose virtual device tags in REST API
======================================
https://blueprints.launchpad.net/nova/+spec/expose-virtual-device-tags-in-rest-api
The 2.32 microversion allows creating a server with tagged block devices
and virtual interfaces (ports) and the 2.49 microversion allows specifying
a tag when attaching volumes and ports, but there is no way to get those
tags out of the REST API. This spec proposes to expose the block device
mapping and virtual interface tags in the REST API when listing volume
attachments and ports for a given server.
Problem description
===================
It is possible to attach volumes and ports to a server with tags but there
is nothing in the REST API that allows a user to read those back. The virtual
device tags are available *within* the guest VM via the config drive or
metadata API service, but not the front-end compute REST API.
Furthermore, correlating volume attachments to BlockDeviceMapping objects
via tag has come up in the `Remove device from volume attach requests`_ spec
as a way to replace reliance on the otherwise unused ``device`` field to
determine ordering of block devices within the guest.
Using volume attachment tags was also an option discussed in the
`Detach and attach boot volumes`_ spec as a way to indicate which volume
was the root volume attached to the server without relying on the
server ``OS-EXT-SRV-ATTR:root_device_name`` field.
.. _Remove device from volume attach requests: https://review.openstack.org/600628/
.. _Detach and attach boot volumes: https://review.openstack.org/600628/
Use Cases
---------
As a user, I want to correlate information, based on tags, to the volumes and
ports attached to my server.
Proposed change
===============
In a new microversion, expose virtual device tags in the REST API response
when showing volume attachments and attached ports.
See the `REST API impact`_ section for details on route and response changes.
**Technical / performance considerations**
When showing attached volume tags, there would really be no additional effort
in exposing the tag since we already query the database for a
BlockDeviceMappingList per instance.
However, the ``os-interface`` port tags present a different challenge. The
``GET /servers/{server_id}/os-interface`` and
``GET /servers/{server_id}/os-interface/{port_id}`` APIs are today simply
proxies to the neutron networking APIs to list ports attached to an instance
and show details about a port attached to an instance, respectively.
The device tag for a port attached to an instance is not stored in neutron,
it is stored in the nova cell database ``virtual_interfaces`` table. So the
problem we have is one of performance when listing ports attached to a server
and we want to show tags because we would have to query both the neutron API
to list ports and then the ``virtual_interfaces`` table for the instance to
get the tags. We have two options:
1. Accept that when listing ports for a single instance, doing one more DB
query to get the tags is not that big of an issue.
2. Rather than proxy the calls to neutron, we could rely on the instance
network info cache to provide the details. That might be OK except we
currently do not store the tags in the info cache, so for any existing
instance the tags would not be shown, unless we did a DB query to look
for them and heal the cache.
Given the complications with option 2, this spec will just use option 1.
**Non-volume BDMs**
It should be noted that swap and ephemeral block devices can also have
tags when the server is created, however there is nothing in the API
today which exposes those types of BDMs; the API only exposes volume BDMs.
As such, this spec does not intend to expose non-volume block device mapping
tags. It is possible that in the future if a kind of
``GET /servers/{server_id}/disks`` API were added we could expose swap and
ephemeral block devices along with their tags, but that is out of scope
for this spec.
Alternatives
------------
In addition to showing the tags in the ``os-volume_attachments`` and
``os-interface`` APIs, we could also modify the server show/list view builder
to provide tags in the server resource ``os-extended-volumes:volumes_attached``
and ``addresses`` fields. This would be trivial to do for showing attached
volume tags since we already query the DB per instance to get the BDMs, but as
noted in the `Proposed change`_ section, it would be non-trivial for port tags
since those are not currently stored in the instance network info cache which
is used to build the ``addresses`` field response value. And it would be odd
to show the attached volume tags in the server response but not the virtual
interface tags. We could heal the network info cache over time, but that
seems unnecessarily complicated when the proposed change already provides a
way to get the tag information for all volumes/ports attached to a given
server resource.
We could also take this opportunity to expose other fields on the
BlockDeviceMapping which are inputs when creating a server, like
``boot_index``, ``volume_type``, ``source_type``, ``destination_type``,
``guest_format``, etc. For simplicity, that is omitted from the proposed
change since it's simpler to just focus on tag exposure for multiple types
of resources.
Data model impact
-----------------
None.
REST API impact
---------------
There are two API resource routes which would be changed. In all cases,
if the block device mapping or virtual interface record does not have a tag
specified, the response value for the ``tag`` key will be ``None``.
os-volume_attachments
~~~~~~~~~~~~~~~~~~~~~
A ``tag`` field will be added to the response for each of the following APIs.
* ``GET /servers/{server_id}/os-volume_attachments (list)``
.. code-block:: json
{
"volumeAttachments": [{
"device": "/dev/sdd",
"id": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
"serverId": "4d8c3732-a248-40ed-bebc-539a6ffd25c0",
"volumeId": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
"tag": "os"
}]
}
* ``GET /servers/{server_id}/os-volume_attachments/{volume_id} (show)``
.. code-block:: json
{
"volumeAttachment": {
"device": "/dev/sdd",
"id": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
"serverId": "2390fb4d-1693-45d7-b309-e29c4af16538",
"volumeId": "a26887c6-c47b-4654-abb5-dfadf7d3f803",
"tag": "os"
}
}
* ``POST /servers/{server_id}/os-volume_attachments (attach)``
.. code-block:: json
{
"volumeAttachment": {
"device": "/dev/vdb",
"id": "c996dd74-44a0-4fd1-a582-a14a4007cc94",
"serverId": "2390fb4d-1693-45d7-b309-e29c4af16538",
"volumeId": "c996dd74-44a0-4fd1-a582-a14a4007cc94",
"tag": "data"
}
}
os-interface
~~~~~~~~~~~~
A ``tag`` field will be added to the response for each of the following APIs.
* ``GET /servers/{server_id}/os-interface (list)``
.. code-block:: json
{
"interfaceAttachments": [{
"fixed_ips": [{
"ip_address": "192.168.1.3",
"subnet_id": "f8a6e8f8-c2ec-497c-9f23-da9616de54ef"
}],
"mac_addr": "fa:16:3e:4c:2c:30",
"net_id": "3cb9bc59-5699-4588-a4b1-b87f96708bc6",
"port_id": "ce531f90-199f-48c0-816c-13e38010b442",
"port_state": "ACTIVE",
"tag": "public"
}]
}
* ``GET /servers/{server_id}/os-interface/{port_id} (show)``
.. code-block:: json
{
"interfaceAttachment": {
"fixed_ips": [{
"ip_address": "192.168.1.3",
"subnet_id": "f8a6e8f8-c2ec-497c-9f23-da9616de54ef"
}],
"mac_addr": "fa:16:3e:4c:2c:30",
"net_id": "3cb9bc59-5699-4588-a4b1-b87f96708bc6",
"port_id": "ce531f90-199f-48c0-816c-13e38010b442",
"port_state": "ACTIVE",
"tag": "public"
}
}
* ``POST /servers/{server_id}/os-interface (attach)``
.. code-block:: json
{
"interfaceAttachment": {
"fixed_ips": [{
"ip_address": "192.168.1.4",
"subnet_id": "f8a6e8f8-c2ec-497c-9f23-da9616de54ef"
}],
"mac_addr": "fa:16:3e:4c:2c:31",
"net_id": "3cb9bc59-5699-4588-a4b1-b87f96708bc6",
"port_id": "ce531f90-199f-48c0-816c-13e38010b443",
"port_state": "ACTIVE",
"tag": "management"
}
}
Security impact
---------------
None.
Notifications impact
--------------------
The ``BlockDevicePayload`` object already exposes BDM tags for
versioned notifications. The ``IpPayload`` object does not expose tags
since they are not in the instance network info cache, but these payloads
are only exposed via the ``InstancePayload`` and like the ``servers`` API
we will not make additional changes to try and show the tags for the resources
nested within the server (InstancePayload) body. This could be done in the
future if desired, potentially with a configuration option like
``[notifications]/bdms_in_notifications``, but it is out of scope for this
spec.
Other end user impact
---------------------
python-novaclient and python-openstackclient will be updated as necessary
to support the new microversion. This likely just means adding a new ``Tag``
column in CLI output when listing attached volumes and ports.
Performance Impact
------------------
There will be a new database query to the ``virtual_interfaces`` table when
showing device tags for ports attached to a server. This should have a minimal
impact to API response times though.
Other deployer impact
---------------------
None.
Developer impact
----------------
None.
Upgrade impact
--------------
None.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Matt Riedemann <mriedem.os@gmail.com> (mriedem)
Work Items
----------
Implement a new microversion and use that to determine if a new ``tag``
field should be in the ``os-volume_attachments`` and ``os-interface`` API
responses when listing/showing/attaching volumes/ports to a server.
Dependencies
============
None.
Testing
=======
Functional API samples tests should be sufficient coverage of this feature.
Documentation Impact
====================
The compute API reference will be updated to note the ``tag`` field in the
response for the ``os-volume_attachments`` and ``os-interface`` APIs.
References
==========
This was originally discussed at the `Ocata summit`_. It came up again at the
`Rocky PTG`_.
Related specs:
* Remove ``device`` from volume attach API: https://review.openstack.org/452546/
* Detach/attach boot volume: https://review.openstack.org/600628/
.. _Ocata summit: https://etherpad.openstack.org/p/ocata-nova-summit-api
.. _Rocky PTG: https://etherpad.openstack.org/p/nova-ptg-rocky
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Pike
- Originally proposed but abandoned
* - Stein
- Re-proposed

View File

@@ -0,0 +1,156 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=================================================
Flavor Extra Spec and Image Properties Validation
=================================================
https://blueprints.launchpad.net/nova/+spec/flavor-extra-spec-image-property-validation
Problem description
===================
Currently flavor extra-specs and image properties validation are done in
separate places. If they are not compatible, the instance may fail to launch
and go into an ERROR state, or may reschedule an unknown number of times
depending on the virt driver behaviour.
Use Cases
---------
As an end user I would like to have instant feedback if flavor extra spec or
image properties are not valid or they are not compatible with each other so
I can correct my configuration and retry the operation.
Proposed change
===============
We want to validate the combination of the flavor extra-specs and image
properties as early as possible once they're both known.
If validation fails then synchronously return error to user.
We'd need to do this anywhere the flavor or image changes, so basically
instance creation, rebuild, and resize. More precisely, rename
_check_requested_image() to something more generic, take it out of
_checks_for_create_and_rebuild(), modify it to check more things and call it
from all three operations: creation, rebuild, and resize.
.. note:: Only things that are not virt driver specific are validated.
Examples of validations to be added [1]_:
* Call hardware.numa_get_constraints to validate all the various numa-related
things. This is currently done only on _create_instance(), should be done for
resize/rebuild as well.
* Ensure that the cpu policy, cpu thread policy and emulator thread policy
values are valid.
* Validate the realtime mask.
* Validate the number of serial ports.
* Validate the cpu topology constraints.
* Validate the ``quota:*``settings (that are not virt driver specific) in the
flavor.
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
Due to the new validations, users could face more 4xx errors for more cases
than we did before in create/rebuild/resize operations.
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
Negligible.
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
jackding
Work Items
----------
* Add validations mostly in nova/compute/api.py.
* Add/update unit tests.
* Update documentation/release-note if necessary depending on the new
validations added.
Dependencies
============
None
Testing
=======
Will add unit tests.
Documentation Impact
====================
None
References
==========
.. [1] https://docs.openstack.org/nova/latest/user/flavors.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,465 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================
Generic os-vif datapath offloads
================================
https://blueprints.launchpad.net/nova/+spec/generic-os-vif-offloads
The existing method in os-vif is to pass datapath offload metadata via a
``VIFPortProfileOVSRepresentor`` port profile object. This is currently used by
the ``ovs`` reference plugin and the external ``agilio_ovs`` plugin. This spec
proposes a refactor of the interface to support more VIF types and offload
modes.
Problem description
===================
Background on Offloads
----------------------
While composing this spec, it became clear that the "offloads" term had
historical meaning that caused confusion about the scope of this spec. This
subsection was added in order to clarify the distinctions between different
classes of offloads.
Protocol Offloads
~~~~~~~~~~~~~~~~~
Network-specific computation being handled by dedicated peripherals is well
established on many platforms. For Linux, the `ethtool man page`_ details a
number of settings for the ``--offload`` option that are available on many
NICs, for specific protocols.
``ethtool`` type offloads typically:
#. are available to guests (and hosts),
#. have a strong relationship with a network endpoint,
#. have a role with generating and consuming packets,
#. can be modeled as capabilities of the virtual NIC on the instance.
Currently, Nova has little modelling for these types of offload capabilities.
Ensuring that instances can live migrate to a compute node capable of
providing the required features is not something Nova can currently determine
ahead of time.
This spec only touches lightly on this class of offloads.
Datapath Offloads
~~~~~~~~~~~~~~~~~
Relatively recently, SmartNICs emerged that allow complex packet processing on
the NIC. This allows the implementation of constructs like bridges and routers
under control of the host. In contrast with procotol offloads, these offloads
apply to the dataplane.
In Open vSwitch, the dataplane can be implemented by, for example, the kernel
datapath (the ``openvswitch.ko`` module), the userspace datapath, or the
``tc-flower`` classifier. In turn, portions of the ``tc-flower`` classifier can
be delegated to a SmartNIC as described in this `TC Flower Offload paper`_.
.. note:: Open vSwitch refers to specific implementations of its packet
processing pipeline as datapaths, not dataplanes. This spec follows
the datapath terminology.
Datapath offloads typically have the following characteristics:
#. The interfaces controlling and managing these offloads are under host
control.
#. Network-level operations such as routing, tunneling, NAT and firewalling can
be described.
#. A special plugging mode could be required, since the packets might bypass
the host hypervisor entirely.
The simplest case of this is an SR-IOV NIC in Virtual Ethernet Bridge (VEB)
mode, as used by the ``sriovnicswitch`` Neutron driver. A special plugging mode
is necessary, (namely IOMMU PCI passthrough), and the hypervisor configures the
VEB with the required MAC ACL filters.
This spec focuses on this class of offloads.
Hybrid Offloads
~~~~~~~~~~~~~~~
In future, it might be possible to push out datapath offloads as a service to
guest instances. In particular, trusted NFV instances might gain access to
sections of the packet processing pipeline, with various levels of isolation
and composition. This spec is does not target this use case.
Core Problem Statement
----------------------
In order to support hardware acceleration for datapath offloads, Nova
core and os-vif need to model the datapath offload plugging metadata. The
existing method in os-vif is to pass this via a
``VIFPortProfileOVSRepresentor`` port profile object. This is used by the
``ovs`` reference plugin and the external ``agilio_ovs`` plugin.
With ``vrouter`` being a potential third user of such metadata (proposed in the
`blueprint for vrouter hardware offloads`_), it's worthwhile to abstract the
interface before the pattern solidifies further.
This spec is limited to refactoring the interface, with future expansion in
mind, while allowing existing plugins to remain functional.
SmartNICs are able to route packets directly to individual SR-IOV Virtual
Functions. These can be connected to instances using IOMMU (vfio-pci
passthrough) or a low-latency vhost-user `virtio-forwarder`_ running on the
compute node.
In Nova, a VIF should fully describe how an instance is plugged into the
datapath. This includes information for the hypervisor to perform the required
plugging, and also info for the datapath control software. For the ``ovs`` VIF,
the hypervisor is generally able to also perform the datapath control, but this
is not the case for every VIF type (hence the existence of os-vif).
The VNIC type is a property of a VIF. It has taken on the semantics of
describing a specific "plugging mode" for the VIF. In the Nova network API,
there is a `list of VNIC types that will trigger a PCI request`_, if Neutron
has passed a VIF to Nova with one of those VNIC types set. `Open vSwitch
offloads`_ uses the following VNIC types to distinguish between offloaded
modes:
* The ``normal`` (or default) VNIC type indicates that the Instance is plugged
into the software bridge.
* The ``direct`` VNIC type indicates that a VF is passed through to the
Instance.
In addition, the Agilio OVS VIF type implements the following offload mode:
* The ``virtio-forwarder`` VNIC type indicates that a VF is attached via a
`virtio-forwarder`_.
Currently, os-vif and Nova implement `switchdev SR-IOV offloads`_ for Open
vSwitch with ``tc-flower`` offloads. In this model, a representor netdev on the
host is associated with each Virtual Function. This representor functions like
a handle for the corresponding virtual port on the NIC's packet processing
pipeline.
Nova passes the PCI address it received from the PCI request to the os-vif
plugin. Optionally, a netdev name can also be passed to allow for friendly
renaming of the representor by the os-vif plugin.
The ``ovs`` and ``agilio_ovs`` os-vif plugins then look up the associated
representor for the VF and perform the datapath plugging. From Nova's
perspective the hypervisor then either passes through a VF using the data from
the ``VIFHostDevice`` os-vif object (with the ``direct`` VNIC type), or plugs
the Instance into a vhost-user handle with data from a ``VIFVHostUser`` os-vif
object (with the ``virtio-forwarder`` VNIC type).
In both cases, the os-vif object has a port profile of
``VIFPortProfileOVSRepresentor`` that carries the offload metadata as well as
Open vSwitch metadata.
Use Cases
---------
Currently, switchdev VF offloads are modelled for one port profile only. Should
a developer, using a different datapath, wish to pass offload metadata to an
os-vif plugin, they would have to extend the object model, or pass the metadata
using a confusingly named object. This spec aims to establish a recommended
mechanism to extend the object model.
Proposed change
===============
Use composition instead of inheritance
--------------------------------------
Instead of using an inheritance based pattern to model the offload
capabilities and metadata, use a composition pattern:
* Implement a ``DatapathOffloadBase`` class.
* Subclass this to ``DatapathOffloadRepresentor`` with the following members:
* ``representor_name: StringField()``
* ``representor_address: StringField()``
* Add a ``datapath_offload`` member to ``VIFPortProfileBase``:
* ``datapath_offload: ObjectField('DatapathOffloadBase', nullable=True,
default=None)``
* Update the os-vif OVS reference plugin to accept and use the new versions and
fields.
Future os-vif plugins combining an existing form of datapath offload (i.e.
switchdev offload) with a new VIF type will not require modifications to
os-vif. Future datapath offload methods will require subclassing
``DatapathOffloadBase``.
Instead of implementing potentially brittle backlevelling code, this option
proposes to keep two parallel interfaces alive in Nova for at least one
overlapping release cycle, before the Open vSwitch plugin is updated in os-vif.
Instead of bumping object versions and creating composition version maps, this
option proposes that versioning be deliberately ignored until the next major
release of os-vif. Currently, version negotiation and backlevelling in os-vif
is not used in Nova or os-vif plugins.
Kuryr Kubernetes is also a user of os-vif and is using object versioning in a
manner not yet supported publicly in os-vif. There is an `ongoing discussion
attempting to find a solution for Kuryr's use case`_.
Should protocol offloads also need to be modeled in os-vif, ``VIFBase`` or
``VIFPortProfileBase`` could gain a ``protocol_offloads`` list of capabilities.
Summary of plugging methods affected
------------------------------------
* Before changes:
* VIF type: ``ovs`` (os-vif plugin: ``ovs``)
* VNIC type: ``direct``
* os-vif object: ``VIFHostDevice``
* ``port_profile: VIFPortProfileOVSRepresentor``
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
* VNIC type: ``direct``
* os-vif object: ``VIFHostDevice``
* ``port_profile: VIFPortProfileOVSRepresentor``
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
* VNIC type: ``virtio-forwarder``
* os-vif object: ``VIFVHostUser``
* ``port_profile: VIFPortProfileOVSRepresentor``
* After this model has been adopted in Nova:
* VIF type: ``ovs`` (os-vif plugin: ``ovs``)
* VNIC type: ``direct``
* os-vif object: ``VIFHostDevice``
* ``port_profile: VIFPortProfileOpenVSwitch``
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
* VNIC type: ``direct``
* os-vif object: ``VIFHostDevice``
* ``port_profile: VIFPortProfileOpenVSwitch``
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
* VIF type: ``agilio_ovs`` (os-vif plugin: ``agilio_ovs``)
* VNIC type: ``virtio-forwarder``
* os-vif object: ``VIFVHostUser``
* ``port_profile: VIFPortProfileOpenVSwitch``
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
Additional Impact
-----------------
os-vif needs to issue a release before these profiles will be available to
general CI testing in Nova. Once this is done, Nova can be adapted to use the
new generic interfaces.
* In Stein, os-vif's object model will gain the interfaces described in this
spec. If needed, a major os-vif release will be issued.
* Then, Nova will depend on the new release and use the new interfaces for new
plugins.
* During this time, os-vif will have two parallel interfaces supporting this
metadata. This is expected to last at least from Stein to Train.
* From Train onwards, existing plugins should be transitioned to the new
model.
* Once all plugins have been transitioned, the parallel interfaces can be
removed in a major release of os-vif.
* Support will be lent to Kuryr Kubernetes during this period, to transition to
a better supported model.
Additional notes
----------------
* No corresponding changes in Neutron are expected: currently os-vif is
consumed by Nova and Kuryr Kubernetes.
* Even though representor addresses are currently modeled as PCI address
objects, it was felt that stricter type checking would be of limited
benefit. Future networking systems might require paths, UUIDs or other
methods of describing representors. Leaving the address member a string was
deemed an acceptable compromise.
* The main concern raised against composition over inheritance was the increase
of the serialization size of the objects.
Alternatives
------------
During the development of this spec it was not immediately clear whether the
composition or inheritance model would be the consensus solution. Because the
two models have wildly different effects on future code, it was decided that
both be implemented in order to compare and contrast.
The implementation for the inheritance model is illustrated in
https://review.openstack.org/608693
Use inheritance to create a generic representor profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Keep using an inheritance based pattern to model the offload capabilities and
metadata:
* Implement ``VIFPortProfileRepresentor`` by subclassing ``VIFPortProfileBase``
and adding the following members:
* ``representor_name: StringField(nullable=True)``
* ``representor_address: StringField()``
Summary of new plugging methods available in an inheritance model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* After os-vif changes:
* Generic VIF with SR-IOV passthrough:
* VNIC type: ``direct``
* os-vif object: ``VIFHostDevice``
* ``port_profile: VIFPortProfileRepresentor``
* Generic VIF with virtio-forwarder:
* VNIC type: ``virtio-forwarder``
* os-vif object: ``VIFVHostUser``
* ``port_profile: VIFPortProfileRepresentor``
Other alternatives considered
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Other alternatives proposed require much more invasive patches to Nova and
os-vif:
* Create a new VIF type for every future datapath/offload combination.
* The inheritance based pattern could be made more generic by renaming the
``VIFPortProfileOVSRepresentor`` class to ``VIFPortProfileRepresentor`` as
illustrated in https://review.openstack.org/608448
* The versioned objects could be backleveled by using a suitable negotiation
mechanism to provide overlap.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
os-vif plugins run with elevated privileges, but no new functionality will be
implemented.
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
Extending the model in this fashion adds more bytes to the VIF objects passed
to the os-vif plugin. At the moment, this effect is negligible, but when the
objects are serialized and passed over the wire, this will increase the size of
the API messages.
However, it's very likely that the object model would undergo a major
version change with a redesign, before this becomes a problem.
Other deployer impact
---------------------
Deployers might notice a deprecation warning in logs if Nova, os-vif or the
os-vif plugin is out of sync.
Developer impact
----------------
Core os-vif semantics will be slightly changed. The details for extending
os-vif objects would be slightly more established.
Upgrade impact
--------------
The minimum required version of os-vif in Nova wil be bumped in both
``requirements.txt`` and ``lower-constraints.txt``. Deployers should be
following at least those minimums.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Jan Gutter <jan.gutter@netronome.com>
Work Items
----------
* Implementation of the composition model in os-vif:
https://review.openstack.org/572081
* Adopt the new os-vif interfaces in Nova. This would likely happen after a
major version release of os-vif.
Dependencies
============
* After both options have been reviewed, and the chosen version has been
merged, an os-vif release needs to be made.
* When updating Nova to use the newer release of os-vif, the corresponding
changes should be made to move away from the deprecated classes. This change
is expected to be minimal.
Testing
=======
* Unit tests for the os-vif changes will test the object model impact.
* Third-party CI is already testing the accelerated plugging modes, no new
new functionality needs to be tested.
Documentation Impact
====================
The os-vif development documentation will be updated with the new classes.
References
==========
* `ethtool man page`_
* `TC Flower Offload paper`_
* `virtio-forwarder`_
* `Open vSwitch offloads`_
* `switchdev SR-IOV offloads`_
* `blueprint for vrouter hardware offloads`_
* `list of VNIC types that will trigger a PCI request`_
* `section in the API where the PCI request is triggered`_
* `ongoing discussion attempting to find a solution for Kuryr's use case`_
.. _`ethtool man page`: http://man7.org/linux/man-pages/man8/ethtool.8.html
.. _`TC Flower Offload paper`: https://www.netdevconf.org/2.2/papers/horman-tcflower-talk.pdf
.. _`virtio-forwarder`: http://virtio-forwarder.readthedocs.io/en/latest/
.. _`Open vSwitch offloads`: https://docs.openstack.org/neutron/queens/admin/config-ovs-offload.html
.. _`switchdev SR-IOV offloads`: https://netdevconf.org/1.2/slides/oct6/04_gerlitz_efraim_introduction_to_switchdev_sriov_offloads.pdf
.. _`blueprint for vrouter hardware offloads`: https://blueprints.launchpad.net/nova/+spec/vrouter-hw-offloads
.. _`list of VNIC types that will trigger a PCI request`: https://github.com/openstack/nova/blob/e3eb5f916580a9bab8f67b0fd685c6b3b23a97b7/nova/network/model.py#L111
.. _`section in the API where the PCI request is triggered`: https://github.com/openstack/nova/blob/e3eb5f916580a9bab8f67b0fd685c6b3b23a97b7/nova/network/neutronv2/api.py#L1921
.. _`ongoing discussion attempting to find a solution for Kuryr's use case`: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000569.html

View File

@@ -0,0 +1,454 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==========================================
Handling a down cell
==========================================
https://blueprints.launchpad.net/nova/+spec/handling-down-cell
This spec aims at addressing the behavioural changes that are required to
support some of the basic nova operations like listing of instances and
services when a cell goes down.
Problem description
===================
Currently in nova when a cell goes down (for instance if the cell DB is not
reachable) basic functionalities like ``nova list`` and ``nova service-list``
do not work and return an API error message. However a single cell going down
should not stop these operations from working for the end users and operators.
Another issue is while calculating quotas during VM creations, the resources
of the down cell are not taken into account and the ``nova boot`` operation is
permitted into the cells which are up. This may result in incorrect quota
reporting for a particular project during boot time which may have implications
when the down cell comes back.
Use Cases
---------
The specific use cases that are being addressed in the spec include:
#. ``nova list`` should work even if a cell goes down. This can be partitioned
into two use cases:
#. The user has no instances in the down cell: Expected behaviour would be
for everything to work as normal. This has been fixed through
`smart server listing`_ if used with the right config options.
#. The user has instances in the down cell: This needs to be gracefully
handled which can be split into two stages:
#. We just skip the down cell and return results from the cells that are
available instead of returning a 500 which has been fixed through
`resilient server listing`_.
#. Instead of skipping the down cell, we build on (modify) the existing
API response to return a minimalistic construct. This will be fixed in
this spec.
#. ``nova show`` should also return a minimalistic construct for instances in
the down cell similar to ``nova list``.
#. ``nova service-list`` should work even if a cell goes down. The solution can
be split into two stages:
#. We skip the down cell and end up displaying all the services from the
other cells as was in cells_v1 setup. This has been fixed through
`resilient service listing`_.
#. We handle this gracefully for the down cell. This will be fixed through
this spec by creating a minimalistic construct.
#. ``nova boot`` should not succeed if that project has any living VMs in the
down cell until an all-cell-iteration independent solution for quota
calculation is implemented through `quotas using placement`_.
Proposed change
===============
This spec proposes to add a new ``queued_for_delete`` column in the
``nova_api.instance_mappings`` table as discussed in the
`cells summary in Dublin PTG`_. This column would be of type Boolean which by
default will be False and upon the deletion (normal/local/soft) of the
respective instance, will be set to True. In the case of soft delete, if the
instance is restored, then the value of the column will be set to False again.
The corresponding ``queued_for_delete`` field will be added in the
InstanceMapping object.
Listing of instances and services from the down cell will return a
`did_not_respond_sentinel`_ object from the scatter-gather utility. Using this
response we can know if a cell is down or not and accordingly modify the
listing commands to work in the following manner for those records which are
from the down cell:
#. ``nova list`` should return a minimalistic construct from the available
information in the API DB which would include:
#. created_at, instance_uuid and project_id from the instance_mapping table.
#. status of the instance would be "UNKNOWN" which would be the major
indication that the record for this instance is partial.
#. rest of the field keys will be missing.
See the `Edge Cases`_ section for more info on running this command with
filters, marker, sorting and paging.
#. ``nova show`` should return a minimalistic construct from the available
information in the API DB which would be similar to ``nova list``. If
``GET /servers/{id}`` cannot reach the cell DB, we can look into the
instance_mapping and request_spec table for the instance details which would
include:
#. instance_uuid, created_at and project_id from the instance_mapping table.
#. status of the instance would be "UNKNOWN" which would be the major
indication that the record for this instance is partial.
#. user_id, flavor, image and availability_zone from the request_spec table.
#. power_state is set to NOSTATE.
#. rest of the field keys will be missing.
#. ``nova service-list`` should return a minimalistic construct from the
available information in the API DB which would include:
#. host and binary from the host_mapping table for the compute services.
#. rest of the field keys will be missing.
Note that if cell0 goes down the controller services will not be listed.
#. ``nova boot`` should not succeed if the requesting project has living VMs in
the down cell. So if the scatter-gather utility returns a
did_not_respond_sentinel while calculating quotas, we have to go and check
if this project has living instances in the down cell from the
instance_mapping table and prevent the boot request if it has. However it
might not be desirable to block VM creation for users having VMs in multiple
cells if a single cell goes down. Hence a new policy rule
``os_compute_api:servers:create:cell_down`` which defaults to
``rule:admin_api`` can be added by which the ability to create instances
when a project has instances in a down cell can be controlled between
users/admin. Using this deployments can configure their setup in whichever
way they desire.
For the 1st, 2nd and 4th operations to work when a cell is down, we need to
have the information regarding if an instance is in SOFT_DELETED/DELETED state
in the API DB so that the living instances can be distinguished from the
deleted ones which is why we add the new column ``queued_for_delete``.
In order to prevent the client side from complaining about missing keys, we
would need a new microversion that would accept the above stated minimal
constructs for the servers in the down cells into the same list of full
constructs of the servers in the up cells. In future we could use a caching
mechanism to have the ability to fill in the down cell instances information.
Note that all other non-listing operations like create and delete will simply
not work for the servers in the down cell since one cannot clearly do anything
about it if the cell database is not reachable. They will continue to return
500 as is the present scenario.
Edge Cases
----------
* Filters: If the user is listing servers using filters the results from the
down cell will be skipped and no minimalistic construct will be provided
since there is no way of validating the filtered results from the down cell
if the value of the filter key itself is missing. Note that by default
``nova list`` uses the ``deleted=False`` and ``project_id=tenant_id``
filters and since we know both of these values from the instance_mapping
table, they will be the only allowed filters. Hence only doing ``nova list``
and ``nova list --minimal`` will show minimalistic results for the down cell.
Other filters like ``nova list --deleted`` or ``nova list --host xx`` will
skip the results for the down cell.
* Marker: If the user does ``nova list --marker`` it will fail with a 500 if
the marker is in the down cell.
* Sorting: We ignore the down cell just like we do for filters since there is
no way of obtaining valid results from the down cell with missing key info.
* Paging: We ignore the down cell. For instance if we have three cells A (up),
B (down) and C (up) and if the marker is half way in A, we would get the
rest half of the results from A, all the results from C and ignore cell B.
Alternatives
------------
* An alternative to adding the new column in the instance_mappings table is to
have the deleted information in the respective RequestSpec record, however it
was decided at the PTG to go ahead with adding the new column in the
instance_mappings table as it is more appropriate. For the main logic there
is no alternative solution other than having the deleted info in the API DB
if the listing operations have to work when a cell goes down.
* Without a new microversion, include 'shell' servers in the response when
listing over down cells which would have UNKNOWN values for those keys
whose information is missing. However the client side would not be able to
digest the response with "UNKNOWN" values. Also it is not possible to assign
"UNKNOWN" to all the fields since not all of them are of string types.
* With a new microversion include the set of server uuids in the down cells
in a new top level API response key called ``unavailable_servers`` and treat
the two lists (one for the servers from the up cells and other for the
servers from the down cells) separately. See `POC for unavailable_servers`_
for more details.
* Using searchlight to backfill when there are down cells. Check
`listing instances using Searchlight`_ for more details.
* Adding backup DBs for each cell database which would act as read-only copies
of the original DB in times of crisis, however this would need massive
syncing and may fetch stale results.
Data model impact
-----------------
A nova_api DB schema change will be required for adding the
``queued_for_delete`` column of type Boolean to the
``nova_api.instance_mappings`` table. This column will be set to False by
default.
Also, the ``InstanceMapping`` object will have a new field called
``queued_for_delete``. An online data migration tool will be added to populate
this field for existing instance_mappings. This tool would basically go over
the instance records in all the cells, and if the vm_state of the instance is
either DELETED or SOFT_DELETED, it will update the ``queued_for_delete`` to
True else leave it at its default value.
REST API impact
---------------
When a cell is down, we currently skip that cell and this spec aims at
giving partial info for ``GET /servers``, ``GET /os-services``,
``GET /servers/detail`` and ``GET /servers/{server_id}`` REST APIs.
There will be a new microversion for the client to recognise missing keys and
NULL values for certain keys in the response.
An example server response for ``GET /servers/detail`` is given below which
includes one available server and one unavailable server.
JSON response body example::
{
"servers": [
{
"OS-EXT-STS:task_state": null,
"addresses": {
"public": [
{
"OS-EXT-IPS-MAC:mac_addr": "fa:xx:xx:xx:xx:1a",
"version": 4,
"addr": "1xx.xx.xx.xx3",
"OS-EXT-IPS:type": "fixed"
},
{
"OS-EXT-IPS-MAC:mac_addr": "fa:xx:xx:xx:xx:1a",
"version": 6,
"addr": "2sss:sss::s",
"OS-EXT-IPS:type": "fixed"
}
]
},
"links": [
{
"href": "http://1xxx.xxx.xxx.xxx/compute/v2.1/servers/b546af1e-3893-44ea-a660-c6b998a64ba7",
"rel": "self"
},
{
"href": "http://1xx.xxx.xxx.xxx/compute/servers/b546af1e-3893-44ea-a660-c6b998a64ba7",
"rel": "bookmark"
}
],
"image": {
"id": "9da3b809-2998-4ada-8cc6-f24bc0b6dd7f",
"links": [
{
"href": "http://1xx.xxx.xxx.xxx/compute/images/9da3b809-2998-4ada-8cc6-f24bc0b6dd7f",
"rel": "bookmark"
}
]
},
"OS-EXT-SRV-ATTR:user_data": null,
"OS-EXT-STS:vm_state": "active",
"OS-EXT-SRV-ATTR:instance_name": "instance-00000001",
"OS-EXT-SRV-ATTR:root_device_name": "/dev/vda",
"OS-SRV-USG:launched_at": "2018-06-29T15:07:39.000000",
"flavor": {
"ephemeral": 0,
"ram": 64,
"original_name": "m1.nano",
"vcpus": 1,
"extra_specs": {},
"swap": 0,
"disk": 0
},
"id": "b546af1e-3893-44ea-a660-c6b998a64ba7",
"security_groups": [
{
"name": "default"
}
],
"OS-SRV-USG:terminated_at": null,
"os-extended-volumes:volumes_attached": [],
"user_id": "187160b0afe041368258c0b195ab9822",
"OS-EXT-SRV-ATTR:hostname": "surya-probes-001",
"OS-DCF:diskConfig": "MANUAL",
"accessIPv4": "",
"accessIPv6": "",
"OS-EXT-SRV-ATTR:reservation_id": "r-uxbso3q4",
"progress": 0,
"OS-EXT-STS:power_state": 1,
"OS-EXT-AZ:availability_zone": "nova",
"config_drive": "",
"status": "ACTIVE",
"OS-EXT-SRV-ATTR:ramdisk_id": "",
"updated": "2018-06-29T15:07:39Z",
"hostId": "e8dcf7ab9762810efdec4307e6219f85a53d5dfe642747c75a87db06",
"OS-EXT-SRV-ATTR:host": "cn1",
"description": null,
"tags": [],
"key_name": null,
"OS-EXT-SRV-ATTR:kernel_id": "",
"OS-EXT-SRV-ATTR:hypervisor_hostname": "cn1",
"locked": false,
"name": "surya-probes-001",
"OS-EXT-SRV-ATTR:launch_index": 0,
"created": "2018-06-29T15:07:29Z",
"tenant_id": "940f47b984034c7f8f9624ab28f5643c",
"host_status": "UP",
"trusted_image_certificates": null,
"metadata": {}
},
{
"created": "2018-06-29T15:07:29Z",
"status": "UNKNOWN",
"tenant_id": "940f47b984034c7f8f9624ab28f5643c",
"id": "bcc6c6dd-3d0a-4633-9586-60878fd68edb",
}
]
}
Security impact
---------------
None.
Notifications impact
--------------------
None.
Other end user impact
---------------------
When a cell DB cannot be connected, ``nova list``, ``nova show`` and
``nova service-list`` will work with the records from the down cell not having
all the information. When these commands are used with filters/sorting/paging,
the output will totally skip the down cell and return only information from the
up cells. As per default policy ``nova boot`` will not work if that tenant_id
has any living instances in the down cell.
Performance Impact
------------------
There will not be any major impact on performance in normal situations. However
when a cell is down, during show/list/boot time there will be a slight
performance impact because of the extra check into the instance_mapping and/or
request_spec tables and the time required for the construction of a
minimalistic record in case a did_not_respond_sentinel is received from the
scatter-gather utility.
Other deployer impact
---------------------
None.
Developer impact
----------------
None.
Upgrade impact
--------------
Since there will be a change in the api DB schema, the ``nova-manage api_db
sync`` command will have to be run to update the instance_mappings table. The
new online data migration tool that will be added to populate the new column
will have to be run.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
<tssurya>
Other contributors:
<belmoreira>
Work Items
----------
#. Add a new column ``queued_for_delete`` to nova_api.instance_mappings table.
#. Add a new field ``queued_for_delete`` to InstanceMapping object.
#. Add a new online migration tool for populating ``queued_for_delete`` of
existing instance_mappings.
#. Handle ``nova list`` gracefully on receiving a timeout from a cell `here`_.
#. Handle ``nova service-list`` gracefully on receiving a timeout from a cell.
#. Handle ``nova boot`` during quota calculation in `quota calculation code`_
when the result is a did_not_respond_sentinel or raised_exception_sentinel.
Implement the extra check into the instance_mapping table to see if the
requesting project has any living instances in the down cell and block the
request accordingly.
Dependencies
============
None.
Testing
=======
Unit and functional tests for verifying the working when a
did_not_respond_sentinel is received.
Documentation Impact
====================
Update the description of the Compute API reference with regards to these
commands to include the meaning of UNKNOWN records.
References
==========
.. _smart server listing: https://review.openstack.org/#/c/509003/
.. _resilient server listing: https://review.openstack.org/#/c/575734/
.. _resilient service listing: https://review.openstack.org/#/c/568271/
.. _quotas using placement: https://review.openstack.org/#/c/509042/
.. _cells summary in Dublin PTG: http://lists.openstack.org/pipermail/openstack-dev/2018-March/128304.html
.. _did_not_respond_sentinel: https://github.com/openstack/nova/blob/f902e0d/nova/context.py#L464
.. _POC for unavailable_servers: https://review.openstack.org/#/c/575996/
.. _listing instances using Searchlight: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/list-instances-using-searchlight.html
.. _here: https://github.com/openstack/nova/blob/f902e0d/nova/compute/multi_cell_list.py#L246
.. _quota calculation code: https://github.com/openstack/nova/blob/f902e0d/nova/quota.py#L1317
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Rocky
- Introduced
* - Stein
- Reproposed

View File

@@ -0,0 +1,313 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
======================================
Default allocation ratio configuration
======================================
https://blueprints.launchpad.net/nova/+spec/initial-allocation-ratios
Provide separate CONF options for specifying the initial allocation
ratio for compute nodes. Change the default values for
CONF.xxx_allocation_ratio options to None and change the behaviour of
the resource tracker to only override allocation ratios for *existing*
compute nodes if the CONF.xxx_allocation_ratio value is not None.
The primary goal of this feature is to support both the API and config way to
pass allocation ratios.
Problem description
===================
Manually set placement allocation ratios are overwritten
--------------------------------------------------------------------
There is currently no way for an admin to set the allocation ratio on an
individual compute node resource provider's inventory record in the placement
API without the resource tracker eventually overwriting that value the next
time it runs the ``update_available_resources`` periodic task on the
``nova-compute`` service.
The saga of the allocation ratio values on the compute host
-----------------------------------------------------------
The process by which nova determines the allocation ratio for CPU, RAM and disk
resources on a hypervisor is confusing and `error`_ `prone`_. The
``compute_nodes`` table in the nova cell DB contains three fields representing
the allocation ratio for CPU, RAM and disk resources on that hypervisor. These
fields are populated using different default values depending on the version of
nova running on the ``nova-compute`` service.
.. _error: https://bugs.launchpad.net/nova/+bug/1742747
.. _prone: https://bugs.launchpad.net/nova/+bug/1789654
Upon starting up, the resource tracker in the ``nova-compute`` service worker
`checks`_ to see if a record exists in the ``compute_nodes`` table of the nova
cell DB for itself. If it does not find one, the resource tracker `creates`_ a
record in the table, `setting`_ the associated allocation ratio values in the
``compute_nodes`` table to the value it finds in the ``cpu_allocation_ratio``,
``ram_allocation_ratio`` and ``disk_allocation_ratio`` nova.conf configuration
options but only if the config option value is not equal to 0.0.
.. _checks: https://github.com/openstack/nova/blob/852de1e/nova/compute/resource_tracker.py#L566
.. _creates: https://github.com/openstack/nova/blob/852de1e/nova/compute/resource_tracker.py#L577-L590
.. _setting: https://github.com/openstack/nova/blob/6a68f9140/nova/compute/resource_tracker.py#L621-L645
The default values of the ``cpu_allocation_ratio``, ``ram_allocation_ratio``
and ``disk_allocation_ratio`` CONF options is `currently set`_ to ``0.0``.
.. _currently set: https://github.com/openstack/nova/blob/852de1e/nova/conf/compute.py#L400
The resource tracker saves these default ``0.0`` values to the
``compute_nodes`` table when the resource tracker calls ``save()`` on the
compute node object. However, there is `code`_ in the
``ComputeNode._from_db_obj`` that, upon **reading** the record back from the
database on first save, changes the values from ``0.0`` to ``16.0``, ``1.5`` or
``1.0``.
.. _code: https://github.com/openstack/nova/blob/852de1e/nova/objects/compute_node.py#L177-L207
The ``ComputeNode`` object that was ``save()``'d by the resource tracker has
these new values for some period of time while the record in the
``compute_nodes`` table continues to have the wrong ``0.0`` values. When the
resource tracker runs its ``update_available_resource()`` next perioidic task,
the new ``16.0``/``1.5``/``1.0`` values are then saved to the compute nodes
table.
There is a `fix`_ for `bug/1789654`_, which is to not persist
zero allocation ratios in ResourceTracker to avoid initializing placement
allocation_ratio with 0.0 (due to the allocation ratio of 0.0 being multiplied
by the total amount in inventory, leading to 0 resources shown on the system).
.. _fix: https://review.openstack.org/#/c/598365/
.. _bug/1789654: https://bugs.launchpad.net/nova/+bug/1789654
Use Cases
---------
An administrator would like to set allocation ratios for individual resources
on a compute node via the placement API *without that value being overwritten*
by the compute node's resource tracker.
An administrator chooses to only use the configuration file to set allocation
ratio overrides on their compute nodes and does not want to use the placement
API to set these ratios.
Proposed change
===============
First, we propose to change the default option values of existing
``CONF.cpu_allocation_ratio``, ``CONF.ram_allocation_ratio`` and
``CONF.disk_allocation_ratio`` options relating to allocation ratios to
``None`` from the existing default values of ``0.0``. The reason we change
it is that this value will be change from ``0.0`` to ``16.0``, ``1.5`` or
``1.0`` later, which is weird and confusing.
We will also change the resource tracker to **only** overwrite the compute
node's allocation ratios to the value of the ``cpu_allocation_ratio``,
``ram_allocation_ratio`` and ``disk_allocation_ratio`` CONF options **if the
value of these options is NOT ``None``**.
In other words, if any of these CONF options is set to something *other than*
``None``, then the CONF option should be considered the complete override value
for that resource class' allocation ratio. Even if an admin manually adjusts
the allocation ratio of the resource class in the placement API, the next time
the ``update_available_resource()`` periodic task runs, it will be overwritten
to the value of the CONF option.
Second, we propose to add 3 new nova.conf configuration options:
* ``initial_cpu_allocation_ratio``
* ``initial_ram_allocation_ratio``
* ``initial_disk_allocation_ratio``
That will used to determine how to set the *initial* allocation ratio of
``VCPU``, ``MEMORY_MB`` and ``DISK_GB`` resource classes when a compute worker
first starts up and creates its compute node record in the nova cell DB and
corresponding inventory records in the placement service. The value of these
new configuration options will only be used if the compute service's resource
tracker is not able to find a record in the placement service for the compute
node the resource tracker is managing.
The default value of each of these CONF options shall be ``16.0``, ``1.5``, and
``1.0`` respectively. This is to match the default values for the original
allocation ratio CONF options before they were set to ``0.0``.
These new ``initial_xxx_allocation_ratio`` CONF options shall **ONLY** be used
if the resource tracker detects no existing record in the ``compute_nodes``
nova cell DB for that hypervisor.
Finally, we will need also add an online data migration and continue to read
the ``xxx_allocation_ratio`` or ``initial_xxx_allocation_ratio`` config on
read from the DB if the values are ``0.0`` or ``None``. If it's an existing
record with 0.0 values, we'd want to do what the compute does, which is use
the configure ``xxx_allocation_ratio`` config if it's not None, and fallback
to using the ``initial_xxx_allocation_ratio`` otherwise.
And add an online data migration that updates all compute_nodes
table records that have ``0.0`` or ``None`` allocation ratios. Then we drop
that at some point with a blocker migration and remove the code in the
``nova.objects.ComputeNode._from_db_obj`` that adjusts allocation ratios.
We propose to add a nova-status upgrade check to iterate the cells looking
for compute_nodes records with ``0.0`` or ``None`` allocation ratios and signal
that as a warning that you haven't done the online data migration. We could
also check the conf options to see if they are explicitly set to 0.0 and if
so, we should fail the status check.
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
We need an online data migrations for any compute_nodes with existing ``0.0``
and ``None`` allocation ratio. If it's an existing record with 0.0 values, we
will replace it with the configure ``xxx_allocation_ratio`` config if it's not
None, and fallback to using the ``initial_xxx_allocation_ratio`` otherwise.
.. note:: Migrating 0.0 allocation ratios from existing ``compute_nodes`` table
records is necessary because the ComputeNode object based on those table
records is what gets used in the scheduler [1]_, specifically the
``NUMATopologyFilter`` and ``CPUWeigher`` (the ``CoreFilter``,
``DiskFilter`` and ``RamFilter`` also use them but those filters are
deprecated for removal so they are not a concern here).
And clearly in order to take advantage of the ability to manually set
allocation ratios on a compute node, that hypervisor would need to be upgraded.
No impact to old compute hosts.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
yikun
Work Items
----------
* Change the default values for ``CONF.xxx_allocation_ratio`` options to
``None``.
* Modify resource tracker to only set allocation ratios on the compute node
object when the CONF options are non- ``None``
* Add new ``initial_xxx_allocation_ratio`` CONF options and modify resource
tracker's initial compute node creation to use these values
* Remove code in the ``ComputeNode._from_db_obj()`` that changes allocation
ratio values
* Add a db online migration to process all compute_nodes with existing ``0.0``
and ``None`` allocation ratio.
* Add a nova-status upgrade check for ``0.0`` or ``None`` allocation ratio.
Dependencies
============
None
Testing
=======
No extraordinary testing outside normal unit and functional testing
Documentation Impact
====================
A release note explaining the use of the new ``initial_xxx_allocation_ratio``
CONF options should be created along with a more detailed doc in the admin
guide explaining the following primary scenarios:
* When the deployer wants to **ALWAYS** set an override value for a resource on
a compute node. This is where the deployer would ensure that the
``cpu_allocation_ratio``, ``ram_allocation_ratio`` and
``disk_allocation_ratio`` CONF options were set to a non- ``None`` value.
* When the deployer wants to set an **INITIAL** value for a compute node's
allocation ratio but wants to allow an admin to adjust this afterwards
without making any CONF file changes. This scenario uses the new
``initial_xxx_allocation_ratios`` for the initial ratio values and then shows
the deployer using the osc placement commands to manually set an allocation
ratio for a resource class on a resource provider.
* When the deployer wants to **ALWAYS** use the placement API to set allocation
ratios, then the deployer should ensure that ``CONF.xxx_allocation_ratio``
options are all set to ``None`` and the deployer should issue Placement
REST API calls to
``PUT /resource_providers/{uuid}/inventories/{resource_class}`` [2]_ or
``PUT /resource_providers/{uuid}/inventories`` [3]_ to set the allocation
ratios of their resources as needed (or use the related ``osc-placement``
plugin commands [4]_).
References
==========
.. [1] https://github.com/openstack/nova/blob/a534ccc5a7/nova/scheduler/host_manager.py#L255
.. [2] https://developer.openstack.org/api-ref/placement/#update-resource-provider-inventory
.. [3] https://developer.openstack.org/api-ref/placement/#update-resource-provider-inventories
.. [4] https://docs.openstack.org/osc-placement/latest/
Nova Stein PTG discussion:
* https://etherpad.openstack.org/p/nova-ptg-stein
Bugs:
* https://bugs.launchpad.net/nova/+bug/1742747
* https://bugs.launchpad.net/nova/+bug/1729621
* https://bugs.launchpad.net/nova/+bug/1739349
* https://bugs.launchpad.net/nova/+bug/1789654
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Proposed

View File

@@ -0,0 +1,199 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==================================================================
Use conductor groups to partition nova-compute services for Ironic
==================================================================
https://blueprints.launchpad.net/nova/+spec/ironic-conductor-groups
Use ironic's conductor group feature to limit the subset of nodes which a
nova-compute service will manage. This allows for partitioning nova-compute
services to a particular location (building, aisle, rack, etc), and provides a
way for operators to manage the failure domain of a given nova-compute service.
Problem description
===================
As OpenStack deployments become larger, and edge compute becomes a reality,
there is a desire to be able to co-locate the nova-compute service with
some subset of ironic nodes.
There is also a desire to be able to reduce the failure domain of a
nova-compute service, and to be able to make the failure domain more
predictable in terms of which ironic nodes can no longer be scheduled to.
Use Cases
---------
Operators managing large and/or distributed ironic environments need more
control over the failure domain of a nova-compute service.
Proposed change
===============
A configuration option ``partition_key`` will be added, to tell the
nova-compute service which ``conductor_group`` (an ironic-ism) it is
responsible for managing. This will be used as a filter when querying the list
of nodes from ironic, so that only the subset of nodes which have a
``conductor_group`` matching the ``partition_key`` will be returned.
As nova-compute services have a hash ring which further partitions the subset
of nodes which a given nova-compute service is managing, we need a mechanism to
tell the service which other compute services are managing the same
``partition_key``. To do this, we will add another configuration option,
``peer_list``, which is a comma-separated list of hostnames of other compute
services managing the same subset of nodes. If set, this will be used instead
of the current code, which fetches a list of all compute services running the
ironic driver from the database. To ensure that the hash ring splits nodes only
between currently running compute services, we will check this list against the
database and filter out any inactive services (i.e. has not checked in
recently) listed in ``peer_list``.
``partition_key`` will default to ``None``. If the value is ``None``, this
functionality will be disabled, and the behavior will be the same as before,
where all nodes are eligible to be managed by the compute service, and all
compute services are considered as peers. Any other value will enable this
service, limiting the nodes to the conductor group matching ``partition_key``,
and using the ``peer_list`` configuration option to determine the list of
peers.
Both options will be added to the ``[ironic]`` config group, and will be
"mutable", meaning it only requires a SIGHUP to update the running service with
new config values.
Alternatives
------------
Ideally, we wouldn't need a ``peer_list`` configuration option, as we would be
able to dynamically fetch this list from the database, and it's prone to
operator mistakes.
One option to do this is to add a field to the compute service record, to store
the partition key. Compute services running the ironic driver could then use
this field to determine their peer list. During the Stein PTG discussion
about this feature, we agreed not to do this, as adding fields or blobjects
in the service record for a single driver is a layer violation.
Another option is for the ironic driver to manage its own list of live services
in something like etcd, and the peer list could be determined from here. This
also feels like a layer violation, and requiring an etcd cluster only for a
particular driver feels confusing at best from an operator POV.
Data model impact
-----------------
None.
REST API impact
---------------
None.
Security impact
---------------
None.
Notifications impact
--------------------
None.
Other end user impact
---------------------
None.
Performance Impact
------------------
Using this feature slightly improves the performance of the resource tracker
update. Instead of iterating over the list of *all* ironic nodes to determine
which should be managed, the compute service will iterate over a subset of
ironic nodes.
Other deployer impact
---------------------
The two configuration options mentioned above are added, but are optional.
The feature isn't enabled unless ``partition_key`` is set.
It's worth noting what happens when a node's conductor group changes. If the
node has an instance, it continues being managed by the compute service
responsible for the instance, as we do today with rebalancing the hash ring.
Without an instance, the node will be picked up by a compute service managing
the new group at the next resource tracker run after the conductor group
changes.
Developer impact
----------------
None.
Upgrade impact
--------------
None.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
jroll
Work Items
----------
* Add the configuration options and the new code paths.
* Add functional tests to ensure that the compute services manage the correct
subset of nodes when this is enabled.
* Add documentation for deployers and operators.
Dependencies
============
None.
Testing
=======
This will need to be tested in functional tests, as it would require spinning
up at least three nova-compute services to properly test the feature. While
possible in integration tests, this isn't a great use of CI resources.
Documentation Impact
====================
Deployer and operator documentation will need updates.
References
==========
This feature and its implementation was roughly agreed upon during the Stein
PTG. See line 662 or so (at the time of this writing):
https://etherpad.openstack.org/p/nova-ptg-stein
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,184 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==================================
Live-Migration force after timeout
==================================
https://blueprints.launchpad.net/nova/+spec/live-migration-force-after-timeout
Replace the existing flawed automatic post-copy logic with the option to
force-complete live-migrations on completion timeout, instead of aborting.
Problem description
===================
In an ideal world, we could tell when a VM looks unable to move, and warn
the operator sooner than the `completion timeout`_. This was the idea with the
`progress timeout`_. Sadly we do not get enough information from QEMU and
libvirt to correctly detect this case. As we were sampling a saw tooth
wave, it was possible for us to think little progress was being made, when in
fact that was not the case. In addition, only memory was being monitored, so
large block_migrations always looked like they were making no progress. Refer
to the `References`_ section for details.
In Ocata we `deprecated`_ that progress timeout, and disabled it by default.
Given there is no quick way to make that work, it should be removed now.
The automatic post-copy is using the same flawed data, so that logic should
also be removed.
Nova currently optimizes for limited guest downtime, over ensuring the
live-migration operation always succeeds. When performing a host maintenance,
operators may want to move all VMs from the affected host to an unaffected
host. In some cases, the VM could be too busy to move before the completion
timeout, and currently that means the live-migration will fail with a timeout
error.
Automatic post-copy used to be able to help with this use case, ensuring Nova
does its best to ensure the live-migration completes, at the cost of a little
more VM downtime. We should look at a replacement for automatic post-copy.
.. _completion timeout: https://docs.openstack.org/nova/rocky/configuration/config.html#libvirt.live_migration_completion_timeout
.. _progress timeout: https://docs.openstack.org/nova/rocky/configuration/config.html#libvirt.live_migration_progress_timeout
.. _deprecated: https://review.openstack.org/#/c/431635/
Use Cases
---------
* Operators want to patch a host and want to move all the VMs out of that
host, with minimal impact to the VMs, so they use live-migration. If the VM
isn't live-migrated there will be significant VM downtime, so its better to
take a little more VM downtime during the live-migration so the VM is able
to avoid the much larger amount of downtime should the VM not get moved
by the live-migration.
Proposed change
===============
* Config option ``libvirt.live_migration_progress_timeout`` was deprecated in
Ocata, and can now be removed.
* Current logic in libvirt driver to auto trigger post-copy will be removed.
* A new configuration option ``libvirt.live_migration_timeout_action`` will be
added. This new option will have choice to ``abort`` (default) or
``force_complete``. This option will determine what actions will be taken
against a VM after ``live_migration_completion_timeout`` expires. Currently
nova just aborts the LM operation after completion timeout expires.
By default, we keep the same behavior of aborting after completion timeout.
Please note the ``abort`` and ``force_complete`` actions that are options in
``live_migration_timeout_action`` config option are the same as if you were to
call the existing REST APIs of the same name. In particular,
``force_complete`` will either pause the VM or trigger post_copy depending on
if post copy is enabled and available.
Alternatives
------------
We could just remove the automatic post copy logic and not replace it, but
this stops us helping operators with the above use case.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Kevin Zheng
Other contributors:
Yikun Jiang
Work Items
----------
* Remove ``libvirt.live_migration_progress_timeout`` and auto post copy logic.
* Add a new libvirt conf option ``live_migration_timeout_action``.
Dependencies
============
None
Testing
=======
Add in-tree functional and unit tests to test new logic. Testing these types
of scenarios in Tempest is not really possible given the unpredictable nature
of a timeout test. Therefore we can simulate and test the logic in functional
tests like those that `already exist`_.
.. _already exist: https://github.com/openstack/nova/blob/89c9127de/nova/tests/functional/test_servers.py#L3482
Documentation Impact
====================
Document new config options.
References
==========
* Live migration progress timeout bug: https://launchpad.net/bugs/1644248
* OSIC whitepaper: http://superuser.openstack.org/wp-content/uploads/2017/06/ha-livemigrate-whitepaper.pdf
* Boston summit session: https://www.openstack.org/videos/boston-2017/openstack-in-motion-live-migration
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Pike
- Approved but not implemented
* - Stein
- Reproposed

View File

@@ -0,0 +1,195 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===============================
Per aggregate scheduling weight
===============================
https://blueprints.launchpad.net/nova/+spec/per-aggregate-scheduling-weight
This spec proposes to add ability to allow users to use ``Aggregate``'s
``metadata`` to override the global config options for weights to achieve
more fine-grained control over resource weights.
Problem description
===================
In the current implementation, the weights are controlled by config options
like [filter_scheduler] cpu_weight_multiplier, the total weight of a compute
node is calculated by combination of several weigher:
weight = w1_multiplier * norm(w1) + w2_multiplier * norm(w2) + ...
As it is controlled by config options, the weights are global across the whole
deployment, this is not convenient enough for operators and users.
Use Cases
---------
As an operator I may want to have a more fine-grained control over resource
scheduling weight configuration so that I can control my resource allocations.
Operators may divide the resource pool by hardware type and their(hardware)
suitable workloads with host aggregates. Setting independent scheduling weight
for each aggregate can make it easier to control the scheduling behavior(
spreading or concentrate). For example, by default I want my deployment to
stack resources to conserve energy, but for my HPC aggregate, I want to set
``cpu_weight_multiplier=10.0`` to spread instances across the hosts in that
aggregate because I want to avoid noisy neighbors as much as possible.
Operators may also restrict flavors/images to host aggregates, and those
flavors/images may have preferences about the importance of CPU/RAM/DISK,
setting a suitable weight for this aggregate other than the global weight
could provide a more suitable resource allocation for the corresponding
workloads. For example, I want to deploy a big data analysis cluster(for
example Hadoop), there are different roles for each vm in this cluster,
for some of them the amount of CPU and RAM is much more important than DISK,
like the ``HDFS NameNode`` and nodes that runs ``MapReduce`` tasks; for some
of them, the size of DISK is more important, like the ``HDFS DataNodes``.
By creating different flavor/image and restrict them to aggregates that have
suitable scheduling weight can have a overall better resource allocation and
performance.
Proposed change
===============
This spec proposes to add abilities in existing weighers to read the
``*_weight_multiplier`` from ``aggregate metadata`` to override the
``*_weight_multiplier`` from config files to achieve a more flexible
weight during scheduling.
This will be done by making the ``weight_multiplier()`` method take a
``HostState`` object as a parameter and get the corresponding
``weight_multiplier`` from the aggregate metadata similar to how
``nova.scheduler.utils.aggregate_values_from_key()`` is used by the
``AggregateCoreFilter`` filter. If the host is in multiple aggregates and
there are conflicting weight values in the metadata, we will use the minimum
value among them.
Alternatives
------------
Add abilities to read the above mentioned multipliers from
``flavor extra_specs`` to make them per-flavor.
This alternative will not be implemented because:
- It could be very difficult to manage per-flavor weights in a
cloud with a lot of flavors, e.g. public cloud.
- Per-flavor weights does not help the case of an image that
requires some kind of extra weight to the host it is used on, so
per-flavor weights is less flexible, but with the proposed solution
we can apply the weights to aggregates which can then be used to
restrict both flavors (AggregateInstanceExtraSpecsFilter) and
images (AggregateImagePropertiesIsolation).
Data model impact
-----------------
None.
REST API impact
---------------
None.
Security impact
---------------
None.
Notifications impact
--------------------
None.
Other end user impact
---------------------
None.
Performance Impact
------------------
There could be a minor decrease in the scheduling performance as
some data gathering and calculation will be added.
Other deployer impact
---------------------
None.
Developer impact
----------------
None.
Upgrade impact
--------------
None.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Zhenyu Zheng
Work Items
----------
#. Add the ability to existing weigher to read the
``*_weight_multiplier`` from ``aggregate metadata`` to override
the ``*_weight_multiplier`` from config files to achieve a more
flexible weight during scheduling
#. Update docs about the new change
Dependencies
============
None.
Testing
=======
Unit tests for verifying when a ``*_weight_multiplier`` is provided in
aggregate metadata.
Documentation Impact
====================
Update the weights user reference documentation here:
https://docs.openstack.org/nova/latest/user/filter-scheduler.html#weights
The aggregate metadata key/value for each weigher will be called out in
the documentation.
References
==========
None.
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,186 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==========================
Per-instance serial number
==========================
Add support for providing unique per-instance serial numbers to servers.
Problem description
===================
A libvirt guest's serial number in the machine BIOS comes from the
``[libvirt]/sysinfo_serial`` configuration option [1]_, which defaults to
reading it from the compute host's ``/etc/machine-id`` file or if that does
not exist, reading it from the libvirt host capabilities. Either way, all
guests on the same host have the same serial number in the guest BIOS.
This can be problematic for guests running licensed software that charges per
installation based on the serial number because if the guest is migrated, it
will incur new charges even though it is only running a single instance of the
software.
If the guest has a specific serial unique to itself, then the license
essentially travels with the guest.
Use Cases
---------
As a user (or cloud provider), I do not want workloads to incur license
charges simply because of those workloads being migrated during normal
operation of the cloud.
Proposed change
===============
To allow users to control this behavior (if the cloud provides it), a new
flavor extra spec ``hw:unique_serial`` and corresponding image property
``hw_unique_serial`` will be introduced which when either is set to ``True``
will result in the guest serial number being set to the instance UUID.
For operators that just want per-instance serial numbers either globally
or for a set of host aggregates, a new "unique" choice will be added to the
existing ``[libvirt]/sysinfo_serial`` configuration which if set will result
in the guest serial number being set to the instance UUID. Note that the
default value for the option will not change as part of this blueprint.
The flavor/image value, if set, supersedes the host configuration.
Alternatives
------------
We could allow users to pass through a serial number UUID when creating
a server and then pass that down through to the hypervisor, but that seems
somewhat excessive for this small change. It is also not clear that all
hypervisor backends support specifying the serial number in the guest and we
want to avoid adding API features that not all compute drivers can support.
Allowing the user to specify a serial number could also potentially be abused
for pirating software unless a unique constraint was put in place, but even
then it would have to span an entire deployment (per-cell DB restrictions would
not be good enough).
Data model impact
-----------------
None besides a new ``FlexibleBooleanField`` field being added to the
``ImageMetaProps`` object.
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None. Users can leverage the functionality by creating new servers with an
enabled flavor/image, or rebuild/resize existing servers with an enabled
flavor/image.
Performance Impact
------------------
None
Other deployer impact
---------------------
Operators that wish to expose this functionality can do so by adding the
extra spec to their flavors and/or images or setting
``[libvirt]/sysinfo_serial=unique`` in nova configuration. If they want to
restrict the functionality to a set of compute hosts, that can also be done by
restricting enabled flavors/images to host aggregates.
Developer impact
----------------
None, except maintainers of other compute drivers besides the libvirt driver
may wish to support the feature eventually.
Upgrade impact
--------------
There is not an explicit upgrade impact except that obviously older compute
code would not know about the new flavor extra spec or image property and thus
if a user was requesting a server with the property, but the serial in the
guest did not match the instance UUID, they could be confused about why it
does not work. Again, operators can control this by deciding when to enable
the feature or by restricting it to certain host aggregates.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Zhenyu Zheng <zhengzhenyu@huawei.com> (Kevin_Zheng)
Other contributors:
Matt Riedemann <mriedem.os@gmail.com> (mriedem)
Work Items
----------
* Add the ``ImageMetaProps.hw_unique_serial`` field.
* Add a new choice, "unique", to the ``[libvirt]/sysinfo_serial`` configuration
option.
* Check for the flavor extra spec and image property in the libvirt driver
where the serial number config is set.
* Docs and tests.
Dependencies
============
None
Testing
=======
Unit tests should be sufficient for this relatively small feature.
Documentation Impact
====================
* The flavor extra spec will be documented: https://docs.openstack.org/nova/latest/user/flavors.html
* The image property will be documented: https://docs.openstack.org/glance/latest/admin/useful-image-properties.html
* The new configuration option choice will be documented [1]_
References
==========
.. [1] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sysinfo_serial
* Libvirt documentation: https://libvirt.org/formatdomain.html#elementsSysinfo
* Nova meeting discussion: http://eavesdrop.openstack.org/meetings/nova/2018/nova.2018-10-18-14.00.log.html#l-199
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,185 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================================
Remove force flag from live-migrate and evacuate
================================================
https://blueprints.launchpad.net/nova/+spec/remove-force-flag-from-live-migrate-and-evacuate
Force live-migrate and evacuate operations cannot be meaningfully supported for
servers having complex resource allocations. So this spec proposes to remove
the ``force`` flag from these operations in a new REST API microversion.
Problem description
===================
Today when ``force: True`` is specified nova tries to blindly copy the resource
allocation from the source host to the target host. This only works if the
the server's allocation is satisfied by the single root resource provider both
on the source host and on the destination host. As soon as the allocation
become more complex (e.g. it allocates from more than one provider
(including sharing providers) or allocates only from a nested provider) the
blind copy will fail.
Use Cases
---------
This change removes the following use case from the system:
* The admin cannot force a live-migration to a specified destination host
against the Nova scheduler and Placement agreement.
* The admin cannot force a evacuate to a specified destination host against
the Nova scheduler and Placement agreement.
This does not effect the use cases when the operator specifies the destination
host and let Nova and Placement verify that host before the move.
Please note that this removes the possibility to force live-migrate servers to
hosts where the nova-compute is disabled as the ComputeFilter in the filter
scheduler will reject such hosts.
Proposed change
===============
Forcing the destination host in a complex allocation case cannot supported
without calling Placement to get allocation candidates on the destination host
as Nova does not know how to copy the complex allocation. The documentation
of the force flag states that Nova will not call the scheduler to verify the
destination host. This rule has already been broken since Pike by two
`bugfixes`_. Also supporting complex allocations requires to get allocation
candidates from Placement. So the spec proposes to remove the ``force`` flag as
it cannot be supported any more.
Note that fixing old microversions to fail cleanly without leaking resources
in complex allocation scenarios is not part of this spec but handled as part of
`use-nested-allocation-candidates`_ That change will make sure that the forced
move operation on a server that either has complex allocation on the source
host or would require complex allocation on the destination host will be
rejected with a NoValidHost exception by the Nova conductor.
Alternatives
------------
* Try to guess when the server needs a complex allocation on the destination
host and only ignore the force flag in these cases.
* Do not manage resource allocations for forced move operations.
See more details in the `ML thread`_
Data model impact
-----------------
None
REST API impact
---------------
In a new microversion remove the ``force`` flag from both APIs:
* POST /servers/{server_id}/action (os-migrateLive Action)
* POST /servers/{server_id}/action (evacuate Action)
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
Update python-novaclient and python-openstackclient to support the new
microversion.
As the admin cannot skip the scheduler any more when moving servers, such move
can fail with scheduler and Placement related reasons.
Performance Impact
------------------
As the admin cannot skip the scheduler when moving a server, such move will
take a bit more time as Nova will call the scheduler and Placement.
Other deployer impact
---------------------
Please note that this spec removes the possibility to force live-migrate
servers to hosts where the nova-compute is disabled as the ComputeFilter in
the filter scheduler will reject such hosts.
Developer impact
----------------
Supporting the force flag has been a detriment to maintaining nova since it's
an edge case and requires workarounds like the ones made in Pike to support it.
Dropping support over time will be a benefit to maintaining the project and
improve consistency/reliability/usability of the API.
Upgrade impact
--------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
balazs-gibizer
Work Items
----------
* Add a new microversion to the API that removes the ``force`` flag from the
payload. If the new microversion is used in the request then default
``force`` to False when calling Nova internals.
* Document the new microversion
* Add support for the new microversion in the python-novaclient and in the
python-openstackclient
Dependencies
============
* Some part of `use-nested-allocation-candidates`_ is a dependecy of this work.
Testing
=======
* Functional and unit test will be provided
Documentation Impact
====================
* API reference document needs to be updated
References
==========
.. _`use-nested-allocation-candidates`: https://blueprints.launchpad.net/nova/+spec/use-nested-allocation-candidates
.. _`ML thread`: http://lists.openstack.org/pipermail/openstack-dev/2018-October/135551.html
.. _`bugfixes`: https://review.openstack.org/#/q/I6590f0eda4ec4996543ad40d8c2640b83fc3dd9d+OR+I40b5af5e85b1266402a7e4bdeb3705e1b0bd6f3b
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,164 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==================================================
show which server group a server is in "nova show"
==================================================
bp link:
https://blueprints.launchpad.net/nova/+spec/show-server-group
Problem description
===================
Currently you had to loop over all groups to find the group the server
belongs to. This spec tries to address this by proposing showing the server
group information in API `GET /servers/{server_id}`.
Use Cases
---------
* Admin/End user want to know the server group that the server belongs to
in a direct way.
Proposed change
===============
Proposes to add the server-group UUID to ``GET /servers/{id}``,
``PUT /servers/{server_id}`` and REBUILD API
``POST /servers/{server_id}/action``.
The server-group information will not be included in
``GET /servers/detail`` API, because the server-group information
needs another DB query.
Alternatives
------------
* One alternative is support the server groups filter by server UUID. Like
"GET /os-server-groups?server=<UUID>".
* Another alternative to support the server group query is following API:
"GET /servers/{server_id}/server_groups".
Data model impact
-----------------
NO
REST API impact
---------------
Allows the `GET /servers/{server_id}` API to show server group's UUID.
"PUT /servers/{server_id}" and REBUILD API "POST /servers/{server_id}/action"
also response same information.
.. highlight:: json
The returned information for server group::
{
"server": {
"server_groups": [ # not cached
"0b5d2c72-12cc-4ba6-a8d7-3ff5cc1d8cb8"
]
}
}
Security impact
---------------
N/A
Notifications impact
--------------------
N/A
Other end user impact
---------------------
* python novaclient would contain the server_group information.
Performance Impact
------------------
* Need another DB query retrieve the server group UUID. To reduce the
perfermance impact for batch API call, "GET /servers/detail" won't
return server group information.
Other deployer impact
---------------------
N/A
Developer impact
----------------
N/A
Upgrade impact
--------------
N/A
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Yongli He
Work Items
----------
* Add new microversion for this change.
Dependencies
============
N/A
Testing
=======
* Add functional api_sample tests.
* Add microversion releated test to tempest.
Documentation Impact
====================
* The API document should be changed to introduce this new feature.
References
==========
* Stein PTG discussion:https://etherpad.openstack.org/p/nova-ptg-stein
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Version
* - Stein
- First Version

View File

@@ -1 +0,0 @@
../../stein-template.rst

View File

@@ -0,0 +1,219 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=======================================================
Support High Precision Event Timer (HPET) on x86 guests
=======================================================
https://blueprints.launchpad.net/nova/+spec/support-hpet-on-guest
Problem description
===================
Use Cases
---------
As an end user looking to migrate an existing appliance to run in a cloud
environment I would like to be able to request a guest with HPET so that I can
share common code between my virtualized and physical products.
As an operator I would like to support onboarding legacy VNFs for my telco
customers where a guest image cannot be modified to work without a HPET.
Proposed change
===============
End users can indicate their desire to have HPET in the guest by specifying a
image property ``hw_time_hpet=True``.
Setting the new image property to "True" would only be guaranteed to be valid
in combination with ``hypervisor_type=qemu`` and either ``architecture=i686``
or ``architecture=x86_64``.
.. note:: A corresponding flavor extra spec will not be introduced since
enabling HPET is really a per-image concern rather than a resource concern
for capacity planning.
A few options to use Traits were considered as described in the next section,
but we end up choosing the simpler approach due to the following reasons:
1) HPET is provided by qemu via emulation, so there are no security
implications as there are already better clock sources available.
2) The HPET was turned off by default purely because of issues with time
drifting on Windows guests. (See nova commit ba3fd16605.)
3) The emulated HPET device is unconditionally available on all versions of
libvirt/qemu supported by OpenStack.
4) The HPET device is only supported for x86 architectures, so in a cloud with
a mix of architectures the image would have to be specific to ensure the
instance is scheduled on an x86 host.
5) Initially we would only support enabling HPET on qemu. Specifying the
hypervisor type will ensure the instance is scheduled on a host using the
qemu hypervisor. It would be possible to extend this to other hypervisors
as well if applicable (vmware supports the ability to enable/disable HPET,
I think), and which ones are supported could be documented in the "useful
image properties" documentation.
Alternatives
------------
The following options to use Trait were considered, but ultimatedly we chose
a simpler approach without using Trait.
Explicit Trait, Implicit Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Operators can indicate their desire to have HPET in the guest by specifying a
placement trait ``trait:COMPUTE_TIME_HPET=required`` in the flavor extra-specs.
End users can indicate their desire to have HPET in the guest by uploading
their own images with the same trait.
Existing nova scheduler code picks up the trait and passes it to
``GET /allocation_candidates``.
Once scheduled to a compute node, the virt driver looks for
``trait:COMPUTE_TIME_HPET=required`` in the flavor/image or
``trait*:COMPUTE_TIME_HPET=required`` for numbered request group in flavor and
uses that as its cue to enable HPET on the guest.
If we do get down to the virt driver and the trait is set, and the driver for
whatever reason (e.g. value(s) wrong in the flavor; wind up on a host that
doesn't support HPET etc.) determines it's not capable of flipping the switch,
it should fail. [1]_
**CON:** We're using a trait to effect guest configuration.
Explicit Config, Implicit Trait
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Operator specifies extra spec ``hw:hpet=True`` in the flavor.
* Nova recognizes this as a known special case and adds
``required=COMPUTE_TIME_HPET`` to the ``GET /allocation_candidates`` query.
* The driver uses the ``hw:hpet=True`` extra spec as its cue to enable HPET on
the guest.
**CON:** The implicit transformation of a special extra spec into
placement-isms is arcane. This wouldn't be the only instance of this; we would
need to organize the "special" extra specs in the code for maintainability, and
document them thoroughly.
Explicit Config, Explicit Trait
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Operator specifies **both** extra specs, ``hw:hpet=True`` and
``trait:COMPUTE_TIME_HPET=required``, in the flavor.
* Existing nova scheduler code picks up the latter and passes it to ``GET
/allocation_candidates``.
* The driver uses the ``hw:hpet=True`` extra spec as its cue to enable HPET on
the guest.
**CON:** The operator has to remember to set both extra specs, which is kind of
gross UX. (If she forgets ``hw:hpet=True``, she ends up with HPET off; if she
forgets ``trait:COMPUTE_TIME_HPET=required``, she can end up with late-failing
NoValidHosts.)
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
Negligible.
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
The new image property will only work reliably after all nodes have been
upgraded.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
jackding
Other contributors:
jaypipes, efried
Work Items
----------
* libvirt driver changes to support HPET
Dependencies
============
None
Testing
=======
Will add unit tests.
Documentation Impact
====================
Update User Documentation for image properties [2]_.
References
==========
.. [1] http://lists.openstack.org/pipermail/openstack-dev/2018-October/135446.html
.. [2] https://docs.openstack.org/glance/latest/admin/useful-image-properties.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,189 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
========================================================
Support to query nova resources filter by changes-before
========================================================
https://blueprints.launchpad.net/nova/+spec/support-to-query-nova-resources-filter-by-changes-before
The compute API already has the changes-since filter to filter servers updated
since the given time and this spec proposes to add a changes-before filter to
filter servers updated before the given time. In addition, the filters could
be used in conjunction to build a kind of time range filter, e.g. to get the
nova resources between changes-since and changes-before.
Problem description
===================
By default, nova can query the instance resource in the
updated_at >= changes-since time period. Users can only query resources
operated at given time, not during given period. Users may be interested in
resources operated in a specific period for monitoring or statistics purpose
but currently they have to retrieve and filter the resources by themselves.
This change can bring facility to users and also improve the efficiency of
timestamp based query.
Use Cases
---------
In large scale environment, lots of resources were created in system.
For tracing the change of resource, user or manage system only need to get
those resources which was changed with some time period, instead of querying
all resources every time to see which was changed.
For example, if you are trying to get the nova resources that were changed
before '2018-07-26T10:31:49Z', you can filter servers like:
* GET /servers/detail?changes-before=2018-07-26T10:31:49Z
Or if you want to filter servers in the time range(e.g. changes-since=
2018-07-26T10:31:49Z -> changes-before=2018-07-30T10:31:49Z), you can
filter servers like:
* GET /servers/detail?changes-since=2018-07-26T10:31:49Z&changes-before=
2018-07-30T10:31:49Z
Proposed change
===============
Add a new microversion to os-instance-actions, os-migrations and servers
list APIs to support changes-before.
Introduce a new changes-before filter for retrieving resources. It accepts a
timestamp and projects will return resources whose updated_at fields are
earlier than this timestamp, it means that "updated_at <= changes-before".
Its(changes-before) value is optional. If changes-since and changes-before
pass the value, the projects will return resources whose updated_at fields
are earlier than or equal to this changes-before, and later than or equal
to changes-since.
**Reading deleted resources**
Like the ``changes-since`` filter, the ``changes-before`` filter will also
return deleted servers.
This spec does not propose to change any read-deleted behavior in the
os-instance-actions or os-migrations APIs. The os-instance-actions API
with the 2.21 microversion allows retrieving instance actions for a deleted
server resource. The os-migrations API takes an optional ``instance_uuid``
filter parameter but does not support returning deleted migration records like
``changes-since`` does in the servers API.
Alternatives
------------
As discussed in `Problem description`_ section, users can retrieve and then
filter resources by themselves, but this method is extremely inconvenient.
Having said that, services like Searchlight do exist which have similar
functionality, i.e. listening for nova notifications and storing them in
a time-series database like elasticsearch from which results can later be
queried. However, requiring Searchlight or a similar alternative solution for
this relatively small change is likely excessive.
Leaving filtering work to the database can utilize the optimization of database
engine and also reduce data transmitted from server to client.
Data model impact
-----------------
None
REST API impact
---------------
A new microversion will be added.
List API will accept new query string parameter changes-before.
Judging in the following cases:
* If the user specifies the changes-before < changes-since, it will
return HTTPBadRequest 400.
* If the user only specifies changes-before, all nova resource before
changes-before will be returned, including the deleted servers.
* If the user specifies changes-since and changes-before, that will
get changes from a specific period, including the deleted servers.
* When the user only specifies changes-since, the original features
remain unchanged.
Users can pass time to the list API url to retrieve resources operated since
a specific time.
* GET /servers?changes-before=2018-07-26T10:31:49Z
* GET /servers/detail?changes-before=2018-07-26T10:31:49Z
* GET /servers/{server_id}/os-instance-actions?changes-before=
2018-07-26T10:31:49Z
* GET /os-migrations?changes-before=2018-07-26T10:31:49Z
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
Python client may add help to inform users this new filter.
Add support for the changes-before filter in python-novaclient
for the 'nova list', 'nova migration-list' and
'nova instance-action-list' command.
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Brin Zhang
Work Items
----------
* Add querying support in sql
* Add API filter
* Add related test
* Add support for changes-before to the 'nova list' operation in novaclient
* Add support for changes-before to the 'nova instance-action-list'
in novaclient
* Add support for changes-before to the 'nova migration-list' in novaclient
Dependencies
============
None
Testing
=======
* Add related unittest
* Add related functional test
Documentation Impact
====================
The nova API documentation will need to be updated to reflect the
REST API changes, and adding microversion instructions.
References
==========
None
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Stein
- Introduced

View File

@@ -0,0 +1,161 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================
VMware live migration
=====================
https://blueprints.launchpad.net/nova/+spec/vmware-live-migration
This is a proposal for adding support for live migration in the VMware
driver. When the VMware driver is used, each nova-compute is managing a
single vCenter cluster. For the purposes of this proposal we assume that
all nova-computes are managing clusters under the same vCenter server. If
migration across different vCenter servers is attempted, an error message
will be generated and no migration will occur.
Problem description
===================
Live migration is not supported when the VMware driver is used.
Use Cases
---------
As an Operator I want to live migrate instances from one compute cluster
(nova-compute host) to another compute cluster (nova-compute host) in the
same vCenter server.
Proposed change
===============
Relocating VMs to another cluster/datastore is a simple matter of calling the
RelocateVM_Task() vSphere API. The source compute host needs to know the
cluster name and the datastore regex of the target compute host. If the
instance is located on a datastore shared between the two clusters, it will
remain there. Otherwise we will choose a datastore that matches the
datastore_regex of the target host and migrate the instance there. There will
be a pre live-migration check that will verify that both source and
destination compute nodes correspond to clusters in the same vCenter server.
A new object will be introduced (VMwareLiveMigrateData) which will carry the
host IP, the cluster name and the datastore regex of the target compute host.
All of them are obtained from the nova config (CONF.vmware.host_ip,
CONF.vmware.cluster_name and CONF.vmware.datastore_regex).
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Upgrade impact
--------------
None
Implementation
==============
https://review.openstack.org/#/c/270116/
Assignee(s)
-----------
Primary assignee:
rgerganov
Work Items
----------
* Add ``VMwareLiveMigrateData`` object
* Implement pre live-migration checks
* Implement methods for selecting target ESX host and datastore
* Ensure CI coverage for live-migration
* Update support-matrix
Dependencies
============
None
Testing
=======
The VMware CI will provision two nova-computes and will execute the live
migration tests from tempest.
Documentation Impact
====================
The feature support matrix should be updated to indicate that live migration
is supported with the VMware driver.
References
==========
http://pubs.vmware.com/vsphere-60/topic/com.vmware.wssdk.apiref.doc/vim.VirtualMachine.html#relocate
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Newton
- Introduced
* - Ocata
- Reproposed
* - Pike
- Reproposed
* - Queens
- Reproposed
* - Rocky
- Reproposed
* - Stein
- Reproposed

View File

@@ -0,0 +1,409 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===================================
vRouter Hardware Offload Enablement
===================================
https://blueprints.launchpad.net/nova/+spec/vrouter-hw-offloads
SmartNICs allow complex packet processing on the NIC. In order to support
hardware acceleration for them, Nova core and os-vif needs modifications to
support the combination of VIF and vRouter plugging they support. This spec
proposes a hybrid SR-IOV and vRouter model to enable acceleration.
.. note:: In this spec, `Juniper Contrail`_, `OpenContrail`_ and
`Tungsten Fabric`_ will be used interchangeably.
Problem description
===================
SmartNICs are able to route packets directly to individual SR-IOV Virtual
Functions. These can be connected to instances using IOMMU (vfio-pci
passthrough) or a low-latency vhost-user `virtio-forwarder`_ running on the
compute node. The `vRouter packet processing pipeline`_ is managed by a
`Contrail Agent`_. If `Offload hooks in kernel vRouter`_ are present, then
datapath match/action rules can be fully offloaded to the SmartNIC instead of
executed on the hypervisor.
For a deeper discussion on datapath offloads, it is highly recommended
to read the `Generic os-vif datapath offloads spec`_.
The ``vrouter`` VIF type has not been converted to the os-vif plugin model.
This spec proposes completing the conversion to an os-vif plugin as the first
stage.
Currently, Nova supports multiple types of Contrail plugging: TAP plugs,
vhost-user socket plugs or VEB SR-IOV plugs. Neutron and the Contrail
controller decides what VIF type to pass to Nova based on the Neutron port
semantics and the configuration of the compute node. This VIF type is then
passed to Nova:
* The ``vrouter`` VIF type plugs a TAP device into the kernel vrouter.ko
datapath.
* The ``vhostuser`` VIF type with the ``vhostuser_vrouter_plug`` mode plugs
into the DPDK-based vRouter datapath.
* The ``hw_veb`` VIF type plugs a VM into the VEB datapath of a NIC using
vfio-pci passthrough.
In order to enable full datapath offloads for SmartNICs, Nova needs to support
additional VNIC types when plugging a VM with the ``vrouter`` VIF type, while
consuming a PCIe Virtual Function resource.
`Open vSwitch offloads`_ recognises the following VNIC types:
* The ``normal`` (or default) VNIC type indicates that the Instance is plugged
into the software bridge. The ``vrouter`` VIF type currently supports only
this VNIC type.
* The ``direct`` VNIC type indicates that a VF is passed through to the
Instance.
In addition, the Agilio OVS VIF type implements the following offload mode:
* The ``virtio-forwarder`` VNIC type indicates that a VF is attached via a
`virtio-forwarder`_.
Use Cases
---------
* Currently, an end user is able to attach a port to an Instance, running on a
hypervisor with support for plugging vRouter VIFs, by using one of the
following methods:
* Normal: Standard kernel based plugging, or vhost-user based plugging
depending on the datapath running on the hypervisor.
* Direct: PCI passthrough plugging into the VEB of an SR-IOV NIC.
* In addition, an end user should be able to attach a port to an Instance
running on a properly configured hypervisor, equipped with a SmartNIC, using
one of the following methods:
* Passthrough: Accelerated IOMMU passthrough to an offloaded vRouter
datapath, ideal for NFV-like applications.
* Virtio Forwarder: Accelerated vhost-user passthrough, maximum
software compatibility with standard virtio drivers and with support for
live migration.
* This enables Juniper, Tungsten Fabric (and partners like Netronome) to
achieve functional parity with the existing OVS VF Representor datapath
offloads for vRouter.
Proposed change
===============
* Stage 1: vRouter migration to os-vif.
* The `vRouter os-vif plugin`_ has been updated with the required code on the
master branch. Changes in Nova for this stage are gated on a release being
issued on that project in order to reflect the specific version required
in the release notes.
Progress on this task is tracked on the `vRouter os-vif conversion
blueprint`_.
* In ``nova/virt/libvirt/vif.py``:
Remove the Legacy vRouter config generation code,
``LibvirtGenericVIFDriver.get_config_vrouter()``, and migrate the plugging
code, ``LibvirtGenericVIFDriver.{plug,unplug}_vrouter()``, to an external
os-vif plugin.
For kernel-based plugging, VIFGeneric will be used.
* In ``privsep/libvirt.py``
Remove privsep code, ``{plug,unplug}_contrail_vif()``:
The call to ``vrouter-port-control`` will be migrated to the external
os-vif plugin, and further changes will be beyond the scope of Nova.
* Stage 2: Extend os-vif with better abstraction for representors.
os-vif's object model needs to be updated with a better abstraction model
to allow representors to be applicable to the ``vrouter`` datapath.
This stage will be covered by implementing the `Generic os-vif datapath
offloads spec`_.
* Stage 3: Extend the ``vrouter`` VIF type in Nova.
Modify ``_nova_to_osvif_vif_vrouter`` to support two additional VNIC types:
* ``VNIC_TYPE_DIRECT``: os-vif ``VIFHostDevice`` will be used.
* ``VNIC_TYPE_VIRTIO_FORWARDER``: os-vif ``VIFVHostUser`` will be used.
Code impact to Nova will be to pass through the representor information to
the os-vif plugin using the extensions developed in Stage 2.
Summary of plugging methods
---------------------------
* Existing methods supported by Contrail:
* VIF type: ``hw_veb`` (legacy)
* VNIC type: ``direct``
* VIF type: ``vhostuser`` (os-vif plugin: ``contrail_vrouter``)
* VNIC type: ``normal``
* ``details: vhostuser_vrouter_plug: True``
* os-vif object ``VIFVHostUser``
* VIF type: ``vrouter`` (legacy)
* VNIC type: ``normal``
* After migration to os-vif (Stage 1):
* VIF type: ``hw_veb`` (legacy)
* VNIC type: ``direct``
* VIF type: ``vhostuser`` (os-vif plugin: ``contrail_vrouter``)
* VNIC type: ``normal``
* ``details: vhostuser_vrouter_plug: True``
* os-vif object: ``VIFVHostUser``
* VIF type: ``vrouter`` (os-vif plugin: ``vrouter``)
* VNIC type: ``normal``
* os-vif object: ``VIFGeneric``
* Additional accelerated plugging modes (Stage 3):
* VIF type: ``vrouter`` (os-vif plugin: ``vrouter``)
* VNIC type: ``direct``
* os-vif object: ``VIFHostDevice``
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
* VIF type: ``vrouter`` (os-vif plugin: ``vrouter``)
* VNIC type: ``virtio-forwarder``
* os-vif object: ``VIFVHostUser``
* ``port_profile.datapath_offload: DatapathOffloadRepresentor``
Additional notes
----------------
* Stage 1 and Stage 2 can be completed and verified in parallel. The
abstraction layer will be tested on the Open vSwitch offloads.
* Selecting between the VEB passthrough mode and the offloaded vRouter
datapath passthrough mode happens at the `Contrail Controller`_. This is
keyed on the provider network associated with the Neutron port.
* The `vRouter os-vif plugin`_ has been updated to adopt ``vrouter`` as the new
os-vif plugin name. ``contrail_vrouter``, is kept as a backwards compatible
alias. This prevents namespace fragmentation. `Tungsten Fabric`_,
`OpenContrail`_ and `Juniper Contrail`_ can use a single os-vif plugin
for the vRouter datapath.
* No corresponding changes in Neutron are expected. The Contrail Neutron
plugin and agent require minimal changes in order to allow the semantics
to propagate correctly.
* This change is agnostic to the SmartNIC datapath: should Contrail switch
to TC based offloads, eBPF or a third-party method, the Nova plugging
logic will remain the same for full offloads.
* A deployer/administrator still has to register the PCI devices on the
hypervisor with ``pci_passthrough_whitelist`` in ``nova.conf``.
* SmartNIC-enabled nodes and standard compute nodes can run side-by-side.
Standard scheduling filters allocate and place Instances according to port
types and driver capabilities.
Alternatives
------------
Alternatives proposed require much more invasive patches to Nova:
* Create a new VIF type:
* This would add three VIF types for Contrail to maintain. This is not
ideal.
* Add glance or flavor annotations:
* This would force an Instance to have one type of acceleration. Code would
possibly move out to more VIF types and Virtual Function reservation would
still need to be updated.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
os-vif plugins run with elevated privileges.
Notifications impact
--------------------
None
Other end user impact
---------------------
End users will be able to plug VIFs into Instances with either ``normal``,
``direct`` or ``virtio-forwarder`` VNIC types on hardware enabled Nova nodes
running Contrail.
Performance Impact
------------------
This code is likely to be called at VIF plugging and unplugging. Performance
is not expected to regress.
On accelerated ports, dataplane performance between Instances is expected to
increase.
Other deployer impact
---------------------
A deployer would still need to configure the SmartNIC components of Contrail
and configure the PCI whitelist in Nova at deployment. This would not require
core OpenStack changes.
Developer impact
----------------
Core Nova semantics will be slightly changed. ``vrouter`` VIFs will support
more VNIC types.
Upgrade impact
--------------
New VNIC type semantics will be available on compute nodes with this patch.
A deployer would be mandated to install the os-vif plugin to retain existing
functionality in Nova. This is expected to be handled by minimum required
versions in Contrail.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Jan Gutter <jan.gutter@netronome.com>
Work Items
----------
* contrail-controller review implementing the semantics has been merged and is
awaiting a release tag:
https://review.opencontrail.org/42850
* The OpenContrail os-vif reference plugin has been updated and is awaiting a
release tag:
https://review.opencontrail.org/43399
* Stage 1: os-vif porting for vRouter VIF has been submitted:
https://review.openstack.org/571325
* Stage 2: `Generic os-vif datapath offloads spec`_ needs to be implemented.
* Stage 3: The OpenContrail os-vif reference plugin needs to be amended with
the interfaces added to os-vif in Stage 2.
* Stage 3: The ``vrouter`` VNIC support needs to be added in Nova:
https://review.openstack.org/572082
Dependencies
============
The following dependencies on Tungsten Fabric have been merged on the master
branch and are awaiting a release tag:
* The Contrail/Tungsten Fabric controller required minor updates to enable the
proposed semantics. This was merged in:
https://review.opencontrail.org/42850
* The os-vif reference plugin has been updated in:
https://review.opencontrail.org/43399
The following items can occur in parallel:
* os-vif extensions for accelerated datapath plugin modes need to be released.
Consult the `Generic os-vif datapath offloads spec`_ for more details. The
os-vif library update is planned for the Stein release.
* Pending release tags on the Contrail os-vif plugin, the `vRouter os-vif
conversion blueprint`_ can be completed. This is currently planned for the
Tungsten Fabric 5.1 release.
Once both of the preceding tasks have been implemented, the following items
can occur in parallel:
* Nova can implement the VNIC support for the ``contrail`` os-vif plugin.
* The ``contrail`` os-vif plugin can be updated to use the new os-vif
interfaces.
Testing
=======
* Unit tests have been refreshed and now cover the VIF operations more
completely.
* Third-party CI testing will be necessary to validate the Contrail and
Tungsten Fabric compatibility.
Documentation Impact
====================
Since this spec affects a non-reference Neutron plugin, a release note in Nova
should suffice. Specific versions of Contrail / Tungsten Fabric need to be
mentioned when a new plugin is required to provide existing functionality. The
external documentation to configure and use the new plugging modes should be
driven from the Contrail / Tungsten Fabric side.
References
==========
* `Juniper Contrail`_
* `OpenContrail`_
* `Tungsten Fabric`_
* `virtio-forwarder`_
* `vRouter packet processing pipeline`_
* `Offload hooks in kernel vRouter`_
* `Open vSwitch offloads`_
* `Generic os-vif datapath offloads spec`_
* `Contrail Agent`_
* `Contrail Controller`_
* `vRouter os-vif plugin`_
* `vRouter os-vif conversion blueprint`_
* `Contrail Controller to Neutron translation unit`_
* `Nova review implementing offloads for legacy plugging <https://review.openstack.org/567147>`_
(this review serves as an example and has been obsoleted)
.. _`Juniper Contrail`: https://www.juniper.net/us/en/products-services/sdn/contrail/
.. _`OpenContrail`: http://www.opencontrail.org/
.. _`Tungsten Fabric`: https://tungsten.io/
.. _`virtio-forwarder`: http://virtio-forwarder.readthedocs.io/en/latest/
.. _`vRouter packet processing pipeline`: https://github.com/Juniper/contrail-vrouter
.. _`Offload hooks in kernel vRouter`: https://github.com/Juniper/contrail-vrouter/blob/R4.1/include/vr_offloads.h
.. _`Open vSwitch offloads`: https://docs.openstack.org/neutron/queens/admin/config-ovs-offload.html
.. _`Contrail Agent`: https://github.com/Juniper/contrail-controller/tree/R4.1/src/vnsw/agent
.. _`Contrail Controller`: https://github.com/Juniper/contrail-controller
.. _`vRouter os-vif plugin`: https://github.com/Juniper/contrail-nova-vif-driver/blob/master/vif_plug_vrouter/
.. _`Generic os-vif datapath offloads spec`: https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/generic-os-vif-offloads.html
.. _`vRouter os-vif conversion blueprint`: https://blueprints.launchpad.net/nova/+spec/vrouter-os-vif-conversion
.. _`Contrail Controller to Neutron translation unit`: https://github.com/Juniper/contrail-controller/blob/R4.1/src/config/api-server/vnc_cfg_types.py