From 11c5cca922fae4f31ce1d0ba391007a4b6a95b9b Mon Sep 17 00:00:00 2001
From: Matt Riedemann <mriedem.os@gmail.com>
Date: Wed, 8 May 2019 16:28:30 -0400
Subject: [PATCH] Spec to pre-filter disabled computes with placement

This blueprint proposes to make nova report a trait to placement when a
compute service is disabled and a request filter in the scheduler which
will use that trait to filter out allocation candidates with that
forbidden trait.

Spec for blueprint pre-filter-disabled-computes

Change-Id: Ia8252028c04b93bbcc783a30baf7d282269dcbc5
---
 .../approved/pre-filter-disabled-computes.rst | 354 ++++++++++++++++++
 1 file changed, 354 insertions(+)
 create mode 100644 specs/train/approved/pre-filter-disabled-computes.rst

diff --git a/specs/train/approved/pre-filter-disabled-computes.rst b/specs/train/approved/pre-filter-disabled-computes.rst
new file mode 100644
index 000000000..e445dd998
--- /dev/null
+++ b/specs/train/approved/pre-filter-disabled-computes.rst
@@ -0,0 +1,354 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+============================
+Pre-filter disabled computes
+============================
+
+https://blueprints.launchpad.net/nova/+spec/pre-filter-disabled-computes
+
+This blueprint proposes to make nova report a trait to placement when a
+compute service is disabled and a request filter in the scheduler which
+will use that trait to filter out allocation candidates with that forbidden
+trait.
+
+
+Problem description
+===================
+
+In a large deployment with several thousand compute nodes, the
+``[scheduler]/max_placement_results`` configuration option may be limited
+such that placement returns allocation candidates which are mostly (or all)
+disabled compute nodes, which can lead to a NoValidHost error during
+scheduling.
+
+Use Cases
+---------
+
+As an operator, I want to limit ``max_placement_results`` to improve scheduler
+throughput but not suffer NoValidHost errors because placement only gives
+back disabled computes.
+
+As a developer, I want to pre-filter disabled computes in placement which
+should be faster (in SQL) than the legacy ``ComputeFilter`` running over the
+results in python. In other words, I want to ask placement better questions
+to get back more targeted results.
+
+As a user, I want to be able to create and resize servers without hitting
+NoValidHost errors because the cloud is performing a rolling upgrade and has
+disabled computes.
+
+Proposed change
+===============
+
+Summary
+-------
+
+Nova will start reporting a ``COMPUTE_STATUS_DISABLED`` trait to placement
+for any compute node resource provider managed by a disabled compute service
+host. When the service is enabled, the trait will be removed.
+
+A scheduler `request filter`_ will be added which will modify the RequestSpec
+to filter out providers with the new trait using `forbidden trait`_ filtering
+syntax.
+
+.. _request filter: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/scheduler/request_filter.py
+.. _forbidden trait: https://docs.openstack.org/nova/latest/user/flavors.html#extra-specs-forbidden-traits
+
+Compute changes
+---------------
+
+For the compute service there are two changes.
+
+set_host_enabled
+~~~~~~~~~~~~~~~~
+
+The compute service already has a `set_host_enabled`_ method which is a
+synchronous RPC call. Historically this was only implemented by the `xenapi
+driver`_ for use with the (now deprecated) `Update Host Status API`_.
+
+This blueprint proposes to use that compute method to generically add/remove
+the ``COMPUTE_STATUS_DISABLED`` trait on the compute nodes managed by that
+service (note that for ironic a compute service host can manage multiple
+nodes). The trait will be managed on only the root compute node resource
+provider in placement, not any nested providers.
+
+The actual implementation will be part of the `ComputeVirtAPI`_ so that
+the libvirt driver has access to it when it automatically disables or enables
+the compute node based on events from the hypervisor. [1]_
+
+update_provider_tree
+~~~~~~~~~~~~~~~~~~~~
+
+During the ``update_available_resource`` operation which is called during
+service start and periodically, the `update_provider_tree`_ flow will sync
+the ``COMPUTE_STATUS_DISABLED`` trait based on the current disabled status
+of the service. This is useful to:
+
+1. Sync the trait on older disabled computes during the upgrade.
+2. Sync the trait in case the API<>compute interaction fails for some reason,
+   like a dropped RPC call.
+
+API changes
+-----------
+
+When the `os-services`_ API(s) are used to enable or disable a compute service,
+the API will synchronously call the compute service via the
+``set_host_enabled`` RPC method to reflect the trait on the
+related compute node resource providers in placement appropriately. For
+example, if compute service A is disabled, the trait will be added. When
+compute service A is enabled, the trait will be removed.
+
+See the `Upgrade impact`_ section for dealing with old computes during a
+rolling upgrade.
+
+Down computes
+~~~~~~~~~~~~~
+
+It is possible to disable a down compute service since currently that disable
+operation is just updating the ``services.disabled`` value in the cell
+database. With this change, the API will have to check if the compute service
+is up using the `service group API`_. If the service is down, the API will not
+call the ``set_host_enabled`` compute method and instead just update the
+``services.disabled`` value in the DB as today and return. When the compute
+service is restarted, the `update_provider_tree`_ flow will sync the trait.
+
+Scheduler changes
+-----------------
+
+A request filter will be added which will modify the RequestSpec to forbid
+providers with the ``COMPUTE_STATUS_DISABLED`` trait. The changes to the
+RequestSpec will not be persisted.
+
+There will *not* be a new configuration option for the request filter meaning
+it will always be enabled.
+
+.. note:: In addition to filtering based on the disabled status of a node,
+          the ``ComputeFilter`` also performs an `is_up check`_ using the
+          service group API. The result of the "is up" check depends on whether
+          or not the service was `forced down`_ or has not "reported in" within
+          some configurable interval meaning the service might be down. This
+          blueprint is *not* going to try and report the up/down status of a
+          compute service using the new trait since it gets fairly complicated
+          and is more of an edge case for unexpected outages.
+
+.. _set_host_enabled: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/compute/rpcapi.py#L891
+.. _xenapi driver: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/virt/xenapi/host.py#L121
+.. _Update Host Status API: https://developer.openstack.org/api-ref/compute/?expanded=#update-host-status
+.. _ComputeVirtAPI: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/compute/manager.py#L419
+.. _update_provider_tree: https://docs.openstack.org/nova/latest/reference/update-provider-tree.html
+.. _os-services: https://developer.openstack.org/api-ref/compute/#compute-services-os-services
+.. _service group API: https://docs.openstack.org/nova/latest/admin/service-groups.html
+.. _is_up check: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/scheduler/filters/compute_filter.py#L44
+.. _forced down: https://developer.openstack.org/api-ref/compute/#update-forced-down
+
+Alternatives
+------------
+
+1. Rather than using a forbidden trait, we could hard-code a resource provider
+   aggregate UUID in nova and add/remove compute node resource providers
+   to/from that aggregate in placement as the service is disabled/enabled.
+
+   * Pros: Aggregates may be more natural since they are a grouping of
+     providers.
+
+   * Cons: Using an aggregate would be harder to debug from an operational
+     perspective since provider aggregates do not have any name or metadata
+     so an operator might wonder why a certain provider is not a candidate
+     for scheduling but is in an aggregate they did not create (or do not
+     see in the nova host aggregates API). Using a trait per provider with
+     a clear name like ``COMPUTE_STATUS_DISABLED`` should make it obvious
+     to a human that the provider is not a scheduling candidate because it
+     is disabled.
+
+2. Rather than using a forbidden trait or aggregate, nova could set the
+   reserved inventory on each provider equal to the total inventory for each
+   resource class on that provider, like what the ironic driver does when a
+   node is undergoing maintenance and should be taken out of scheduling
+   consideration. [2]_
+
+   * Pros: No new traits, can just follow the ironic driver pattern.
+
+   * Cons: Ironic node resource providers are expected to have a single
+     resource class in inventory so it is easier to manage changing the
+     reserved value on just that one class, but for non-baremetal providers
+     they are reporting at least three resource classes (VCPU, MEMORY_MB and
+     DISK_GB) so it would be more complicated to set reserved = total on all
+     of those classes. Furthermore, changing the inventory is not configurable
+     like a request filter is.
+
+   Long-term, we could consider changing the ironic driver node maintenance
+   code to just set/unset the ``COMPUTE_STATUS_DISABLED`` trait.
+
+3. Rather than the ``os-services`` API synchronously calling the
+   ``set_host_enabled`` method on the compute service, the API could just
+   toggle the trait on the affected providers directly.
+
+   * Pros: No blocking calls from the API to the compute service when changing
+     the disabled status of the service - although one could argue the blocking
+     nature proposed in the spec is advantageous so the admin gets confirmation
+     that the service is disabled and will be pre-filtered properly during
+     scheduling.
+
+   * Cons: Potential duplication of the code that manages the trait which could
+     violate the principle of single responsibility.
+
+4. Do nothing and instead focus efforts on optimizing the performance of the
+   nova scheduler which is likely the root cause that large deployments need
+   to severely limit ``max_placement_results`` [3]_. However, regardless of
+   optimizing the scheduler (which is something we should do anyway), part of
+   making scheduling faster in nova is dependent on nova asking placement
+   more informed questions and placement providing a smaller set of allocation
+   candidates, i.e. filter in SQL (placement) rather than in python (nova).
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None. Operators can use the `osc-placement`_ CLI to view and manage provider
+traits directly.
+
+.. _osc-placement: https://docs.openstack.org/osc-placement/latest/index.html
+
+Performance Impact
+------------------
+
+In one respect this should improve scheduler performance during an upgrade
+or maintenance of a large cloud which has many disabled compute services
+since placement would be returning fewer allocation candidates for the nova
+scheduler to filter.
+
+On the other hand, this would add overhead to the ``os-services`` API when
+changing the disabled status on a compute service.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Upgrade impact
+--------------
+
+There are a few upgrade considerations for this change.
+
+1. The API will check the RPC API version of the target compute service and if
+   it is old the ``set_host_enabled`` method will not be called. When the
+   compute service is upgraded and restarted, the ``update_provider_tree`` call
+   will sync the trait.
+
+2. Existing disabled computes need to have the trait reported
+   on upgrade which will happen via the ``update_available_resource`` flow
+   (update_provider_tree) called on start of the compute after it is upgraded.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Matt Riedemann (mriedem) <mriedem.os@gmail.com>
+
+Other contributors:
+  None
+
+Work Items
+----------
+
+* Make the changes to the compute service:
+
+  * The ``set_host_enabled`` method
+  * The ``update_provider_tree`` flow
+  * The libvirt driver to callback to add/remove the trait when it is notified
+    of the hypervisor going down or up
+
+* Plumb the ``os-services`` API to call the ``set_host_enabled`` compute
+  service method when the disabled status changes on a compute service
+
+* Add a request filter which will add a forbidden trait to the
+  RequestSpec to filter out disabled compute node resource providers during
+  the GET /allocation_candidates call to placement.
+
+
+Dependencies
+============
+
+The ``COMPUTE_STATUS_DISABLED`` trait would need to be added to the
+`os-traits`_ library.
+
+.. _os-traits: https://docs.openstack.org/os-traits/latest/
+
+
+Testing
+=======
+
+Unit and functional tests should be sufficient for this feature.
+
+
+Documentation Impact
+====================
+
+The new scheduler request filter will be documented in the admin docs. [4]_
+
+References
+==========
+
+Footnotes
+---------
+
+.. [1] https://opendev.org/openstack/nova/src/tag/19.0.0/nova/virt/libvirt/driver.py#L3802
+
+.. [2] https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/allow-reserved-equal-total-inventory.html
+
+.. [3] https://bugs.launchpad.net/nova/+bug/1737465
+
+.. [4] https://docs.openstack.org/nova/latest/admin/configuration/schedulers.html
+
+Other
+-----
+
+* The original bug reported by CERN: https://bugs.launchpad.net/nova/+bug/1805984
+
+* Initial proof of concept: https://review.opendev.org/654596/
+
+* Train PTG mailing list mention: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005908.html
+
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Train
+     - Introduced