From 11c5cca922fae4f31ce1d0ba391007a4b6a95b9b Mon Sep 17 00:00:00 2001 From: Matt Riedemann Date: Wed, 8 May 2019 16:28:30 -0400 Subject: [PATCH] Spec to pre-filter disabled computes with placement This blueprint proposes to make nova report a trait to placement when a compute service is disabled and a request filter in the scheduler which will use that trait to filter out allocation candidates with that forbidden trait. Spec for blueprint pre-filter-disabled-computes Change-Id: Ia8252028c04b93bbcc783a30baf7d282269dcbc5 --- .../approved/pre-filter-disabled-computes.rst | 354 ++++++++++++++++++ 1 file changed, 354 insertions(+) create mode 100644 specs/train/approved/pre-filter-disabled-computes.rst diff --git a/specs/train/approved/pre-filter-disabled-computes.rst b/specs/train/approved/pre-filter-disabled-computes.rst new file mode 100644 index 000000000..e445dd998 --- /dev/null +++ b/specs/train/approved/pre-filter-disabled-computes.rst @@ -0,0 +1,354 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================ +Pre-filter disabled computes +============================ + +https://blueprints.launchpad.net/nova/+spec/pre-filter-disabled-computes + +This blueprint proposes to make nova report a trait to placement when a +compute service is disabled and a request filter in the scheduler which +will use that trait to filter out allocation candidates with that forbidden +trait. + + +Problem description +=================== + +In a large deployment with several thousand compute nodes, the +``[scheduler]/max_placement_results`` configuration option may be limited +such that placement returns allocation candidates which are mostly (or all) +disabled compute nodes, which can lead to a NoValidHost error during +scheduling. + +Use Cases +--------- + +As an operator, I want to limit ``max_placement_results`` to improve scheduler +throughput but not suffer NoValidHost errors because placement only gives +back disabled computes. + +As a developer, I want to pre-filter disabled computes in placement which +should be faster (in SQL) than the legacy ``ComputeFilter`` running over the +results in python. In other words, I want to ask placement better questions +to get back more targeted results. + +As a user, I want to be able to create and resize servers without hitting +NoValidHost errors because the cloud is performing a rolling upgrade and has +disabled computes. + +Proposed change +=============== + +Summary +------- + +Nova will start reporting a ``COMPUTE_STATUS_DISABLED`` trait to placement +for any compute node resource provider managed by a disabled compute service +host. When the service is enabled, the trait will be removed. + +A scheduler `request filter`_ will be added which will modify the RequestSpec +to filter out providers with the new trait using `forbidden trait`_ filtering +syntax. + +.. _request filter: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/scheduler/request_filter.py +.. _forbidden trait: https://docs.openstack.org/nova/latest/user/flavors.html#extra-specs-forbidden-traits + +Compute changes +--------------- + +For the compute service there are two changes. + +set_host_enabled +~~~~~~~~~~~~~~~~ + +The compute service already has a `set_host_enabled`_ method which is a +synchronous RPC call. Historically this was only implemented by the `xenapi +driver`_ for use with the (now deprecated) `Update Host Status API`_. + +This blueprint proposes to use that compute method to generically add/remove +the ``COMPUTE_STATUS_DISABLED`` trait on the compute nodes managed by that +service (note that for ironic a compute service host can manage multiple +nodes). The trait will be managed on only the root compute node resource +provider in placement, not any nested providers. + +The actual implementation will be part of the `ComputeVirtAPI`_ so that +the libvirt driver has access to it when it automatically disables or enables +the compute node based on events from the hypervisor. [1]_ + +update_provider_tree +~~~~~~~~~~~~~~~~~~~~ + +During the ``update_available_resource`` operation which is called during +service start and periodically, the `update_provider_tree`_ flow will sync +the ``COMPUTE_STATUS_DISABLED`` trait based on the current disabled status +of the service. This is useful to: + +1. Sync the trait on older disabled computes during the upgrade. +2. Sync the trait in case the API<>compute interaction fails for some reason, + like a dropped RPC call. + +API changes +----------- + +When the `os-services`_ API(s) are used to enable or disable a compute service, +the API will synchronously call the compute service via the +``set_host_enabled`` RPC method to reflect the trait on the +related compute node resource providers in placement appropriately. For +example, if compute service A is disabled, the trait will be added. When +compute service A is enabled, the trait will be removed. + +See the `Upgrade impact`_ section for dealing with old computes during a +rolling upgrade. + +Down computes +~~~~~~~~~~~~~ + +It is possible to disable a down compute service since currently that disable +operation is just updating the ``services.disabled`` value in the cell +database. With this change, the API will have to check if the compute service +is up using the `service group API`_. If the service is down, the API will not +call the ``set_host_enabled`` compute method and instead just update the +``services.disabled`` value in the DB as today and return. When the compute +service is restarted, the `update_provider_tree`_ flow will sync the trait. + +Scheduler changes +----------------- + +A request filter will be added which will modify the RequestSpec to forbid +providers with the ``COMPUTE_STATUS_DISABLED`` trait. The changes to the +RequestSpec will not be persisted. + +There will *not* be a new configuration option for the request filter meaning +it will always be enabled. + +.. note:: In addition to filtering based on the disabled status of a node, + the ``ComputeFilter`` also performs an `is_up check`_ using the + service group API. The result of the "is up" check depends on whether + or not the service was `forced down`_ or has not "reported in" within + some configurable interval meaning the service might be down. This + blueprint is *not* going to try and report the up/down status of a + compute service using the new trait since it gets fairly complicated + and is more of an edge case for unexpected outages. + +.. _set_host_enabled: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/compute/rpcapi.py#L891 +.. _xenapi driver: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/virt/xenapi/host.py#L121 +.. _Update Host Status API: https://developer.openstack.org/api-ref/compute/?expanded=#update-host-status +.. _ComputeVirtAPI: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/compute/manager.py#L419 +.. _update_provider_tree: https://docs.openstack.org/nova/latest/reference/update-provider-tree.html +.. _os-services: https://developer.openstack.org/api-ref/compute/#compute-services-os-services +.. _service group API: https://docs.openstack.org/nova/latest/admin/service-groups.html +.. _is_up check: https://opendev.org/openstack/nova/src/tag/19.0.0/nova/scheduler/filters/compute_filter.py#L44 +.. _forced down: https://developer.openstack.org/api-ref/compute/#update-forced-down + +Alternatives +------------ + +1. Rather than using a forbidden trait, we could hard-code a resource provider + aggregate UUID in nova and add/remove compute node resource providers + to/from that aggregate in placement as the service is disabled/enabled. + + * Pros: Aggregates may be more natural since they are a grouping of + providers. + + * Cons: Using an aggregate would be harder to debug from an operational + perspective since provider aggregates do not have any name or metadata + so an operator might wonder why a certain provider is not a candidate + for scheduling but is in an aggregate they did not create (or do not + see in the nova host aggregates API). Using a trait per provider with + a clear name like ``COMPUTE_STATUS_DISABLED`` should make it obvious + to a human that the provider is not a scheduling candidate because it + is disabled. + +2. Rather than using a forbidden trait or aggregate, nova could set the + reserved inventory on each provider equal to the total inventory for each + resource class on that provider, like what the ironic driver does when a + node is undergoing maintenance and should be taken out of scheduling + consideration. [2]_ + + * Pros: No new traits, can just follow the ironic driver pattern. + + * Cons: Ironic node resource providers are expected to have a single + resource class in inventory so it is easier to manage changing the + reserved value on just that one class, but for non-baremetal providers + they are reporting at least three resource classes (VCPU, MEMORY_MB and + DISK_GB) so it would be more complicated to set reserved = total on all + of those classes. Furthermore, changing the inventory is not configurable + like a request filter is. + + Long-term, we could consider changing the ironic driver node maintenance + code to just set/unset the ``COMPUTE_STATUS_DISABLED`` trait. + +3. Rather than the ``os-services`` API synchronously calling the + ``set_host_enabled`` method on the compute service, the API could just + toggle the trait on the affected providers directly. + + * Pros: No blocking calls from the API to the compute service when changing + the disabled status of the service - although one could argue the blocking + nature proposed in the spec is advantageous so the admin gets confirmation + that the service is disabled and will be pre-filtered properly during + scheduling. + + * Cons: Potential duplication of the code that manages the trait which could + violate the principle of single responsibility. + +4. Do nothing and instead focus efforts on optimizing the performance of the + nova scheduler which is likely the root cause that large deployments need + to severely limit ``max_placement_results`` [3]_. However, regardless of + optimizing the scheduler (which is something we should do anyway), part of + making scheduling faster in nova is dependent on nova asking placement + more informed questions and placement providing a smaller set of allocation + candidates, i.e. filter in SQL (placement) rather than in python (nova). + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None. Operators can use the `osc-placement`_ CLI to view and manage provider +traits directly. + +.. _osc-placement: https://docs.openstack.org/osc-placement/latest/index.html + +Performance Impact +------------------ + +In one respect this should improve scheduler performance during an upgrade +or maintenance of a large cloud which has many disabled compute services +since placement would be returning fewer allocation candidates for the nova +scheduler to filter. + +On the other hand, this would add overhead to the ``os-services`` API when +changing the disabled status on a compute service. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +None + +Upgrade impact +-------------- + +There are a few upgrade considerations for this change. + +1. The API will check the RPC API version of the target compute service and if + it is old the ``set_host_enabled`` method will not be called. When the + compute service is upgraded and restarted, the ``update_provider_tree`` call + will sync the trait. + +2. Existing disabled computes need to have the trait reported + on upgrade which will happen via the ``update_available_resource`` flow + (update_provider_tree) called on start of the compute after it is upgraded. + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Matt Riedemann (mriedem) + +Other contributors: + None + +Work Items +---------- + +* Make the changes to the compute service: + + * The ``set_host_enabled`` method + * The ``update_provider_tree`` flow + * The libvirt driver to callback to add/remove the trait when it is notified + of the hypervisor going down or up + +* Plumb the ``os-services`` API to call the ``set_host_enabled`` compute + service method when the disabled status changes on a compute service + +* Add a request filter which will add a forbidden trait to the + RequestSpec to filter out disabled compute node resource providers during + the GET /allocation_candidates call to placement. + + +Dependencies +============ + +The ``COMPUTE_STATUS_DISABLED`` trait would need to be added to the +`os-traits`_ library. + +.. _os-traits: https://docs.openstack.org/os-traits/latest/ + + +Testing +======= + +Unit and functional tests should be sufficient for this feature. + + +Documentation Impact +==================== + +The new scheduler request filter will be documented in the admin docs. [4]_ + +References +========== + +Footnotes +--------- + +.. [1] https://opendev.org/openstack/nova/src/tag/19.0.0/nova/virt/libvirt/driver.py#L3802 + +.. [2] https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/allow-reserved-equal-total-inventory.html + +.. [3] https://bugs.launchpad.net/nova/+bug/1737465 + +.. [4] https://docs.openstack.org/nova/latest/admin/configuration/schedulers.html + +Other +----- + +* The original bug reported by CERN: https://bugs.launchpad.net/nova/+bug/1805984 + +* Initial proof of concept: https://review.opendev.org/654596/ + +* Train PTG mailing list mention: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005908.html + + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Train + - Introduced