Return Alternate Hosts

Sometimes when a request to build or resize a VM is attempted, the build can fail for a variety of different reasons. At recent PTG and Forum events we discussed a request from operators to still be able to retry failed builds. With the cells V2 design, though, the communication from the cell conductor to start the retry process will no longer be possible, as cells cannot call up to the API level. We propose to have the scheduler return some alternate hosts along with the selected host. This was desired because in the event of a failed build or resize, another host could be tried without having to go through the entire scheduling process again. Blueprint: return-alternate-hosts Change-Id: I3b6fcfa67425c0865e174c5a921fbfa9596c76d9
2017-09-14 22:50:49 +00:00
parent f49397be06
commit eabda27a6d
1 changed files with 255 additions and 0 deletions
--- a/specs/queens/approved/return-alternate-hosts.rst
+++ b/specs/queens/approved/return-alternate-hosts.rst
@@ -0,0 +1,255 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+======================
+Return Alternate Hosts
+======================
+
+https://blueprints.launchpad.net/nova/+spec/return-alternate-hosts
+
+Sometimes when a request to build a VM is attempted, the build can fail for a
+variety of different reasons. At recent PTGs and Forums we discussed a request
+from operators to have the scheduler return some alternate hosts along with the
+selected host. This was desired because in the event of a failed build, another
+host could be tried without having to go through the entire scheduling process
+again.
+
+
+Problem description
+===================
+
+When a request to build a VM is received, a suitable host must be found. This
+selection process can take a non-trivial amount of time. Occasionally the build
+of an instance on a host fails, for any of a number of reasons. When that
+happens, the process has to start all over again, and because this happens in
+the cell, and cells cannot call back up to the api layer where the scheduler
+lives, we run into a problem. Operators stated that they wanted to preserve the
+ability to retry a failed build, but the design of the current retry system
+doesn't work in a cells V2 world, as it would require an up call from the cell
+conductor to the superconductor to request a retry.
+
+Similarly, resize operations currently also need to call up to the
+superconductor in order to retry a failed resize.
+
+
+Use Cases
+---------
+
+As an operator of an OpenStack deployment, I want to ensure that both VM and
+Ironic builds and resizes are successful as often as possible, and take as
+little time as possible.
+
+
+Proposed change
+===============
+
+We propose to have the scheduler's select_destinations() return N hosts per
+requested instance, where N is the value in CONF.scheduler.max_attempts.
+
+When all the hosts for a request are successfully claimed, the scheduler will
+scan the remaining sorted list of hosts to find additional hosts in the same
+cell as each of the selected hosts, until the total number of hosts for each
+requested instance equals the configured amount, or the list of hosts is
+exhausted. This means that even if an operator configures their deployment for,
+say, 5 max_attempts, fewer than that may be returned if there are not a
+sufficient number of qualified hosts.
+
+The RPC interfaces between conductor, scheduler, and the cell computes will
+have to be changed to reflect this modification.
+
+After calling the scheduler's select_destinations(), the superconductor will
+have a list of one or more Selection objects for each requested instance. It
+will process each instance in turn, as it does today. The difference is that
+for each instance, it will pop the first Selection object from the list, and
+use that to determine the host to which it will cast the call to
+build_and_run_instance(). This RPC cast will have to be changed to add the list
+of remaining Selection objects as an additional parameter.
+
+The compute will not use the list of Selection objects in any way; all the
+information it needs to build the instance is contained in the current
+paramters. If the build succeeds, the process ends. If, however, the build
+fails, compute will call its delete_allocation_for_instance() as it currently
+does, and then call the ComputeTaskAPI's build_instances() to perform the
+retry. This call will be modified to pass the Selection object list back to
+the conductor. The conductor will then inspect the list of Selection objects:
+if it is empty, then all possible retries have been exhausted, and the process
+stops. Otherwise, the conductor pops the first Selection object, and the
+process repeats until either the build is successful, or all hosts have failed.
+
+The only difference during the retries is that the conductor will first have to
+verify that the host a Selection object represents still has sufficient
+resources for the instance by calling Placement to attempt to claim them, using
+the value in the Selection object's `allocation` field. If that field is empty,
+that represents the initial selected host, whose resources have already been
+claimed. If there is a value there, that means that we are in a retry, so the
+conductor will first attempt to claim the resources using that value. If that
+fails, that Selection object is discarded, and the next is popped from the
+list.
+
+The logic flow for resize operations can be similarly modified to allow for
+retries within the cell, too. Live migrations, in contrast, have a retry
+process that is handled in the superconductor, so it will only need to be
+modified to work with the new values returned from select_destinations().
+
+Note that in the event of a burst of requests for similarly-sized instances,
+the list of alternates returned for each request will likely have some overlap.
+If retries become necessary, the earlier retry may allocate resources that
+would make that host unsuitable for a slightly later retry. This claiming code
+will ensure that we don't attempt to build on a host that doesn't have
+sufficient resources, but that also means that we might run out of alternates
+for the later requests. Operators will need to increase
+CONF.scheduler.max_attempts if they find that exhausting the pool of alternates
+is happening often in their deployment.
+
+As this proposal will change the structure of what is returned from the call to
+select_destinations(), any method, such as evacuate, unshelve, or a migration,
+will have to be modified to accept the new data structure. They will not be
+required to change how they work with this information. In other words, while
+the build and resize processes in a cell will be changed as noted above to
+retry failed builds using these alternates, these other consumers of
+select_destinations() will not change how they use the result, because they do
+not handle retries from within the cell conductor. We may decide to change them
+at a later date, but that is not in the scope of this spec.
+
+Alternatives
+------------
+
+Continue returning a single host per instance. This is simpler from the
+scheduler/conductor side of things, but will make failed builds more common
+than with this change, since retries won't be possible. Since we are now
+pre-claiming in the scheduler, resource races, which was a major contributor to
+failed builds, should no longer happen, making the number of failed builds much
+lower even without this change.
+
+Instead of passing these Selection objects around, store this information,
+keyed by instance_uuid, in a distributed datastore like etcd, and have the
+conductor access that information as needed. The calls involved in building
+instances already contain nearly a dozen parameters, and it feels like more
+tech debt to continue to add more.
+
+Allow the cell conductor to call back up to the superconductor when a build
+fails to initiate a retry. We have already decided that such callups will not
+be allowed, making this option not possible without abandoning that design
+tenet.
+
+Data model impact
+-----------------
+
+None, because none of this alternate host information will be persisted.
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+This will slightly increase the amount of data sent between the scheduler,
+superconductor, cell conductor, and compute, but not to any degree that should
+be impactful. It will have a positive performance impact when an instance build
+fails, as the cell conductor can retry on a different host right away.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+This change will not make the workflow for the whole scheduling/building
+process any more complex, but it will make the data being sent among the
+services a little more complex.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  ed-leafe
+
+Other contributors:
+  None
+
+Work Items
+----------
+
+* Modify the scheduler's select_destinations() method to find additional hosts
+  in the same cell as the selected host, and return these as a list of
+  Selection objects to the superconductor.
+
+* Modify the superconductor to pass this new data to the selected compute host.
+
+* Modify all the calls that comprise the retry pathway in compute and conductor
+  to properly handle the list of Selection objects.
+
+* Modify all other methods that call select_destinations() to properly handle
+  the lsit of Selection objects.
+
+
+Dependencies
+============
+
+This depends on the work to implement Selection objects being completed. The
+spec for Selection objects is at https://review.openstack.org/#/c/498830/.
+
+
+Testing
+=======
+
+Each of the modified RPC interfaces will have to be tested to verify that the
+new data structures are being correctly passed. Tests will have to be added to
+ensure that the retry loop in the cell conductor properly handles build
+failures.
+
+
+Documentation Impact
+====================
+
+The documentation for CONF.scheduler.max_attempts will need to be updated to
+let operators know that if they are seeing cases where a burst of requests have
+led to builds failing because none of the alternates has enough resources left,
+they should increase that value to provide a larger pool of alternates to
+retry.
+
+Any of the documentation of the scheduler workflow will need to be updated to
+reflect these changes.
+
+References
+==========
+
+None
+
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced