manila-specs/specs/ocata/stochastic-weighing-scheduler.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

=============================
Stochastic Weighing Scheduler
=============================

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/manila/+spec/stochastic-weighing-scheduler

The filter scheduler is the de-facto standard scheduler for Manila and has
a lot of desirable properties. However there are certain scenarios where it's
hard or impossible to get it to do the right thing. I think some small tweaks
could help make admin's lives easier.


Problem description
===================

I'm concerned about 2 specific problems:

1. When there are multiple backends that are not identical, it can be hard to
   ensure that load is spread across all the backends. Consider the case of a
   few "large" backends mixed with some "small" backends. Even if they're from
   the same vendor by default new shares will go exclusively to the large
   backends until free space decreases to the same level as the small backends.
   This can be worked around by using something other than free space to weigh
   hosts, but no matter what you choose, you'll have a similar issue whenever
   the backends aren't homogeneous.

2. Even if the admin is able to ensure that all the backends are identical in
   every way, at some point the cloud will probably need to grow, by adding
   new storage backends. When this happens there will be a mix of brand new
   empty backends and mostly full backends. No matter what kind of weighing
   function you use, initially 100% of new requests will be scheduled on the
   new backends. Depending on how good or bad the weighing function is, it
   could take a long time before the old backends start receiving new requests
   and during this period system performance is likely to drop dramatically.
   The problem is particularly bad if the upgrade is a small one: consider
   adding 1 new backend to a system with 10 existing backends. If 100% of
   new shares go to the new backend, then for some period, there will be 10x
   load on the single backend.

There is one existing partial solution to the above problems -- the goodness
weigher -- but that has some limitations worth mentioning. Consider an ideal
goodness function -- an oracle that always returns the right value such
that the best backend for new shares is sorted to the top. Because the inputs
to the goodness function (other than available space) are only evaluated every
60 seconds, bursts of creation requests will nearly always go to the same
backend within a 60 second window. While we could shrink the time window of
this problem by sending more frequent updates, that has its own costs and also
has diminishing returns. In the more realistic case of a goodness function
that's non-ideal, it may take longer than 60 seconds for the goodness function
output to reflect changes based on recent creation requests.


Use Cases
=========

The existing scheduler handles homogeneous backends best, and relies on a
relatively low rate of creation requests compared to the capacity of the whole
system, so that it can get keep up to date information with which to make
optimal decisions. It also deals best with cases when you don't add capacity
over time.

I'm interested in making the scheduler perform well across a broad range of
deployment scenarios:

1. Mixed vendor scenarios
2. A mix of generations of hardware from a single vendor
3. A mix of capacities of hardware (small vs. large configs)
4. Adding new capacity to a running cloud to deal with growth

These are all deployer/administrator concerns. Part of the proposed solution
is to enable certain things which are impossible today, but mostly the goal
is to make the average case "just work" so that administrators don't have to
babysit the system to get reasonable behavior.


Proposed change
===============

Currently the filter scheduler does 3 things:

1. Takes a list of all pools and filters out the ones that are unsuitable for
   a given creation request.
2. Generates a weight for each pool based on one of the available weighers.
3. Sorts the pools and chooses the one with the highest weight.

I call the above system "winner-take-all" because whether the top 2 weights
are 100 and 0 or 49 and 51, the winner gets the request 100% of the
time.

I propose renaming the existing HostWeightHandler to OrderedHostWeightHandler
and adding a new weight handler to the filter scheduler called
StochasticHostWeightHandler. The OrderedHostWeightHandler would continue to
be the default and would not change behavior. The new weight handler would
be implement different behavior as follows:

In step 3 above, rather than simply selecting the highest weight, the
weight handler would sum up the weight of all choices, assign each pool a
subset of that range with a size equal to that host's weight, then generate a
random number across the whole range and choose the pool mapped to that range.

An easier way to visualize the above algorithm is to imagine a raffle drawing.
Each pool is given a number of raffle tickets equal to the pool's weight
(assume weights normalized from 0-100). The winning pool is chosen by a raffle
drawing. Every creation request results in a new raffle being held.

Pools with higher weights get more raffle tickets and thus have a higher
chance to win, but any pool with a weight higher than 0 has some chance to
win.

The advantage to the above algorithm is that it distinguishes between weights
that are close (49 and 51) vs weights that are far (0 and 100) so just because
one pools is slightly better than another pool, it doesn't always win. Also,
it can give different results within a 60 second window of time when the
inputs to the weigher aren't updated, significantly decreasing the pain of
slow share stats updates.

It should be pointed out that this algorithm not only requires that weights
are properly normalized (the current goodness weigher also requires this) but
that the weight should be roughly linear across the range of possible values.
Any deviation from linear "goodness" can result in bad decisions being made,
due to the randomness inherent in this approach.


Alternatives
------------

There aren't many good options to deal with the problem of bursty requests
relative to the update frequency of share stats. You can update stats faster
but there's a limit. The limit is to have the scheduler synchronously request
absolute latest share stats from every backend for every request. Clearly
that approach won't scale.

To deal with the heterogeneous backends problem, we have the goodness
function, but it's challenging to pick a goodness function that yields
acceptable results across all types of variation in backends. This proposal
keeps the goodness function and builds upon it to both make it stronger, and
also more tolerant to imperfection.


Data model impact
-----------------

No database changes.


REST API impact
---------------

No REST API changes.


Security impact
---------------

No security impact.


Notifications impact
--------------------

No notification impact.


Other end user impact
---------------------

End users may indirectly experience better (or conceivably worse) scheduling
choices made by the modified scheduler.


Performance Impact
------------------

No performance impact. In fact this approach is proposed expressly because
alternative solutions would have a performance impact and I want to avoid
that.


Other deployer impact
---------------------

I propose a single new weigher class for the scheduler. The default weigher
would continue to be the existing weighter. An administrator would need to
intentionally modify the weigher class config option to observe changed
behavior.


Developer impact
----------------

Developers wouldn't be directly impacted, but anyone working on goodness
functions or other weighers would need to be aware of the linearity
requirement for getting good behavior out of this new scheduler mode.

In order to avoid accidentally feeding nonlinear goodness values into the
stochastic weighing scheduler, we may want to create alternatively-named
version of the various weights or weighers, forcing driver authors to
explicitly opt-in to the new scheme and thus indicate that the weights
they're returning are suitably linear.


Implementation
==============

Assignee(s)
-----------

Primary assignee:
  bswartz

Work Items
----------

This should be doable in a single patch.


Dependencies
============

* Filter scheduler (manila)
* Goodness weigher (manila)


Testing
=======

Testing this feature will require a multibackend configuration (otherwise
scheduling is just a no-op).

Because randomness is inherently required for the correctness of the
algorithm, it will be challenging to write automated functional test cases
without subverting the random number generation. I propose that we rely on
unit tests to ensure correctness because it's easy to "fake" random numbers
in unit tests.


Documentation Impact
====================

Dev docs need updated to explain to driver authors what the expectations are
for goodness functions.

Config ref needs to explain to deployers what the new config option does.


References
==========

This spec is a copy of any idea accepted by the Cinder community:
https://github.com/openstack/cinder-specs/blob/master/specs/newton/stochastic-weighing-scheduler.rst