Add Stochastic Weighing Scheduler
This spec proposes some small tweaks to the manila scheduler to perform better in real world circumstances. Implements blueprint stochastic-weighing-scheduler Change-Id: Icf64c00afca60fc3a0557ad6070c20c3163b2aca
This commit is contained in:
parent
24c86c056f
commit
3524979e36
258
specs/ocata/stochastic-weighing-scheduler.rst
Normal file
258
specs/ocata/stochastic-weighing-scheduler.rst
Normal file
@ -0,0 +1,258 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================
|
||||
Stochastic Weighing Scheduler
|
||||
=============================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/manila/+spec/stochastic-weighing-scheduler
|
||||
|
||||
The filter scheduler is the de-facto standard scheduler for Manila and has
|
||||
a lot of desirable properties. However there are certain scenarios where it's
|
||||
hard or impossible to get it to do the right thing. I think some small tweaks
|
||||
could help make admin's lives easier.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
I'm concerned about 2 specific problems:
|
||||
|
||||
1. When there are multiple backends that are not identical, it can be hard to
|
||||
ensure that load is spread across all the backends. Consider the case of a
|
||||
few "large" backends mixed with some "small" backends. Even if they're from
|
||||
the same vendor by default new shares will go exclusively to the large
|
||||
backends until free space decreases to the same level as the small backends.
|
||||
This can be worked around by using something other than free space to weigh
|
||||
hosts, but no matter what you choose, you'll have a similar issue whenever
|
||||
the backends aren't homogeneous.
|
||||
|
||||
2. Even if the admin is able to ensure that all the backends are identical in
|
||||
every way, at some point the cloud will probably need to grow, by adding
|
||||
new storage backends. When this happens there will be a mix of brand new
|
||||
empty backends and mostly full backends. No matter what kind of weighing
|
||||
function you use, initially 100% of new requests will be scheduled on the
|
||||
new backends. Depending on how good or bad the weighing function is, it
|
||||
could take a long time before the old backends start receiving new requests
|
||||
and during this period system performance is likely to drop dramatically.
|
||||
The problem is particularly bad if the upgrade is a small one: consider
|
||||
adding 1 new backend to a system with 10 existing backends. If 100% of
|
||||
new shares go to the new backend, then for some period, there will be 10x
|
||||
load on the single backend.
|
||||
|
||||
There is one existing partial solution to the above problems -- the goodness
|
||||
weigher -- but that has some limitations worth mentioning. Consider an ideal
|
||||
goodness function -- an oracle that always returns the right value such
|
||||
that the best backend for new shares is sorted to the top. Because the inputs
|
||||
to the goodness function (other than available space) are only evaluated every
|
||||
60 seconds, bursts of creation requests will nearly always go to the same
|
||||
backend within a 60 second window. While we could shrink the time window of
|
||||
this problem by sending more frequent updates, that has its own costs and also
|
||||
has diminishing returns. In the more realistic case of a goodness function
|
||||
that's non-ideal, it may take longer than 60 seconds for the goodness function
|
||||
output to reflect changes based on recent creation requests.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
The existing scheduler handles homogeneous backends best, and relies on a
|
||||
relatively low rate of creation requests compared to the capacity of the whole
|
||||
system, so that it can get keep up to date information with which to make
|
||||
optimal decisions. It also deals best with cases when you don't add capacity
|
||||
over time.
|
||||
|
||||
I'm interested in making the scheduler perform well across a broad range of
|
||||
deployment scenarios:
|
||||
|
||||
1. Mixed vendor scenarios
|
||||
2. A mix of generations of hardware from a single vendor
|
||||
3. A mix of capacities of hardware (small vs. large configs)
|
||||
4. Adding new capacity to a running cloud to deal with growth
|
||||
|
||||
These are all deployer/administrator concerns. Part of the proposed solution
|
||||
is to enable certain things which are impossible today, but mostly the goal
|
||||
is to make the average case "just work" so that administrators don't have to
|
||||
babysit the system to get reasonable behavior.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Currently the filter scheduler does 3 things:
|
||||
|
||||
1. Takes a list of all pools and filters out the ones that are unsuitable for
|
||||
a given creation request.
|
||||
2. Generates a weight for each pool based on one of the available weighers.
|
||||
3. Sorts the pools and chooses the one with the highest weight.
|
||||
|
||||
I call the above system "winner-take-all" because whether the top 2 weights
|
||||
are 100 and 0 or 49 and 51, the winner gets the request 100% of the
|
||||
time.
|
||||
|
||||
I propose renaming the existing HostWeightHandler to OrderedHostWeightHandler
|
||||
and adding a new weight handler to the filter scheduler called
|
||||
StochasticHostWeightHandler. The OrderedHostWeightHandler would continue to
|
||||
be the default and would not change behavior. The new weight handler would
|
||||
be implement different behavior as follows:
|
||||
|
||||
In step 3 above, rather than simply selecting the highest weight, the
|
||||
weight handler would sum up the weight of all choices, assign each pool a
|
||||
subset of that range with a size equal to that host's weight, then generate a
|
||||
random number across the whole range and choose the pool mapped to that range.
|
||||
|
||||
An easier way to visualize the above algorithm is to imagine a raffle drawing.
|
||||
Each pool is given a number of raffle tickets equal to the pool's weight
|
||||
(assume weights normalized from 0-100). The winning pool is chosen by a raffle
|
||||
drawing. Every creation request results in a new raffle being held.
|
||||
|
||||
Pools with higher weights get more raffle tickets and thus have a higher
|
||||
chance to win, but any pool with a weight higher than 0 has some chance to
|
||||
win.
|
||||
|
||||
The advantage to the above algorithm is that it distinguishes between weights
|
||||
that are close (49 and 51) vs weights that are far (0 and 100) so just because
|
||||
one pools is slightly better than another pool, it doesn't always win. Also,
|
||||
it can give different results within a 60 second window of time when the
|
||||
inputs to the weigher aren't updated, significantly decreasing the pain of
|
||||
slow share stats updates.
|
||||
|
||||
It should be pointed out that this algorithm not only requires that weights
|
||||
are properly normalized (the current goodness weigher also requires this) but
|
||||
that the weight should be roughly linear across the range of possible values.
|
||||
Any deviation from linear "goodness" can result in bad decisions being made,
|
||||
due to the randomness inherent in this approach.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There aren't many good options to deal with the problem of bursty requests
|
||||
relative to the update frequency of share stats. You can update stats faster
|
||||
but there's a limit. The limit is to have the scheduler synchronously request
|
||||
absolute latest share stats from every backend for every request. Clearly
|
||||
that approach won't scale.
|
||||
|
||||
To deal with the heterogeneous backends problem, we have the goodness
|
||||
function, but it's challenging to pick a goodness function that yields
|
||||
acceptable results across all types of variation in backends. This proposal
|
||||
keeps the goodness function and builds upon it to both make it stronger, and
|
||||
also more tolerant to imperfection.
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
No database changes.
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
No REST API changes.
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
No security impact.
|
||||
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
No notification impact.
|
||||
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
End users may indirectly experience better (or conceivably worse) scheduling
|
||||
choices made by the modified scheduler.
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No performance impact. In fact this approach is proposed expressly because
|
||||
alternative solutions would have a performance impact and I want to avoid
|
||||
that.
|
||||
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
I propose a single new weigher class for the scheduler. The default weigher
|
||||
would continue to be the existing weighter. An administrator would need to
|
||||
intentionally modify the weigher class config option to observe changed
|
||||
behavior.
|
||||
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers wouldn't be directly impacted, but anyone working on goodness
|
||||
functions or other weighers would need to be aware of the linearity
|
||||
requirement for getting good behavior out of this new scheduler mode.
|
||||
|
||||
In order to avoid accidentally feeding nonlinear goodness values into the
|
||||
stochastic weighing scheduler, we may want to create alternatively-named
|
||||
version of the various weights or weighers, forcing driver authors to
|
||||
explicitly opt-in to the new scheme and thus indicate that the weights
|
||||
they're returning are suitably linear.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bswartz
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
This should be doable in a single patch.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Filter scheduler (manila)
|
||||
* Goodness weigher (manila)
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Testing this feature will require a multibackend configuration (otherwise
|
||||
scheduling is just a no-op).
|
||||
|
||||
Because randomness is inherently required for the correctness of the
|
||||
algorithm, it will be challenging to write automated functional test cases
|
||||
without subverting the random number generation. I propose that we rely on
|
||||
unit tests to ensure correctness because it's easy to "fake" random numbers
|
||||
in unit tests.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Dev docs need updated to explain to driver authors what the expectations are
|
||||
for goodness functions.
|
||||
|
||||
Config ref needs to explain to deployers what the new config option does.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
This spec is a copy of any idea accepted by the Cinder community:
|
||||
https://github.com/openstack/cinder-specs/blob/master/specs/newton/stochastic-weighing-scheduler.rst
|
Loading…
Reference in New Issue
Block a user