Merge "Scheduler Introduce lightwieght transactional model for HostState"
This commit is contained in:
241
specs/liberty/approved/host-state-level-locking.rst
Normal file
241
specs/liberty/approved/host-state-level-locking.rst
Normal file
@@ -0,0 +1,241 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================================
|
||||
Scheduler: Introduce HostState level locking
|
||||
============================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
|
||||
|
||||
Nova FilterScheduler implementation even though inherently multi-threaded, uses
|
||||
no locking for access to the shared in-memory HostState data structures, that
|
||||
are shared between all active threads. Even though this means that most of
|
||||
decisions that scheduler makes under load are not internally consistent, this
|
||||
is not necessarily a huge issue for the basic use case, as Nova makes sure that
|
||||
the set resource usage policy is maintained even due to races using the retry
|
||||
mechanism [1]_. This can however cause issues in several more complex use
|
||||
cases. A non exhaustive list of some examples would be: high resource
|
||||
utilization, high load, specific types of host and resources (e.g. Ironic nodes
|
||||
[2]_ and complex resources such as NUMA topology or PCI devices).
|
||||
|
||||
We propose to change the scheduler code to use a lightweight transactional
|
||||
approach to avoid full blown locking while still mitigating some of the race
|
||||
conditions.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Our scheduler service is inherently multi-threaded as it currently runs an
|
||||
oslo-messaging RpcServer using an EventletExecutor. This means that every
|
||||
incoming RPC message for select_destinations will be dispatched in it's own
|
||||
green thread.
|
||||
|
||||
Upon receiving the message, every green thread will read all ComputeNode states
|
||||
from the database, and potentially [3]_ populate the internal global data
|
||||
structure that holds the host states which will be used for filtering.
|
||||
|
||||
Further along, after choosing a host, each thread will call the
|
||||
HostState.consume_from_instance() method on the chosen object, which will
|
||||
"consume" the resources for the instance being scheduled from the chosen
|
||||
HostState object. This is the equivalent of what Claims code does once the
|
||||
request makes it to a nova-compute service, except instead of updating the
|
||||
ComputeNode table, it updates the scheduler service's in memory HostState
|
||||
object.
|
||||
|
||||
However since there is no mutual exclusion of threads between
|
||||
the time a filter function ran and decide that the host passes, until a single
|
||||
host state was chosen. A number of other concurrent threads could have already
|
||||
updated the same host state. A classic race condition. Once we consider this,
|
||||
some obvious avenues for improvement arise.
|
||||
|
||||
1. When calling consume_from_instance() we are basically doing a claim of
|
||||
resources on the host state, that may have changed since the filter function
|
||||
that decided to pass the host ran. At that point we have all the information
|
||||
to know early if a claim is going to fail and try to choose a different
|
||||
host. This is roughly equivalent to retrying a transaction.
|
||||
|
||||
It is worth noting here that even though we may find that host seems like
|
||||
it will be failing, we may still want to choose it, as we don't ever drop
|
||||
the resources consumed on the HostState even after we register a retry from
|
||||
a already chosen compute host in this refresh cycle, so it may in fact be
|
||||
a false negative.
|
||||
|
||||
2. There needs to be some kind of locking that is granular enough so as not to
|
||||
cause too much unnecessary overhead, but also to allow for more consistent
|
||||
handling of HostState.
|
||||
|
||||
|
||||
Use Cases
|
||||
----------
|
||||
|
||||
There is no specific use case that this is aimed at. It is an internal
|
||||
refactoring aimed at improving data consistency in the scheduler, and thus
|
||||
overall effectiveness of placement decisions.
|
||||
|
||||
Project Priority
|
||||
-----------------
|
||||
|
||||
Yes - This is work related to the scheduler, one of the priority topics for
|
||||
Liberty.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Firstly, it would be very useful to use the Claim logic instead (or inside) of
|
||||
HostState.consume_from_instance() as there is almost complete duplication
|
||||
there.
|
||||
|
||||
Next change that would be in the scope for this blueprint would be adding
|
||||
synchronisation primitives around accessing and updating HostState fields.
|
||||
A lightweight approach would be to not use any synchronisation primitives in
|
||||
the filters, as access to the host state is a) read-only b) usually per
|
||||
resource. consume_from_instance is the place where we want to make sure access
|
||||
is synchronized, as once the host is chosen, it will need to have resources
|
||||
consumed (remember - many concurrent threads could be trying to consume
|
||||
resources from the same HostState) and if it fails any of the "claims", no
|
||||
resources should be consumed. Updating the host state with fresh values after
|
||||
a DB read should also be synchronized.
|
||||
|
||||
Final piece of the puzzle is modifying the FilterScheduler._schedule() method
|
||||
to take into account the failure to claim in consume_from_instance() and try
|
||||
the next host that passed the filters, or choose to ignore the local in memory
|
||||
failure and risk a retry from the compute host.
|
||||
|
||||
It is worth noting that this proposal only looks at fixing data consistency
|
||||
among threads of a single nova-scheduler process. Running several workers still
|
||||
means that their internal state is going to be inconsistent between updates
|
||||
from the database. Fixing this is outside of the scope of this proposal.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are a number of ways we could re-design the scheduler so that the issues
|
||||
discussed in this spec become irrelevant. This blueprint aims to improve some
|
||||
obvious issues with the current implementation of the scheduler without
|
||||
changing the basic design.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Even though there will be overhead of synchronisation in every request after
|
||||
this change which may decrease the average response time for basic workloads,
|
||||
I fully expect this to massively improve the performance in conditions of a
|
||||
large number of requests, or low overall cloud capacity (or specific resources
|
||||
such as Ironic hosts), as it will significantly cut down on issued retries.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
There may be several config options deployers would need to consider. Defaults
|
||||
may be chosen in such a way as to not change previous behaviour.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers would need to understand that there is now locking going on in the
|
||||
scheduler, and consider this when making changes to the code, especially in
|
||||
case of adding additional resources.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
<ndipanov>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Refactor Claim classes to not be directly dependent on the resource_tracker,
|
||||
so that they can be used in the scheduler code and possibly move out of the
|
||||
compute/ subtree
|
||||
|
||||
* Modify HostState.consume_from_instance() to use the Claim logic and acquire
|
||||
a HostState instance-wide lock for doing so.
|
||||
|
||||
* Modify HostState.update_from_compute_node() to acquire a HostState
|
||||
instance-wide lock for updating the host state.
|
||||
|
||||
* Modify FilterSchedule._schedule() method to expect a claim transaction
|
||||
failure and take appropriate action.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
As is usually the case with race problems, it is notoriously difficult
|
||||
to come up with deterministic tests. Testing will be limited to unit tests
|
||||
making sure that proper synchronisation primitives are called as expected.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
There may be an additional config option to turn on the transactional nature
|
||||
of consume_from_instance() and possibly another one to tell the scheduler to
|
||||
go ahead and attempt to land an instance even though a local claim failed.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] The Retry mechanism works kind of like a 2PC where the instance
|
||||
resource usage is consumed on the in memory view the scheduler has, but is
|
||||
only committed to the DB when the request makes it to the chosen compute
|
||||
host, and under a global resource lock.
|
||||
.. [2] This `bug <https://bugs.launchpad.net/nova/+bug/1341420>` shows that
|
||||
this is pretty bad in case of Ironic.
|
||||
.. [3] I say potentially because there is a check of a timestamp to see if the
|
||||
HostState has actually been updated more recently than the ComputeNode
|
||||
record (with in flight requests not yet claimed on their compute hosts).
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section for liberty intended to be used each time the spec
|
||||
is updated to describe new design, API or any database schema
|
||||
updated. Useful to let reader understand what's happened along the
|
||||
time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Liberty
|
||||
- Introduced
|
||||
|
||||
Reference in New Issue
Block a user