2c7831fb5a
Blueprint: host-state-level-locking Change-Id: Iedb4291801e4dc159ad8aacbaf3a602512c87868
242 lines
8.6 KiB
ReStructuredText
242 lines
8.6 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
============================================
|
|
Scheduler: Introduce HostState level locking
|
|
============================================
|
|
|
|
https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
|
|
|
|
Nova FilterScheduler implementation even though inherently multi-threaded, uses
|
|
no locking for access to the shared in-memory HostState data structures, that
|
|
are shared between all active threads. Even though this means that most of
|
|
decisions that scheduler makes under load are not internally consistent, this
|
|
is not necessarily a huge issue for the basic use case, as Nova makes sure that
|
|
the set resource usage policy is maintained even due to races using the retry
|
|
mechanism [1]_. This can however cause issues in several more complex use
|
|
cases. A non exhaustive list of some examples would be: high resource
|
|
utilization, high load, specific types of host and resources (e.g. Ironic nodes
|
|
[2]_ and complex resources such as NUMA topology or PCI devices).
|
|
|
|
We propose to change the scheduler code to use a lightweight transactional
|
|
approach to avoid full blown locking while still mitigating some of the race
|
|
conditions.
|
|
|
|
|
|
Problem description
|
|
===================
|
|
|
|
Our scheduler service is inherently multi-threaded as it currently runs an
|
|
oslo-messaging RpcServer using an EventletExecutor. This means that every
|
|
incoming RPC message for select_destinations will be dispatched in it's own
|
|
green thread.
|
|
|
|
Upon receiving the message, every green thread will read all ComputeNode states
|
|
from the database, and potentially [3]_ populate the internal global data
|
|
structure that holds the host states which will be used for filtering.
|
|
|
|
Further along, after choosing a host, each thread will call the
|
|
HostState.consume_from_instance() method on the chosen object, which will
|
|
"consume" the resources for the instance being scheduled from the chosen
|
|
HostState object. This is the equivalent of what Claims code does once the
|
|
request makes it to a nova-compute service, except instead of updating the
|
|
ComputeNode table, it updates the scheduler service's in memory HostState
|
|
object.
|
|
|
|
However since there is no mutual exclusion of threads between
|
|
the time a filter function ran and decide that the host passes, until a single
|
|
host state was chosen. A number of other concurrent threads could have already
|
|
updated the same host state. A classic race condition. Once we consider this,
|
|
some obvious avenues for improvement arise.
|
|
|
|
1. When calling consume_from_instance() we are basically doing a claim of
|
|
resources on the host state, that may have changed since the filter function
|
|
that decided to pass the host ran. At that point we have all the information
|
|
to know early if a claim is going to fail and try to choose a different
|
|
host. This is roughly equivalent to retrying a transaction.
|
|
|
|
It is worth noting here that even though we may find that host seems like
|
|
it will be failing, we may still want to choose it, as we don't ever drop
|
|
the resources consumed on the HostState even after we register a retry from
|
|
a already chosen compute host in this refresh cycle, so it may in fact be
|
|
a false negative.
|
|
|
|
2. There needs to be some kind of locking that is granular enough so as not to
|
|
cause too much unnecessary overhead, but also to allow for more consistent
|
|
handling of HostState.
|
|
|
|
|
|
Use Cases
|
|
----------
|
|
|
|
There is no specific use case that this is aimed at. It is an internal
|
|
refactoring aimed at improving data consistency in the scheduler, and thus
|
|
overall effectiveness of placement decisions.
|
|
|
|
Project Priority
|
|
-----------------
|
|
|
|
Yes - This is work related to the scheduler, one of the priority topics for
|
|
Liberty.
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Firstly, it would be very useful to use the Claim logic instead (or inside) of
|
|
HostState.consume_from_instance() as there is almost complete duplication
|
|
there.
|
|
|
|
Next change that would be in the scope for this blueprint would be adding
|
|
synchronisation primitives around accessing and updating HostState fields.
|
|
A lightweight approach would be to not use any synchronisation primitives in
|
|
the filters, as access to the host state is a) read-only b) usually per
|
|
resource. consume_from_instance is the place where we want to make sure access
|
|
is synchronized, as once the host is chosen, it will need to have resources
|
|
consumed (remember - many concurrent threads could be trying to consume
|
|
resources from the same HostState) and if it fails any of the "claims", no
|
|
resources should be consumed. Updating the host state with fresh values after
|
|
a DB read should also be synchronized.
|
|
|
|
Final piece of the puzzle is modifying the FilterScheduler._schedule() method
|
|
to take into account the failure to claim in consume_from_instance() and try
|
|
the next host that passed the filters, or choose to ignore the local in memory
|
|
failure and risk a retry from the compute host.
|
|
|
|
It is worth noting that this proposal only looks at fixing data consistency
|
|
among threads of a single nova-scheduler process. Running several workers still
|
|
means that their internal state is going to be inconsistent between updates
|
|
from the database. Fixing this is outside of the scope of this proposal.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
There are a number of ways we could re-design the scheduler so that the issues
|
|
discussed in this spec become irrelevant. This blueprint aims to improve some
|
|
obvious issues with the current implementation of the scheduler without
|
|
changing the basic design.
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
None
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
Even though there will be overhead of synchronisation in every request after
|
|
this change which may decrease the average response time for basic workloads,
|
|
I fully expect this to massively improve the performance in conditions of a
|
|
large number of requests, or low overall cloud capacity (or specific resources
|
|
such as Ironic hosts), as it will significantly cut down on issued retries.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
There may be several config options deployers would need to consider. Defaults
|
|
may be chosen in such a way as to not change previous behaviour.
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
Developers would need to understand that there is now locking going on in the
|
|
scheduler, and consider this when making changes to the code, especially in
|
|
case of adding additional resources.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
<ndipanov>
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Refactor Claim classes to not be directly dependent on the resource_tracker,
|
|
so that they can be used in the scheduler code and possibly move out of the
|
|
compute/ subtree
|
|
|
|
* Modify HostState.consume_from_instance() to use the Claim logic and acquire
|
|
a HostState instance-wide lock for doing so.
|
|
|
|
* Modify HostState.update_from_compute_node() to acquire a HostState
|
|
instance-wide lock for updating the host state.
|
|
|
|
* Modify FilterSchedule._schedule() method to expect a claim transaction
|
|
failure and take appropriate action.
|
|
|
|
Dependencies
|
|
============
|
|
|
|
None
|
|
|
|
Testing
|
|
=======
|
|
|
|
As is usually the case with race problems, it is notoriously difficult
|
|
to come up with deterministic tests. Testing will be limited to unit tests
|
|
making sure that proper synchronisation primitives are called as expected.
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
There may be an additional config option to turn on the transactional nature
|
|
of consume_from_instance() and possibly another one to tell the scheduler to
|
|
go ahead and attempt to land an instance even though a local claim failed.
|
|
|
|
References
|
|
==========
|
|
|
|
.. [1] The Retry mechanism works kind of like a 2PC where the instance
|
|
resource usage is consumed on the in memory view the scheduler has, but is
|
|
only committed to the DB when the request makes it to the chosen compute
|
|
host, and under a global resource lock.
|
|
.. [2] This `bug <https://bugs.launchpad.net/nova/+bug/1341420>` shows that
|
|
this is pretty bad in case of Ironic.
|
|
.. [3] I say potentially because there is a check of a timestamp to see if the
|
|
HostState has actually been updated more recently than the ComputeNode
|
|
record (with in flight requests not yet claimed on their compute hosts).
|
|
|
|
|
|
History
|
|
=======
|
|
|
|
Optional section for liberty intended to be used each time the spec
|
|
is updated to describe new design, API or any database schema
|
|
updated. Useful to let reader understand what's happened along the
|
|
time.
|
|
|
|
.. list-table:: Revisions
|
|
:header-rows: 1
|
|
|
|
* - Release Name
|
|
- Description
|
|
* - Liberty
|
|
- Introduced
|
|
|