From f8b167a2951f82f8320795afff607c1e8abc8cf4 Mon Sep 17 00:00:00 2001
From: Nikola Dipanov <ndipanov@redhat.com>
Date: Mon, 21 Mar 2016 14:59:58 +0000
Subject: [PATCH] Introduce host state level locking

Change-Id: Idbefac509ab7a1bfb417a0debc16fd8203736f16
Previously-approved: Mitaka
---
 .../approved/host-state-level-locking.rst     | 219 ++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 specs/newton/approved/host-state-level-locking.rst

diff --git a/specs/newton/approved/host-state-level-locking.rst b/specs/newton/approved/host-state-level-locking.rst
new file mode 100644
index 000000000..849dcf7a7
--- /dev/null
+++ b/specs/newton/approved/host-state-level-locking.rst
@@ -0,0 +1,219 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+============================================
+Scheduler: Introduce HostState level locking
+============================================
+
+https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
+
+Nova FilterScheduler implementation even though inherently multi-threaded, uses
+no locking for access to the shared in-memory HostState data structures, that
+are shared between all active threads. Even though this means that most of
+decisions that scheduler makes under load are not internally consistent, this
+is not necessarily a huge issue for the basic use case, as Nova makes sure that
+the set resource usage policy is maintained even due to races using the retry
+mechanism [1]_. This can however cause issues in several more complex use
+cases. A non exhaustive list of some examples would be: high resource
+utilization, high load, specific types of host and resources (e.g. Ironic nodes
+[2]_ and  complex resources such as NUMA topology or PCI devices).
+
+We propose to change the scheduler code to use a lightweight transactional
+approach to avoid full blown locking while still mitigating some of the race
+conditions.
+
+
+Problem description
+===================
+
+Our scheduler service is inherently multi-threaded as it currently runs an
+oslo-messaging RpcServer using an EventletExecutor. This means that every
+incoming RPC message for select_destinations will be dispatched in it's own
+green thread.
+
+Upon receiving the message, every green thread will read all ComputeNode states
+from the database, and potentially [3]_ populate the internal global data
+structure that holds the host states which will be used for filtering.
+
+Further along, after choosing a host, each thread will call the
+HostState.consume_from_instance() method on the chosen object, which will
+"consume" the resources for the instance being scheduled from the chosen
+HostState object. This is the equivalent of what Claims code does once the
+request makes it to a nova-compute service, except instead of updating the
+ComputeNode table, it updates the scheduler service's in memory HostState
+object.
+
+However since there is no mutual exclusion of threads between
+the time a filter function ran and decide that the host passes, until a single
+host state was chosen. A number of other concurrent threads could have already
+updated the same host state. A classic race condition. Once we consider this,
+some obvious avenues for improvement arise.
+
+1. When calling consume_from_instance() we are basically doing a claim of
+   resources on the host state, that may have changed since the filter function
+   that decided to pass the host ran. At that point we have all the information
+   to know early if a claim is going to fail and try to choose a different
+   host. This is roughly equivalent to retrying a transaction.
+
+   It is worth noting here that even though we may find that host seems like
+   it will be failing, we may still want to choose it, as we don't ever drop
+   the resources consumed on the HostState even after we register a retry from
+   a already chosen compute host in this refresh cycle, so it may in fact be
+   a false negative.
+
+2. There needs to be some kind of locking that is granular enough so as not to
+   cause too much unnecessary overhead, but also to allow for more consistent
+   handling of HostState.
+
+
+Use Cases
+----------
+
+There is no specific use case that this is aimed at. It is an internal
+refactoring aimed at improving data consistency in the scheduler, and thus
+overall effectiveness of placement decisions.
+
+
+Proposed change
+===============
+
+Firstly, it would be very useful to use the Claim logic instead (or inside) of
+HostState.consume_from_instance() as there is almost complete duplication
+there.
+
+Next change that would be in the scope for this blueprint would be adding
+synchronisation primitives around accessing and updating HostState fields.
+A lightweight approach would be to not use any synchronisation primitives in
+the filters, as access to the host state is a) read-only b) usually per
+resource. consume_from_instance is the place where we want to make sure access
+is synchronized, as once the host is chosen, it will need to have resources
+consumed (remember - many concurrent threads could be trying to consume
+resources from the same HostState) and if it fails any of the "claims", no
+resources should be consumed. Updating the host state with fresh values after
+a DB read should also be synchronized.
+
+Final piece of the puzzle is modifying the FilterScheduler._schedule() method
+to take into account the failure to claim in consume_from_instance() and try
+the next host that passed the filters, or choose to ignore the local in memory
+failure and risk a retry from the compute host.
+
+It is worth noting that this proposal only looks at fixing data consistency
+among threads of a single nova-scheduler process. Running several workers still
+means that their internal state is going to be inconsistent between updates
+from the database. Fixing this is outside of the scope of this proposal.
+
+Alternatives
+------------
+
+There are a number of ways we could re-design the scheduler so that the issues
+discussed in this spec become irrelevant. This blueprint aims to improve some
+obvious issues with the current implementation of the scheduler without
+changing the basic design.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Even though there will be overhead of synchronisation in every request after
+this change which may decrease the average response time for basic workloads,
+I fully expect this to massively improve the performance in conditions of a
+large number of requests, or low overall cloud capacity (or specific resources
+such as Ironic hosts), as it will significantly cut down on issued retries.
+
+Other deployer impact
+---------------------
+
+There may be several config options deployers would need to consider. Defaults
+may be chosen in such a way as to not change previous behaviour.
+
+Developer impact
+----------------
+
+Developers would need to understand that there is now locking going on in the
+scheduler, and consider this when making changes to the code, especially in
+case of adding additional resources.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  <ndipanov>
+
+Work Items
+----------
+
+* Modify HostState.consume_from_instance() to use the Claim logic and acquire
+  a HostState instance-wide lock for doing so (merged in Mitaka).
+
+* Modify HostState.update_from_compute_node() to acquire a HostState
+  instance-wide lock for updating the host state (merged in Mitaka).
+
+* Refactor Claim classes to not be directly dependent on the resource_tracker,
+  so that they can be used in the scheduler code and possibly move out of the
+  compute/ subtree
+
+* Modify FilterSchedule._schedule() method to expect a claim transaction
+  failure and take appropriate action.
+
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+As is usually the case with race problems, it is notoriously difficult
+to come up with deterministic tests. Testing will be limited to unit tests
+making sure that proper synchronisation primitives are called as expected.
+
+Documentation Impact
+====================
+
+There may be an additional config option to turn on the transactional nature
+of consume_from_instance() and possibly another one to tell the scheduler to
+go ahead and attempt to land an instance even though a local claim failed.
+
+References
+==========
+
+ .. [1] The Retry mechanism works kind of like a 2PC where the instance
+    resource usage is consumed on the in memory view the scheduler has, but is
+    only committed to the DB when the request makes it to the chosen compute
+    host, and under a global resource lock.
+ .. [2] This `bug <https://bugs.launchpad.net/nova/+bug/1341420>` shows that
+    this is pretty bad in case of Ironic.
+ .. [3] I say potentially because there is a check of a timestamp to see if the
+    HostState has actually been updated more recently than the ComputeNode
+    record (with in flight requests not yet claimed on their compute hosts).