From f8b167a2951f82f8320795afff607c1e8abc8cf4 Mon Sep 17 00:00:00 2001 From: Nikola Dipanov Date: Mon, 21 Mar 2016 14:59:58 +0000 Subject: [PATCH] Introduce host state level locking Change-Id: Idbefac509ab7a1bfb417a0debc16fd8203736f16 Previously-approved: Mitaka --- .../approved/host-state-level-locking.rst | 219 ++++++++++++++++++ 1 file changed, 219 insertions(+) create mode 100644 specs/newton/approved/host-state-level-locking.rst diff --git a/specs/newton/approved/host-state-level-locking.rst b/specs/newton/approved/host-state-level-locking.rst new file mode 100644 index 000000000..849dcf7a7 --- /dev/null +++ b/specs/newton/approved/host-state-level-locking.rst @@ -0,0 +1,219 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================================ +Scheduler: Introduce HostState level locking +============================================ + +https://blueprints.launchpad.net/nova/+spec/host-state-level-locking + +Nova FilterScheduler implementation even though inherently multi-threaded, uses +no locking for access to the shared in-memory HostState data structures, that +are shared between all active threads. Even though this means that most of +decisions that scheduler makes under load are not internally consistent, this +is not necessarily a huge issue for the basic use case, as Nova makes sure that +the set resource usage policy is maintained even due to races using the retry +mechanism [1]_. This can however cause issues in several more complex use +cases. A non exhaustive list of some examples would be: high resource +utilization, high load, specific types of host and resources (e.g. Ironic nodes +[2]_ and complex resources such as NUMA topology or PCI devices). + +We propose to change the scheduler code to use a lightweight transactional +approach to avoid full blown locking while still mitigating some of the race +conditions. + + +Problem description +=================== + +Our scheduler service is inherently multi-threaded as it currently runs an +oslo-messaging RpcServer using an EventletExecutor. This means that every +incoming RPC message for select_destinations will be dispatched in it's own +green thread. + +Upon receiving the message, every green thread will read all ComputeNode states +from the database, and potentially [3]_ populate the internal global data +structure that holds the host states which will be used for filtering. + +Further along, after choosing a host, each thread will call the +HostState.consume_from_instance() method on the chosen object, which will +"consume" the resources for the instance being scheduled from the chosen +HostState object. This is the equivalent of what Claims code does once the +request makes it to a nova-compute service, except instead of updating the +ComputeNode table, it updates the scheduler service's in memory HostState +object. + +However since there is no mutual exclusion of threads between +the time a filter function ran and decide that the host passes, until a single +host state was chosen. A number of other concurrent threads could have already +updated the same host state. A classic race condition. Once we consider this, +some obvious avenues for improvement arise. + +1. When calling consume_from_instance() we are basically doing a claim of + resources on the host state, that may have changed since the filter function + that decided to pass the host ran. At that point we have all the information + to know early if a claim is going to fail and try to choose a different + host. This is roughly equivalent to retrying a transaction. + + It is worth noting here that even though we may find that host seems like + it will be failing, we may still want to choose it, as we don't ever drop + the resources consumed on the HostState even after we register a retry from + a already chosen compute host in this refresh cycle, so it may in fact be + a false negative. + +2. There needs to be some kind of locking that is granular enough so as not to + cause too much unnecessary overhead, but also to allow for more consistent + handling of HostState. + + +Use Cases +---------- + +There is no specific use case that this is aimed at. It is an internal +refactoring aimed at improving data consistency in the scheduler, and thus +overall effectiveness of placement decisions. + + +Proposed change +=============== + +Firstly, it would be very useful to use the Claim logic instead (or inside) of +HostState.consume_from_instance() as there is almost complete duplication +there. + +Next change that would be in the scope for this blueprint would be adding +synchronisation primitives around accessing and updating HostState fields. +A lightweight approach would be to not use any synchronisation primitives in +the filters, as access to the host state is a) read-only b) usually per +resource. consume_from_instance is the place where we want to make sure access +is synchronized, as once the host is chosen, it will need to have resources +consumed (remember - many concurrent threads could be trying to consume +resources from the same HostState) and if it fails any of the "claims", no +resources should be consumed. Updating the host state with fresh values after +a DB read should also be synchronized. + +Final piece of the puzzle is modifying the FilterScheduler._schedule() method +to take into account the failure to claim in consume_from_instance() and try +the next host that passed the filters, or choose to ignore the local in memory +failure and risk a retry from the compute host. + +It is worth noting that this proposal only looks at fixing data consistency +among threads of a single nova-scheduler process. Running several workers still +means that their internal state is going to be inconsistent between updates +from the database. Fixing this is outside of the scope of this proposal. + +Alternatives +------------ + +There are a number of ways we could re-design the scheduler so that the issues +discussed in this spec become irrelevant. This blueprint aims to improve some +obvious issues with the current implementation of the scheduler without +changing the basic design. + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +Even though there will be overhead of synchronisation in every request after +this change which may decrease the average response time for basic workloads, +I fully expect this to massively improve the performance in conditions of a +large number of requests, or low overall cloud capacity (or specific resources +such as Ironic hosts), as it will significantly cut down on issued retries. + +Other deployer impact +--------------------- + +There may be several config options deployers would need to consider. Defaults +may be chosen in such a way as to not change previous behaviour. + +Developer impact +---------------- + +Developers would need to understand that there is now locking going on in the +scheduler, and consider this when making changes to the code, especially in +case of adding additional resources. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + + +Work Items +---------- + +* Modify HostState.consume_from_instance() to use the Claim logic and acquire + a HostState instance-wide lock for doing so (merged in Mitaka). + +* Modify HostState.update_from_compute_node() to acquire a HostState + instance-wide lock for updating the host state (merged in Mitaka). + +* Refactor Claim classes to not be directly dependent on the resource_tracker, + so that they can be used in the scheduler code and possibly move out of the + compute/ subtree + +* Modify FilterSchedule._schedule() method to expect a claim transaction + failure and take appropriate action. + + +Dependencies +============ + +None + +Testing +======= + +As is usually the case with race problems, it is notoriously difficult +to come up with deterministic tests. Testing will be limited to unit tests +making sure that proper synchronisation primitives are called as expected. + +Documentation Impact +==================== + +There may be an additional config option to turn on the transactional nature +of consume_from_instance() and possibly another one to tell the scheduler to +go ahead and attempt to land an instance even though a local claim failed. + +References +========== + + .. [1] The Retry mechanism works kind of like a 2PC where the instance + resource usage is consumed on the in memory view the scheduler has, but is + only committed to the DB when the request makes it to the chosen compute + host, and under a global resource lock. + .. [2] This `bug ` shows that + this is pretty bad in case of Ironic. + .. [3] I say potentially because there is a check of a timestamp to see if the + HostState has actually been updated more recently than the ComputeNode + record (with in flight requests not yet claimed on their compute hosts).