7b71f096b9
This patch delivers the first working version of a distributed scheduler implementation based on local and persistent job queues. The idea is inspired by the parallel computing pattern known as "Work stealing" although it doesn't fully repeat it due to a nature of Mistral. See https://en.wikipedia.org/wiki/Work_stealing for details. Advantages of this scheduler implementation: * It doesn't have job processing delays when a cluster topology' is stable caused by DB polling intervals. A job gets scheduled in memory and also saved into the persistent storage for reliability. A persistent job can be picked up only after a configured allowed period of time so that it happens effectively after a node responsible for local processing crashed. * Low DB load. DB polling still exists but it's not a primary scheduling mechamisn now but rather a protection from node crash situations. That means that a polling interval can now be made large like 30 seconds, instead of 1-2 seconds. Less DB load leads to less DB deadlocks between scheduler instances and less retries on MySQL. * Since DB load is now less it gives better scalability properties. A bigger number of engines won't now lead to much bigger contention because of a big DB polling intervals. * Protection from having jobs forever hanging in processing state. In the existing implementation, if a scheduler captured a job for processing (set its "processing" flag to True) and then crashed then a job will be in processing state forever in the DB. Instead of a boolean "processing" flag, the new implementation uses a timestamp showing when a job was captured. That gives us the opportunity to make such jobs eligible for recapturing and further processing after a certain configured timeout. TODO: * More testing * DB migration for the new scheduled jobs table * Benchmarks and testing under load * Standardize the scheduler interface and write an adapter for the existing scheduler so that we could choose between scheduler implementations. It's highly desired to make transition to the new scheduler smooth in production: we always need to be able to roll back to the existing scheduler. Partial blueprint: mistral-redesign-scheduler Partial blueprint: mistral-eliminate-scheduler-delays Change-Id: If7d06b64ac14d01e80d31242e1640cb93f2aa6fe |
||
---|---|---|
.. | ||
__init__.py | ||
test_scheduler.py |