diff --git a/doc/source/images/Rally_Distributed_Runner.png b/doc/source/images/Rally_Distributed_Runner.png new file mode 100644 index 00000000..5899a5d1 Binary files /dev/null and b/doc/source/images/Rally_Distributed_Runner.png differ diff --git a/doc/specs/in-progress/distributed_runner.rst b/doc/specs/in-progress/distributed_runner.rst new file mode 100644 index 00000000..3a14521e --- /dev/null +++ b/doc/specs/in-progress/distributed_runner.rst @@ -0,0 +1,153 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +.. + This template should be in ReSTructured text. The filename in the git + repository should match the launchpad URL, for example a URL of + https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named + awesome-thing.rst . Please do not delete any of the sections in this + template. If you have nothing to say for a whole section, just write: None + For help with syntax, see http://sphinx-doc.org/rest.html + To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html + + +============================ +Implement Distributed Runner +============================ + +We need a Distributed Runner in Rally that will run tasks on many nodes +simultaneously. + +Problem description +=================== + +Currently there are several runners in Rally, but they all can only run on +the same host that Rally itself runs on. It limits test load that Rally can +generate. In some cases required load can not be generated from one host. + +In current implementation Runner object runs actual subtask and generates test +results while TaskEngine via ResultConsumer retrieves these results, +checks them against specified SLA and stores in DB. + +There are several aspects that should be kept in mind when reasoning about +distributed load generation: + +- Even one active runner is able to produce significant amounts of result data + so that TaskEngine could barely process it in time. We assume that + the single TaskEngine instance definitely will not be able to process + several streams of raw test result data from several simultaneous runners. + +- We need test results to be checked against SLA as soon as possible so that + we could stop load generation on SLA violation immediately (or close to) + and protect the environment being tested. On the other hand we need results + from all runners to be analysed, i.e. checking SLA on a single runner is not + enough. + +- Since we expect long task duration we want to provide to user at least + partial information about task execution as soon as possible. + + +Proposed change +=============== + +It is proposed to introduce two new component, RunnerAgent and a new plugin +of runner type, DistributedRunner, and refactor existing components, +TaskEngine, Runner and SLA, so that overall interaction will look as follows. + +.. image:: ../../source/images/Rally_Distributed_Runner.png + :align: center + + +1. TaskEngine + + - create subtask context + - create instance of Runner + - run Runner.run() with context object and info about sceanario + - in separated thread consume iteration result chunks & SLA from Runner + - delete context + +2. RunnerAgent + - is executed on agent nodes + - runs Runner for received task iterations with given context and args + - collects iteration result chunks, stores them on local filesystem, + sends them on request to DistributedRunner + - aggregates SLA data and periodically sends it to DistributedRunner + - stops Runner on receive of corresponding message + +3. DistributedRunner + - is a regular plugin of Runner type + - communicates with remote RunnerAgents wia message queue (ZeroMQ) + - provides context, args and SLA to RunnerAgents + - distributes task iterations to RunnerAgents + - aggregates SLA data from RunnerAgents + - merges chunks of task result data + +It is supposed to use separate communication channels for task results +and SLA data. + - SLA data is sent periodically (e.g. once per second) for iterations + that are already finished. + - Task results are collected into chunks and stored locally by + RunnerAgent and only send on request. + + + + + +Alternatives +------------ + +No way + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Illia Khudoshyn + + +Work Items +---------- + +- Refactor current SLA mechanism to support aggregated SLA data + +- Refactor current Runner base class + - collect iteration results into chunks, ordered by timestamp + - perform local SLA checks + - aggregate SLA data + +- Refactor TaskEngine to reflect changes in Runner + - operate chunks of ordered test results rather then stream of raw + result items + - apply SLA checks to aggregated SLA data + - analyze SLA data and consume test results in separate threads + +- Develop infrastructure that will allow multi-node Rally configuration + and run + +- Implement RunnerAgent + - run Runner + - cache prepared chunks of iteration results + - comunicate via ZMQ with DistributedRunner(send task results + and SLA on separate channels) + - terminate Runner on 'stop' command from TaskEngine + +- Implement DistributedRunner that will + - feed tasks to RunnerAgents + - receive chunks of result data from RunnerAgents, merge it and + provide merged data to TaskEngine + - receive aggregated SLA data from RunnerAgents, merge it + and provide data to TaskEngine + - translate 'stop' command from TaskEngine to RunnerAgents + +Dependencies +============ + +- DB model refactoring (boris-42) +- Report generation refactoring (amaretsky)