rally/doc/specs/in-progress/distributed_runner.rst
Illia Khudoshyn 335d2d5ead [Spec]Add a spec for distiributed load generation
Change-Id: If2462a1142bbf1ce49cd61ac114ab0ced7394ed8
2015-11-19 14:45:06 +00:00

154 lines
5.0 KiB
ReStructuredText

..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
This template should be in ReSTructured text. The filename in the git
repository should match the launchpad URL, for example a URL of
https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named
awesome-thing.rst . Please do not delete any of the sections in this
template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html
============================
Implement Distributed Runner
============================
We need a Distributed Runner in Rally that will run tasks on many nodes
simultaneously.
Problem description
===================
Currently there are several runners in Rally, but they all can only run on
the same host that Rally itself runs on. It limits test load that Rally can
generate. In some cases required load can not be generated from one host.
In current implementation Runner object runs actual subtask and generates test
results while TaskEngine via ResultConsumer retrieves these results,
checks them against specified SLA and stores in DB.
There are several aspects that should be kept in mind when reasoning about
distributed load generation:
- Even one active runner is able to produce significant amounts of result data
so that TaskEngine could barely process it in time. We assume that
the single TaskEngine instance definitely will not be able to process
several streams of raw test result data from several simultaneous runners.
- We need test results to be checked against SLA as soon as possible so that
we could stop load generation on SLA violation immediately (or close to)
and protect the environment being tested. On the other hand we need results
from all runners to be analysed, i.e. checking SLA on a single runner is not
enough.
- Since we expect long task duration we want to provide to user at least
partial information about task execution as soon as possible.
Proposed change
===============
It is proposed to introduce two new component, RunnerAgent and a new plugin
of runner type, DistributedRunner, and refactor existing components,
TaskEngine, Runner and SLA, so that overall interaction will look as follows.
.. image:: ../../source/images/Rally_Distributed_Runner.png
:align: center
1. TaskEngine
- create subtask context
- create instance of Runner
- run Runner.run() with context object and info about sceanario
- in separated thread consume iteration result chunks & SLA from Runner
- delete context
2. RunnerAgent
- is executed on agent nodes
- runs Runner for received task iterations with given context and args
- collects iteration result chunks, stores them on local filesystem,
sends them on request to DistributedRunner
- aggregates SLA data and periodically sends it to DistributedRunner
- stops Runner on receive of corresponding message
3. DistributedRunner
- is a regular plugin of Runner type
- communicates with remote RunnerAgents wia message queue (ZeroMQ)
- provides context, args and SLA to RunnerAgents
- distributes task iterations to RunnerAgents
- aggregates SLA data from RunnerAgents
- merges chunks of task result data
It is supposed to use separate communication channels for task results
and SLA data.
- SLA data is sent periodically (e.g. once per second) for iterations
that are already finished.
- Task results are collected into chunks and stored locally by
RunnerAgent and only send on request.
Alternatives
------------
No way
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Illia Khudoshyn
Work Items
----------
- Refactor current SLA mechanism to support aggregated SLA data
- Refactor current Runner base class
- collect iteration results into chunks, ordered by timestamp
- perform local SLA checks
- aggregate SLA data
- Refactor TaskEngine to reflect changes in Runner
- operate chunks of ordered test results rather then stream of raw
result items
- apply SLA checks to aggregated SLA data
- analyze SLA data and consume test results in separate threads
- Develop infrastructure that will allow multi-node Rally configuration
and run
- Implement RunnerAgent
- run Runner
- cache prepared chunks of iteration results
- comunicate via ZMQ with DistributedRunner(send task results
and SLA on separate channels)
- terminate Runner on 'stop' command from TaskEngine
- Implement DistributedRunner that will
- feed tasks to RunnerAgents
- receive chunks of result data from RunnerAgents, merge it and
provide merged data to TaskEngine
- receive aggregated SLA data from RunnerAgents, merge it
and provide data to TaskEngine
- translate 'stop' command from TaskEngine to RunnerAgents
Dependencies
============
- DB model refactoring (boris-42)
- Report generation refactoring (amaretsky)