[Spec]Add a spec for distiributed load generation
Change-Id: If2462a1142bbf1ce49cd61ac114ab0ced7394ed8
This commit is contained in:
parent
42138277ce
commit
58b8859f43
BIN
doc/source/images/Rally_Distributed_Runner.png
Normal file
BIN
doc/source/images/Rally_Distributed_Runner.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 164 KiB |
153
doc/specs/in-progress/distributed_runner.rst
Normal file
153
doc/specs/in-progress/distributed_runner.rst
Normal file
@ -0,0 +1,153 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
This template should be in ReSTructured text. The filename in the git
|
||||
repository should match the launchpad URL, for example a URL of
|
||||
https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named
|
||||
awesome-thing.rst . Please do not delete any of the sections in this
|
||||
template. If you have nothing to say for a whole section, just write: None
|
||||
For help with syntax, see http://sphinx-doc.org/rest.html
|
||||
To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html
|
||||
|
||||
|
||||
============================
|
||||
Implement Distributed Runner
|
||||
============================
|
||||
|
||||
We need a Distributed Runner in Rally that will run tasks on many nodes
|
||||
simultaneously.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently there are several runners in Rally, but they all can only run on
|
||||
the same host that Rally itself runs on. It limits test load that Rally can
|
||||
generate. In some cases required load can not be generated from one host.
|
||||
|
||||
In current implementation Runner object runs actual subtask and generates test
|
||||
results while TaskEngine via ResultConsumer retrieves these results,
|
||||
checks them against specified SLA and stores in DB.
|
||||
|
||||
There are several aspects that should be kept in mind when reasoning about
|
||||
distributed load generation:
|
||||
|
||||
- Even one active runner is able to produce significant amounts of result data
|
||||
so that TaskEngine could barely process it in time. We assume that
|
||||
the single TaskEngine instance definitely will not be able to process
|
||||
several streams of raw test result data from several simultaneous runners.
|
||||
|
||||
- We need test results to be checked against SLA as soon as possible so that
|
||||
we could stop load generation on SLA violation immediately (or close to)
|
||||
and protect the environment being tested. On the other hand we need results
|
||||
from all runners to be analysed, i.e. checking SLA on a single runner is not
|
||||
enough.
|
||||
|
||||
- Since we expect long task duration we want to provide to user at least
|
||||
partial information about task execution as soon as possible.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
It is proposed to introduce two new component, RunnerAgent and a new plugin
|
||||
of runner type, DistributedRunner, and refactor existing components,
|
||||
TaskEngine, Runner and SLA, so that overall interaction will look as follows.
|
||||
|
||||
.. image:: ../../source/images/Rally_Distributed_Runner.png
|
||||
:align: center
|
||||
|
||||
|
||||
1. TaskEngine
|
||||
|
||||
- create subtask context
|
||||
- create instance of Runner
|
||||
- run Runner.run() with context object and info about sceanario
|
||||
- in separated thread consume iteration result chunks & SLA from Runner
|
||||
- delete context
|
||||
|
||||
2. RunnerAgent
|
||||
- is executed on agent nodes
|
||||
- runs Runner for received task iterations with given context and args
|
||||
- collects iteration result chunks, stores them on local filesystem,
|
||||
sends them on request to DistributedRunner
|
||||
- aggregates SLA data and periodically sends it to DistributedRunner
|
||||
- stops Runner on receive of corresponding message
|
||||
|
||||
3. DistributedRunner
|
||||
- is a regular plugin of Runner type
|
||||
- communicates with remote RunnerAgents wia message queue (ZeroMQ)
|
||||
- provides context, args and SLA to RunnerAgents
|
||||
- distributes task iterations to RunnerAgents
|
||||
- aggregates SLA data from RunnerAgents
|
||||
- merges chunks of task result data
|
||||
|
||||
It is supposed to use separate communication channels for task results
|
||||
and SLA data.
|
||||
- SLA data is sent periodically (e.g. once per second) for iterations
|
||||
that are already finished.
|
||||
- Task results are collected into chunks and stored locally by
|
||||
RunnerAgent and only send on request.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
No way
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Illia Khudoshyn
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Refactor current SLA mechanism to support aggregated SLA data
|
||||
|
||||
- Refactor current Runner base class
|
||||
- collect iteration results into chunks, ordered by timestamp
|
||||
- perform local SLA checks
|
||||
- aggregate SLA data
|
||||
|
||||
- Refactor TaskEngine to reflect changes in Runner
|
||||
- operate chunks of ordered test results rather then stream of raw
|
||||
result items
|
||||
- apply SLA checks to aggregated SLA data
|
||||
- analyze SLA data and consume test results in separate threads
|
||||
|
||||
- Develop infrastructure that will allow multi-node Rally configuration
|
||||
and run
|
||||
|
||||
- Implement RunnerAgent
|
||||
- run Runner
|
||||
- cache prepared chunks of iteration results
|
||||
- comunicate via ZMQ with DistributedRunner(send task results
|
||||
and SLA on separate channels)
|
||||
- terminate Runner on 'stop' command from TaskEngine
|
||||
|
||||
- Implement DistributedRunner that will
|
||||
- feed tasks to RunnerAgents
|
||||
- receive chunks of result data from RunnerAgents, merge it and
|
||||
provide merged data to TaskEngine
|
||||
- receive aggregated SLA data from RunnerAgents, merge it
|
||||
and provide data to TaskEngine
|
||||
- translate 'stop' command from TaskEngine to RunnerAgents
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- DB model refactoring (boris-42)
|
||||
- Report generation refactoring (amaretsky)
|
Loading…
Reference in New Issue
Block a user