335d2d5ead
Change-Id: If2462a1142bbf1ce49cd61ac114ab0ced7394ed8
154 lines
5.0 KiB
ReStructuredText
154 lines
5.0 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
..
|
|
This template should be in ReSTructured text. The filename in the git
|
|
repository should match the launchpad URL, for example a URL of
|
|
https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named
|
|
awesome-thing.rst . Please do not delete any of the sections in this
|
|
template. If you have nothing to say for a whole section, just write: None
|
|
For help with syntax, see http://sphinx-doc.org/rest.html
|
|
To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html
|
|
|
|
|
|
============================
|
|
Implement Distributed Runner
|
|
============================
|
|
|
|
We need a Distributed Runner in Rally that will run tasks on many nodes
|
|
simultaneously.
|
|
|
|
Problem description
|
|
===================
|
|
|
|
Currently there are several runners in Rally, but they all can only run on
|
|
the same host that Rally itself runs on. It limits test load that Rally can
|
|
generate. In some cases required load can not be generated from one host.
|
|
|
|
In current implementation Runner object runs actual subtask and generates test
|
|
results while TaskEngine via ResultConsumer retrieves these results,
|
|
checks them against specified SLA and stores in DB.
|
|
|
|
There are several aspects that should be kept in mind when reasoning about
|
|
distributed load generation:
|
|
|
|
- Even one active runner is able to produce significant amounts of result data
|
|
so that TaskEngine could barely process it in time. We assume that
|
|
the single TaskEngine instance definitely will not be able to process
|
|
several streams of raw test result data from several simultaneous runners.
|
|
|
|
- We need test results to be checked against SLA as soon as possible so that
|
|
we could stop load generation on SLA violation immediately (or close to)
|
|
and protect the environment being tested. On the other hand we need results
|
|
from all runners to be analysed, i.e. checking SLA on a single runner is not
|
|
enough.
|
|
|
|
- Since we expect long task duration we want to provide to user at least
|
|
partial information about task execution as soon as possible.
|
|
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
It is proposed to introduce two new component, RunnerAgent and a new plugin
|
|
of runner type, DistributedRunner, and refactor existing components,
|
|
TaskEngine, Runner and SLA, so that overall interaction will look as follows.
|
|
|
|
.. image:: ../../source/images/Rally_Distributed_Runner.png
|
|
:align: center
|
|
|
|
|
|
1. TaskEngine
|
|
|
|
- create subtask context
|
|
- create instance of Runner
|
|
- run Runner.run() with context object and info about sceanario
|
|
- in separated thread consume iteration result chunks & SLA from Runner
|
|
- delete context
|
|
|
|
2. RunnerAgent
|
|
- is executed on agent nodes
|
|
- runs Runner for received task iterations with given context and args
|
|
- collects iteration result chunks, stores them on local filesystem,
|
|
sends them on request to DistributedRunner
|
|
- aggregates SLA data and periodically sends it to DistributedRunner
|
|
- stops Runner on receive of corresponding message
|
|
|
|
3. DistributedRunner
|
|
- is a regular plugin of Runner type
|
|
- communicates with remote RunnerAgents wia message queue (ZeroMQ)
|
|
- provides context, args and SLA to RunnerAgents
|
|
- distributes task iterations to RunnerAgents
|
|
- aggregates SLA data from RunnerAgents
|
|
- merges chunks of task result data
|
|
|
|
It is supposed to use separate communication channels for task results
|
|
and SLA data.
|
|
- SLA data is sent periodically (e.g. once per second) for iterations
|
|
that are already finished.
|
|
- Task results are collected into chunks and stored locally by
|
|
RunnerAgent and only send on request.
|
|
|
|
|
|
|
|
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
No way
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
Illia Khudoshyn
|
|
|
|
|
|
Work Items
|
|
----------
|
|
|
|
- Refactor current SLA mechanism to support aggregated SLA data
|
|
|
|
- Refactor current Runner base class
|
|
- collect iteration results into chunks, ordered by timestamp
|
|
- perform local SLA checks
|
|
- aggregate SLA data
|
|
|
|
- Refactor TaskEngine to reflect changes in Runner
|
|
- operate chunks of ordered test results rather then stream of raw
|
|
result items
|
|
- apply SLA checks to aggregated SLA data
|
|
- analyze SLA data and consume test results in separate threads
|
|
|
|
- Develop infrastructure that will allow multi-node Rally configuration
|
|
and run
|
|
|
|
- Implement RunnerAgent
|
|
- run Runner
|
|
- cache prepared chunks of iteration results
|
|
- comunicate via ZMQ with DistributedRunner(send task results
|
|
and SLA on separate channels)
|
|
- terminate Runner on 'stop' command from TaskEngine
|
|
|
|
- Implement DistributedRunner that will
|
|
- feed tasks to RunnerAgents
|
|
- receive chunks of result data from RunnerAgents, merge it and
|
|
provide merged data to TaskEngine
|
|
- receive aggregated SLA data from RunnerAgents, merge it
|
|
and provide data to TaskEngine
|
|
- translate 'stop' command from TaskEngine to RunnerAgents
|
|
|
|
Dependencies
|
|
============
|
|
|
|
- DB model refactoring (boris-42)
|
|
- Report generation refactoring (amaretsky)
|