diff --git a/doc/source/images/Rally_Distributed_Runner.png b/doc/source/images/Rally_Distributed_Runner.png
new file mode 100644
index 00000000..5899a5d1
Binary files /dev/null and b/doc/source/images/Rally_Distributed_Runner.png differ
diff --git a/doc/specs/in-progress/distributed_runner.rst b/doc/specs/in-progress/distributed_runner.rst
new file mode 100644
index 00000000..3a14521e
--- /dev/null
+++ b/doc/specs/in-progress/distributed_runner.rst
@@ -0,0 +1,153 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+..
+ This template should be in ReSTructured text. The filename in the git
+ repository should match the launchpad URL, for example a URL of
+ https://blueprints.launchpad.net/heat/+spec/awesome-thing should be named
+ awesome-thing.rst .  Please do not delete any of the sections in this
+ template.  If you have nothing to say for a whole section, just write: None
+ For help with syntax, see http://sphinx-doc.org/rest.html
+ To test out your formatting, see http://www.tele3.cz/jbar/rest/rest.html
+
+
+============================
+Implement Distributed Runner
+============================
+
+We need a Distributed Runner in Rally that will run tasks on many nodes
+simultaneously.
+
+Problem description
+===================
+
+Currently there are several runners in Rally, but they all can only run on
+the same host that Rally itself runs on. It limits test load that Rally can
+generate. In some cases required load can not be generated from one host.
+
+In current implementation Runner object runs actual subtask and generates test
+results while TaskEngine via ResultConsumer retrieves these results,
+checks them against specified SLA and stores in DB.
+
+There are several aspects that should be kept in mind when reasoning about
+distributed load generation:
+
+- Even one active runner is able to produce significant amounts of result data
+  so that TaskEngine could barely process it in time. We assume that
+  the single TaskEngine instance definitely will not be able to process
+  several streams of raw test result data from several simultaneous runners.
+
+- We need test results to be checked against SLA as soon as possible so that
+  we could stop load generation on SLA violation immediately (or close to)
+  and protect the environment being tested. On the other hand we need results
+  from all runners to be analysed, i.e. checking SLA on a single runner is not
+  enough.
+
+- Since we expect long task duration we want to provide to user at least
+  partial information about task execution as soon as possible.
+
+
+Proposed change
+===============
+
+It is proposed to introduce two new component, RunnerAgent and a new plugin
+of runner type, DistributedRunner, and refactor existing components,
+TaskEngine, Runner and SLA, so that overall interaction will look as follows.
+
+.. image:: ../../source/images/Rally_Distributed_Runner.png
+   :align: center
+
+
+1. TaskEngine
+
+    - create subtask context
+    - create instance of Runner
+    - run Runner.run() with context object and info about sceanario
+    - in separated thread consume iteration result chunks & SLA from Runner
+    - delete context
+
+2. RunnerAgent
+    - is executed on agent nodes
+    - runs Runner for received task iterations with given context and args
+    - collects iteration result chunks, stores them on local filesystem,
+      sends them on request to DistributedRunner
+    - aggregates SLA data and periodically sends it to DistributedRunner
+    - stops Runner on receive of corresponding message
+
+3. DistributedRunner
+    - is a regular plugin of Runner type
+    - communicates with remote RunnerAgents wia message queue (ZeroMQ)
+    - provides context, args and SLA to RunnerAgents
+    - distributes task iterations to RunnerAgents
+    - aggregates SLA data from RunnerAgents
+    - merges chunks of task result data
+
+It is supposed to use separate communication channels for task results
+and SLA data.
+    - SLA data is sent periodically (e.g. once per second) for iterations
+      that are already finished.
+    - Task results are collected into chunks and stored locally by
+      RunnerAgent and only send on request.
+
+
+
+
+
+Alternatives
+------------
+
+No way
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Illia Khudoshyn
+
+
+Work Items
+----------
+
+- Refactor current SLA mechanism to support aggregated SLA data
+
+- Refactor current Runner base class
+    - collect iteration results into chunks, ordered by timestamp
+    - perform local SLA checks
+    - aggregate SLA data
+
+- Refactor TaskEngine to reflect changes in Runner
+    - operate chunks of ordered test results rather then stream of raw
+      result items
+    - apply SLA checks to aggregated SLA data
+    - analyze SLA data and consume test results in separate threads
+
+- Develop infrastructure that will allow multi-node Rally configuration
+  and run
+
+- Implement RunnerAgent
+    - run Runner
+    - cache prepared chunks of iteration results
+    - comunicate via ZMQ with DistributedRunner(send task results
+      and SLA on separate channels)
+    - terminate Runner on 'stop' command from TaskEngine
+
+- Implement DistributedRunner that will
+    - feed tasks to RunnerAgents
+    - receive chunks of result data from RunnerAgents, merge it and
+      provide merged data to TaskEngine
+    - receive aggregated SLA data from RunnerAgents, merge it
+      and provide data to TaskEngine
+    - translate 'stop' command from TaskEngine to RunnerAgents
+
+Dependencies
+============
+
+- DB model refactoring (boris-42)
+- Report generation refactoring (amaretsky)