 13e1482886
			
		
	
	13e1482886
	
	
	
		
			
			retrys => retries 3 subseqent => subsequent tranfer => transfer exeception => exception overriden => overridden datastructure => data structure Change-Id: Ibb6e3541606f8405d8408c0204f8ad8edc3f058f Closes-Bug: #1478431
		
			
				
	
	
		
			428 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			428 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| Overview
 | |
| ========
 | |
| 
 | |
| This is engine that schedules tasks to **workers** -- separate processes
 | |
| dedicated for certain atoms execution, possibly running on other machines,
 | |
| connected via `amqp`_ (or other supported `kombu`_ transports).
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     This engine is under active development and is usable and **does** work
 | |
|     but is missing some features (please check the `blueprint page`_ for
 | |
|     known issues and plans) that will make it more production ready.
 | |
| 
 | |
| .. _blueprint page: https://blueprints.launchpad.net/taskflow?searchtext=wbe
 | |
| 
 | |
| Terminology
 | |
| -----------
 | |
| 
 | |
| Client
 | |
|   Code or program or service (or user) that uses this library to define
 | |
|   flows and run them via engines.
 | |
| 
 | |
| Transport + protocol
 | |
|   Mechanism (and `protocol`_ on top of that mechanism) used to pass information
 | |
|   between the client and worker (for example amqp as a transport and a json
 | |
|   encoded message format as the protocol).
 | |
| 
 | |
| Executor
 | |
|   Part of the worker-based engine and is used to publish task requests, so
 | |
|   these requests can be accepted and processed by remote workers.
 | |
| 
 | |
| Worker
 | |
|   Workers are started on remote hosts and each has a list of tasks it can
 | |
|   perform (on request). Workers accept and process task requests that are
 | |
|   published by an executor. Several requests can be processed simultaneously
 | |
|   in separate threads (or processes...). For example, an `executor`_ can be
 | |
|   passed to the worker and configured to run in as many threads (green or
 | |
|   not) as desired.
 | |
| 
 | |
| Proxy
 | |
|   Executors interact with workers via a proxy. The proxy maintains the
 | |
|   underlying transport and publishes messages (and invokes callbacks on message
 | |
|   reception).
 | |
| 
 | |
| Requirements
 | |
| ------------
 | |
| 
 | |
| * **Transparent:** it should work as ad-hoc replacement for existing
 | |
|   *(local)* engines with minimal, if any refactoring (e.g. it should be
 | |
|   possible to run the same flows on it without changing client code if
 | |
|   everything is set up and configured properly).
 | |
| * **Transport-agnostic:** the means of transport should be abstracted so that
 | |
|   we can use `oslo.messaging`_, `gearmand`_, `amqp`_, `zookeeper`_, `marconi`_,
 | |
|   `websockets`_ or anything else that allows for passing information between a
 | |
|   client and a worker.
 | |
| * **Simple:** it should be simple to write and deploy.
 | |
| * **Non-uniformity:** it should support non-uniform workers which allows
 | |
|   different workers to execute different sets of atoms depending on the workers
 | |
|   published capabilities.
 | |
| 
 | |
| .. _marconi: https://wiki.openstack.org/wiki/Marconi
 | |
| .. _zookeeper: http://zookeeper.org/
 | |
| .. _gearmand: http://gearman.org/
 | |
| .. _oslo.messaging: https://wiki.openstack.org/wiki/Oslo/Messaging
 | |
| .. _websockets: http://en.wikipedia.org/wiki/WebSocket
 | |
| .. _amqp: http://www.amqp.org/
 | |
| .. _executor: https://docs.python.org/dev/library/concurrent.futures.html#executor-objects
 | |
| .. _protocol: http://en.wikipedia.org/wiki/Communications_protocol
 | |
| 
 | |
| Design
 | |
| ======
 | |
| 
 | |
| There are two communication sides, the *executor* (and associated engine
 | |
| derivative) and *worker* that communicate using a proxy component. The proxy
 | |
| is designed to accept/publish messages from/into a named exchange.
 | |
| 
 | |
| High level architecture
 | |
| -----------------------
 | |
| 
 | |
| .. image:: img/worker-engine.svg
 | |
|    :height: 340px
 | |
|    :align: right
 | |
| 
 | |
| Executor and worker communication
 | |
| ---------------------------------
 | |
| 
 | |
| Let's consider how communication between an executor and a worker happens.
 | |
| First of all an engine resolves all atoms dependencies and schedules atoms that
 | |
| can be performed at the moment. This uses the same scheduling and dependency
 | |
| resolution logic that is used for every other engine type. Then the atoms which
 | |
| can be executed immediately (ones that are dependent on outputs of other tasks
 | |
| will be executed when that output is ready) are executed by the worker-based
 | |
| engine executor in the following manner:
 | |
| 
 | |
| 1. The executor initiates task execution/reversion using a proxy object.
 | |
| 2. :py:class:`~taskflow.engines.worker_based.proxy.Proxy` publishes task
 | |
|    request (format is described below) into a named exchange using a routing
 | |
|    key that is used to deliver request to particular workers topic. The
 | |
|    executor then waits for the task requests to be accepted and confirmed by
 | |
|    workers. If the executor doesn't get a task confirmation from workers within
 | |
|    the given timeout the task is considered as timed-out and a timeout
 | |
|    exception is raised.
 | |
| 3. A worker receives a request message and starts a new thread for processing
 | |
|    it.
 | |
| 
 | |
|    1. The worker dispatches the request (gets desired endpoint that actually
 | |
|       executes the task).
 | |
|    2. If dispatched succeeded then the worker sends a confirmation response
 | |
|       to the executor otherwise the worker sends a failed response along with
 | |
|       a serialized :py:class:`failure <taskflow.types.failure.Failure>` object
 | |
|       that contains what has failed (and why).
 | |
|    3. The worker executes the task and once it is finished sends the result
 | |
|       back to the originating executor (every time a task progress event is
 | |
|       triggered it sends progress notification to the executor where it is
 | |
|       handled by the engine, dispatching to listeners and so-on).
 | |
| 
 | |
| 4. The executor gets the task request confirmation from the worker and the task
 | |
|    request state changes from the ``PENDING`` to the ``RUNNING`` state. Once a
 | |
|    task request is in the ``RUNNING`` state it can't be timed-out (considering
 | |
|    that the task execution process may take an unpredictable amount of time).
 | |
| 5. The executor gets the task execution result from the worker and passes it
 | |
|    back to the executor and worker-based engine to finish task processing (this
 | |
|    repeats for subsequent tasks).
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     :py:class:`~taskflow.types.failure.Failure` objects are not directly
 | |
|     json-serializable (they contain references to tracebacks which are not
 | |
|     serializable), so they are converted to dicts before sending and converted
 | |
|     from dicts after receiving on both executor & worker sides (this
 | |
|     translation is lossy since the traceback can't be fully retained, due
 | |
|     to its contents containing internal interpreter references and
 | |
|     details).
 | |
| 
 | |
| Protocol
 | |
| ~~~~~~~~
 | |
| 
 | |
| .. automodule:: taskflow.engines.worker_based.protocol
 | |
| 
 | |
| Examples
 | |
| ~~~~~~~~
 | |
| 
 | |
| Request (execute)
 | |
| """""""""""""""""
 | |
| 
 | |
| * **task_name** - full task name to be performed
 | |
| * **task_cls** - full task class name to be performed
 | |
| * **action** - task action to be performed (e.g. execute, revert)
 | |
| * **arguments** - arguments the task action to be called with
 | |
| * **result** - task execution result (result or
 | |
|   :py:class:`~taskflow.types.failure.Failure`) *[passed to revert only]*
 | |
| 
 | |
| Additionally, the following parameters are added to the request message:
 | |
| 
 | |
| * **reply_to** - executor named exchange workers will send responses back to
 | |
| * **correlation_id** - executor request id (since there can be multiple request
 | |
|   being processed simultaneously)
 | |
| 
 | |
| **Example:**
 | |
| 
 | |
| .. code:: json
 | |
| 
 | |
|     {
 | |
|         "action": "execute",
 | |
|         "arguments": {
 | |
|             "x": 111
 | |
|         },
 | |
|         "task_cls": "taskflow.tests.utils.TaskOneArgOneReturn",
 | |
|         "task_name": "taskflow.tests.utils.TaskOneArgOneReturn",
 | |
|         "task_version": [
 | |
|             1,
 | |
|             0
 | |
|         ]
 | |
|     }
 | |
| 
 | |
| 
 | |
| Request (revert)
 | |
| """"""""""""""""
 | |
| 
 | |
| When **reverting:**
 | |
| 
 | |
| .. code:: json
 | |
| 
 | |
|     {
 | |
|         "action": "revert",
 | |
|         "arguments": {},
 | |
|         "failures": {
 | |
|             "taskflow.tests.utils.TaskWithFailure": {
 | |
|                 "exc_type_names": [
 | |
|                     "RuntimeError",
 | |
|                     "StandardError",
 | |
|                     "Exception"
 | |
|                 ],
 | |
|                 "exception_str": "Woot!",
 | |
|                 "traceback_str": "  File \"/homes/harlowja/dev/os/taskflow/taskflow/engines/action_engine/executor.py\", line 56, in _execute_task\n    result = task.execute(**arguments)\n  File \"/homes/harlowja/dev/os/taskflow/taskflow/tests/utils.py\", line 165, in execute\n    raise RuntimeError('Woot!')\n",
 | |
|                 "version": 1
 | |
|             }
 | |
|         },
 | |
|         "result": [
 | |
|             "failure",
 | |
|             {
 | |
|                 "exc_type_names": [
 | |
|                     "RuntimeError",
 | |
|                     "StandardError",
 | |
|                     "Exception"
 | |
|                 ],
 | |
|                 "exception_str": "Woot!",
 | |
|                 "traceback_str": "  File \"/homes/harlowja/dev/os/taskflow/taskflow/engines/action_engine/executor.py\", line 56, in _execute_task\n    result = task.execute(**arguments)\n  File \"/homes/harlowja/dev/os/taskflow/taskflow/tests/utils.py\", line 165, in execute\n    raise RuntimeError('Woot!')\n",
 | |
|                 "version": 1
 | |
|             }
 | |
|         ],
 | |
|         "task_cls": "taskflow.tests.utils.TaskWithFailure",
 | |
|         "task_name": "taskflow.tests.utils.TaskWithFailure",
 | |
|         "task_version": [
 | |
|             1,
 | |
|             0
 | |
|         ]
 | |
|     }
 | |
| 
 | |
| Worker response(s)
 | |
| """"""""""""""""""
 | |
| 
 | |
| When **running:**
 | |
| 
 | |
| .. code:: json
 | |
| 
 | |
|     {
 | |
|         "data": {},
 | |
|         "state": "RUNNING"
 | |
|     }
 | |
| 
 | |
| When **progressing:**
 | |
| 
 | |
| .. code:: json
 | |
| 
 | |
|     {
 | |
|         "details": {
 | |
|             "progress": 0.5
 | |
|         },
 | |
|         "event_type": "update_progress",
 | |
|         "state": "EVENT"
 | |
|     }
 | |
| 
 | |
| When **succeeded:**
 | |
| 
 | |
| .. code:: json
 | |
| 
 | |
|     {
 | |
|         "data": {
 | |
|             "result": 666
 | |
|         },
 | |
|         "state": "SUCCESS"
 | |
|     }
 | |
| 
 | |
| When **failed:**
 | |
| 
 | |
| .. code:: json
 | |
| 
 | |
|     {
 | |
|         "data": {
 | |
|             "result": {
 | |
|                 "exc_type_names": [
 | |
|                     "RuntimeError",
 | |
|                     "StandardError",
 | |
|                     "Exception"
 | |
|                 ],
 | |
|                 "exception_str": "Woot!",
 | |
|                 "traceback_str": "  File \"/homes/harlowja/dev/os/taskflow/taskflow/engines/action_engine/executor.py\", line 56, in _execute_task\n    result = task.execute(**arguments)\n  File \"/homes/harlowja/dev/os/taskflow/taskflow/tests/utils.py\", line 165, in execute\n    raise RuntimeError('Woot!')\n",
 | |
|                 "version": 1
 | |
|             }
 | |
|         },
 | |
|         "state": "FAILURE"
 | |
|     }
 | |
| 
 | |
| Request state transitions
 | |
| -------------------------
 | |
| 
 | |
| .. image:: img/wbe_request_states.svg
 | |
|    :width: 520px
 | |
|    :align: center
 | |
|    :alt: WBE request state transitions
 | |
| 
 | |
| **WAITING** - Request placed on queue (or other `kombu`_ message bus/transport)
 | |
| but not *yet* consumed.
 | |
| 
 | |
| **PENDING** - Worker accepted request and is pending to run using its
 | |
| executor (threads, processes, or other).
 | |
| 
 | |
| **FAILURE** - Worker failed after running request (due to task exception) or
 | |
| no worker moved/started executing (by placing the request into ``RUNNING``
 | |
| state) with-in specified time span (this defaults to 60 seconds unless
 | |
| overridden).
 | |
| 
 | |
| **RUNNING** - Workers executor (using threads, processes...) has started to
 | |
| run requested task (once this state is transitioned to any request timeout no
 | |
| longer becomes applicable; since at this point it is unknown how long a task
 | |
| will run since it can not be determined if a task is just taking a long time
 | |
| or has failed).
 | |
| 
 | |
| **SUCCESS** - Worker finished running task without exception.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     During the ``WAITING`` and ``PENDING`` stages the engine keeps track
 | |
|     of how long the request has been *alive* for and if a timeout is reached
 | |
|     the request will automatically transition to ``FAILURE`` and any further
 | |
|     transitions from a worker will be disallowed (for example, if a worker
 | |
|     accepts the request in the future and sets the task to ``PENDING`` this
 | |
|     transition will be logged and ignored). This timeout can be adjusted and/or
 | |
|     removed by setting the engine ``transition_timeout`` option to a
 | |
|     higher/lower value or by setting it to ``None`` (to remove the timeout
 | |
|     completely). In the future this will be improved to be more dynamic
 | |
|     by implementing the blueprints associated with `failover`_ and
 | |
|     `info/resilence`_.
 | |
| 
 | |
| .. _failover: https://blueprints.launchpad.net/taskflow/+spec/wbe-worker-failover
 | |
| .. _info/resilence: https://blueprints.launchpad.net/taskflow/+spec/wbe-worker-info
 | |
| 
 | |
| Usage
 | |
| =====
 | |
| 
 | |
| Workers
 | |
| -------
 | |
| 
 | |
| To use the worker based engine a set of workers must first be established on
 | |
| remote machines. These workers must be provided a list of task objects, task
 | |
| names, modules names (or entrypoints that can be examined for valid tasks) they
 | |
| can respond to (this is done so that arbitrary code execution is not possible).
 | |
| 
 | |
| For complete parameters and object usage please visit
 | |
| :py:class:`~taskflow.engines.worker_based.worker.Worker`.
 | |
| 
 | |
| **Example:**
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     from taskflow.engines.worker_based import worker as w
 | |
| 
 | |
|     config = {
 | |
|         'url': 'amqp://guest:guest@localhost:5672//',
 | |
|         'exchange': 'test-exchange',
 | |
|         'topic': 'test-tasks',
 | |
|         'tasks': ['tasks:TestTask1', 'tasks:TestTask2'],
 | |
|     }
 | |
|     worker = w.Worker(**config)
 | |
|     worker.run()
 | |
| 
 | |
| Engines
 | |
| -------
 | |
| 
 | |
| To use the worker based engine a flow must be constructed (which contains tasks
 | |
| that are visible on remote machines) and the specific worker based engine
 | |
| entrypoint must be selected. Certain configuration options must also be
 | |
| provided so that the transport backend can be configured and initialized
 | |
| correctly. Otherwise the usage should be mostly transparent (and is nearly
 | |
| identical to using any other engine type).
 | |
| 
 | |
| For complete parameters and object usage please see
 | |
| :py:class:`~taskflow.engines.worker_based.engine.WorkerBasedActionEngine`.
 | |
| 
 | |
| **Example with amqp transport:**
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     flow = lf.Flow('simple-linear').add(...)
 | |
|     eng = taskflow.engines.load(flow, engine='worker-based',
 | |
|                                 url='amqp://guest:guest@localhost:5672//',
 | |
|                                 exchange='test-exchange',
 | |
|                                 topics=['topic1', 'topic2'])
 | |
|     eng.run()
 | |
| 
 | |
| **Example with filesystem transport:**
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     flow = lf.Flow('simple-linear').add(...)
 | |
|     eng = taskflow.engines.load(flow, engine='worker-based',
 | |
|                                 exchange='test-exchange',
 | |
|                                 topics=['topic1', 'topic2'],
 | |
|                                 transport='filesystem',
 | |
|                                 transport_options={
 | |
|                                     'data_folder_in': '/tmp/in',
 | |
|                                     'data_folder_out': '/tmp/out',
 | |
|                                 })
 | |
|     eng.run()
 | |
| 
 | |
| Additional supported keyword arguments:
 | |
| 
 | |
| * ``executor``: a class that provides a
 | |
|   :py:class:`~taskflow.engines.worker_based.executor.WorkerTaskExecutor`
 | |
|   interface; it will be used for executing, reverting and waiting for remote
 | |
|   tasks.
 | |
| 
 | |
| Limitations
 | |
| ===========
 | |
| 
 | |
| * Atoms inside a flow must receive and accept parameters only from the ways
 | |
|   defined in :doc:`persistence <persistence>`. In other words, the task
 | |
|   that is created when a workflow is constructed will not be the same task that
 | |
|   is executed on a remote worker (and any internal state not passed via the
 | |
|   :doc:`input and output <inputs_and_outputs>` mechanism can not be
 | |
|   transferred). This means resource objects (database handles, file
 | |
|   descriptors, sockets, ...) can **not** be directly sent across to remote
 | |
|   workers (instead the configuration that defines how to fetch/create these
 | |
|   objects must be instead).
 | |
| * Worker-based engines will in the future be able to run lightweight tasks
 | |
|   locally to avoid transport overhead for very simple tasks (currently it will
 | |
|   run even lightweight tasks remotely, which may be non-performant).
 | |
| * Fault detection, currently when a worker acknowledges a task the engine will
 | |
|   wait for the task result indefinitely (a task may take an indeterminate
 | |
|   amount of time to finish). In the future there needs to be a way to limit
 | |
|   the duration of a remote workers execution (and track their liveness) and
 | |
|   possibly spawn the task on a secondary worker if a timeout is reached (aka
 | |
|   the first worker has died or has stopped responding).
 | |
| 
 | |
| Implementations
 | |
| ===============
 | |
| 
 | |
| .. automodule:: taskflow.engines.worker_based.engine
 | |
| 
 | |
| Components
 | |
| ----------
 | |
| 
 | |
| .. automodule:: taskflow.engines.worker_based.proxy
 | |
| .. automodule:: taskflow.engines.worker_based.worker
 | |
| 
 | |
| .. _kombu: http://kombu.readthedocs.org/
 |