Add a conductor considerations section

Add a small section in the conductor docs about
the cycling issue and give some resolutions that
can be applied as well as link to the better solution
which is garbage collection for jobs that are not
working out.

Also includes some tiny tweaks to other docs.

Change-Id: I73e9f8f5a8888eaf967d62723f6ffb45b02887c9
This commit is contained in:
Joshua Harlow 2014-06-27 16:13:24 -07:00 committed by Joshua Harlow
parent 95cb0625f4
commit c2ec0b2e49
4 changed files with 38 additions and 6 deletions

View File

@ -24,9 +24,41 @@ They are responsible for the following:
.. note:: .. note::
They are inspired by and have similar responsiblities They are inspired by and have similar responsibilities
as `railroad conductors`_. as `railroad conductors`_.
Considerations
==============
Some usage considerations should be used when using a conductor to make sure
it's used in a safe and reliable manner. Eventually we hope to make these
non-issues but for now they are worth mentioning.
Endless cycling
---------------
**What:** Jobs that fail (due to some type of internal error) on one conductor
will be abandoned by that conductor and then another conductor may experience
those same errors and abandon it (and repeat). This will create a job
abandonment cycle that will continue for as long as the job exists in an
claimable state.
**Example:**
.. image:: img/conductor_cycle.png
:scale: 70%
:alt: Conductor cycling
**Alleviate by:**
#. Forcefully delete jobs that have been failing continuously after a given
number of conductor attempts. This can be either done manually or
automatically via scripts (or other associated monitoring).
#. Resolve the internal error's cause (storage backend failure, other...).
#. Help implement `jobboard garbage binning`_.
.. _jobboard garbage binning: https://blueprints.launchpad.net/taskflow/+spec/jobboard-garbage-bin
Interfaces Interfaces
========== ==========

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

View File

@ -214,7 +214,7 @@ the engine can immediately stop doing further work. The effect that this causes
is that when a claim is lost another engine can immediately attempt to acquire is that when a claim is lost another engine can immediately attempt to acquire
the claim that was previously lost and it *could* begin working on the the claim that was previously lost and it *could* begin working on the
unfinished tasks that the later engine may also still be executing (since that unfinished tasks that the later engine may also still be executing (since that
engine is not yet aware that it has lost the claim). engine is not yet aware that it has *lost* the claim).
**TLDR:** not `preemptable`_, possible to become aware of losing a claim **TLDR:** not `preemptable`_, possible to become aware of losing a claim
after the fact (at the next state change), another engine could have acquired after the fact (at the next state change), another engine could have acquired
@ -235,8 +235,8 @@ the claim by then, therefore both would be *working* on a job.
#. Delay claiming partially completed work by adding a wait period (to allow #. Delay claiming partially completed work by adding a wait period (to allow
the previous engine to coalesce) before working on a partially completed job the previous engine to coalesce) before working on a partially completed job
(combine this with the prior suggestions and dual-engine issues should be (combine this with the prior suggestions and *most* dual-engine issues
avoided). should be avoided).
.. _idempotent: http://en.wikipedia.org/wiki/Idempotence .. _idempotent: http://en.wikipedia.org/wiki/Idempotence
.. _preemptable: http://en.wikipedia.org/wiki/Preemption_%28computing%29 .. _preemptable: http://en.wikipedia.org/wiki/Preemption_%28computing%29

View File

@ -7,8 +7,7 @@ Overview
This is engine that schedules tasks to **workers** -- separate processes This is engine that schedules tasks to **workers** -- separate processes
dedicated for certain atoms execution, possibly running on other machines, dedicated for certain atoms execution, possibly running on other machines,
connected via `amqp`_ (or other supported `kombu connected via `amqp`_ (or other supported `kombu`_ transports).
<http://kombu.readthedocs.org/>`_ transports).
.. note:: .. note::
@ -18,6 +17,7 @@ connected via `amqp`_ (or other supported `kombu
production ready. production ready.
.. _blueprint page: https://blueprints.launchpad.net/taskflow?searchtext=wbe .. _blueprint page: https://blueprints.launchpad.net/taskflow?searchtext=wbe
.. _kombu: http://kombu.readthedocs.org/
Terminology Terminology
----------- -----------