Add a conductor considerations section

Add a small section in the conductor docs about the cycling issue and give some resolutions that can be applied as well as link to the better solution which is garbage collection for jobs that are not working out. Also includes some tiny tweaks to other docs. Change-Id: I73e9f8f5a8888eaf967d62723f6ffb45b02887c9
2014-06-27 16:13:24 -07:00 · 2014-06-27 16:13:24 -07:00 · c2ec0b2e49
commit c2ec0b2e49
parent 95cb0625f4
4 changed files with 38 additions and 6 deletions
--- a/doc/source/conductors.rst
+++ b/doc/source/conductors.rst
@ -24,9 +24,41 @@ They are responsible for the following:

 .. note::

-     They are inspired by and have similar responsiblities
+     They are inspired by and have similar responsibilities
     as `railroad conductors`_.

+Considerations
+==============
+
+Some usage considerations should be used when using a conductor to make sure
+it's used in a safe and reliable manner. Eventually we hope to make these
+non-issues but for now they are worth mentioning.
+
+Endless cycling
+---------------
+
+**What:** Jobs that fail (due to some type of internal error) on one conductor
+will be abandoned by that conductor and then another conductor may experience
+those same errors and abandon it (and repeat). This will create a job
+abandonment cycle that will continue for as long as the job exists in an
+claimable state.
+
+**Example:**
+
+.. image:: img/conductor_cycle.png
+   :scale: 70%
+   :alt: Conductor cycling
+
+**Alleviate by:**
+
+#. Forcefully delete jobs that have been failing continuously after a given
+   number of conductor attempts. This can be either done manually or
+   automatically via scripts (or other associated monitoring).
+#. Resolve the internal error's cause (storage backend failure, other...).
+#. Help implement `jobboard garbage binning`_.
+
+.. _jobboard garbage binning: https://blueprints.launchpad.net/taskflow/+spec/jobboard-garbage-bin
+
 Interfaces
 ==========

--- a/doc/source/img/conductor_cycle.png
+++ b/doc/source/img/conductor_cycle.png
--- a/doc/source/jobs.rst
+++ b/doc/source/jobs.rst
@ -214,7 +214,7 @@ the engine can immediately stop doing further work. The effect that this causes
 is that when a claim is lost another engine can immediately attempt to acquire
 the claim that was previously lost and it *could* begin working on the
 unfinished tasks that the later engine may also still be executing (since that
-engine is not yet aware that it has lost the claim).
+engine is not yet aware that it has *lost* the claim).

 **TLDR:** not `preemptable`_, possible to become aware of losing a claim
 after the fact (at the next state change), another engine could have acquired
@ -235,8 +235,8 @@ the claim by then, therefore both would be *working* on a job.

 #. Delay claiming partially completed work by adding a wait period (to allow
   the previous engine to coalesce) before working on a partially completed job
-   (combine this with the prior suggestions and dual-engine issues should be
-   avoided).
+   (combine this with the prior suggestions and *most* dual-engine issues
+   should be avoided).

 .. _idempotent: http://en.wikipedia.org/wiki/Idempotence
 .. _preemptable: http://en.wikipedia.org/wiki/Preemption_%28computing%29
--- a/doc/source/workers.rst
+++ b/doc/source/workers.rst
@ -7,8 +7,7 @@ Overview

 This is engine that schedules tasks to **workers** -- separate processes
 dedicated for certain atoms execution, possibly running on other machines,
-connected via `amqp`_ (or other supported `kombu
-<http://kombu.readthedocs.org/>`_ transports).
+connected via `amqp`_ (or other supported `kombu`_ transports).

 .. note::

@ -18,6 +17,7 @@ connected via `amqp`_ (or other supported `kombu
    production ready.

 .. _blueprint page: https://blueprints.launchpad.net/taskflow?searchtext=wbe
+.. _kombu: http://kombu.readthedocs.org/

 Terminology
 -----------