Add a resumption strategy doc

Move docs from wiki to developer docs and add on and adjust to reflect the current state of things. Change-Id: I50ab1ebeb33074d1fbc7493749d0d518b66de69e
2014-04-26 13:47:15 -07:00
parent a2b4ba6662
commit 7f6ef479c1
3 changed files with 158 additions and 2 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -18,6 +18,7 @@ Contents
   inputs_and_outputs
   notifications
   persistence
   resumption
   exceptions
   utils
   states
--- a/doc/source/persistence.rst
+++ b/doc/source/persistence.rst
@@ -58,7 +58,7 @@ objects for each atom in the workflow the engine will be executing.
 **Execution:** When an engine beings to execute it will examine any previously existing
 :py:class:`~taskflow.persistence.logbook.AtomDetail` objects to see if they can be used
-for resuming; see `big picture`_ for more details on this subject. For atoms which have not
+for resuming; see :doc:`resumption <resumption>` for more details on this subject. For atoms which have not
 finished (or did not finish correctly from a previous run) they will begin executing
 only after any dependent inputs are ready. This is done by analyzing the execution
 graph and looking at predecessor :py:class:`~taskflow.persistence.logbook.AtomDetail`
@@ -88,7 +88,6 @@ A few scenarios come to mind:
  of map-reduce jobs on them.
 .. _hdfs: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
 .. _big picture: https://wiki.openstack.org/wiki/TaskFlow/Patterns_and_Engines/Persistence#Big_Picture
 .. note::
--- a/doc/source/resumption.rst
+++ b/doc/source/resumption.rst
@@ -0,0 +1,156 @@
 ----------
 Resumption
 ----------
 Overview
 ========
 **Question**: *How can we persist the flow so that it can be resumed, restarted or
 rolled-back on engine failure?*
 **Answer:** Since a flow is a set of :doc:`atoms <atoms>` and relations between atoms we
 need to create a model and corresponding information that allows us to persist
 the *right* amount of information to preserve, resume, and rollback a flow on
 software or hardware failure.
 To allow for resumption taskflow must be able to re-create the flow and re-connect
 the links between atom (and between atoms->atom details and so on) in order to
 revert those atoms or resume those atoms in the correct ordering. Taskflow provides
 a pattern that can help in automating this process (it does **not** prohibit the user
 from creating their own strategies for doing this).
 Factories
 =========
 The default provided way is to provide a `factory`_ function which will create (or
 recreate your workflow). This function can be provided when loading
 a flow and corresponding engine via the provided
 :py:meth:`load_from_factory() <taskflow.engines.helpers.load_from_factory>` method. This
 `factory`_ function is expected to be a function (or ``staticmethod``) which is reimportable (aka
 has a well defined name that can be located by the ``__import__`` function in python, this
 excludes ``lambda`` style functions and ``instance`` methods). The `factory`_ function
 name will be saved into the logbook and it will be imported and called to create the
 workflow objects (or recreate it if resumption happens). This allows for the flow
 to be recreated if and when that is needed (even on remote machines, as long as the
 reimportable name can be located).
 .. _factory: https://en.wikipedia.org/wiki/Factory_%28object-oriented_programming%29
 Names
 =====
 When a flow is created it is expected that each atom has a unique name, this
 name serves a special purpose in the resumption process (as well as serving
 a useful purpose when running, allowing for atom identification in the
 :doc:`notification <notifications>` process). The reason for having names is that
 an atom in a flow needs to be somehow  matched with (a potentially)
 existing :py:class:`~taskflow.persistence.logbook.AtomDetail` during engine
 resumption & subsequent running.
 The match should be:
 * stable if atoms are added or removed
 * should not change when service is restarted, upgraded...
 * should be the same across all server instances in HA setups
 Names provide this although they do have weaknesses:
 * the names of atoms must be unique in flow
 * it becomes hard to change the name of atom since a name change causes other
  side-effects
 .. note::
    Even though these weaknesses names were selected as a *good enough* solution for the above
    matching requirements (until something better is invented/created that can satisfy those
    same requirements).
 Scenarios
 =========
 When new flow is loaded into engine, there is no persisted data
 for it yet, so a corresponding :py:class:`~taskflow.persistence.logbook.FlowDetail` object
 will be created, as well as a :py:class:`~taskflow.persistence.logbook.AtomDetail` object for
 each atom that is contained in it. These will be immediately saved into the persistence backend
 that is configured. If no persistence backend is configured, then as expected nothing will be
 saved and the atoms and flow will be ran in a non-persistent manner.
 **Subsequent run:** When we resume the flow from a persistent backend (for example,
 if the flow was interrupted and engine destroyed to save resources or if the
 service was restarted), we need to re-create the flow. For that, we will call
 the function that was saved on first-time loading that builds the flow for
 us (aka; the flow factory function described above) and the engine will run. The
 following scenarios explain some expected structural changes and how they can
 be accommodated (and what the effect will be when resuming & running).
 Same atoms
 ----------
 When the factory function mentioned above returns the exact same the flow and
 atoms (no changes are performed).
 **Runtime change:** Nothing should be done -- the engine will re-associate
 atoms with :py:class:`~taskflow.persistence.logbook.AtomDetail` objects by name
 and then the engine resumes.
 Atom was added
 --------------
 When the factory function mentioned above alters the flow by adding
 a new atom in (for example for changing the runtime structure of what was previously
 ran in the first run).
 **Runtime change:** By default when the engine resumes it will notice that
 a corresponding :py:class:`~taskflow.persistence.logbook.AtomDetail` does not
 exist and one will be created and associated.
 Atom was removed
 ----------------
 When the factory function mentioned above alters the flow by removing
 a new atom in (for example for changing the runtime structure of what was previously
 ran in the first run).
 **Runtime change:** Nothing should be done -- flow structure is reloaded from factory
 function, and removed atom is not in it -- so, flow will be ran as if it was
 not there, and any results it returned if it was completed before will be ignored.
 Atom code was changed
 ---------------------
 When the factory function mentioned above alters the flow by deciding that a newer
 version of a previously existing atom should be ran (possibly to perform some
 kind of upgrade or to fix a bug in a prior atoms code).
 **Factory change:** The atom name & version will have to be altered. The
 factory should replace this name where it was being used previously.
 **Runtime change:** This will fall under the same runtime adjustments that exist
 when a new atom is added. In the future taskflow could make this easier by
 providing a ``upgrade()`` function that can be used to give users the ability
 to upgrade atoms before running (manual introspection & modification of a
 :py:class:`~taskflow.persistence.logbook.LogBook` can be done before engine loading
 and running to accomplish this in the meantime).
 Atom was split in two atoms or merged from two (or more) to one atom
 --------------------------------------------------------------------
 When the factory function mentioned above alters the flow by deciding that a previously
 existing atom should be split into N atoms or the factory function decides that N atoms
 should be merged in <N atoms (typically occurring during refactoring).
 **Runtime change:** This will fall under the same runtime adjustments that exist
 when a new atom is added or removed. In the future taskflow could make this easier by
 providing a ``migrate()`` function that can be used to give users the ability
 to migrate atoms previous data before running (manual introspection & modification of a
 :py:class:`~taskflow.persistence.logbook.LogBook` can be done before engine loading
 and running to accomplish this in the meantime).
 Flow structure was changed
 --------------------------
 If manual links were added or removed from graph, or task requirements were changed, or
 flow was refactored (atom moved into or out of subflows, linear flow was replaced with
 graph flow, tasks were reordered in linear flow, etc).
 **Runtime change:** Nothing should be done.