From 7f6ef479c19baeddd876384c9f8a935f59f9ccbb Mon Sep 17 00:00:00 2001
From: Joshua Harlow <harlowja@gmail.com>
Date: Sat, 26 Apr 2014 13:47:15 -0700
Subject: [PATCH] Add a resumption strategy doc

Move docs from wiki to developer docs and
add on and adjust to reflect the current
state of things.

Change-Id: I50ab1ebeb33074d1fbc7493749d0d518b66de69e
---
 doc/source/index.rst       |   1 +
 doc/source/persistence.rst |   3 +-
 doc/source/resumption.rst  | 156 +++++++++++++++++++++++++++++++++++++
 3 files changed, 158 insertions(+), 2 deletions(-)
 create mode 100644 doc/source/resumption.rst
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 84075223..a59575c6 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -18,6 +18,7 @@ Contents
    inputs_and_outputs
    notifications
    persistence
+   resumption
    exceptions
    utils
    states
diff --git a/doc/source/persistence.rst b/doc/source/persistence.rst
index f7fe810c..0a1f84b6 100644
--- a/doc/source/persistence.rst
+++ b/doc/source/persistence.rst
@@ -58,7 +58,7 @@ objects for each atom in the workflow the engine will be executing.
 
 **Execution:** When an engine beings to execute it will examine any previously existing
 :py:class:`~taskflow.persistence.logbook.AtomDetail` objects to see if they can be used
-for resuming; see `big picture`_ for more details on this subject. For atoms which have not
+for resuming; see :doc:`resumption <resumption>` for more details on this subject. For atoms which have not
 finished (or did not finish correctly from a previous run) they will begin executing
 only after any dependent inputs are ready. This is done by analyzing the execution
 graph and looking at predecessor :py:class:`~taskflow.persistence.logbook.AtomDetail`
@@ -88,7 +88,6 @@ A few scenarios come to mind:
   of map-reduce jobs on them.
 
 .. _hdfs: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
-.. _big picture: https://wiki.openstack.org/wiki/TaskFlow/Patterns_and_Engines/Persistence#Big_Picture
 
 .. note::
 
diff --git a/doc/source/resumption.rst b/doc/source/resumption.rst
new file mode 100644
index 00000000..b80fa909
--- /dev/null
+++ b/doc/source/resumption.rst
@@ -0,0 +1,156 @@
+----------
+Resumption
+----------
+
+Overview
+========
+
+**Question**: *How can we persist the flow so that it can be resumed, restarted or
+rolled-back on engine failure?*
+
+**Answer:** Since a flow is a set of :doc:`atoms <atoms>` and relations between atoms we
+need to create a model and corresponding information that allows us to persist
+the *right* amount of information to preserve, resume, and rollback a flow on
+software or hardware failure.
+
+To allow for resumption taskflow must be able to re-create the flow and re-connect
+the links between atom (and between atoms->atom details and so on) in order to
+revert those atoms or resume those atoms in the correct ordering. Taskflow provides
+a pattern that can help in automating this process (it does **not** prohibit the user
+from creating their own strategies for doing this).
+
+Factories
+=========
+
+The default provided way is to provide a `factory`_ function which will create (or
+recreate your workflow). This function can be provided when loading
+a flow and corresponding engine via the provided
+:py:meth:`load_from_factory() <taskflow.engines.helpers.load_from_factory>` method. This
+`factory`_ function is expected to be a function (or ``staticmethod``) which is reimportable (aka
+has a well defined name that can be located by the ``__import__`` function in python, this
+excludes ``lambda`` style functions and ``instance`` methods). The `factory`_ function
+name will be saved into the logbook and it will be imported and called to create the
+workflow objects (or recreate it if resumption happens). This allows for the flow
+to be recreated if and when that is needed (even on remote machines, as long as the
+reimportable name can be located).
+
+.. _factory: https://en.wikipedia.org/wiki/Factory_%28object-oriented_programming%29
+
+Names
+=====
+
+When a flow is created it is expected that each atom has a unique name, this
+name serves a special purpose in the resumption process (as well as serving
+a useful purpose when running, allowing for atom identification in the
+:doc:`notification <notifications>` process). The reason for having names is that
+an atom in a flow needs to be somehow  matched with (a potentially)
+existing :py:class:`~taskflow.persistence.logbook.AtomDetail` during engine
+resumption & subsequent running.
+
+The match should be:
+
+* stable if atoms are added or removed
+* should not change when service is restarted, upgraded...
+* should be the same across all server instances in HA setups
+
+Names provide this although they do have weaknesses:
+
+* the names of atoms must be unique in flow
+* it becomes hard to change the name of atom since a name change causes other
+  side-effects
+
+.. note::
+
+    Even though these weaknesses names were selected as a *good enough* solution for the above
+    matching requirements (until something better is invented/created that can satisfy those
+    same requirements).
+
+Scenarios
+=========
+
+When new flow is loaded into engine, there is no persisted data
+for it yet, so a corresponding :py:class:`~taskflow.persistence.logbook.FlowDetail` object
+will be created, as well as a :py:class:`~taskflow.persistence.logbook.AtomDetail` object for
+each atom that is contained in it. These will be immediately saved into the persistence backend
+that is configured. If no persistence backend is configured, then as expected nothing will be
+saved and the atoms and flow will be ran in a non-persistent manner.
+
+**Subsequent run:** When we resume the flow from a persistent backend (for example,
+if the flow was interrupted and engine destroyed to save resources or if the
+service was restarted), we need to re-create the flow. For that, we will call
+the function that was saved on first-time loading that builds the flow for
+us (aka; the flow factory function described above) and the engine will run. The
+following scenarios explain some expected structural changes and how they can
+be accommodated (and what the effect will be when resuming & running).
+
+Same atoms
+----------
+
+When the factory function mentioned above returns the exact same the flow and
+atoms (no changes are performed).
+
+**Runtime change:** Nothing should be done -- the engine will re-associate
+atoms with :py:class:`~taskflow.persistence.logbook.AtomDetail` objects by name
+and then the engine resumes.
+
+Atom was added
+--------------
+
+When the factory function mentioned above alters the flow by adding
+a new atom in (for example for changing the runtime structure of what was previously
+ran in the first run).
+
+**Runtime change:** By default when the engine resumes it will notice that
+a corresponding :py:class:`~taskflow.persistence.logbook.AtomDetail` does not
+exist and one will be created and associated.
+
+Atom was removed
+----------------
+
+When the factory function mentioned above alters the flow by removing
+a new atom in (for example for changing the runtime structure of what was previously
+ran in the first run).
+
+**Runtime change:** Nothing should be done -- flow structure is reloaded from factory
+function, and removed atom is not in it -- so, flow will be ran as if it was
+not there, and any results it returned if it was completed before will be ignored.
+
+Atom code was changed
+---------------------
+
+When the factory function mentioned above alters the flow by deciding that a newer
+version of a previously existing atom should be ran (possibly to perform some
+kind of upgrade or to fix a bug in a prior atoms code).
+
+**Factory change:** The atom name & version will have to be altered. The
+factory should replace this name where it was being used previously.
+
+**Runtime change:** This will fall under the same runtime adjustments that exist
+when a new atom is added. In the future taskflow could make this easier by
+providing a ``upgrade()`` function that can be used to give users the ability
+to upgrade atoms before running (manual introspection & modification of a
+:py:class:`~taskflow.persistence.logbook.LogBook` can be done before engine loading
+and running to accomplish this in the meantime).
+
+Atom was split in two atoms or merged from two (or more) to one atom
+--------------------------------------------------------------------
+
+When the factory function mentioned above alters the flow by deciding that a previously
+existing atom should be split into N atoms or the factory function decides that N atoms
+should be merged in <N atoms (typically occurring during refactoring).
+
+**Runtime change:** This will fall under the same runtime adjustments that exist
+when a new atom is added or removed. In the future taskflow could make this easier by
+providing a ``migrate()`` function that can be used to give users the ability
+to migrate atoms previous data before running (manual introspection & modification of a
+:py:class:`~taskflow.persistence.logbook.LogBook` can be done before engine loading
+and running to accomplish this in the meantime).
+
+Flow structure was changed
+--------------------------
+
+If manual links were added or removed from graph, or task requirements were changed, or
+flow was refactored (atom moved into or out of subflows, linear flow was replaced with
+graph flow, tasks were reordered in linear flow, etc).
+
+**Runtime change:** Nothing should be done.