From 7f6ef479c19baeddd876384c9f8a935f59f9ccbb Mon Sep 17 00:00:00 2001 From: Joshua Harlow Date: Sat, 26 Apr 2014 13:47:15 -0700 Subject: [PATCH] Add a resumption strategy doc Move docs from wiki to developer docs and add on and adjust to reflect the current state of things. Change-Id: I50ab1ebeb33074d1fbc7493749d0d518b66de69e --- doc/source/index.rst | 1 + doc/source/persistence.rst | 3 +- doc/source/resumption.rst | 156 +++++++++++++++++++++++++++++++++++++ 3 files changed, 158 insertions(+), 2 deletions(-) create mode 100644 doc/source/resumption.rst diff --git a/doc/source/index.rst b/doc/source/index.rst index 84075223..a59575c6 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -18,6 +18,7 @@ Contents inputs_and_outputs notifications persistence + resumption exceptions utils states diff --git a/doc/source/persistence.rst b/doc/source/persistence.rst index f7fe810c..0a1f84b6 100644 --- a/doc/source/persistence.rst +++ b/doc/source/persistence.rst @@ -58,7 +58,7 @@ objects for each atom in the workflow the engine will be executing. **Execution:** When an engine beings to execute it will examine any previously existing :py:class:`~taskflow.persistence.logbook.AtomDetail` objects to see if they can be used -for resuming; see `big picture`_ for more details on this subject. For atoms which have not +for resuming; see :doc:`resumption ` for more details on this subject. For atoms which have not finished (or did not finish correctly from a previous run) they will begin executing only after any dependent inputs are ready. This is done by analyzing the execution graph and looking at predecessor :py:class:`~taskflow.persistence.logbook.AtomDetail` @@ -88,7 +88,6 @@ A few scenarios come to mind: of map-reduce jobs on them. .. _hdfs: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html -.. _big picture: https://wiki.openstack.org/wiki/TaskFlow/Patterns_and_Engines/Persistence#Big_Picture .. note:: diff --git a/doc/source/resumption.rst b/doc/source/resumption.rst new file mode 100644 index 00000000..b80fa909 --- /dev/null +++ b/doc/source/resumption.rst @@ -0,0 +1,156 @@ +---------- +Resumption +---------- + +Overview +======== + +**Question**: *How can we persist the flow so that it can be resumed, restarted or +rolled-back on engine failure?* + +**Answer:** Since a flow is a set of :doc:`atoms ` and relations between atoms we +need to create a model and corresponding information that allows us to persist +the *right* amount of information to preserve, resume, and rollback a flow on +software or hardware failure. + +To allow for resumption taskflow must be able to re-create the flow and re-connect +the links between atom (and between atoms->atom details and so on) in order to +revert those atoms or resume those atoms in the correct ordering. Taskflow provides +a pattern that can help in automating this process (it does **not** prohibit the user +from creating their own strategies for doing this). + +Factories +========= + +The default provided way is to provide a `factory`_ function which will create (or +recreate your workflow). This function can be provided when loading +a flow and corresponding engine via the provided +:py:meth:`load_from_factory() ` method. This +`factory`_ function is expected to be a function (or ``staticmethod``) which is reimportable (aka +has a well defined name that can be located by the ``__import__`` function in python, this +excludes ``lambda`` style functions and ``instance`` methods). The `factory`_ function +name will be saved into the logbook and it will be imported and called to create the +workflow objects (or recreate it if resumption happens). This allows for the flow +to be recreated if and when that is needed (even on remote machines, as long as the +reimportable name can be located). + +.. _factory: https://en.wikipedia.org/wiki/Factory_%28object-oriented_programming%29 + +Names +===== + +When a flow is created it is expected that each atom has a unique name, this +name serves a special purpose in the resumption process (as well as serving +a useful purpose when running, allowing for atom identification in the +:doc:`notification ` process). The reason for having names is that +an atom in a flow needs to be somehow matched with (a potentially) +existing :py:class:`~taskflow.persistence.logbook.AtomDetail` during engine +resumption & subsequent running. + +The match should be: + +* stable if atoms are added or removed +* should not change when service is restarted, upgraded... +* should be the same across all server instances in HA setups + +Names provide this although they do have weaknesses: + +* the names of atoms must be unique in flow +* it becomes hard to change the name of atom since a name change causes other + side-effects + +.. note:: + + Even though these weaknesses names were selected as a *good enough* solution for the above + matching requirements (until something better is invented/created that can satisfy those + same requirements). + +Scenarios +========= + +When new flow is loaded into engine, there is no persisted data +for it yet, so a corresponding :py:class:`~taskflow.persistence.logbook.FlowDetail` object +will be created, as well as a :py:class:`~taskflow.persistence.logbook.AtomDetail` object for +each atom that is contained in it. These will be immediately saved into the persistence backend +that is configured. If no persistence backend is configured, then as expected nothing will be +saved and the atoms and flow will be ran in a non-persistent manner. + +**Subsequent run:** When we resume the flow from a persistent backend (for example, +if the flow was interrupted and engine destroyed to save resources or if the +service was restarted), we need to re-create the flow. For that, we will call +the function that was saved on first-time loading that builds the flow for +us (aka; the flow factory function described above) and the engine will run. The +following scenarios explain some expected structural changes and how they can +be accommodated (and what the effect will be when resuming & running). + +Same atoms +---------- + +When the factory function mentioned above returns the exact same the flow and +atoms (no changes are performed). + +**Runtime change:** Nothing should be done -- the engine will re-associate +atoms with :py:class:`~taskflow.persistence.logbook.AtomDetail` objects by name +and then the engine resumes. + +Atom was added +-------------- + +When the factory function mentioned above alters the flow by adding +a new atom in (for example for changing the runtime structure of what was previously +ran in the first run). + +**Runtime change:** By default when the engine resumes it will notice that +a corresponding :py:class:`~taskflow.persistence.logbook.AtomDetail` does not +exist and one will be created and associated. + +Atom was removed +---------------- + +When the factory function mentioned above alters the flow by removing +a new atom in (for example for changing the runtime structure of what was previously +ran in the first run). + +**Runtime change:** Nothing should be done -- flow structure is reloaded from factory +function, and removed atom is not in it -- so, flow will be ran as if it was +not there, and any results it returned if it was completed before will be ignored. + +Atom code was changed +--------------------- + +When the factory function mentioned above alters the flow by deciding that a newer +version of a previously existing atom should be ran (possibly to perform some +kind of upgrade or to fix a bug in a prior atoms code). + +**Factory change:** The atom name & version will have to be altered. The +factory should replace this name where it was being used previously. + +**Runtime change:** This will fall under the same runtime adjustments that exist +when a new atom is added. In the future taskflow could make this easier by +providing a ``upgrade()`` function that can be used to give users the ability +to upgrade atoms before running (manual introspection & modification of a +:py:class:`~taskflow.persistence.logbook.LogBook` can be done before engine loading +and running to accomplish this in the meantime). + +Atom was split in two atoms or merged from two (or more) to one atom +-------------------------------------------------------------------- + +When the factory function mentioned above alters the flow by deciding that a previously +existing atom should be split into N atoms or the factory function decides that N atoms +should be merged in