11 KiB
Persistence
Overview
In order to be able to receive inputs and create outputs from atoms (or other engine processes) in a fault-tolerant way, there is a need to be able to place what atoms output in some kind of location where it can be re-used by other atoms (or used for other purposes). To accommodate this type of usage TaskFlow provides an abstraction (provided by pluggable stevedore backends) that is similar in concept to a running programs memory.
This abstraction serves the following major purposes:
- Tracking of what was done (introspection).
- Saving memory which allows for restarting from the last saved state which is a critical feature to restart and resume workflows (checkpointing).
- Associating additional metadata with atoms while running (without
having those atoms need to save this data themselves). This makes it
possible to add-on new metadata in the future without having to change
the atoms themselves. For example the following can be saved:
- Timing information (how long a task took to run).
- User information (who the task ran as).
- When a atom/workflow was ran (and why).
- Saving historical data (failures, successes, intermediary results...) to allow for retry atoms to be able to decide if they should should continue vs. stop.
- Something you create...
How it is used
On engine <engines>
construction typically a
backend (it can be optional) will be provided which satisfies the
:py~taskflow.persistence.base.Backend
abstraction.
Along with providing a backend object a :py~taskflow.persistence.logbook.FlowDetail
object will
also be created and provided (this object will contain the details about
the flow to be ran) to the engine constructor (or associated :pyload()
<taskflow.engines.helpers.load>
helper functions).
Typically a :py~taskflow.persistence.logbook.FlowDetail
object is
created from a :py~taskflow.persistence.logbook.LogBook
object (the
book object acts as a type of container for :py~taskflow.persistence.logbook.FlowDetail
and
:py~taskflow.persistence.logbook.AtomDetail
objects).
Preparation: Once an engine starts to run it will
create a :py~taskflow.storage.Storage
object which will act as
the engines interface to the underlying backend storage objects (it
provides helper functions that are commonly used by the engine, avoiding
repeating code when interacting with the provided :py~taskflow.persistence.logbook.FlowDetail
and
:py~taskflow.persistence.base.Backend
objects). As an
engine initializes it will extract (or create) :py~taskflow.persistence.logbook.AtomDetail
objects for
each atom in the workflow the engine will be executing.
Execution: When an engine beings to execute (see
engine <engines>
for more of the details about how an engine goes about this process) it
will examine any previously existing :py~taskflow.persistence.logbook.AtomDetail
objects to
see if they can be used for resuming; see resumption <resumption>
for more details on this
subject. For atoms which have not finished (or did not finish correctly
from a previous run) they will begin executing only after any dependent
inputs are ready. This is done by analyzing the execution graph and
looking at predecessor :py~taskflow.persistence.logbook.AtomDetail
outputs and
states (which may have been persisted in a past run). This will result
in either using there previous information or by running those
predecessors and saving their output to the :py~taskflow.persistence.logbook.FlowDetail
and
:py~taskflow.persistence.base.Backend
objects. This
execution, analysis and interaction with the storage objects continues
(what is described here is a simplification of what really happens;
which is quite a bit more complex) until the engine has finished running
(at which point the engine will have succeeded or failed in its attempt
to run the workflow).
Post-execution: Typically when an engine is done running the logbook would be discarded (to avoid creating a stockpile of useless data) and the backend storage would be told to delete any contents for a given execution. For certain use-cases though it may be advantageous to retain logbooks and there contents.
A few scenarios come to mind:
- Post runtime failure analysis and triage (saving what failed and why).
- Metrics (saving timing information associated with each atom and using it to perform offline performance analysis, which enables tuning tasks and/or isolating and fixing slow tasks).
- Data mining logbooks to find trends (in failures for example).
- Saving logbooks for further forensics analysis.
- Exporting logbooks to hdfs (or other no-sql storage) and running some type of map-reduce jobs on them.
Note
It should be emphasized that logbook is the authoritative, and,
preferably, the only (see inputs and outputs <inputs_and_outputs>
) source
of run-time state information (breaking this principle makes it
hard/impossible to restart or resume in any type of automated fashion).
When an atom returns a result, it should be written directly to a
logbook. When atom or flow state changes in any way, logbook is first to
know (see notifications <notifications>
for how a user may
also get notified of those same state changes). The logbook and a
backend and associated storage helper class are responsible to store the
actual data. These components used together specify the persistence
mechanism (how data is saved and where -- memory, database, whatever...)
and the persistence policy (when data is saved -- every time it changes
or at some particular moments or simply never).
Usage
To select which persistence backend to use you should use the
:pyfetch()
<taskflow.persistence.backends.fetch>
function which uses
entrypoints (internally using stevedore) to fetch and
configure your backend. This makes it simpler than accessing the backend
data types directly and provides a common function from which a backend
can be fetched.
Using this function to fetch a backend might look like:
from taskflow.persistence import backends
...= backends.fetch(conf={
persistence "connection': "mysql",
"user": ...,
"password": ...,
})
book = make_and_save_logbook(persistence)
...
As can be seen from above the conf
parameter acts as a
dictionary that is used to fetch and configure your backend. The
restrictions on it are the following:
- a dictionary (or dictionary like type), holding backend type with
key
'connection'
and possibly type-specific backend parameters as other keys.
Types
Memory
Connection: 'memory'
Retains all data in local memory (not persisted to reliable storage). Useful for scenarios where persistence is not required (and also in unit tests).
Note
See :py~taskflow.persistence.backends.impl_memory.MemoryBackend
for implementation details.
Files
Connection: 'dir'
or
'file'
Retains all data in a directory & file based structure on local disk. Will be persisted locally in the case of system failure (allowing for resumption from the same local machine only). Useful for cases where a more reliable persistence is desired along with the simplicity of files and directories (a concept everyone is familiar with).
Note
See :py~taskflow.persistence.backends.impl_dir.DirBackend
for implementation details.
Sqlalchemy
Connection: 'mysql'
or
'postgres'
or 'sqlite'
Retains all data in a ACID compliant database using the sqlalchemy library for schemas, connections, and database interaction functionality. Useful when you need a higher level of durability than offered by the previous solutions. When using these connection types it is possible to resume a engine from a peer machine (this does not apply when using sqlite).
Schema
Logbooks
Name | Type | Primary Key |
---|---|---|
created_at | DATETIME | False |
updated_at | DATETIME | False |
uuid | VARCHAR | True |
name | VARCHAR | False |
meta | TEXT | False |
Flow details
Name | Type | Primary Key |
---|---|---|
created_at | DATETIME | False |
updated_at | DATETIME | False |
uuid | VARCHAR | True |
name | VARCHAR | False |
meta | TEXT | False |
state | VARCHAR | False |
parent_uuid | VARCHAR | False |
Atom details
Name | Type | Primary Key |
---|---|---|
created_at | DATETIME | False |
updated_at | DATETIME | False |
uuid | VARCHAR | True |
name | VARCHAR | False |
meta | TEXT | False |
atom_type | VARCHAR | False |
state | VARCHAR | False |
intention | VARCHAR | False |
results | TEXT | False |
failure | TEXT | False |
version | TEXT | False |
parent_uuid | VARCHAR | False |
Note
See :py~taskflow.persistence.backends.impl_sqlalchemy.SQLAlchemyBackend
for implementation details.
Warning
Currently there is a size limit (not applicable for
sqlite
) that the results
will contain. This
size limit will restrict how many prior failures a retry atom can
contain. More information and a future fix will be posted to bug 1416088 (for
the meantime try to ensure that your retry units history does not grow
beyond ~80 prior results).
Zookeeper
Connection: 'zookeeper'
Retains all data in a zookeeper backend (zookeeper
exposes operations on files and directories, similar to the above
'dir'
or 'file'
connection types). Internally
the kazoo library is used to
interact with zookeeper to perform reliable, distributed and atomic
operations on the contents of a logbook represented as znodes. Since
zookeeper is also distributed it is also able to resume a engine from a
peer machine (having similar functionality as the database connection
types listed previously).
Note
See :py~taskflow.persistence.backends.impl_zookeeper.ZkBackend
for implementation details.
Interfaces
taskflow.persistence.backends
taskflow.persistence.base
taskflow.persistence.logbook
Implementations
taskflow.persistence.backends.impl_dir
taskflow.persistence.backends.impl_memory
taskflow.persistence.backends.impl_sqlalchemy
taskflow.persistence.backends.impl_zookeeper
Storage
taskflow.storage
Hierarchy
taskflow.persistence.base taskflow.persistence.backends.impl_dir taskflow.persistence.backends.impl_memory taskflow.persistence.backends.impl_sqlalchemy taskflow.persistence.backends.impl_zookeeper