Better scoping for atoms and flows

Change-Id: Ie2493e2b3f99b8aa657b1638de5343c4bc6f4c6c
This commit is contained in:
Joshua Harlow 2014-06-13 21:37:49 -07:00 committed by Joshua Harlow
parent 613920da8b
commit 2fd15a69e1
1 changed files with 397 additions and 0 deletions

View File

@ -0,0 +1,397 @@
=====================================
Better scoping for atoms and flows.
=====================================
Include the URL of your launchpad blueprint:
https://blueprints.launchpad.net/taskflow/+spec/improved-scoping
Better scoping for atoms and flows.
Problem description
===================
Currently `atoms`_ in taskflow have the ability to declare, rebind, and receive
automatically needed runtime symbols (see `arguments and results`_) and provide
named output symbols. This feature is required to enable an external entity to
arrange execution ordering as well as to allow for state transfer/retainment
to be performed by a workflow runtime (in this case the `engine`_
concept). This is quite useful to allow for automated resuming (and other
features, such as parallel execution) since when an atom does not
maintain state (or maintains very little) internally, the workflow runtime
can track the execution flow and the output symbols and input
symbols (with associated values) of atoms using
various `persistence strategies`_ (it also allows the engine to notify others
about state transitions and other nifty features...).
**Note:** In a way this externalized state is equivalent to a
workflows *memory* (without actually retaining that state on the runtime stack
and/or heap). This externalization of workflow execution & state enables
many innovative strategies that can be explored in the future and is one of the
key design patterns that taskflow was built in-mind with.
In general though we need to increase the usefulness & usability of the current
mechanism. It currently has some of the following drawbacks (included with
each is an example to make this more clear as to why it's a drawback):
* Overly complicated ``rebinding/requires/provides`` manipulation to avoid
symbol naming conflicts. For example, when one atom in a `flow`_ produces an
output that has a name of another atoms input and a third atom in a subflow
also produces the same output there is required to be
a `rebinding`_ application to ensure that the right output goes into the
right input. This results in a bad (not horrible, just bad) user
interaction & experience.
**Example:**
::
from taskflow import task
from taskflow.patterns import linear_flow
class Dumper(task.Task):
def execute(self, *args, **kwargs):
print(self)
print(args)
print(kwargs)
r = linear_flow.Flow("root")
r.add(
Dumper(provides=['a']),
Dumper(requires=['a'],
provides=['c']),
Dumper(requires=['c']),
)
sr = linear_flow.Flow("subroot")
sr.add(Dumper(provides=['c']))
sr.add(Dumper(requires=['c']))
# This line fails with the following message:
#
# subroot provides ['c'] but is already being provided by root and
# duplicate producers are disallowed
#
# It can be resolved by renaming provides=['c'] -> provides=['c_1'] (and
# subsequently renaming the following requires in the subroot flow) which
# instead should be resolved by proper lookup & scoping of inputs and
# outputs.
r.add(sr)
As can be seen in this example we have created a scoping *like* mechanism via
the nested flow concept & implementation but we have not made it as easy as it
should be to nest flows that have *conflicting* symbols for inputs (aka,
required bound symbols) and outputs (aka, provided bound symbols). This should
be resolved so that it becomes much easier to combine arbitrary flows together
without having to worry about symbol naming errors & associated issues.
**Note:** we can (and certain folks are) using the ability to inject symbols
that are requirements of atoms before running to create a atom local scope.
This helps in avoiding some of the runtime naming issues but does not solve the
full problem.
* Loss of nesting post-compilation/post-runtime. This makes it hard to do
extraction of results post-execution since it limits how those results can
be fetched (making it unintuitive to users why they can not extract results
for a given nesting hierarchy). This results in a bad user experience (and
likely is not what users expect).
**Example:**
::
from taskflow.engines.action_engine import engine
from taskflow.patterns import linear_flow
from taskflow.persistence.backends import impl_memory
from taskflow import task
from taskflow.utils import persistence_utils as pu
class Dumper(task.Task):
def execute(self, *args, **kwargs):
print("Executing: %s" % self)
print(args)
print(kwargs)
return (self.name,)
r = linear_flow.Flow("root")
r.add(
Dumper(name="r-0", provides=['a']),
Dumper(name="r-1", requires=['a'], provides=['c']),
Dumper(name="r-2", requires=['c']),
)
sr = linear_flow.Flow("subroot")
sr.add(Dumper(name="sr-0", provides=['c_1']))
sr.add(Dumper(name="sr-1", requires=['c_1']))
r.add(sr)
# Create needed persistence layers/backends...
storage_backend = impl_memory.MemoryBackend()
detail_book, flow_detail = pu.temporary_flow_detail(storage_backend)
# Create an engine and run.
engine_conf = {}
e = engine.SingleThreadedActionEngine(
r, flow_detail, storage_backend, engine_conf)
e.compile()
e.run()
print("Done:")
print e.storage.fetch_all()
# Output produced is the following:
#
# Executing: r-0==1.0
# ()
# {}
# Executing: r-1==1.0
# ()
# {'a': 'r-0'}
# Executing: r-2==1.0
# ()
# {'c': 'r-1'}
# Executing: sr-0==1.0
# ()
# {}
# Executing: sr-1==1.0
# ()
# {'c_1': 'sr-0'}
# Done:
# {'a': 'r-0', 'c_1': 'sr-0', 'c': 'r-1'}
#
# No exposed API to get just the results of 'subroot', the only exposed
# API is to get by atom name or all, this makes it hard for users that just
# want to extract individual results from a given segment of the
# overall hierarchy.
To increase the usefulness of the storage, persistence and workflow concept
we need to expand the inference, validation, input and output, storage and
runtime lookup mechanism to better account for the `scope`_ a atom resides
in.
.. _atoms: http://docs.openstack.org/developer/taskflow/atoms.html#atom
.. _arguments and results: http://docs.openstack.org/developer/taskflow/arguments_and_results.html#arguments-specification
.. _engine: http://docs.openstack.org/developer/taskflow/engines.html
.. _scope: https://en.wikipedia.org/wiki/Scope_%28computer_science%29
.. _rebinding: http://docs.openstack.org/developer/taskflow/arguments_and_results.html#rebinding
.. _flow: http://docs.openstack.org/developer/taskflow/patterns.html#taskflow.flow.Flow
.. _persistence strategies: http://docs.openstack.org/developer/taskflow/persistence.html
Proposed user facing change
===========================
To ensure the case where a subflow produces output symbols that conflict with a
contained parent flow we will allow for a subflow to provide the same output
as a prior sibling/parent instead of denying that addition. This means that if
a parent flow contains a atom/flow ``X`` that produces symbol ``a`` and it
contains another atom or subflow ``Y`` that also produces ``a`` the ``a`` which
will be visible to items following ``Y`` will be the ``a`` produced
by ``Y`` and not by ``X``. For the items inside ``Y`` the ``a`` that will be
visible will be determined by the location in ``Y`` where ``a`` is
produced (the items that use ``a`` before ``a`` is produced in ``Y`` will use
the ``a`` produced by ``X`` and the items after ``a`` is produced in ``Y`` will
use the new ``a``). This type of *shadowing* reflects a concept how people
familiar with programming already use (`variable name shadowing`_).
To allow a flow to retain even *more* control of its exposed input and output
symbols we will introduce the following new flow constructor parameter.
* ``contain=<CONSTANT>``: when set on a flow object this attribute will cause
the flow to behave differently when intermixed with other flows. One of the
constants to be will be ``contain=REQUIRES`` which will denote that this
flow will use only requirements that are produced by the atoms contained
in itself and **not** try to require any symbols from its parent or prior
sibling flows or atoms. This attribute literally means the scope of
the flow will be completly self contained. A second constant (these
constants can be *ORed* together to combine them in various ways) will
be ``contain=PROVIDES`` which will denote that the symbols this
flow *may* produce will **not** be consumable by any subsequent sibling
flows or atoms. This attribute literally means that the scope of the flow
will be restricted to **only** using requirements from prior sibling or
parent flows and the produced output symbols will **not** be visible to
subsequent sibling flows or atoms.
When no constant is provided we will assume the standard routine of not
restricting input and output symbols and only applying the shadowing rule
defined previously.
**Note:** depending on time constraints we have the ability to just skip the
different ``contain`` constants and just do the shadowing approach (and later
add in the other various constants as time permits).
.. _variable name shadowing: https://en.wikipedia.org/wiki/Variable_shadowing
Proposed runtime change
=======================
During runtime we will be required to create a logical structure which retains
the same user facing constraints. To do this we will retain information about
the atom and flow `symbol table`_ like hierarchy at runtime in a secondary
tree structure (so now instead of *just* retaining a directed graph of the
atoms and flows prior structure we will retain a directed graph and a tree
hierarchical structure).
This tree structure will contain a representation of the hierarchy that
atoms were composed in and the symbols being produced at the different levels.
For example an atom in a top level flow will be at a higher level in that tree
and a atom in a subflow will be at a lower level in that tree. The leaf nodes
of the tree will be the individual atom objects + any associated metadata and
the non-leaf nodes will be the flow objects + any associated metadata (the main
piece of metadata in flow nodes will be a symbol table, also known as a
dictionary). This structure & associated metadata will be constructed
at compilation time where we presently construct the directed graph of
nodes to run.
This approach allows the lookup of an atoms requirements to become a symbol
table & tree traversal problem where the atoms (now a node in the tree) parents
will be traversed until an atom that produces a needed symbol is located (this
information is verified at preparation time, which happens right before
execution, so it can be assumed there are no atoms that have symbols that are
*not* provided by some other atom).
At compilation time the ``contain=<CONSTANT>`` attribute will also be examined
and metadata will be associated with the created tree node to signify what the
visiblity of the symbol table for that node is. This metadata will be used
during the runtime symbol name lookup process to ensure we restrict the lookup
of symbols to the constraints imposed by the selected attribute/s.
At runtime when a symbol is needed for an atom we will locate the node that
is associated with the atom in the tree and walk upwards until we find the
correct symbol (obeying the ``contain`` constraints as needed) and value. When
saving we will save values into the parent flow nodes symbol table instead of
into the single symbol table that is saved into currently.
Finally, this addition makes it possible for post-execution extraction of
individual tree segments (by allowing for fetching a tree nodes symbol table
and allowing for users to traverse it as they desire). This is often useful
for examining the results flows and atoms produced after the workflow runtime
has finished executed (and doing any further function/method... calls that an
application may wish to do with those results).
.. _symbol table: http://en.wikipedia.org/wiki/Symbol_table
Alternatives
------------
The alternative is not to change anything and require that users go through
a painful symbol renaming (and extraction) process. This works *ok* for
workflows that are controlled and where it is possible to define the flow in a
single function where all the various symbol names can be adjusted at flow
creation time. It does not work well for arbitrary gluing of various workflows
together from arbitrary sources (a use-case that would be expected to be common
in the OpenStack projects, where drivers *could* provide components of an
overall workflow). Without this change it would likely mean that there would be
various functions created by users that would have *messy* and
*complicated* symbol renaming algorithms to resolve the issue that taskflow
should instead resolve itself. This results in a bad user experience (and
likely is not what users expect).
Impact on Existing APIs
-----------------------
The existing API's will continue operating as before, when the new options
are set the functionalty will change accordingly to be less strict. Now instead
of duplicate names causing errors a new mode will be enabled by default, the
variable shadowing mode. This will allow flows that would have not been
allowed to be created before now to be created. In general this will be an
additive change that enables new usage that errored out before this change.
Security impact
---------------
N/A
Performance Impact
------------------
N/A
Configuration Impact
--------------------
N/A
Developer Impact
----------------
This scoping should make it easier to implement flows in a manner that
conceptually makes sense for programmers used to the standard scoping
strategies that programming languages come built-in with.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
* harlowja
Other contributors:
* dkrause?
Milestones
----------
J/3
Work Items
----------
* Add a tree type [https://review.openstack.org/#/c/97325/]
* Add ``contains`` constraints to flows and adjust pattern ``add()`` methods
to accept and verify those constraints at atom/subflow addition time.
* Retain symbol hierarchy at compilation time by constructing a tree instance
and during the directed graph creation routine adding nodes to this tree as
needed (along with any other metadata needed).
* Adjust the compilation routine to retain this ``contains`` attribute in the
tree nodes metadata so that it can be using at runtime.
* Adjust the action engine implementation to use this new source of information
during symbol lookup so that this new information is used during runtime.
* Expose the results of running via a new api that allows for fetching a named
atom/flows storage resultant ``node`` (this allows for traversing over the
symbol tables for children nodes contained there-in).
* Test like crazy.
Future ideas
------------
* In a future change we could support the ability to have automatic symbol
names that would be populated at compilation time. This would allow the flow
creator to associate a ``<anonymous>`` like object as the symbol that will
be transferred between tasks/atoms (which right now is required to be
a string). The ``<anonymous>`` object instance will be translated into
a *actual* generated symbol name at compilation time (the runtime symbol
lookup mechanism will then be unaffected by this change). This would help
those users that can not use the above new capabilities. It would allow those
users to have a way to transfer symbols between scopes without
being *as* restricted by literal string names.
Incubation
==========
N/A
Documentation Impact
====================
Developer docs, examples will be updated to explain the new change and provide
examples of how this new change can be used.
Dependencies
============
N/A
References
==========
N/A
.. note::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode