Adding documentation and a DB field for tracking the executor UUID.

Change-Id: I86157efec59c5b9eabb537f8fed124c157f51d43 blueprint: recover-transfers-on-restart
2013-08-30 17:05:05 -10:00 · 2013-08-30 17:05:05 -10:00 · 41c2032662
parent 3fcdd2eebe
commit 41c2032662
2 changed files with 69 additions and 0 deletions
--- a/doc/source/executor_consitency.rst
+++ b/doc/source/executor_consitency.rst
@ -0,0 +1,67 @@
+
+Only One Executor At A Time
+===========================
+
+It is important that only 1 process is ever working on a transfer at a time.
+It is further important that no one transfer is stalled out because the
+executor that began it failed part of the way through.  This document describes
+the plan to achieve this.
+
+The Problem
+-----------
+
+Say that Staccato is configured with multiple executors.  Partially through
+a transfer one of the executors fails, however the other executor is running
+perfectly well.  We need something to detect that the transfer which was
+running (and which is still in the running state) is no longer active and
+needs to be placed back into a pending state (NEW or ERROR at the time of
+this writing).  Once placed in such a state it will be active for scheduling
+again.
+
+We also want to avoid the situation where there are two executors and they
+both select the same transfer and thus work is redundantly done.  This could
+happen via a race, or it could happen when it appears that an executor has
+died but in reality it is (or will soon be) transferring data.
+
+Redundant Transfer
+------------------
+
+At the time of this writing staccato does allow for the possibility of
+redundant transfers.  The contract with the user is that some (or all)
+of the data set may be transfers twice.  This contract is there to release
+the staccato implementation and architecture from complicated and slow
+inter-process locking mechanisms which would needed to avoid every single
+case.  However, this contract is not there to allow staccato to entirely
+ignore the problem.  Redundant transfers are unwelcome because they use
+resource.  By the very nature of this problem, redudnant transfer will only
+happen when the system is unaware of all but one of the unneeded transfers
+(if we know about them, we would kill them) thus staccato cannot properly
+manage the resources.
+
+Solution
+--------
+
+Each executor will be associated with a UUID that lives in the process space
+of that executor (it is not written to the database).  Each row in the database
+represents a requested transfer.  When the state column in that row moves to
+RUNNING the executor ID will be recorded in another column in that row.  As
+the executor is performing a transfer it periodically checks the row on which
+it is working to verify that its UUID is still the one in the database.  If it
+is not, it must immediately terminate its workload without making any further
+updates to the database.  The time window in which it checks the database is
+configurable and will define the window of possibility for redundant transfers.
+
+The executor UUID will be some combination of hostname and pid.  This will make
+it easier for an operator to determine what is happening.
+
+Clean Up
+~~~~~~~~
+
+We also need to determine if a transfer is marked as running, but the executor
+has unexpectedly died.  In order to determine this we will look at the
+'updated_at' time stamp that is associated with every transfer row.  If the
+row has not been updated in N times the configurable update window then
+staccato assumes that the executor is dead, it clears the executor UUID from
+the row, and moves the transfer back to a pending state.
+
+
--- a/staccato/db/models.py
+++ b/staccato/db/models.py
@ -45,6 +45,8 @@ class XferRequest(BASE, ModelBase):
    # TODO add protocol specific json documents
    source_opts = Column(PickleType())
    dest_opts = Column(PickleType())
+    executor_uuid = Column(String(512), nullable=True)
+


 def register_models(engine):