Adding documentation and a DB field for tracking the executor UUID.

Change-Id: I86157efec59c5b9eabb537f8fed124c157f51d43
blueprint: recover-transfers-on-restart
This commit is contained in:
John Bresnahan 2013-08-30 17:05:05 -10:00
parent 3fcdd2eebe
commit 41c2032662
2 changed files with 69 additions and 0 deletions

View File

@ -0,0 +1,67 @@
Only One Executor At A Time
===========================
It is important that only 1 process is ever working on a transfer at a time.
It is further important that no one transfer is stalled out because the
executor that began it failed part of the way through. This document describes
the plan to achieve this.
The Problem
-----------
Say that Staccato is configured with multiple executors. Partially through
a transfer one of the executors fails, however the other executor is running
perfectly well. We need something to detect that the transfer which was
running (and which is still in the running state) is no longer active and
needs to be placed back into a pending state (NEW or ERROR at the time of
this writing). Once placed in such a state it will be active for scheduling
again.
We also want to avoid the situation where there are two executors and they
both select the same transfer and thus work is redundantly done. This could
happen via a race, or it could happen when it appears that an executor has
died but in reality it is (or will soon be) transferring data.
Redundant Transfer
------------------
At the time of this writing staccato does allow for the possibility of
redundant transfers. The contract with the user is that some (or all)
of the data set may be transfers twice. This contract is there to release
the staccato implementation and architecture from complicated and slow
inter-process locking mechanisms which would needed to avoid every single
case. However, this contract is not there to allow staccato to entirely
ignore the problem. Redundant transfers are unwelcome because they use
resource. By the very nature of this problem, redudnant transfer will only
happen when the system is unaware of all but one of the unneeded transfers
(if we know about them, we would kill them) thus staccato cannot properly
manage the resources.
Solution
--------
Each executor will be associated with a UUID that lives in the process space
of that executor (it is not written to the database). Each row in the database
represents a requested transfer. When the state column in that row moves to
RUNNING the executor ID will be recorded in another column in that row. As
the executor is performing a transfer it periodically checks the row on which
it is working to verify that its UUID is still the one in the database. If it
is not, it must immediately terminate its workload without making any further
updates to the database. The time window in which it checks the database is
configurable and will define the window of possibility for redundant transfers.
The executor UUID will be some combination of hostname and pid. This will make
it easier for an operator to determine what is happening.
Clean Up
~~~~~~~~
We also need to determine if a transfer is marked as running, but the executor
has unexpectedly died. In order to determine this we will look at the
'updated_at' time stamp that is associated with every transfer row. If the
row has not been updated in N times the configurable update window then
staccato assumes that the executor is dead, it clears the executor UUID from
the row, and moves the transfer back to a pending state.

View File

@ -45,6 +45,8 @@ class XferRequest(BASE, ModelBase):
# TODO add protocol specific json documents
source_opts = Column(PickleType())
dest_opts = Column(PickleType())
executor_uuid = Column(String(512), nullable=True)
def register_models(engine):