Adding documentation and a DB field for tracking the executor UUID.
Change-Id: I86157efec59c5b9eabb537f8fed124c157f51d43 blueprint: recover-transfers-on-restart
This commit is contained in:
parent
3fcdd2eebe
commit
41c2032662
67
doc/source/executor_consitency.rst
Normal file
67
doc/source/executor_consitency.rst
Normal file
@ -0,0 +1,67 @@
|
||||
|
||||
Only One Executor At A Time
|
||||
===========================
|
||||
|
||||
It is important that only 1 process is ever working on a transfer at a time.
|
||||
It is further important that no one transfer is stalled out because the
|
||||
executor that began it failed part of the way through. This document describes
|
||||
the plan to achieve this.
|
||||
|
||||
The Problem
|
||||
-----------
|
||||
|
||||
Say that Staccato is configured with multiple executors. Partially through
|
||||
a transfer one of the executors fails, however the other executor is running
|
||||
perfectly well. We need something to detect that the transfer which was
|
||||
running (and which is still in the running state) is no longer active and
|
||||
needs to be placed back into a pending state (NEW or ERROR at the time of
|
||||
this writing). Once placed in such a state it will be active for scheduling
|
||||
again.
|
||||
|
||||
We also want to avoid the situation where there are two executors and they
|
||||
both select the same transfer and thus work is redundantly done. This could
|
||||
happen via a race, or it could happen when it appears that an executor has
|
||||
died but in reality it is (or will soon be) transferring data.
|
||||
|
||||
Redundant Transfer
|
||||
------------------
|
||||
|
||||
At the time of this writing staccato does allow for the possibility of
|
||||
redundant transfers. The contract with the user is that some (or all)
|
||||
of the data set may be transfers twice. This contract is there to release
|
||||
the staccato implementation and architecture from complicated and slow
|
||||
inter-process locking mechanisms which would needed to avoid every single
|
||||
case. However, this contract is not there to allow staccato to entirely
|
||||
ignore the problem. Redundant transfers are unwelcome because they use
|
||||
resource. By the very nature of this problem, redudnant transfer will only
|
||||
happen when the system is unaware of all but one of the unneeded transfers
|
||||
(if we know about them, we would kill them) thus staccato cannot properly
|
||||
manage the resources.
|
||||
|
||||
Solution
|
||||
--------
|
||||
|
||||
Each executor will be associated with a UUID that lives in the process space
|
||||
of that executor (it is not written to the database). Each row in the database
|
||||
represents a requested transfer. When the state column in that row moves to
|
||||
RUNNING the executor ID will be recorded in another column in that row. As
|
||||
the executor is performing a transfer it periodically checks the row on which
|
||||
it is working to verify that its UUID is still the one in the database. If it
|
||||
is not, it must immediately terminate its workload without making any further
|
||||
updates to the database. The time window in which it checks the database is
|
||||
configurable and will define the window of possibility for redundant transfers.
|
||||
|
||||
The executor UUID will be some combination of hostname and pid. This will make
|
||||
it easier for an operator to determine what is happening.
|
||||
|
||||
Clean Up
|
||||
~~~~~~~~
|
||||
|
||||
We also need to determine if a transfer is marked as running, but the executor
|
||||
has unexpectedly died. In order to determine this we will look at the
|
||||
'updated_at' time stamp that is associated with every transfer row. If the
|
||||
row has not been updated in N times the configurable update window then
|
||||
staccato assumes that the executor is dead, it clears the executor UUID from
|
||||
the row, and moves the transfer back to a pending state.
|
||||
|
||||
|
@ -45,6 +45,8 @@ class XferRequest(BASE, ModelBase):
|
||||
# TODO add protocol specific json documents
|
||||
source_opts = Column(PickleType())
|
||||
dest_opts = Column(PickleType())
|
||||
executor_uuid = Column(String(512), nullable=True)
|
||||
|
||||
|
||||
|
||||
def register_models(engine):
|
||||
|
Loading…
x
Reference in New Issue
Block a user