Merge "Enhance XJob reliability spec"

2017-01-07 07:05:30 +00:00 · 2017-01-07 07:05:30 +00:00 · fbcb71ed86
commit fbcb71ed86
parent ce9ad625be 2161a18f02
1 changed files with 234 additions and 0 deletions
--- a/specs/ocata/enhance-xjob-reliability.rst
+++ b/specs/ocata/enhance-xjob-reliability.rst
@ -0,0 +1,234 @@
+=======================================
+Enhance Reliability of Asynchronous Job
+=======================================
+
+Background
+==========
+
+Currently we are using cast method in our RPC client to trigger asynchronous
+job in XJob daemon. After one of the worker threads receives the RPC message
+from the message broker, it registers the job in the database and starts to
+run the handle function. The registration guarantees that asynchronous job will
+not be lost after the job fails and the failed job can be redone. The detailed
+discussion of the asynchronous job process in XJob daemon is covered in our
+design document [1]_.
+
+Though asynchronous jobs are correctly saved after worker threads get the RPC
+message, we still have risk to lose jobs. By using cast method, it's only
+guaranteed that the message is received by the message broker, but there's no
+guarantee that the message can be received by the message consumer, i.e., the
+RPC server thread running in XJob daemon. According to the RabbitMQ document,
+undelivered messages will be lost if RabbitMQ server stops [2]_. Message
+persistence or publisher confirm [3]_ can be used to increase reliability, but
+they sacrifice performance. On the other hand, we can not assume that message
+brokers other than RabbitMQ will provide similar persistence or confirmation
+functionality. Therefore, Tricircle itself should handle the asynchronous job
+reliability problem as far as possible. Since we already have a framework to
+register, run and redo asynchronous jobs in XJob daemon, we propose a cheaper
+way to improve reliability.
+
+Proposal
+========
+
+One straightforward way to make sure that the RPC server has received the RPC
+message is to use call method. RPC client will be blocked until the RPC server
+replies the message if it uses call method to send the RPC request. So if
+something wrong happens before the reply, RPC client can be aware of it. Of
+course we cannot make RPC client wait too long, thus RPC handlers in the RPC
+server side need to be simple and quick to run. Thanks to the asynchronous job
+framework we already have, migrating from cast method to call method is easy.
+
+Here is the flow of the current process::
+
+  +--------+     +--------+         +---------+     +---------------+   +----------+
+  |        |     |        |         |         |     |               |   |          |
+  | API    |     | RPC    |         | Message |     | RPC Server    |   | Database |
+  | Server |     | client |         | Broker  |     | Handle Worker |   |          |
+  |        |     |        |         |         |     |               |   |          |
+  +---+----+     +---+----+         +----+----+     +-------+-------+   +----+-----+
+      |              |                   |                  |                |
+      | call RPC API |                   |                  |                |
+      +-------------->                   |                  |                |
+      |              | send cast message |                  |                |
+      |              +------------------->                  |                |
+      | call return  |                   | dispatch message |                |
+      <--------------+                   +------------------>                |
+      |              |                   |                  | register job   |
+      |              |                   |                  +---------------->
+      |              |                   |                  |                |
+      |              |                   |                  | obtain lock    |
+      |              |                   |                  +---------------->
+      |              |                   |                  |                |
+      |              |                   |                  | run job        |
+      |              |                   |                  +----+           |
+      |              |                   |                  |    |           |
+      |              |                   |                  |    |           |
+      |              |                   |                  <----+           |
+      |              |                   |                  |                |
+      |              |                   |                  |                |
+      +              +                   +                  +                +
+
+We can just leave **register job** phase in the RPC handle and put **obtain
+lock** and **run job** phase in a separate thread, so the RPC handle is simple
+enough to use call method to invoke it. Here is the proposed flow::
+
+  +--------+     +--------+          +---------+     +---------------+   +----------+   +-------------+   +-------+
+  |        |     |        |          |         |     |               |   |          |   |             |   |       |
+  | API    |     | RPC    |          | Message |     | RPC Server    |   | Database |   | RPC Server  |   | Job   |
+  | Server |     | client |          | Broker  |     | Handle Worker |   |          |   | Loop Worker |   | Queue |
+  |        |     |        |          |         |     |               |   |          |   |             |   |       |
+  +---+----+     +---+----+          +----+----+     +-------+-------+   +----+-----+   +------+------+   +---+---+
+      |              |                    |                  |                |                |              |
+      | call RPC API |                    |                  |                |                |              |
+      +-------------->                    |                  |                |                |              |
+      |              | send call message  |                  |                |                |              |
+      |              +-------------------->                  |                |                |              |
+      |              |                    | dispatch message |                |                |              |
+      |              |                    +------------------>                |                |              |
+      |              |                    |                  | register job   |                |              |
+      |              |                    |                  +---------------->                |              |
+      |              |                    |                  |                |                |              |
+      |              |                    |                  | job enqueue    |                |              |
+      |              |                    |                  +------------------------------------------------>
+      |              |                    |                  |                |                |              |
+      |              |                    | reply message    |                |                | job dequeue  |
+      |              |                    <------------------+                |                |-------------->
+      |              | send reply message |                  |                | obtain lock    |              |
+      |              <--------------------+                  |                <----------------+              |
+      | call return  |                    |                  |                |                |              |
+      <--------------+                    |                  |                |        run job |              |
+      |              |                    |                  |                |           +----+              |
+      |              |                    |                  |                |           |    |              |
+      |              |                    |                  |                |           |    |              |
+      |              |                    |                  |                |           +---->              |
+      |              |                    |                  |                |                |              |                                                                         |              |
+      |              |                    |                  |                |                |              |
+      +              +                    +                  +                +                +              +
+
+In the above graph, **Loop Worker** is a new-introduced thread to do the actual
+work. **Job Queue** is an eventlet queue [4]_ used to coordinate **Handle
+Worker** who produces job entries and **Loop Worker** who consumes job entries.
+While accessing an empty queue, **Loop Worker** will be blocked until some job
+entries are put into the queue. **Loop Worker** retrieves job entries from the
+job queue then start to run it. Similar to the original flow, since multiple
+workers may get the same type of job for the same resource at the same time,
+workers need to obtain the lock before it can run the job. One problem occurs
+whenever XJob daemon stops before it finishes all the jobs in the job queue;
+all unfinished jobs are lost. To solve it, we make changes to the original
+periodical task that is used to redo failed job, and let it also handle the
+jobs which have been registered for a certain time but haven't been started.
+So both failed jobs and "orphan" new jobs can be picked up and redone.
+
+You can see that **Handle Worker** doesn't do many works, it just consumes RPC
+messages, register jobs then put job items in the job queue. So one extreme
+solution here, will be to register new jobs in the API server side and start
+worker threads to retrieve jobs from the database and run them. In this way, we
+can remove all the RPC processes and use database to coordinate. The drawback
+of this solution is that we don't dispatch jobs. All the workers query jobs
+from the database so there is high probability that some of the workers obtain
+the same job and thus race occurs. In the first solution, message broker
+helps us to dispatch messages, and so dispatch jobs.
+
+Considering job dispatch is important, we can make some changes to the second
+solution and move to the third one, that is to also register new jobs in the
+API server side, but we still use cast method to trigger asynchronous job in
+XJob daemon. Since job registration is done in the API server side, we are not
+afraid that the jobs will be lost if cast messages are lost. If API server side
+fails to register the job, it will return response of failure; If registration
+of job succeeds, the job will be done by XJob daemon at last. By using RPC, we
+dispatch jobs with the help of message brokers. One thing which makes cast
+method better than call method is that retrieving RPC messages and running job
+handles are done in the same thread so if one XJob daemon is busy handling
+jobs, RPC messages will not be dispatched to it. However when using call
+method, RPC messages are retrieved by one thread(the **Handle Worker**) and job
+handles are run by another thread(the **Loop Worker**), so XJob daemon may
+accumulate many jobs in the queue and at the same time it's busy handling jobs.
+This solution has the same problem with the call method solution. If cast
+messages are lost, the new jobs are registered in the database but no XJob
+daemon is aware of these new jobs. Same way to solve it, use periodical task to
+pick up these "orphan" jobs. Here is the flow::
+
+  +--------+     +--------+         +---------+     +---------------+   +----------+
+  |        |     |        |         |         |     |               |   |          |
+  | API    |     | RPC    |         | Message |     | RPC Server    |   | Database |
+  | Server |     | client |         | Broker  |     | Handle Worker |   |          |
+  |        |     |        |         |         |     |               |   |          |
+  +---+----+     +---+----+         +----+----+     +-------+-------+   +----+-----+
+      |              |                   |                  |                |
+      | call RPC API |                   |                  |                |
+      +-------------->                   |                  |                |
+      |              | register job      |                  |                |
+      |              +------------------------------------------------------->
+      |              |                   |                  |                |
+      |              | [if succeed to    |                  |                |
+      |              |  register job]    |                  |                |
+      |              | send cast message |                  |                |
+      |              +------------------->                  |                |
+      | call return  |                   | dispatch message |                |
+      <--------------+                   +------------------>                |
+      |              |                   |                  | obtain lock    |
+      |              |                   |                  +---------------->
+      |              |                   |                  |                |
+      |              |                   |                  | run job        |
+      |              |                   |                  +----+           |
+      |              |                   |                  |    |           |
+      |              |                   |                  |    |           |
+      |              |                   |                  <----+           |
+      |              |                   |                  |                |
+      |              |                   |                  |                |
+      +              +                   +                  +                +
+
+Discussion
+==========
+
+In this section we discuss the pros and cons of the above three solutions.
+
+.. list-table:: **Solution Comparison**
+    :header-rows: 1
+
+    * - Solution
+      - Pros
+      - Cons
+    * - API server uses call
+      - no RPC message lost
+      - downtime of unfinished jobs in the job queue when XJob daemon stops,
+        job dispatch not based on XJob daemon workload
+    * - API server register jobs + no RPC
+      - no requirement on RPC(message broker), no downtime
+      - no job dispatch, conflict costs time
+    * - API server register jobs + uses cast
+      - job dispatch based on XJob daemon workload
+      - downtime of lost jobs due to cast messages lost
+
+Downtime means that after a job is dispatched to a worker, other workers need
+to wait for a certain time to determine that job is expired and take over it.
+
+Conclusion
+==========
+
+We decide to implement the third solution(API server register jobs + uses cast)
+since it improves the asynchronous job reliability and at the mean time has
+better work load dispatch.
+
+Data Model Impact
+=================
+
+None
+
+Dependencies
+============
+
+None
+
+Documentation Impact
+====================
+
+None
+
+References
+==========
+
+.. [1] https://docs.google.com/document/d/1zcxwl8xMEpxVCqLTce2-dUOtB-ObmzJTbV1uSQ6qTsY
+.. [2] https://www.rabbitmq.com/tutorials/tutorial-two-python.html
+.. [3] https://www.rabbitmq.com/confirms.html
+.. [4] http://eventlet.net/doc/modules/queue.html