From 2161a18f028b7ec3f1ebb0b881598b6e582869b7 Mon Sep 17 00:00:00 2001 From: zhiyuan_cai Date: Tue, 20 Dec 2016 14:03:26 +0800 Subject: [PATCH] Enhance XJob reliability spec 1. What is the problem? Currently we use cast method to trigger XJob daemon which we have risk to lose jobs. 2. What is the solution to the problem? Determine a solution to improve the reliability of XJob daemon. The detailed discussion of the problems and the proposals to solve the problems are covered in this specification. 3. What the features need to be implemented to the Tricircle to realize the solution? The proposed changes in this specification needs to be implemented. Change-Id: Idf7ec52098393d668d1425a79807b235a1fc9960 --- specs/ocata/enhance-xjob-reliability.rst | 234 +++++++++++++++++++++++ 1 file changed, 234 insertions(+) create mode 100644 specs/ocata/enhance-xjob-reliability.rst diff --git a/specs/ocata/enhance-xjob-reliability.rst b/specs/ocata/enhance-xjob-reliability.rst new file mode 100644 index 00000000..1d467441 --- /dev/null +++ b/specs/ocata/enhance-xjob-reliability.rst @@ -0,0 +1,234 @@ +======================================= +Enhance Reliability of Asynchronous Job +======================================= + +Background +========== + +Currently we are using cast method in our RPC client to trigger asynchronous +job in XJob daemon. After one of the worker threads receives the RPC message +from the message broker, it registers the job in the database and starts to +run the handle function. The registration guarantees that asynchronous job will +not be lost after the job fails and the failed job can be redone. The detailed +discussion of the asynchronous job process in XJob daemon is covered in our +design document [1]_. + +Though asynchronous jobs are correctly saved after worker threads get the RPC +message, we still have risk to lose jobs. By using cast method, it's only +guaranteed that the message is received by the message broker, but there's no +guarantee that the message can be received by the message consumer, i.e., the +RPC server thread running in XJob daemon. According to the RabbitMQ document, +undelivered messages will be lost if RabbitMQ server stops [2]_. Message +persistence or publisher confirm [3]_ can be used to increase reliability, but +they sacrifice performance. On the other hand, we can not assume that message +brokers other than RabbitMQ will provide similar persistence or confirmation +functionality. Therefore, Tricircle itself should handle the asynchronous job +reliability problem as far as possible. Since we already have a framework to +register, run and redo asynchronous jobs in XJob daemon, we propose a cheaper +way to improve reliability. + +Proposal +======== + +One straightforward way to make sure that the RPC server has received the RPC +message is to use call method. RPC client will be blocked until the RPC server +replies the message if it uses call method to send the RPC request. So if +something wrong happens before the reply, RPC client can be aware of it. Of +course we cannot make RPC client wait too long, thus RPC handlers in the RPC +server side need to be simple and quick to run. Thanks to the asynchronous job +framework we already have, migrating from cast method to call method is easy. + +Here is the flow of the current process:: + + +--------+ +--------+ +---------+ +---------------+ +----------+ + | | | | | | | | | | + | API | | RPC | | Message | | RPC Server | | Database | + | Server | | client | | Broker | | Handle Worker | | | + | | | | | | | | | | + +---+----+ +---+----+ +----+----+ +-------+-------+ +----+-----+ + | | | | | + | call RPC API | | | | + +--------------> | | | + | | send cast message | | | + | +-------------------> | | + | call return | | dispatch message | | + <--------------+ +------------------> | + | | | | register job | + | | | +----------------> + | | | | | + | | | | obtain lock | + | | | +----------------> + | | | | | + | | | | run job | + | | | +----+ | + | | | | | | + | | | | | | + | | | <----+ | + | | | | | + | | | | | + + + + + + + +We can just leave **register job** phase in the RPC handle and put **obtain +lock** and **run job** phase in a separate thread, so the RPC handle is simple +enough to use call method to invoke it. Here is the proposed flow:: + + +--------+ +--------+ +---------+ +---------------+ +----------+ +-------------+ +-------+ + | | | | | | | | | | | | | | + | API | | RPC | | Message | | RPC Server | | Database | | RPC Server | | Job | + | Server | | client | | Broker | | Handle Worker | | | | Loop Worker | | Queue | + | | | | | | | | | | | | | | + +---+----+ +---+----+ +----+----+ +-------+-------+ +----+-----+ +------+------+ +---+---+ + | | | | | | | + | call RPC API | | | | | | + +--------------> | | | | | + | | send call message | | | | | + | +--------------------> | | | | + | | | dispatch message | | | | + | | +------------------> | | | + | | | | register job | | | + | | | +----------------> | | + | | | | | | | + | | | | job enqueue | | | + | | | +------------------------------------------------> + | | | | | | | + | | | reply message | | | job dequeue | + | | <------------------+ | |--------------> + | | send reply message | | | obtain lock | | + | <--------------------+ | <----------------+ | + | call return | | | | | | + <--------------+ | | | run job | | + | | | | | +----+ | + | | | | | | | | + | | | | | | | | + | | | | | +----> | + | | | | | | | | | + | | | | | | | + + + + + + + + + +In the above graph, **Loop Worker** is a new-introduced thread to do the actual +work. **Job Queue** is an eventlet queue [4]_ used to coordinate **Handle +Worker** who produces job entries and **Loop Worker** who consumes job entries. +While accessing an empty queue, **Loop Worker** will be blocked until some job +entries are put into the queue. **Loop Worker** retrieves job entries from the +job queue then start to run it. Similar to the original flow, since multiple +workers may get the same type of job for the same resource at the same time, +workers need to obtain the lock before it can run the job. One problem occurs +whenever XJob daemon stops before it finishes all the jobs in the job queue; +all unfinished jobs are lost. To solve it, we make changes to the original +periodical task that is used to redo failed job, and let it also handle the +jobs which have been registered for a certain time but haven't been started. +So both failed jobs and "orphan" new jobs can be picked up and redone. + +You can see that **Handle Worker** doesn't do many works, it just consumes RPC +messages, register jobs then put job items in the job queue. So one extreme +solution here, will be to register new jobs in the API server side and start +worker threads to retrieve jobs from the database and run them. In this way, we +can remove all the RPC processes and use database to coordinate. The drawback +of this solution is that we don't dispatch jobs. All the workers query jobs +from the database so there is high probability that some of the workers obtain +the same job and thus race occurs. In the first solution, message broker +helps us to dispatch messages, and so dispatch jobs. + +Considering job dispatch is important, we can make some changes to the second +solution and move to the third one, that is to also register new jobs in the +API server side, but we still use cast method to trigger asynchronous job in +XJob daemon. Since job registration is done in the API server side, we are not +afraid that the jobs will be lost if cast messages are lost. If API server side +fails to register the job, it will return response of failure; If registration +of job succeeds, the job will be done by XJob daemon at last. By using RPC, we +dispatch jobs with the help of message brokers. One thing which makes cast +method better than call method is that retrieving RPC messages and running job +handles are done in the same thread so if one XJob daemon is busy handling +jobs, RPC messages will not be dispatched to it. However when using call +method, RPC messages are retrieved by one thread(the **Handle Worker**) and job +handles are run by another thread(the **Loop Worker**), so XJob daemon may +accumulate many jobs in the queue and at the same time it's busy handling jobs. +This solution has the same problem with the call method solution. If cast +messages are lost, the new jobs are registered in the database but no XJob +daemon is aware of these new jobs. Same way to solve it, use periodical task to +pick up these "orphan" jobs. Here is the flow:: + + +--------+ +--------+ +---------+ +---------------+ +----------+ + | | | | | | | | | | + | API | | RPC | | Message | | RPC Server | | Database | + | Server | | client | | Broker | | Handle Worker | | | + | | | | | | | | | | + +---+----+ +---+----+ +----+----+ +-------+-------+ +----+-----+ + | | | | | + | call RPC API | | | | + +--------------> | | | + | | register job | | | + | +-------------------------------------------------------> + | | | | | + | | [if succeed to | | | + | | register job] | | | + | | send cast message | | | + | +-------------------> | | + | call return | | dispatch message | | + <--------------+ +------------------> | + | | | | obtain lock | + | | | +----------------> + | | | | | + | | | | run job | + | | | +----+ | + | | | | | | + | | | | | | + | | | <----+ | + | | | | | + | | | | | + + + + + + + +Discussion +========== + +In this section we discuss the pros and cons of the above three solutions. + +.. list-table:: **Solution Comparison** + :header-rows: 1 + + * - Solution + - Pros + - Cons + * - API server uses call + - no RPC message lost + - downtime of unfinished jobs in the job queue when XJob daemon stops, + job dispatch not based on XJob daemon workload + * - API server register jobs + no RPC + - no requirement on RPC(message broker), no downtime + - no job dispatch, conflict costs time + * - API server register jobs + uses cast + - job dispatch based on XJob daemon workload + - downtime of lost jobs due to cast messages lost + +Downtime means that after a job is dispatched to a worker, other workers need +to wait for a certain time to determine that job is expired and take over it. + +Conclusion +========== + +We decide to implement the third solution(API server register jobs + uses cast) +since it improves the asynchronous job reliability and at the mean time has +better work load dispatch. + +Data Model Impact +================= + +None + +Dependencies +============ + +None + +Documentation Impact +==================== + +None + +References +========== + +.. [1] https://docs.google.com/document/d/1zcxwl8xMEpxVCqLTce2-dUOtB-ObmzJTbV1uSQ6qTsY +.. [2] https://www.rabbitmq.com/tutorials/tutorial-two-python.html +.. [3] https://www.rabbitmq.com/confirms.html +.. [4] http://eventlet.net/doc/modules/queue.html