diff --git a/specs/async-container-operation.rst b/specs/async-container-operation.rst new file mode 100644 index 0000000000..4438cf35c7 --- /dev/null +++ b/specs/async-container-operation.rst @@ -0,0 +1,452 @@ +================================= +Asynchronous Container Operations +================================= + +Launchpad blueprint: + +https://blueprints.launchpad.net/magnum/+spec/async-container-operations + +At present, container operations are done in a synchronous way, end-to-end. +This model does not scale well, and incurs a penalty on the client to be +stuck till the end of completion of the operation. + +Problem Description +------------------- + +At present Magnum-Conductor executes the container operation as part of +processing the request forwarded from Magnum-API. For +container-create, if the image needs to be pulled down, it may take +a while depending on the responsiveness of the registry, which can be a +substantial delay. At the same time, experiments suggest that even for +pre-pulled image, the time taken by each operations, namely +create/start/delete, are in the same order, as it involves complete turn +around between the magnum-client and the COE-API, via Magnum-API and +Magnum-Conductor[1]. + +Use Cases +--------- + +For wider enterprise adoption of Magnum, we need it to scale better. +For that we need to replace some of these synchronous behaviors with +suitable alternative of asynchronous implementation. + +To understand the use-case better, we can have a look at the average +time spent during container operations, as noted at[1]. + +Proposed Changes +---------------- + +The design has been discussed over the ML[6]. The conclusions have been kept +on the 'whiteboard' of the Blueprint. + +The amount of code change is expected to be significant. To ease the +process of adoption, code review, functional tests, an approach of phased +implementation may be required. We can define the scope of the three phases of +the implementation as follows - + +* Phase-0 will bring in the basic feature of asynchronous mode of operation in + Magnum - (A) from API to Conductor and (B) from Conductor to COE-API. During + phase-0, this mode will be optional through configuration. + + Both the communications of (A) and (B) are proposed to be made asynchronous + to achieve the best of it. If we do (A) alone, it does not gain us much, as + (B) takes up the higher cycles of the operation. If we do (B) alone, it does + not make sense, as (A) will synchronously wait for no meaningful data. + +* Phase-1 will concentrate on making the feature persistent to address various + scenarios of conductor restart, worker failure etc. We will support this + feature for multiple Conductor-workers in this phase. + +* Phase-2 will select asynchronous mode of operation as the default mode. At + the same time, we can evaluate to drop the code for synchronous mode, too. + + +Phase-0 is required as a meaningful temporary step, to establish the +importance and tangible benefits of phase-1. This is also to serve as a +proof-of-concept at a lower cost of code changes with a configurable option. +This will enable developers and operators to have a taste of the feature, +before bringing in the heavier dependencies and changes proposed in phase-1. + +A reference implemetation for the phase-0 items, has been put for review[2]. + +Following is the summary of the design - + +1. Configurable mode of operation - async +----------------------------------------- + +For ease of adoption, the async_mode of communication between API-conductor, +conductor-COE in magnum, can be controlled using a configuration option. So +the code-path for sync mode and async mode would co-exist for now. To achieve +this with minimal/no code duplication and cleaner interface, we are using +openstack/futurist[4]. Futurist interface hides the details of type of executor +being used. In case of async configuration, a greenthreadpool of configured +poolsize gets created. Here is a sample of how the config would look +like: :: + + [DEFAULT] + async_enable = False + + [conductor] + async_threadpool_max_workers = 64 + +Futurist library is used in oslo.messaging. Thus, it is used by almost all +OpenStack projects, in effect. Futurist is very useful to run same code +under different execution model and hence saving potential duplication of +code. + + +2. Type of operations +--------------------- + +There are two classes of container operations - one that can be made async, +namely create/delete/start/stop/pause/unpause/reboot, which do not need data +about the container in return. The other type requires data, namely +container-logs. For async-type container-operations, magnum-API will be +using 'cast' instead of 'call' from oslo_messaging[5]. + +'cast' from oslo.messaging.rpcclient is used to invoke a method and return +immediately, whereas 'call' invokes a method and waits for a reply. While +operating in asynchronous mode, it is intuitive to use cast method, as the +result of the response may not be available immediately. + +Magnum-api first fetches the details of a container, by doing +'get_rpc_resource'. This function uses magnum objects. Hence, this function +uses a 'call' method underneath. Once, magnum-api gets back the details, +it issues the container operation next, using another 'call' method. +The above proposal is to replace the second 'call' with 'cast'. + +If user issues a container operation, when there is no listening +conductor (because of process failure), there will be a RPC timeout at the +first 'call' method. In this case, user will observe the request to +get blocked at client and finally fail with HTTP 500 ERROR, after the RPC +timeout, which is 60 seconds by default. This behavior is independent of the +usage of 'cast' or 'call' for the second message, mentioned above. This +behavior does not influence our design, but it is documented here for clarity +of understanding. + + +3. Ensuring the order of execution - Phase-0 +-------------------------------------------- + +Magnum-conductor needs to ensure that for a given bay and given container, +the operations are executed in sequence. In phase-0, we want to demonstrate +how asynchronous behavior helps scaling. Asynchronous mode of container +operations would be supported for single magnum-conductor scenario, in +phase-0. If magnum-conductor crashes, there will be no recovery for the +operations accepted earlier - which means no persistence in phase-0, for +operations accepted by magnum-conductor. Multiple conductor scenario and +persistence will be addressed in phase-1 [please refer to the next section +for further details]. If COE crashes or does not respond, the error will be +detected, as it happens in sync mode, and reflected on the container-status. + +Magnum-conductor will maintain a job-queue. Job-queue is indexed by bay-id and +container-id. A job-queue entry would contain the sequence of operations +requested for a given bay-id and container-id, in temporal order. A +greenthread will execute the tasks/operations in order for a given job-queue +entry, till the queue empties. Using a greethread in this fashion saves us +from the cost and complexity of locking, along with functional correctness. +When request for new operation comes in, it gets appended to the corresponding +queue entry. + +For a sequence of container operations, if an intermediate operation fails, +we will stop continuing the sequence. The community feels more confident to +start with this strictly defensive policy[17]. The failure will be logged +and saved into the container-object, which will help an operator be informed +better about the result of the sequence of container operations. We may revisit +this policy later, if we think it is too restrictive. + +4. Ensuring the order of execution - phase-1 +-------------------------------------------- + +The goal is to execute requests for a given bay and a given container in +sequence. In phase-1, we want to address persistence and capability of +supporting multiple magnum-conductor processes. To achieve this, we will +reuse the concepts laid out in phase-0 and use a standard library. + +We propose to use taskflow[7] for this implementation. Magnum-conductors +will consume the AMQP message and post a task[8] on a taskflow jobboard[9]. +Greenthreads from magnum-conductors would subscribe to the taskflow +jobboard as taskflow-conductors[10]. Taskflow jobboard is maintained with +a choice of persistent backend[11]. This will help address the concern of +persistence for accepted operations, when a conductor crashes. Taskflow +will ensure that tasks, namely container operations, in a job, namely a +sequence of operations for a given bay and container, would execute in +sequence. We can easily notice that some of the concepts used in phase-0 +are reused as it is. For example, job-queue maps to jobboard here, use of +greenthread maps to the conductor concept of taskflow. Hence, we expect easier +migration from phase-0 to phase-1, with the choice of taskflow. + +For taskflow jobboard[11], the available choices of backend are Zookeeper and +Redis. But, we plan to use MySQL as default choice of backend, for magnum +conductor jobboard use-case. This support will be added to taskflow. Later, +we may choose to support the flexibility of other backends like ZK/Redis via +configuration. But, phase-1 will keep the implementation simple with MySQL +backend and revisit this, if required. + +Let's consider the scenarios of Conductor crashing - + - If a task is added to jobboard, and conductor crashes after that, + taskflow can assign a particular job to any available greenthread agents + from other conductor instances. If the system was running with single + magnum-conductor, it will wait for the conductor to come back and join. + - A task is picked up and magnum-conductor crashes. In this case, the task + is not complete from jobboard point-of-view. As taskflow detects the + conductor going away, it assigns another available conductor. + - When conductor picks up a message from AMQP, it will acknowledge AMQP, + only after persisting it to jobboard. This will prevent losing the message, + if conductor crashes after picking up the message from AMQP. Explicit + acknowledgement from application may use NotificationResult.HANDLED[12] + to AMQP. We may use the at-least-one-guarantee[13] feature in + oslo.messaging[14], as it becomes available. + +To summarize some of the important outcomes of this proposal - + - A taskflow job represents the sequence of container operations on a given + bay and given container. At a given point of time, the sequence may contain + a single or multiple operations. + - There will be a single jobboard for all conductors. + - Task-flow conductors are multiple greenthreads from a given + magnum-conductor. + - Taskflow-conductor will run in 'blocking' mode[15], as those greenthreads + have no other job than claiming and executing the jobs from jobboard. + - Individual jobs are supposed to maintain a temporal sequence. So the + taskflow-engine would be 'serial'[16]. + - The proposed model for a 'job' is to consist of a temporal sequence of + 'tasks' - operations on a given bay and a given container. Henceforth, + it is expected that when a given operation, namely container-create is in + progress, a request for container-start may come in. Adding the task to + the existing job is intuitive to maintain the sequence of operations. + +To fit taskflow exactly into our use-case, we may need to do two enhancements +in taskflow - +- Supporting mysql plugin as a DB backend for jobboard. Support for redis +exists, so it will be similar. +We do not see any technical roadblock for adding mysql support for taskflow +jobboard. If the proposal does not get approved by taskflow team, we may have +to use redis, as an alternative option. +- Support for dynamically adding tasks to a job on jobboard. This also looks +feasible, as discussed over the #openstack-state-management [Unfortunately, +this channel is not logged, but if we agree in this direction, we can initiate +discussion over ML, too] +If taskflow team does not allow adding this feature, even though they have +agreed now, we will use the dependency feature in taskflow. We will explore +and elaborate this further, if it requires. + + +5. Status of progress +--------------------- + +The progress of execution of a container operation is reflected on the status +of a container as - 'create-in-progress', 'delete-in-progress' etc. + +Alternatives +------------ + +Without an asynchronous implementation, Magnum will suffer from complaints +about poor scalability and slowness. + +In this design, stack-lock[3] has been considered as an alternative to +taskflow. Following are the reasons for preferring taskflow over +stack-lock, as of now, +- Stack-lock used in Heat is not a library, so it will require making a copy +for Magnum, which is not desirable. +- Taskflow is relatively mature, well supported, feature-rich library. +- Taskflow has in-built capacity to scale out[in] as multiple conductors +can join in[out] the cluster. +- Taskflow has a failure detection and recovery mechanism. If a process +crashes, then worker threads from other conductor may continue the execution. + +In this design, we describe futurist[4] as a choice of implementation. The +choice was to prevent duplication of code for async and sync mode. For this +purpose, we could not find any other solution to compare. + +Data model impact +----------------- + +Phase-0 has no data model impact. But phase-1 may introduce an additional +table into the Magnum database. As per the present proposal for using taskflow +in phase-1, we have to introduce a new table for jobboard under magnum db. +This table will be exposed to taskflow library as a persistent db plugin. +Alternatively, an implementation with stack-lock will also require an +introduction of a new table for stack-lock objects. + +REST API impact +--------------- + +None. + +Security impact +--------------- + +None. + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance impact +------------------ + +Asynchrnous mode of operation helps in scalability. Hence, it improves +responsiveness and reduces the turn around time in a significant +proportion. A small test on devstack, comparing both the modes, +demonstrate this with numbers.[1] + +Other deployer impact +--------------------- + +None. + +Developer impact +---------------- + +None + +Implementation +-------------- + +Assignee(s) +----------- + +Primary assignee + suro-patz(Surojit Pathak) + +Work Items +---------- + +For phase-0 +* Introduce config knob for asynchronous mode of container operations. + +* Changes for Magnum-API to use CAST instead of CALL for operations eligible + for asynchronous mode. + +* Implement the in-memory job-queue in Magnum conductor, and integrate futurist + library. + +* Unit tests and functional tests for async mode. + +* Documentation changes. + +For phase-1 +* Get the dependencies on taskflow being resolved. + +* Introduce jobboard table into Magnum DB. + +* Integrate taskflow in Magnum conductor to replace the in-memory job-queue + with taskflow jobboard. Also, we need conductor greenthreads to subscribe + as workers to the taskflow jobboard. + +* Add unit tests and functional tests for persistence and multiple conductor + scenario. + +* Documentation changes. + +For phase-2 +* We will promote asynchronous mode of operation as the default mode of +operation. + +* We may decide to drop the code for synchronous mode and corresponding config. + +* Documentation changes. + + +Dependencies +------------ + +For phase-1, if we choose to implement using taskflow, we need to get +following two features added to taskflow first - +* Ability to add new task to an existing job on jobboard. +* mysql plugin support as persistent DB. + +Testing +------- + +All the existing test cases are run to ensure async mode does not break them. +Additionally more functional tests and unit tests will be added specific to +async mode. + +Documentation Impact +-------------------- + +Magnum documentation will include a description of the option for asynchronous +mode of container operations and its benefits. We will also add to +developer documentation on guideline for implementing a container operation in +both the modes - sync and async. We will add a section on 'how to debug +container operations in async mode'. The phase-0 and phase-1 implementation +and their support for single or multiple conductors will be clearly documented +for the operators. + +References +---------- + +[1] - Execution time comparison between sync and async modes: + +https://gist.github.com/surojit-pathak/2cbdad5b8bf5b569e755 + +[2] - Proposed change under review: + +https://review.openstack.org/#/c/267134/ + +[3] - Heat's use of stacklock + +http://docs.openstack.org/developer/heat/_modules/heat/engine/stack_lock.html + +[4] - openstack/futurist + +http://docs.openstack.org/developer/futurist/ + +[5] - openstack/oslo.messaging + +http://docs.openstack.org/developer/oslo.messaging/rpcclient.html + +[6] - ML discussion on the design + +http://lists.openstack.org/pipermail/openstack-dev/2015-December/082524.html + +[7] - Taskflow library + +http://docs.openstack.org/developer/taskflow/ + +[8] - task in taskflow + +http://docs.openstack.org/developer/taskflow/atoms.html#task + +[9] - job and jobboard in taskflow + +http://docs.openstack.org/developer/taskflow/jobs.html + +[10] - conductor in taskflow + +http://docs.openstack.org/developer/taskflow/conductors.html + +[11] - persistent backend support in taskflow + +http://docs.openstack.org/developer/taskflow/persistence.html + +[12] - oslo.messaging notification handler + +http://docs.openstack.org/developer/oslo.messaging/notification_listener.html + +[13] - Blueprint for at-least-once-guarantee, oslo.messaging + +https://blueprints.launchpad.net/oslo.messaging/+spec/at-least-once-guarantee + +[14] - Patchset under review for at-least-once-guarantee, oslo.messaging + +https://review.openstack.org/#/c/229186/ + +[15] - Taskflow blocking mode for conductor + +http://docs.openstack.org/developer/taskflow/conductors.html#taskflow.conductors.backends.impl_executor.ExecutorConductor + +[16] - Taskflow serial engine + +http://docs.openstack.org/developer/taskflow/engines.html + +[17] - Community feedback on policy to handle failure within a sequence + +http://eavesdrop.openstack.org/irclogs/%23openstack-containers/%23openstack-containers.2016-03-08.log.html#t2016-03-08T20:41:17