Merge "Spec for asynchronous container operations"

2016-03-22 17:23:33 +00:00 · 2016-03-22 17:23:33 +00:00 · 28aa5c86d9
parent 01bc09849b a0b1a15521
commit 28aa5c86d9
1 changed files with 452 additions and 0 deletions
--- a/specs/async-container-operation.rst
+++ b/specs/async-container-operation.rst
@ -0,0 +1,452 @@
+=================================
+Asynchronous Container Operations
+=================================
+
+Launchpad blueprint:
+
+https://blueprints.launchpad.net/magnum/+spec/async-container-operations
+
+At present, container operations are done in a synchronous way, end-to-end.
+This model does not scale well, and incurs a penalty on the client to be
+stuck till the end of completion of the operation.
+
+Problem Description
+-------------------
+
+At present Magnum-Conductor executes the container operation as part of
+processing the request forwarded from Magnum-API. For
+container-create, if the image needs to be pulled down, it may take
+a while depending on the responsiveness of the registry, which can be a
+substantial delay. At the same time, experiments suggest that even for
+pre-pulled image, the time taken by each operations, namely
+create/start/delete, are in the same order, as it involves complete turn
+around between the magnum-client and the COE-API, via Magnum-API and
+Magnum-Conductor[1].
+
+Use Cases
+---------
+
+For wider enterprise adoption of Magnum, we need it to scale better.
+For that we need to replace some of these synchronous behaviors with
+suitable alternative of asynchronous implementation.
+
+To understand the use-case better, we can have a look at the average
+time spent during container operations, as noted at[1].
+
+Proposed Changes
+----------------
+
+The design has been discussed over the ML[6]. The conclusions have been kept
+on the 'whiteboard' of the Blueprint.
+
+The amount of code change is expected to be significant. To ease the
+process of adoption, code review, functional tests, an approach of phased
+implementation may be required. We can define the scope of the three phases of
+the implementation as follows -
+
+* Phase-0 will bring in the basic feature of asynchronous mode of operation in
+  Magnum - (A) from API to Conductor and (B) from Conductor to COE-API. During
+  phase-0, this mode will be optional through configuration.
+
+  Both the communications of (A) and (B) are proposed to be made asynchronous
+  to achieve the best of it. If we do (A) alone, it does not gain us much, as
+  (B) takes up the higher cycles of the operation. If we do (B) alone, it does
+  not make sense, as (A) will synchronously wait for no meaningful data.
+
+* Phase-1 will concentrate on making the feature persistent to address various
+  scenarios of conductor restart, worker failure etc. We will support this
+  feature for multiple Conductor-workers in this phase.
+
+* Phase-2 will select asynchronous mode of operation as the default mode. At
+  the same time, we can evaluate to drop the code for synchronous mode, too.
+
+
+Phase-0 is required as a meaningful temporary step, to establish the
+importance and tangible benefits of phase-1. This is also to serve as a
+proof-of-concept at a lower cost of code changes with a configurable option.
+This will enable developers and operators to have a taste of the feature,
+before bringing in the heavier dependencies and changes proposed in phase-1.
+
+A reference implemetation for the phase-0 items, has been put for review[2].
+
+Following is the summary of the design -
+
+1. Configurable mode of operation - async
+-----------------------------------------
+
+For ease of adoption, the async_mode of communication between API-conductor,
+conductor-COE in magnum, can be controlled using a configuration option. So
+the code-path for sync mode and async mode would co-exist for now. To achieve
+this with minimal/no code duplication and cleaner interface, we are using
+openstack/futurist[4]. Futurist interface hides the details of type of executor
+being used. In case of async configuration, a greenthreadpool of configured
+poolsize gets created. Here is a sample of how the config would look
+like: ::
+
+      [DEFAULT]
+      async_enable = False
+
+      [conductor]
+      async_threadpool_max_workers = 64
+
+Futurist library is used in oslo.messaging. Thus, it is used by almost all
+OpenStack projects, in effect. Futurist is very useful to run same code
+under different execution model and hence saving potential duplication of
+code.
+
+
+2. Type of operations
+---------------------
+
+There are two classes of container operations - one that can be made async,
+namely create/delete/start/stop/pause/unpause/reboot, which do not need data
+about the container in return. The other type requires data, namely
+container-logs. For async-type container-operations, magnum-API will be
+using 'cast' instead of 'call' from oslo_messaging[5].
+
+'cast' from oslo.messaging.rpcclient is used to invoke a method and return
+immediately, whereas 'call' invokes a method and waits for a reply. While
+operating in asynchronous mode, it is intuitive to use cast method, as the
+result of the response may not be available immediately.
+
+Magnum-api first fetches the details of a container, by doing
+'get_rpc_resource'. This function uses magnum objects. Hence, this function
+uses a 'call' method underneath. Once, magnum-api gets back the details,
+it issues the container operation next, using another 'call' method.
+The above proposal is to replace the second 'call' with 'cast'.
+
+If user issues a container operation, when there is no listening
+conductor (because of process failure), there will be a RPC timeout at the
+first 'call' method. In this case, user will observe the request to
+get blocked at client and finally fail with HTTP 500 ERROR, after the RPC
+timeout, which is 60 seconds by default. This behavior is independent of the
+usage of 'cast' or 'call' for the second message, mentioned above. This
+behavior does not influence our design, but it is documented here for clarity
+of understanding.
+
+
+3. Ensuring the order of execution - Phase-0
+--------------------------------------------
+
+Magnum-conductor needs to ensure that for a given bay and given container,
+the operations are executed in sequence. In phase-0, we want to demonstrate
+how asynchronous behavior helps scaling. Asynchronous mode of container
+operations would be supported for single magnum-conductor scenario, in
+phase-0. If magnum-conductor crashes, there will be no recovery for the
+operations accepted earlier - which means no persistence in phase-0, for
+operations accepted by magnum-conductor. Multiple conductor scenario and
+persistence will be addressed in phase-1 [please refer to the next section
+for further details]. If COE crashes or does not respond, the error will be
+detected, as it happens in sync mode, and reflected on the container-status.
+
+Magnum-conductor will maintain a job-queue. Job-queue is indexed by bay-id and
+container-id. A job-queue entry would contain the sequence of operations
+requested for a given bay-id and container-id, in temporal order. A
+greenthread will execute the tasks/operations in order for a given job-queue
+entry, till the queue empties. Using a greethread in this fashion saves us
+from the cost and complexity of locking, along with functional correctness.
+When request for new operation comes in, it gets appended to the corresponding
+queue entry.
+
+For a sequence of container operations, if an intermediate operation fails,
+we will stop continuing the sequence. The community feels more confident to
+start with this strictly defensive policy[17]. The failure will be logged
+and saved into the container-object, which will help an operator be informed
+better about the result of the sequence of container operations. We may revisit
+this policy later, if we think it is too restrictive.
+
+4. Ensuring the order of execution - phase-1
+--------------------------------------------
+
+The goal is to execute requests for a given bay and a given container in
+sequence. In phase-1, we want to address persistence and capability of
+supporting multiple magnum-conductor processes. To achieve this, we will
+reuse the concepts laid out in phase-0 and use a standard library.
+
+We propose to use taskflow[7] for this implementation. Magnum-conductors
+will consume the AMQP message and post a task[8] on a taskflow jobboard[9].
+Greenthreads from magnum-conductors would subscribe to the taskflow
+jobboard as taskflow-conductors[10]. Taskflow jobboard is maintained with
+a choice of persistent backend[11]. This will help address the concern of
+persistence for accepted operations, when a conductor crashes. Taskflow
+will ensure that tasks, namely container operations, in a job, namely a
+sequence of operations for a given bay and container, would execute in
+sequence. We can easily notice that some of the concepts used in phase-0
+are reused as it is. For example, job-queue maps to jobboard here, use of
+greenthread maps to the conductor concept of taskflow. Hence, we expect easier
+migration from phase-0 to phase-1, with the choice of taskflow.
+
+For taskflow jobboard[11], the available choices of backend are Zookeeper and
+Redis. But, we plan to use MySQL as default choice of backend, for magnum
+conductor jobboard use-case. This support will be added to taskflow. Later,
+we may choose to support the flexibility of other backends like ZK/Redis via
+configuration. But, phase-1 will keep the implementation simple with MySQL
+backend and revisit this, if required.
+
+Let's consider the scenarios of Conductor crashing -
+ - If a task is added to jobboard, and conductor crashes after that,
+   taskflow can assign a particular job to any available greenthread agents
+   from other conductor instances. If the system was running with single
+   magnum-conductor, it will wait for the conductor to come back and join.
+ - A task is picked up and magnum-conductor crashes. In this case, the task
+   is not complete from jobboard point-of-view. As taskflow detects the
+   conductor going away, it assigns another available conductor.
+ - When conductor picks up a message from AMQP, it will acknowledge AMQP,
+   only after persisting it to jobboard. This will prevent losing the message,
+   if conductor crashes after picking up the message from AMQP. Explicit
+   acknowledgement from application may use NotificationResult.HANDLED[12]
+   to AMQP. We may use the at-least-one-guarantee[13] feature in
+   oslo.messaging[14], as it becomes available.
+
+To summarize some of the important outcomes of this proposal -
+ - A taskflow job represents the sequence of container operations on a given
+   bay and given container. At a given point of time, the sequence may contain
+   a single or multiple operations.
+ - There will be a single jobboard for all conductors.
+ - Task-flow conductors are multiple greenthreads from a given
+   magnum-conductor.
+ - Taskflow-conductor will run in 'blocking' mode[15], as those greenthreads
+   have no other job than claiming and executing the jobs from jobboard.
+ - Individual jobs are supposed to maintain a temporal sequence. So the
+   taskflow-engine would be 'serial'[16].
+ - The proposed model for a 'job' is to consist of a temporal sequence of
+   'tasks' - operations on a given bay and a given container. Henceforth,
+   it is expected that when a given operation, namely container-create is in
+   progress, a request for container-start may come in. Adding the task to
+   the existing job is intuitive to maintain the sequence of operations.
+
+To fit taskflow exactly into our use-case, we may need to do two enhancements
+in taskflow -
+- Supporting mysql plugin as a DB backend for jobboard. Support for redis
+exists, so it will be similar.
+We do not see any technical roadblock for adding mysql support for taskflow
+jobboard. If the proposal does not get approved by taskflow team, we may have
+to use redis, as an alternative option.
+- Support for dynamically adding tasks to a job on jobboard. This also looks
+feasible, as discussed over the #openstack-state-management [Unfortunately,
+this channel is not logged, but if we agree in this direction, we can initiate
+discussion over ML, too]
+If taskflow team does not allow adding this feature, even though they have
+agreed now, we will use the dependency feature in taskflow. We will explore
+and elaborate this further, if it requires.
+
+
+5. Status of progress
+---------------------
+
+The progress of execution of a container operation is reflected on the status
+of a container as - 'create-in-progress', 'delete-in-progress' etc.
+
+Alternatives
+------------
+
+Without an asynchronous implementation, Magnum will suffer from complaints
+about poor scalability and slowness.
+
+In this design, stack-lock[3] has been considered as an alternative to
+taskflow. Following are the reasons for preferring taskflow over
+stack-lock, as of now,
+- Stack-lock used in Heat is not a library, so it will require making a copy
+for Magnum, which is not desirable.
+- Taskflow is relatively mature, well supported, feature-rich library.
+- Taskflow has in-built capacity to scale out[in] as multiple conductors
+can join in[out] the cluster.
+- Taskflow has a failure detection and recovery mechanism. If a process
+crashes, then worker threads from other conductor may continue the execution.
+
+In this design, we describe futurist[4] as a choice of implementation. The
+choice was to prevent duplication of code for async and sync mode. For this
+purpose, we could not find any other solution to compare.
+
+Data model impact
+-----------------
+
+Phase-0 has no data model impact. But phase-1 may introduce an additional
+table into the Magnum database. As per the present proposal for using taskflow
+in phase-1, we have to introduce a new table for jobboard under magnum db.
+This table will be exposed to taskflow library as a persistent db plugin.
+Alternatively, an implementation with stack-lock will also require an
+introduction of a new table for stack-lock objects.
+
+REST API impact
+---------------
+
+None.
+
+Security impact
+---------------
+
+None.
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance impact
+------------------
+
+Asynchrnous mode of operation helps in scalability. Hence, it improves
+responsiveness and reduces the turn around time in a significant
+proportion. A small test on devstack, comparing both the modes,
+demonstrate this with numbers.[1]
+
+Other deployer impact
+---------------------
+
+None.
+
+Developer impact
+----------------
+
+None
+
+Implementation
+--------------
+
+Assignee(s)
+-----------
+
+Primary assignee
+  suro-patz(Surojit Pathak)
+
+Work Items
+----------
+
+For phase-0
+* Introduce config knob for asynchronous mode of container operations.
+
+* Changes for Magnum-API to use CAST instead of CALL for operations eligible
+  for asynchronous mode.
+
+* Implement the in-memory job-queue in Magnum conductor, and integrate futurist
+  library.
+
+* Unit tests and functional tests for async mode.
+
+* Documentation changes.
+
+For phase-1
+* Get the dependencies on taskflow being resolved.
+
+* Introduce jobboard table into Magnum DB.
+
+* Integrate taskflow in Magnum conductor to replace the in-memory job-queue
+  with taskflow jobboard. Also, we need conductor greenthreads to subscribe
+  as workers to the taskflow jobboard.
+
+* Add unit tests and functional tests for persistence and multiple conductor
+  scenario.
+
+* Documentation changes.
+
+For phase-2
+* We will promote asynchronous mode of operation as the default mode of
+operation.
+
+* We may decide to drop the code for synchronous mode and corresponding config.
+
+* Documentation changes.
+
+
+Dependencies
+------------
+
+For phase-1, if we choose to implement using taskflow, we need to get
+following two features added to taskflow first -
+* Ability to add new task to an existing job on jobboard.
+* mysql plugin support as persistent DB.
+
+Testing
+-------
+
+All the existing test cases are run to ensure async mode does not break them.
+Additionally more functional tests and unit tests will be added specific to
+async mode.
+
+Documentation Impact
+--------------------
+
+Magnum documentation will include a description of the option for asynchronous
+mode of container operations and its benefits. We will also add to
+developer documentation on guideline for implementing a container operation in
+both the modes - sync and async. We will add a section on 'how to debug
+container operations in async mode'. The phase-0 and phase-1 implementation
+and their support for single or multiple conductors will be clearly documented
+for the operators.
+
+References
+----------
+
+[1] - Execution time comparison between sync and async modes:
+
+https://gist.github.com/surojit-pathak/2cbdad5b8bf5b569e755
+
+[2] - Proposed change under review:
+
+https://review.openstack.org/#/c/267134/
+
+[3] - Heat's use of stacklock
+
+http://docs.openstack.org/developer/heat/_modules/heat/engine/stack_lock.html
+
+[4] - openstack/futurist
+
+http://docs.openstack.org/developer/futurist/
+
+[5] - openstack/oslo.messaging
+
+http://docs.openstack.org/developer/oslo.messaging/rpcclient.html
+
+[6] - ML discussion on the design
+
+http://lists.openstack.org/pipermail/openstack-dev/2015-December/082524.html
+
+[7] - Taskflow library
+
+http://docs.openstack.org/developer/taskflow/
+
+[8] - task in taskflow
+
+http://docs.openstack.org/developer/taskflow/atoms.html#task
+
+[9] - job and jobboard in taskflow
+
+http://docs.openstack.org/developer/taskflow/jobs.html
+
+[10] - conductor in taskflow
+
+http://docs.openstack.org/developer/taskflow/conductors.html
+
+[11] - persistent backend support in taskflow
+
+http://docs.openstack.org/developer/taskflow/persistence.html
+
+[12] - oslo.messaging notification handler
+
+http://docs.openstack.org/developer/oslo.messaging/notification_listener.html
+
+[13] - Blueprint for at-least-once-guarantee, oslo.messaging
+
+https://blueprints.launchpad.net/oslo.messaging/+spec/at-least-once-guarantee
+
+[14] - Patchset under review for at-least-once-guarantee, oslo.messaging
+
+https://review.openstack.org/#/c/229186/
+
+[15] - Taskflow blocking mode for conductor
+
+http://docs.openstack.org/developer/taskflow/conductors.html#taskflow.conductors.backends.impl_executor.ExecutorConductor
+
+[16] - Taskflow serial engine
+
+http://docs.openstack.org/developer/taskflow/engines.html
+
+[17] - Community feedback on policy to handle failure within a sequence
+
+http://eavesdrop.openstack.org/irclogs/%23openstack-containers/%23openstack-containers.2016-03-08.log.html#t2016-03-08T20:41:17