mistral-specs/specs/pike/approved/ha.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

==========================
Mistral HA and Scalability
==========================

Launchpad blueprints:

https://blueprints.launchpad.net/mistral/+spec/mistral-ha-spec
https://blueprints.launchpad.net/mistral/+spec/mistral-ha-gate
https://blueprints.launchpad.net/mistral/+spec/mistral-multinode-tests
https://blueprints.launchpad.net/mistral/+spec/mistral-fault-tolerance
https://blueprints.launchpad.net/mistral/+spec/mistral-scale-up-down

Mistral needs to be highly available (HA) and scalable.

This specification doesn't attempt to describe all steps and measures that
need to be taken in order to achieve these goals. It rather aims to set a
common ground for understanding what tasks and challenges the team needs to
solve and proposes approximate approaches for that.

Problem description
===================

In a nutshell, being highly available and scalable means that we should
be able to run more than one Mistral cluster component of any type (api,
engine, executor) and gain more reliability and performance.
Originally, Mistral was designed to be fully asynchronous and thus easily
scalable, so we know that Mistral, in fact, works with multiple APIs, engines
and executors and its performance becomes higher in this case.
High availability though is a complex non functional requirement that is
usually achieved by making lots of changes to account for various failure
situations that can happen with multiple nodes of the systems. Particularly,
the system should gracefully handle cluster topology changes: node addition
and node departure. These events should not affect normal work of workflows
that have already been started.
So far, the Mistral team hasn't spent significant time on HA. Although Mistral
partially implements these two non functional requirements we need to make
sure that all of their corresponding aspects are implemented and tested. This
specification aims to clearly define these aspects and roughly propose
approaches for taking care of them.

Use Cases
---------

* Scalability is needed when performance of a Mistral cluster component,
  such as Mistral engine, limited by resources (I/O, RAM, CPU) of a single
  machine, is not sufficient. Those are any use cases that assume significant
  load on Mistral, essentially lots of tasks processed by Mistral per time
  unit.
  Example: hosted workflow service in a public cloud.
* High availability is critically important for use cases where workflows
  managed by Mistral should live even in case of infrastructure failures.
  To deal with these failures the system needs to have redundancy, i.e.
  extra instances of cluster components. Losing part of them doesn't affect
  normal work of the system, all functionality continues to be accessible.


Proposed change
===============

**Testing infrastructure**

Currently, Mistral team doesn't have any infrastructure that would allow to
automate testing of multi-node Mistral, whether we want to test scalability
properties or reliability and availability. The first step towards achieving
the goals defined by the spec is to create such an infrastructure.

Currently, the requirements for this infrastructure are seen as follows:

* New gate in OpenStack CI that will allow to run any number of Mistral
  cluster components as separate processes
* Python utility methods that will allow to start, stop and find running
  Mistral cluster components. This is required because we should be able to
  have different number of cluster components for different tests within the
  same gate. It will also allow to test addition and departure of cluster
  nodes (scale up and down).
* Means that will allow to imitate infrastructure failures. One of the most
  important is lost messages in RPC.
* A tool for benchmarking Mistral performance when it's installed in multi-node
  mode. Ideally, it needs to be a separate gate (most likely based on Rally)


**Mistral stable multi-node topology**

This work item mostly assumes testing Mistral with multiple nodes when the
number of node is stable. The most important part is to test multiple engines
because engine is the part of the system that creates complex database
transactions and there used to be known issues related to that like deadlocks.
These known issues seem to have been fixed but we need to create test harness
for this and make sure it's tested automatically on every commit. Potentially,
we might still have some issues like that.

**Mistral variable multi-node topology (scale up/down)**

Once multi-node Mistral is tested with a stable cluster topology we need to
make sure to create tests where nodes are added and removed while workflows
are running. These workflows must finish successfully regardless of cluster
topology changes, only if there's at least one running component of every
type (engine, executor, api).
In this part we expect issues related to lost RPC messages for communications
where message acknowledgements are not used. It's possible that component A
will send a message component B, and if component B polls the message from
a message queue and goes down before the message is fully processed and
result is sent back, then state of a corresponding object in the database
won't be updated (action execution, task execution, workflow execution).
Having that said, as of now, there's no opportunity to fix this when using
oslo.messaging based RPC implementation because it doesn't support message
acknowledgement needed for the "Work queue" pattern, i.e. "at-least-once"
delivery. See https://www.rabbitmq.com/tutorials/tutorial-two-python.html for
details. For now, we have to use Kombu-based Mistral RPC which supports this
mechanism. However, it's known that message acknowledgements should be used
in more cases where the mistral services communicate, than currently enabled.
For example, it is enabled on the executor side so that a message sent to an
executor (to run an action) is acknowledged after it's fully processed. But
it's known that it's not enabled on the engine side when, for example,
executor sends a result of the action back to Mistral. Part of this work
item will be identifying all such gaps in the Mistral communication protocol
and fix them.
Another visible problem is graceful scale down, when the node that is
intentionally leaving the cluster should stop handling new incoming RPC
requests but complete requests that are already being processed. If this
concern is not taken care of, we'll be getting repeated message processing
problem, i.e. when in fact a message was processed more than once that in turn
leads to breaking normal logic of workflow processing. The example: an engine
sends a "run action" request to an executor, executor polls it, runs an action
and immediately after that goes down w/o having a chance to send the result
back to the engine acknowledging the initial message. So, in fact, the acton
already ran once. Then, since the message is not acknowledged, a message queue
broker will resend it to a different executor and it'll be processed again.
The solution: graceful shutdown when executor fully processes the message,
sends the result back, acknowledges the message and only then is allowed to
shut down. On this matter, we need to distinguish between idempotent and
non-idempotent actions. The described problem applies only to non-idempotent
actions since, from their definition, every execution of such action changes
the state of the system.

**Handling infrastructure failures**

Mistral should be able to gracefully handle the following infrastructure
outages:

* Temporary lost connectivity with the database
* Temporary lost connectivity with the message broker

Currently, there aren't automatic tests that would guarantee that Mistral
handles these situations gracefully w/o losing ability to complete workflows
that are already running. So this work stream assumes creating tests that
imitate these failures and fixing possible issues that may occur in this case.
The most important part is execution objects stuck in a non-terminal state
like RUNNING due to a temporary outage of the message broker or the database.
It is known that there can be situations when an RPC message is completely
lost. For example, RabbitMQ does not guarantee that if it accepted a message
it will be delivered to consumers. Automatic recovery looks impossible in this
case but we need to come up with approaches how to help operators identify
these situations and recover from them. This should be figured out during the
implementation phase.

**Benchmarking of multi-node Mistral**

We already have a CI gate based on Rally that allows benchmarking a single-node
Mistral. We need to create a gate and a tool set that will allow benchmarking
Mistral that has different number of nodes so that we see how scaling up/down
changes Mistral performance. When we're talking about Mistral performance,
until not explained, it's not clear what it means because if we mean workflow
execution time (from when it was started to when it was completed) then it
totally depends on what kind of workflow it is, whether it's highly parallel
or has only sequences, whether it has joins, "with-items" tasks, task policies
etc.
The proposal is to come up with several workflows that could be called
"reference workflows" and to measure execution time of these workflows in
seconds (minimal, average, highest) as well as "processing time per task" for
each one of them that would show how much time is needed to process an
individual task for all such workflows. These reference workflows together
should describe all kinds of elementary workflow blocks that any more complex
workflow can consist of. Thus these workflows will give us understanding about
performance of all most important workflow topologies. Plus we can have one or
more complex workflow built of these elementary blocks and measure its
execution time as well.

Currently, the proposed reference workflows are:

* Sequence of N tasks where each task is started only after the previous one.
  Without parallelism at all.
* N parallel tasks w/o connections between each other. Fully parallel workflow.
* X parallel sequences where each contains N tasks. Mixed sequential
  and parallel workflows.
* N parallel tasks joined by a task marked as "join: all"
* X parallel sequences where each contains N tasks joined by a task marked as
  "join: all"
* One task configured as "with-items" that processes N elements
* Complex workflow where all the previous items are combined together

Alternatives
------------

None.

Data model impact
-----------------

None.

REST API impact
---------------

None.

End user impact
---------------

Higher uptime of Mistral service (if by end user we mean those who call
Mistral API).

Performance Impact
------------------

* Performance of Mistral with more nodes will be higher than performance of
  Mistral with less nodes.

Deployer impact
---------------

Deployers will be able to run more Mistral nodes to increase performance and
redundancy/availability of the service.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  rakhmerov

Other contributors:
  melisha

Work Items
----------

* OpenStack CI job for testing Mistral in HA
* Make multi-node Mistral work reliably with a stable cluster topology
  (includes automatic testing)
* Make multi-node Mistral work reliably with a variable cluster topology
  (includes automatic testing)
* Make Mistral handle infrastructure failures
  (includes automatic testing)
* Fix known RPC communication issues (acknowledgements etc.)
* Create a set of automatic benchmarks (most likely with Rally) that will
  show how Mistral cluster scale up / scale down influence performance

Dependencies
============

None.


Testing
=======

* Use a new CI gate that will allow to run multiple cluster nodes and test
  how Mistral scales up and down and how it reacts on various infrastructure
  failures (more details above)

References
==========

None.