Update the overview of keepalived and pacemaker HA

High level descriptions of the keepalived and pacemaker HA architectures

Change-Id: Idb838b08d730895135ef92335777ca92b5951c54
This commit is contained in:
Andrew Beekhof 2015-11-16 12:53:56 +11:00 committed by KATO Tomoyuki
parent 9ac92531d0
commit 0a4abf4db9
6 changed files with 317 additions and 11 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 223 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 215 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

View File

@ -0,0 +1,96 @@
============================
The keepalived architecture
============================
High availability strategies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following diagram shows a very simplified view of the different
strategies used to achieve high availability for the OpenStack
services:
.. image:: /figures/keepalived-arch.jpg
:width: 100%
Depending on the method used to communicate with the service, the
following availability strategies will be followed:
- Keepalived, for the HAProxy instances.
- Access via an HAProxy virtual IP, for services such as HTTPd that
are accessed via a TCP socket that can be load balanced
- Built-in application clustering, when available from the application.
Galera is one example of this.
- Starting up one instance of the service on several controller nodes,
when they can coexist and coordinate by other means. RPC in
``nova-conductor`` is one example of this.
- No high availability, when the service can only work in
active/passive mode.
There are known issues with cinder-volume that recommend setting it as
active-passive for now, see:
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
While there will be multiple neutron LBaaS agents running, each agent
will manage a set of load balancers, that cannot be failed over to
another node.
Architecture limitations
~~~~~~~~~~~~~~~~~~~~~~~~
This architecture has some inherent limitations that should be kept in
mind during deployment and daily operations.
The following sections describe these limitations.
#. Keepalived and network partitions
In case of a network partitioning, there is a chance that two or
more nodes running keepalived claim to hold the same VIP, which may
lead to an undesired behaviour. Since keepalived uses VRRP over
multicast to elect a master (VIP owner), a network partition in
which keepalived nodes cannot communicate will result in the VIPs
existing on two nodes. When the network partition is resolved, the
duplicate VIPs should also be resolved. Note that this network
partition problem with VRRP is a known limitation for this
architecture.
#. Cinder-volume as a single point of failure
There are currently concerns over the cinder-volume service ability
to run as a fully active-active service. During the Mitaka
timeframe, this is being worked on, see:
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
Thus, cinder-volume will only be running on one of the controller
nodes, even if it will be configured on all nodes. In case of a
failure in the node running cinder-volume, it should be started in
a surviving controller node.
#. Neutron-lbaas-agent as a single point of failure
The current design of the neutron LBaaS agent using the HAProxy
driver does not allow high availability for the tenant load
balancers. The neutron-lbaas-agent service will be enabled and
running on all controllers, allowing for load balancers to be
distributed across all nodes. However, a controller node failure
will stop all load balancers running on that node until the service
is recovered or the load balancer is manually removed and created
again.
#. Service monitoring and recovery required
An external service monitoring infrastructure is required to check
the OpenStack service health, and notify operators in case of any
failure. This architecture does not provide any facility for that,
so it would be necessary to integrate the OpenStack deployment with
any existing monitoring environment.
#. Manual recovery after a full cluster restart
Some support services used by RDO or RHEL OSP use their own form of
application clustering. Usually, these services maintain a cluster
quorum, that may be lost in case of a simultaneous restart of all
cluster nodes, for example during a power outage. Each service will
require its own procedure to regain quorum.
If you find any or all of these limitations concerning, you are
encouraged to refer to the `Pacemaker HA architectire
<introl-ha-design-pacemaker.html>`_ instead.

View File

@ -0,0 +1,198 @@
==========================
The Pacemaker architecture
==========================
What is a cluster manager
~~~~~~~~~~~~~~~~~~~~~~~~~
At its core, a cluster is a distributed finite state machine capable
of co-ordinating the startup and recovery of inter-related services
across a set of machines.
Even a distributed and/or replicated application that is able to
survive failures on one or more machines can benefit from a
cluster manager:
#. Awareness of other applications in the stack
While SYS-V init replacements like systemd can provide
deterministic recovery of a complex stack of services, the
recovery is limited to one machine and lacks the context of what
is happening on other machines - context that is crucial to
determine the difference between a local failure, clean startup
and recovery after a total site failure.
#. Awareness of instances on other machines
Services like RabbitMQ and Galera have complicated boot-up
sequences that require co-ordination, and often serialization, of
startup operations across all machines in the cluster. This is
especially true after site-wide failure or shutdown where we must
first determine the last machine to be active.
#. A shared implementation and calculation of `quorum
<http://en.wikipedia.org/wiki/Quorum_(Distributed_Systems)>`_.
It is very important that all members of the system share the same
view of who their peers are and whether or not they are in the
majority. Failure to do this leads very quickly to an internal
`split-brain <http://en.wikipedia.org/wiki/Split-brain_(computing)>`_
state - where different parts of the system are pulling in
different and incompatioble directions.
#. Data integrity through fencing (a non-responsive process does not
imply it is not doing anything)
A single application does not have sufficient context to know the
difference between failure of a machine and failure of the
applcation on a machine. The usual practice is to assume the
machine is dead and carry on, however this is highly risky - a
rogue process or machine could still be responding to requests and
generally causing havoc. The safer approach is to make use of
remotely accessible power switches and/or network switches and SAN
controllers to fence (isolate) the machine before continuing.
#. Automated recovery of failed instances
While the application can still run after the failure of several
instances, it may not have sufficient capacity to serve the
required volume of requests. A cluster can automatically recover
failed instances to prevent additional load induced failures.
For this reason, the use of a cluster manager like `Pacemaker
<http://clusterlabs.org>`_ is highly recommended.
Deployment flavors
~~~~~~~~~~~~~~~~~~
It is possible to deploy three different flavors of the Pacemaker
architecture. The two extremes are **Collapsed** (where every
component runs on every node) and **Segregated** (where every
component runs in its own 3+ node cluster).
Regardless of which flavor you choose, it is recommended that the
clusters contain at least three nodes so that we can take advantage of
`quorum <quorum_>`_.
Quorum becomes important when a failure causes the cluster to split in
two or more paritions. In this situation, you want the majority to
ensure the minority are truly dead (through fencing) and continue to
host resources. For a two-node cluster, no side has the majority and
you can end up in a situation where both sides fence each other, or
both sides are running the same services - leading to data corruption.
Clusters with an even number of hosts suffer from similar issues - a
single network failure could easily cause a N:N split where neither
side retains a majority. For this reason, we recommend an odd number
of cluster members when scaling up.
You can have up to 16 cluster members (this is currently limited by
the ability of corosync to scale higher). In extreme cases, 32 and
even up to 64 nodes could be possible, however, this is not well tested.
Collapsed
---------
In this configuration, there is a single cluster of 3 or more
nodes on which every component is running.
This scenario has the advantage of requiring far fewer, if more
powerful, machines. Additionally, being part of a single cluster
allows us to accurately model the ordering dependencies between
components.
This scenario can be visualized as below.
.. image:: /figures/Cluster-deployment-collapsed.png
:width: 100%
You would choose this option if you prefer to have fewer but more
powerful boxes.
This is the most common option and the one we document here.
Segregated
----------
In this configuration, each service runs in a dedicated cluster of
3 or more nodes.
The benefits of this approach are the physical isolation between
components and the ability to add capacity to specific components.
You would also choose this option if you prefer to have more but
less powerful boxes.
This scenario can be visualized as below, where each box below
represents a cluster of three or more guests.
.. image:: /figures/Cluster-deployment-segregated.png
:width: 100%
Mixed
-----
It is also possible to follow a segregated approach for one or more
components that are expected to be a bottleneck and use a collapsed
apprach for the remainder.
Proxy server
~~~~~~~~~~~~
Almost all services in this stack benefit from being proxied.
Using a proxy server provides:
#. Load distribution
Many services can act in an active/active capacity, however, they
usually require an external mechanism for distributing requests to
one of the available instances. The proxy server can serve this
role.
#. API isolation
By sending all API access through the proxy, we can clearly
identify service interdependencies. We can also move them to
locations other than ``localhost`` to increase capacity if the
need arises.
#. Simplified process for adding/removing of nodes
Since all API access is directed to the proxy, adding or removing
nodes has no impact on the configuration of other services. This
can be very useful in upgrade scenarios where an entirely new set
of machines can be configured and tested in isolation before
telling the proxy to direct traffic there instead.
#. Enhanced failure detection
The proxy can be configured as a secondary mechanism for detecting
service failures. It can even be configured to look for nodes in
a degraded state (such as being 'too far' behind in the
replication) and take them out of circulation.
The following components are currently unable to benefit from the use
of a proxy server:
* RabbitMQ
* Memcached
* MongoDB
However, the reasons vary and are discussed under each component's
heading.
We recommend HAProxy as the load balancer, however, there are many
alternatives in the marketplace.
We use a check interval of 1 second, however, the timeouts vary by service.
Generally, we use round-robin to distriute load amongst instances of
active/active services, however, Galera uses the ``stick-table`` options
to ensure that incoming connections to the virtual IP (VIP) should be
directed to only one of the available back ends.
In Galera's case, although it can run active/active, this helps avoid
lock contention and prevent deadlocks. It is used in combination with
the ``httpchk`` option that ensures only nodes that are in sync with its
peers are allowed to handle requests.

View File

@ -23,17 +23,6 @@ In general we can divide all the OpenStack components into three categories:
- :term:`Advanced Message Queuing Protocol (AMQP)` provides OpenStack
internal stateful communication service.
Assuming that every single OpenStack controller runs the full set of
the elementary services (symmetric controller), the common good practice
is to have a small odd number of controllers.
Most of the time, it means three OpenStack controllers.
[TODO Discuss SLA (Service Level Agreement), if this is the measure we use.
Other possibilities include MTTR (Mean Time To Recover),
RTO (Recovery Time Objective),
and ETR (Expected Time of Repair,]
Network components
~~~~~~~~~~~~~~~~~~
@ -50,6 +39,29 @@ and expected SLA.]
See [TODO link] for more information about configuring networking
for high availability.
Common deployement architectures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are primarily two HA architectures in use today.
One uses a cluster manager such as Pacemaker or Veritas to co-ordinate
the actions of the various services across a set of machines. Since
we are focused on FOSS, we will refer to this as the Pacemaker
architecture.
The other is optimized for Active/Active services that do not require
any inter-machine coordination. In this setup, services are started by
your init system (systemd in most modern distributions) and a tool is
used to move IP addresses between the hosts. The most common package
for doing this is keepalived.
.. toctree::
:maxdepth: 1
intro-ha-arch-pacemaker.rst
intro-ha-arch-keepalived.rst
Load balancing (HAProxy)
~~~~~~~~~~~~~~~~~~~~~~~~