Update the overview of keepalived and pacemaker HA
High level descriptions of the keepalived and pacemaker HA architectures Change-Id: Idb838b08d730895135ef92335777ca92b5951c54
This commit is contained in:
parent
9ac92531d0
commit
0a4abf4db9
Binary file not shown.
After Width: | Height: | Size: 223 KiB |
Binary file not shown.
After Width: | Height: | Size: 215 KiB |
Binary file not shown.
After Width: | Height: | Size: 52 KiB |
|
@ -0,0 +1,96 @@
|
|||
============================
|
||||
The keepalived architecture
|
||||
============================
|
||||
|
||||
High availability strategies
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The following diagram shows a very simplified view of the different
|
||||
strategies used to achieve high availability for the OpenStack
|
||||
services:
|
||||
|
||||
.. image:: /figures/keepalived-arch.jpg
|
||||
:width: 100%
|
||||
|
||||
Depending on the method used to communicate with the service, the
|
||||
following availability strategies will be followed:
|
||||
|
||||
- Keepalived, for the HAProxy instances.
|
||||
- Access via an HAProxy virtual IP, for services such as HTTPd that
|
||||
are accessed via a TCP socket that can be load balanced
|
||||
- Built-in application clustering, when available from the application.
|
||||
Galera is one example of this.
|
||||
- Starting up one instance of the service on several controller nodes,
|
||||
when they can coexist and coordinate by other means. RPC in
|
||||
``nova-conductor`` is one example of this.
|
||||
- No high availability, when the service can only work in
|
||||
active/passive mode.
|
||||
|
||||
There are known issues with cinder-volume that recommend setting it as
|
||||
active-passive for now, see:
|
||||
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
||||
|
||||
While there will be multiple neutron LBaaS agents running, each agent
|
||||
will manage a set of load balancers, that cannot be failed over to
|
||||
another node.
|
||||
|
||||
Architecture limitations
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This architecture has some inherent limitations that should be kept in
|
||||
mind during deployment and daily operations.
|
||||
The following sections describe these limitations.
|
||||
|
||||
#. Keepalived and network partitions
|
||||
|
||||
In case of a network partitioning, there is a chance that two or
|
||||
more nodes running keepalived claim to hold the same VIP, which may
|
||||
lead to an undesired behaviour. Since keepalived uses VRRP over
|
||||
multicast to elect a master (VIP owner), a network partition in
|
||||
which keepalived nodes cannot communicate will result in the VIPs
|
||||
existing on two nodes. When the network partition is resolved, the
|
||||
duplicate VIPs should also be resolved. Note that this network
|
||||
partition problem with VRRP is a known limitation for this
|
||||
architecture.
|
||||
|
||||
#. Cinder-volume as a single point of failure
|
||||
|
||||
There are currently concerns over the cinder-volume service ability
|
||||
to run as a fully active-active service. During the Mitaka
|
||||
timeframe, this is being worked on, see:
|
||||
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
||||
Thus, cinder-volume will only be running on one of the controller
|
||||
nodes, even if it will be configured on all nodes. In case of a
|
||||
failure in the node running cinder-volume, it should be started in
|
||||
a surviving controller node.
|
||||
|
||||
#. Neutron-lbaas-agent as a single point of failure
|
||||
|
||||
The current design of the neutron LBaaS agent using the HAProxy
|
||||
driver does not allow high availability for the tenant load
|
||||
balancers. The neutron-lbaas-agent service will be enabled and
|
||||
running on all controllers, allowing for load balancers to be
|
||||
distributed across all nodes. However, a controller node failure
|
||||
will stop all load balancers running on that node until the service
|
||||
is recovered or the load balancer is manually removed and created
|
||||
again.
|
||||
|
||||
#. Service monitoring and recovery required
|
||||
|
||||
An external service monitoring infrastructure is required to check
|
||||
the OpenStack service health, and notify operators in case of any
|
||||
failure. This architecture does not provide any facility for that,
|
||||
so it would be necessary to integrate the OpenStack deployment with
|
||||
any existing monitoring environment.
|
||||
|
||||
#. Manual recovery after a full cluster restart
|
||||
|
||||
Some support services used by RDO or RHEL OSP use their own form of
|
||||
application clustering. Usually, these services maintain a cluster
|
||||
quorum, that may be lost in case of a simultaneous restart of all
|
||||
cluster nodes, for example during a power outage. Each service will
|
||||
require its own procedure to regain quorum.
|
||||
|
||||
If you find any or all of these limitations concerning, you are
|
||||
encouraged to refer to the `Pacemaker HA architectire
|
||||
<introl-ha-design-pacemaker.html>`_ instead.
|
|
@ -0,0 +1,198 @@
|
|||
==========================
|
||||
The Pacemaker architecture
|
||||
==========================
|
||||
|
||||
What is a cluster manager
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
At its core, a cluster is a distributed finite state machine capable
|
||||
of co-ordinating the startup and recovery of inter-related services
|
||||
across a set of machines.
|
||||
|
||||
Even a distributed and/or replicated application that is able to
|
||||
survive failures on one or more machines can benefit from a
|
||||
cluster manager:
|
||||
|
||||
#. Awareness of other applications in the stack
|
||||
|
||||
While SYS-V init replacements like systemd can provide
|
||||
deterministic recovery of a complex stack of services, the
|
||||
recovery is limited to one machine and lacks the context of what
|
||||
is happening on other machines - context that is crucial to
|
||||
determine the difference between a local failure, clean startup
|
||||
and recovery after a total site failure.
|
||||
|
||||
#. Awareness of instances on other machines
|
||||
|
||||
Services like RabbitMQ and Galera have complicated boot-up
|
||||
sequences that require co-ordination, and often serialization, of
|
||||
startup operations across all machines in the cluster. This is
|
||||
especially true after site-wide failure or shutdown where we must
|
||||
first determine the last machine to be active.
|
||||
|
||||
#. A shared implementation and calculation of `quorum
|
||||
<http://en.wikipedia.org/wiki/Quorum_(Distributed_Systems)>`_.
|
||||
|
||||
It is very important that all members of the system share the same
|
||||
view of who their peers are and whether or not they are in the
|
||||
majority. Failure to do this leads very quickly to an internal
|
||||
`split-brain <http://en.wikipedia.org/wiki/Split-brain_(computing)>`_
|
||||
state - where different parts of the system are pulling in
|
||||
different and incompatioble directions.
|
||||
|
||||
#. Data integrity through fencing (a non-responsive process does not
|
||||
imply it is not doing anything)
|
||||
|
||||
A single application does not have sufficient context to know the
|
||||
difference between failure of a machine and failure of the
|
||||
applcation on a machine. The usual practice is to assume the
|
||||
machine is dead and carry on, however this is highly risky - a
|
||||
rogue process or machine could still be responding to requests and
|
||||
generally causing havoc. The safer approach is to make use of
|
||||
remotely accessible power switches and/or network switches and SAN
|
||||
controllers to fence (isolate) the machine before continuing.
|
||||
|
||||
#. Automated recovery of failed instances
|
||||
|
||||
While the application can still run after the failure of several
|
||||
instances, it may not have sufficient capacity to serve the
|
||||
required volume of requests. A cluster can automatically recover
|
||||
failed instances to prevent additional load induced failures.
|
||||
|
||||
For this reason, the use of a cluster manager like `Pacemaker
|
||||
<http://clusterlabs.org>`_ is highly recommended.
|
||||
|
||||
Deployment flavors
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
It is possible to deploy three different flavors of the Pacemaker
|
||||
architecture. The two extremes are **Collapsed** (where every
|
||||
component runs on every node) and **Segregated** (where every
|
||||
component runs in its own 3+ node cluster).
|
||||
|
||||
Regardless of which flavor you choose, it is recommended that the
|
||||
clusters contain at least three nodes so that we can take advantage of
|
||||
`quorum <quorum_>`_.
|
||||
|
||||
Quorum becomes important when a failure causes the cluster to split in
|
||||
two or more paritions. In this situation, you want the majority to
|
||||
ensure the minority are truly dead (through fencing) and continue to
|
||||
host resources. For a two-node cluster, no side has the majority and
|
||||
you can end up in a situation where both sides fence each other, or
|
||||
both sides are running the same services - leading to data corruption.
|
||||
|
||||
Clusters with an even number of hosts suffer from similar issues - a
|
||||
single network failure could easily cause a N:N split where neither
|
||||
side retains a majority. For this reason, we recommend an odd number
|
||||
of cluster members when scaling up.
|
||||
|
||||
You can have up to 16 cluster members (this is currently limited by
|
||||
the ability of corosync to scale higher). In extreme cases, 32 and
|
||||
even up to 64 nodes could be possible, however, this is not well tested.
|
||||
|
||||
Collapsed
|
||||
---------
|
||||
|
||||
In this configuration, there is a single cluster of 3 or more
|
||||
nodes on which every component is running.
|
||||
|
||||
This scenario has the advantage of requiring far fewer, if more
|
||||
powerful, machines. Additionally, being part of a single cluster
|
||||
allows us to accurately model the ordering dependencies between
|
||||
components.
|
||||
|
||||
This scenario can be visualized as below.
|
||||
|
||||
.. image:: /figures/Cluster-deployment-collapsed.png
|
||||
:width: 100%
|
||||
|
||||
You would choose this option if you prefer to have fewer but more
|
||||
powerful boxes.
|
||||
|
||||
This is the most common option and the one we document here.
|
||||
|
||||
Segregated
|
||||
----------
|
||||
|
||||
In this configuration, each service runs in a dedicated cluster of
|
||||
3 or more nodes.
|
||||
|
||||
The benefits of this approach are the physical isolation between
|
||||
components and the ability to add capacity to specific components.
|
||||
|
||||
You would also choose this option if you prefer to have more but
|
||||
less powerful boxes.
|
||||
|
||||
This scenario can be visualized as below, where each box below
|
||||
represents a cluster of three or more guests.
|
||||
|
||||
.. image:: /figures/Cluster-deployment-segregated.png
|
||||
:width: 100%
|
||||
|
||||
Mixed
|
||||
-----
|
||||
|
||||
It is also possible to follow a segregated approach for one or more
|
||||
components that are expected to be a bottleneck and use a collapsed
|
||||
apprach for the remainder.
|
||||
|
||||
|
||||
Proxy server
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Almost all services in this stack benefit from being proxied.
|
||||
Using a proxy server provides:
|
||||
|
||||
#. Load distribution
|
||||
|
||||
Many services can act in an active/active capacity, however, they
|
||||
usually require an external mechanism for distributing requests to
|
||||
one of the available instances. The proxy server can serve this
|
||||
role.
|
||||
|
||||
#. API isolation
|
||||
|
||||
By sending all API access through the proxy, we can clearly
|
||||
identify service interdependencies. We can also move them to
|
||||
locations other than ``localhost`` to increase capacity if the
|
||||
need arises.
|
||||
|
||||
#. Simplified process for adding/removing of nodes
|
||||
|
||||
Since all API access is directed to the proxy, adding or removing
|
||||
nodes has no impact on the configuration of other services. This
|
||||
can be very useful in upgrade scenarios where an entirely new set
|
||||
of machines can be configured and tested in isolation before
|
||||
telling the proxy to direct traffic there instead.
|
||||
|
||||
#. Enhanced failure detection
|
||||
|
||||
The proxy can be configured as a secondary mechanism for detecting
|
||||
service failures. It can even be configured to look for nodes in
|
||||
a degraded state (such as being 'too far' behind in the
|
||||
replication) and take them out of circulation.
|
||||
|
||||
The following components are currently unable to benefit from the use
|
||||
of a proxy server:
|
||||
|
||||
* RabbitMQ
|
||||
* Memcached
|
||||
* MongoDB
|
||||
|
||||
However, the reasons vary and are discussed under each component's
|
||||
heading.
|
||||
|
||||
We recommend HAProxy as the load balancer, however, there are many
|
||||
alternatives in the marketplace.
|
||||
|
||||
We use a check interval of 1 second, however, the timeouts vary by service.
|
||||
|
||||
Generally, we use round-robin to distriute load amongst instances of
|
||||
active/active services, however, Galera uses the ``stick-table`` options
|
||||
to ensure that incoming connections to the virtual IP (VIP) should be
|
||||
directed to only one of the available back ends.
|
||||
|
||||
In Galera's case, although it can run active/active, this helps avoid
|
||||
lock contention and prevent deadlocks. It is used in combination with
|
||||
the ``httpchk`` option that ensures only nodes that are in sync with its
|
||||
peers are allowed to handle requests.
|
|
@ -23,17 +23,6 @@ In general we can divide all the OpenStack components into three categories:
|
|||
- :term:`Advanced Message Queuing Protocol (AMQP)` provides OpenStack
|
||||
internal stateful communication service.
|
||||
|
||||
Assuming that every single OpenStack controller runs the full set of
|
||||
the elementary services (symmetric controller), the common good practice
|
||||
is to have a small odd number of controllers.
|
||||
Most of the time, it means three OpenStack controllers.
|
||||
|
||||
|
||||
[TODO Discuss SLA (Service Level Agreement), if this is the measure we use.
|
||||
Other possibilities include MTTR (Mean Time To Recover),
|
||||
RTO (Recovery Time Objective),
|
||||
and ETR (Expected Time of Repair,]
|
||||
|
||||
Network components
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
@ -50,6 +39,29 @@ and expected SLA.]
|
|||
See [TODO link] for more information about configuring networking
|
||||
for high availability.
|
||||
|
||||
Common deployement architectures
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are primarily two HA architectures in use today.
|
||||
|
||||
One uses a cluster manager such as Pacemaker or Veritas to co-ordinate
|
||||
the actions of the various services across a set of machines. Since
|
||||
we are focused on FOSS, we will refer to this as the Pacemaker
|
||||
architecture.
|
||||
|
||||
The other is optimized for Active/Active services that do not require
|
||||
any inter-machine coordination. In this setup, services are started by
|
||||
your init system (systemd in most modern distributions) and a tool is
|
||||
used to move IP addresses between the hosts. The most common package
|
||||
for doing this is keepalived.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
intro-ha-arch-pacemaker.rst
|
||||
intro-ha-arch-keepalived.rst
|
||||
|
||||
|
||||
Load balancing (HAProxy)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
|
Loading…
Reference in New Issue