Update the overview of keepalived and pacemaker HA

High level descriptions of the keepalived and pacemaker HA architectures Change-Id: Idb838b08d730895135ef92335777ca92b5951c54
2015-11-16 12:53:56 +11:00 · 2015-11-16 12:53:56 +11:00 · 0a4abf4db9
parent 9ac92531d0
commit 0a4abf4db9
6 changed files with 317 additions and 11 deletions
--- a/doc/ha-guide/source/figures/Cluster-deployment-collapsed.png
+++ b/doc/ha-guide/source/figures/Cluster-deployment-collapsed.png
--- a/doc/ha-guide/source/figures/Cluster-deployment-segregated.png
+++ b/doc/ha-guide/source/figures/Cluster-deployment-segregated.png
--- a/doc/ha-guide/source/figures/keepalived-arch.jpg
+++ b/doc/ha-guide/source/figures/keepalived-arch.jpg
--- a/doc/ha-guide/source/intro-ha-arch-keepalived.rst
+++ b/doc/ha-guide/source/intro-ha-arch-keepalived.rst
@ -0,0 +1,96 @@
+============================
+The keepalived architecture
+============================
+
+High availability strategies
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following diagram shows a very simplified view of the different
+strategies used to achieve high availability for the OpenStack
+services:
+
+.. image:: /figures/keepalived-arch.jpg
+   :width: 100%
+
+Depending on the method used to communicate with the service, the
+following availability strategies will be followed:
+
+-  Keepalived, for the HAProxy instances.
+-  Access via an HAProxy virtual IP, for services such as HTTPd that
+   are accessed via a TCP socket that can be load balanced
+-  Built-in application clustering, when available from the application.
+   Galera is one example of this.
+-  Starting up one instance of the service on several controller nodes,
+   when they can coexist and coordinate by other means. RPC in
+   ``nova-conductor`` is one example of this.
+-  No high availability, when the service can only work in
+   active/passive mode.
+
+There are known issues with cinder-volume that recommend setting it as
+active-passive for now, see:
+https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
+
+While there will be multiple neutron LBaaS agents running, each agent
+will manage a set of load balancers, that cannot be failed over to
+another node.
+
+Architecture limitations
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+This architecture has some inherent limitations that should be kept in
+mind during deployment and daily operations.
+The following sections describe these limitations.
+
+#. Keepalived and network partitions
+
+   In case of a network partitioning, there is a chance that two or
+   more nodes running keepalived claim to hold the same VIP, which may
+   lead to an undesired behaviour. Since keepalived uses VRRP over
+   multicast to elect a master (VIP owner), a network partition in
+   which keepalived nodes cannot communicate will result in the VIPs
+   existing on two nodes. When the network partition is resolved, the
+   duplicate VIPs should also be resolved. Note that this network
+   partition problem with VRRP is a known limitation for this
+   architecture.
+
+#. Cinder-volume as a single point of failure
+
+   There are currently concerns over the cinder-volume service ability
+   to run as a fully active-active service. During the Mitaka
+   timeframe, this is being worked on, see:
+   https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
+   Thus, cinder-volume will only be running on one of the controller
+   nodes, even if it will be configured on all nodes. In case of a
+   failure in the node running cinder-volume, it should be started in
+   a surviving controller node.
+
+#. Neutron-lbaas-agent as a single point of failure
+
+   The current design of the neutron LBaaS agent using the HAProxy
+   driver does not allow high availability for the tenant load
+   balancers. The neutron-lbaas-agent service will be enabled and
+   running on all controllers, allowing for load balancers to be
+   distributed across all nodes. However, a controller node failure
+   will stop all load balancers running on that node until the service
+   is recovered or the load balancer is manually removed and created
+   again.
+
+#. Service monitoring and recovery required
+
+   An external service monitoring infrastructure is required to check
+   the OpenStack service health, and notify operators in case of any
+   failure. This architecture does not provide any facility for that,
+   so it would be necessary to integrate the OpenStack deployment with
+   any existing monitoring environment.
+
+#. Manual recovery after a full cluster restart
+
+   Some support services used by RDO or RHEL OSP use their own form of
+   application clustering. Usually, these services maintain a cluster
+   quorum, that may be lost in case of a simultaneous restart of all
+   cluster nodes, for example during a power outage. Each service will
+   require its own procedure to regain quorum.
+
+If you find any or all of these limitations concerning, you are
+encouraged to refer to the `Pacemaker HA architectire
+<introl-ha-design-pacemaker.html>`_ instead.
--- a/doc/ha-guide/source/intro-ha-arch-pacemaker.rst
+++ b/doc/ha-guide/source/intro-ha-arch-pacemaker.rst
@ -0,0 +1,198 @@
+==========================
+The Pacemaker architecture
+==========================
+
+What is a cluster manager
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+At its core, a cluster is a distributed finite state machine capable
+of co-ordinating the startup and recovery of inter-related services
+across a set of machines.
+
+Even a distributed and/or replicated application that is able to
+survive failures on one or more machines can benefit from a
+cluster manager:
+
+#. Awareness of other applications in the stack
+
+   While SYS-V init replacements like systemd can provide
+   deterministic recovery of a complex stack of services, the
+   recovery is limited to one machine and lacks the context of what
+   is happening on other machines - context that is crucial to
+   determine the difference between a local failure, clean startup
+   and recovery after a total site failure.
+
+#. Awareness of instances on other machines
+
+   Services like RabbitMQ and Galera have complicated boot-up
+   sequences that require co-ordination, and often serialization, of
+   startup operations across all machines in the cluster. This is
+   especially true after site-wide failure or shutdown where we must
+   first determine the last machine to be active.
+
+#. A shared implementation and calculation of `quorum
+   <http://en.wikipedia.org/wiki/Quorum_(Distributed_Systems)>`_.
+
+   It is very important that all members of the system share the same
+   view of who their peers are and whether or not they are in the
+   majority. Failure to do this leads very quickly to an internal
+   `split-brain <http://en.wikipedia.org/wiki/Split-brain_(computing)>`_
+   state - where different parts of the system are pulling in
+   different and incompatioble directions.
+
+#. Data integrity through fencing (a non-responsive process does not
+   imply it is not doing anything)
+
+   A single application does not have sufficient context to know the
+   difference between failure of a machine and failure of the
+   applcation on a machine. The usual practice is to assume the
+   machine is dead and carry on, however this is highly risky - a
+   rogue process or machine could still be responding to requests and
+   generally causing havoc. The safer approach is to make use of
+   remotely accessible power switches and/or network switches and SAN
+   controllers to fence (isolate) the machine before continuing.
+
+#. Automated recovery of failed instances
+
+   While the application can still run after the failure of several
+   instances, it may not have sufficient capacity to serve the
+   required volume of requests. A cluster can automatically recover
+   failed instances to prevent additional load induced failures.
+
+For this reason, the use of a cluster manager like `Pacemaker
+<http://clusterlabs.org>`_ is highly recommended.
+
+Deployment flavors
+~~~~~~~~~~~~~~~~~~
+
+It is possible to deploy three different flavors of the Pacemaker
+architecture. The two extremes are **Collapsed** (where every
+component runs on every node) and **Segregated** (where every
+component runs in its own 3+ node cluster).
+
+Regardless of which flavor you choose, it is recommended that the
+clusters contain at least three nodes so that we can take advantage of
+`quorum <quorum_>`_.
+
+Quorum becomes important when a failure causes the cluster to split in
+two or more paritions. In this situation, you want the majority to
+ensure the minority are truly dead (through fencing) and continue to
+host resources. For a two-node cluster, no side has the majority and
+you can end up in a situation where both sides fence each other, or
+both sides are running the same services - leading to data corruption.
+
+Clusters with an even number of hosts suffer from similar issues - a
+single network failure could easily cause a N:N split where neither
+side retains a majority. For this reason, we recommend an odd number
+of cluster members when scaling up.
+
+You can have up to 16 cluster members (this is currently limited by
+the ability of corosync to scale higher). In extreme cases, 32 and
+even up to 64 nodes could be possible, however, this is not well tested.
+
+Collapsed
+---------
+
+In this configuration, there is a single cluster of 3 or more
+nodes on which every component is running.
+
+This scenario has the advantage of requiring far fewer, if more
+powerful, machines. Additionally, being part of a single cluster
+allows us to accurately model the ordering dependencies between
+components.
+
+This scenario can be visualized as below.
+
+.. image:: /figures/Cluster-deployment-collapsed.png
+   :width: 100%
+
+You would choose this option if you prefer to have fewer but more
+powerful boxes.
+
+This is the most common option and the one we document here.
+
+Segregated
+----------
+
+In this configuration, each service runs in a dedicated cluster of
+3 or more nodes.
+
+The benefits of this approach are the physical isolation between
+components and the ability to add capacity to specific components.
+
+You would also choose this option if you prefer to have more but
+less powerful boxes.
+
+This scenario can be visualized as below, where each box below
+represents a cluster of three or more guests.
+
+.. image:: /figures/Cluster-deployment-segregated.png
+   :width: 100%
+
+Mixed
+-----
+
+It is also possible to follow a segregated approach for one or more
+components that are expected to be a bottleneck and use a collapsed
+apprach for the remainder.
+
+
+Proxy server
+~~~~~~~~~~~~
+
+Almost all services in this stack benefit from being proxied.
+Using a proxy server provides:
+
+#. Load distribution
+
+   Many services can act in an active/active capacity, however, they
+   usually require an external mechanism for distributing requests to
+   one of the available instances. The proxy server can serve this
+   role.
+
+#. API isolation
+
+   By sending all API access through the proxy, we can clearly
+   identify service interdependencies. We can also move them to
+   locations other than ``localhost`` to increase capacity if the
+   need arises.
+
+#. Simplified process for adding/removing of nodes
+
+   Since all API access is directed to the proxy, adding or removing
+   nodes has no impact on the configuration of other services. This
+   can be very useful in upgrade scenarios where an entirely new set
+   of machines can be configured and tested in isolation before
+   telling the proxy to direct traffic there instead.
+
+#. Enhanced failure detection
+
+   The proxy can be configured as a secondary mechanism for detecting
+   service failures. It can even be configured to look for nodes in
+   a degraded state (such as being 'too far' behind in the
+   replication) and take them out of circulation.
+
+The following components are currently unable to benefit from the use
+of a proxy server:
+
+* RabbitMQ
+* Memcached
+* MongoDB
+
+However, the reasons vary and are discussed under each component's
+heading.
+
+We recommend HAProxy as the load balancer, however, there are many
+alternatives in the marketplace.
+
+We use a check interval of 1 second, however, the timeouts vary by service.
+
+Generally, we use round-robin to distriute load amongst instances of
+active/active services, however, Galera uses the ``stick-table`` options
+to ensure that incoming connections to the virtual IP (VIP) should be
+directed to only one of the available back ends.
+
+In Galera's case, although it can run active/active, this helps avoid
+lock contention and prevent deadlocks. It is used in combination with
+the ``httpchk`` option that ensures only nodes that are in sync with its
+peers are allowed to handle requests.
--- a/doc/ha-guide/source/intro-ha-controller.rst
+++ b/doc/ha-guide/source/intro-ha-controller.rst
@ -23,17 +23,6 @@ In general we can divide all the OpenStack components into three categories:
 - :term:`Advanced Message Queuing Protocol (AMQP)` provides OpenStack
  internal stateful communication service.

-Assuming that every single OpenStack controller runs the full set of
-the elementary services (symmetric controller), the common good practice
-is to have a small odd number of controllers.
-Most of the time, it means three OpenStack controllers.
-
-
-[TODO Discuss SLA (Service Level Agreement), if this is the measure we use.
-Other possibilities include MTTR (Mean Time To Recover),
-RTO (Recovery Time Objective),
-and ETR (Expected Time of Repair,]
-
 Network components
 ~~~~~~~~~~~~~~~~~~

@ -50,6 +39,29 @@ and expected SLA.]
 See [TODO link] for more information about configuring networking
 for high availability.

+Common deployement architectures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are primarily two HA architectures in use today.
+
+One uses a cluster manager such as Pacemaker or Veritas to co-ordinate
+the actions of the various services across a set of machines. Since
+we are focused on FOSS, we will refer to this as the Pacemaker
+architecture.
+
+The other is optimized for Active/Active services that do not require
+any inter-machine coordination. In this setup, services are started by
+your init system (systemd in most modern distributions) and a tool is
+used to move IP addresses between the hosts. The most common package
+for doing this is keepalived.
+
+.. toctree::
+   :maxdepth: 1
+
+   intro-ha-arch-pacemaker.rst
+   intro-ha-arch-keepalived.rst
+
+
 Load balancing (HAProxy)
 ~~~~~~~~~~~~~~~~~~~~~~~~