Merge "[HA Guide] Update for the current predominant architectures"

2016-09-23 06:45:50 +00:00 · 2016-09-23 06:45:50 +00:00 · 73286dd87e
commit 73286dd87e
parent df7cbb465d c5c825fd88
3 changed files with 27 additions and 107 deletions
--- a/doc/ha-guide/source/figures/keepalived-arch.jpg
+++ b/doc/ha-guide/source/figures/keepalived-arch.jpg
--- a/doc/ha-guide/source/intro-ha-arch-keepalived.rst
+++ b/doc/ha-guide/source/intro-ha-arch-keepalived.rst
@ -1,96 +0,0 @@
 ============================
 The keepalived architecture
 ============================
 High availability strategies
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The following diagram shows a very simplified view of the different
 strategies used to achieve high availability for the OpenStack
 services:
 .. image:: /figures/keepalived-arch.jpg
   :width: 100%
 Depending on the method used to communicate with the service, the
 following availability strategies will be followed:
 -  Keepalived, for the HAProxy instances.
 -  Access via an HAProxy virtual IP, for services such as HTTPd that
   are accessed via a TCP socket that can be load balanced
 -  Built-in application clustering, when available from the application.
   Galera is one example of this.
 -  Starting up one instance of the service on several controller nodes,
   when they can coexist and coordinate by other means. RPC in
   ``nova-conductor`` is one example of this.
 -  No high availability, when the service can only work in
   active/passive mode.
 There are known issues with cinder-volume that recommend setting it as
 active-passive for now, see:
 https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
 While there will be multiple neutron LBaaS agents running, each agent
 will manage a set of load balancers, that cannot be failed over to
 another node.
 Architecture limitations
 ~~~~~~~~~~~~~~~~~~~~~~~~
 This architecture has some inherent limitations that should be kept in
 mind during deployment and daily operations.
 The following sections describe these limitations.
 #. Keepalived and network partitions
   In case of a network partitioning, there is a chance that two or
   more nodes running keepalived claim to hold the same VIP, which may
   lead to an undesired behaviour. Since keepalived uses VRRP over
   multicast to elect a master (VIP owner), a network partition in
   which keepalived nodes cannot communicate will result in the VIPs
   existing on two nodes. When the network partition is resolved, the
   duplicate VIPs should also be resolved. Note that this network
   partition problem with VRRP is a known limitation for this
   architecture.
 #. Cinder-volume as a single point of failure
   There are currently concerns over the cinder-volume service ability
   to run as a fully active-active service. During the Mitaka
   timeframe, this is being worked on, see:
   https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
   Thus, cinder-volume will only be running on one of the controller
   nodes, even if it will be configured on all nodes. In case of a
   failure in the node running cinder-volume, it should be started in
   a surviving controller node.
 #. Neutron-lbaas-agent as a single point of failure
   The current design of the neutron LBaaS agent using the HAProxy
   driver does not allow high availability for the project load
   balancers. The neutron-lbaas-agent service will be enabled and
   running on all controllers, allowing for load balancers to be
   distributed across all nodes. However, a controller node failure
   will stop all load balancers running on that node until the service
   is recovered or the load balancer is manually removed and created
   again.
 #. Service monitoring and recovery required
   An external service monitoring infrastructure is required to check
   the OpenStack service health, and notify operators in case of any
   failure. This architecture does not provide any facility for that,
   so it would be necessary to integrate the OpenStack deployment with
   any existing monitoring environment.
 #. Manual recovery after a full cluster restart
   Some support services used by RDO or RHEL OSP use their own form of
   application clustering. Usually, these services maintain a cluster
   quorum, that may be lost in case of a simultaneous restart of all
   cluster nodes, for example during a power outage. Each service will
   require its own procedure to regain quorum.
 If you find any or all of these limitations concerning, you are
 encouraged to refer to the
 :doc:`Pacemaker HA architecture<intro-ha-arch-pacemaker>` instead.
--- a/doc/ha-guide/source/intro-ha-controller.rst
+++ b/doc/ha-guide/source/intro-ha-controller.rst
@ -42,21 +42,37 @@ Networking for high availability.
 Common deployment architectures
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-There are primarily two HA architectures in use today.
+There are primarily two recommended architectures for making OpenStack
 highly available.
-One uses a cluster manager such as Pacemaker or Veritas to co-ordinate
+Both use a cluster manager such as Pacemaker or Veritas to
-the actions of the various services across a set of machines. Since
+orchestrate the actions of the various services across a set of
-we are focused on FOSS, we will refer to this as the Pacemaker
+machines. Since we are focused on FOSS, we will refer to these as
-architecture.
+Pacemaker architectures.
-The other is optimized for Active/Active services that do not require
+The architectures differ in the sets of services managed by the
-any inter-machine coordination. In this setup, services are started by
+cluster.
-your init system (systemd in most modern distributions) and a tool is
+
-used to move IP addresses between the hosts. The most common package
+Traditionally, Pacemaker has been positioned as an all-encompassing
-for doing this is keepalived.
+solution. However, as OpenStack services have matured, they are
 increasingly able to run in an active/active configuration and
 gracefully tolerate the disappearance of the APIs on which they
 depend.
 With this in mind, some vendors are restricting Pacemaker's use to
 services that must operate in an active/passive mode (such as
 cinder-volume), those with multiple states (for example, Galera) and
 those with complex bootstrapping procedures (such as RabbitMQ).
 The majority of services, needing no real orchestration, are handled
 by Systemd on each node. This approach avoids the need to coordinate
 service upgrades or location changes with the cluster and has the
 added advantage of more easily scaling beyond Corosync's 16 node
 limit. However, it will generally require the addition of an
 enterprise monitoring solution such as Nagios or Sensu for those
 wanting centralized failure reporting.
 .. toctree::
   :maxdepth: 1
   intro-ha-arch-pacemaker.rst
   intro-ha-arch-keepalived.rst