Merge "Introduction to ha services on the controller"

2015-06-22 08:41:49 +00:00
parent dfa2e34983 595179e206
commit 488902134e
2 changed files with 113 additions and 2 deletions
--- a/doc/ha-guide/source/install-ha.rst
+++ b/doc/ha-guide/source/install-ha.rst
@@ -1,8 +1,8 @@
-
 =====================================
 Installing high availability packages
 =====================================

+[TODO -- write intro to this section]

 .. toctree::
   :maxdepth: 2
--- a/doc/ha-guide/source/intro-ha-controller.rst
+++ b/doc/ha-guide/source/intro-ha-controller.rst
@@ -1,4 +1,115 @@
-
 ========================================
 Overview of highly-available controllers
 ========================================
+
+A highly-available OpenStack environment
+must have a controller cluster with three or more nodes.
+The following components are normally included in the cluster.
+
+[TODO Discuss SLA (Service Level Agreement), if this is the measure we use.
+Other possibilities include MTTR (Mean Time To Recover),
+RTO (Recovery Time Objective),
+and ETR (Expected Time of Repair,]
+
+Network components
+------------------
+
+[TODO Need discussion of network hardware, bonding interfaces,
+intelligent Layer 2 switches, routers and Layer 3 switches.]
+
+The configuration uses static routing without
+Virtual Router Redundancy Protocol (VRRP)
+or similar techniques implemented.
+
+[TODO Need description of VIP failover inside Linux namespaces
+and expected SLA.]
+
+See [TODO link] for more information about configuring networking
+for high availability.
+
+Load Balancing (HAProxy)
+-------------------------
+
+HAProxy is a load balancer that runs on each controller in the cluster
+but does not synchronize the state.
+Each instance of HAProxy configures its frontend to accept connections
+only from the Virtual IP (VIP) address and to terminate them
+as a list of all instances of the corresponding service under load balancing.
+For example, any OpenStack API service.
+This makes the instances of HAProxy act independently
+and fail over transparently
+together with the Network endpoints (VIP addresses) failover
+and shares the same SLA.
+
+See [TODO link] for information about configuring HAProxy.
+
+Database (MySQL/Galera)
+-----------------------
+
+MySQL with Galera can be configured
+using one of the following strategies:
+
+- Each instance has its own IP address;
+  OpenStack services are configured with the list of these IP addresses
+  so they can select on of the addresses from those available.
+
+- The MySQL/Galera cluster runs behind HAProxy.
+  HAProxy the load balances incoming requests
+  and exposes just one IP address for all the clients.
+
+  Galera synchronous replication guarantees a zero slave lag.
+  The failover procedure completes once HAProxy detects
+  that the active back end has gone down and switches to the backup one,
+  which is then marked as 'UP'.
+  If no back ends are up (in other words,
+  the Galera cluster is not ready to accept connections),
+  the failover procedure finishes only when
+  the Galera cluster has been successfully reassembled.
+  The SLA is normally no more than 5 minutes.
+
+- Use MySQL/Galera in active/passive mode
+  to avoid deadlocks on ``SELECT ... FOR UPDATE`` type queries
+  (used, for example, by nova and neutron).
+  This issue is discussed more in the following:
+
+  - `http://lists.openstack.org/pipermail/openstack-dev/2014-May/035264.html`
+  - `http://www.joinfu.com/`
+  - `http://www.joinfu.com/`
+
+
+AMQP (RabbitMQ)
+~~~~~~~~~~~~~~~
+
+RabbitMQ nodes fail over both on the application
+and the infrastructure layers.
+The application layer is controlled by
+the ``oslo.messaging`` configuration options
+for multiple AMQP hosts. If the AMQP node fails,
+the application reconnects to the next one configured
+within the specified reconnect interval.
+The specified reconnect interval constitutes its SLA.
+On the infrastructure layer,
+the SLA is the time for which RabbitMQ cluster reassembles.
+Several cases are possible.
+The Mnesia keeper node is the master
+of the corresponding Pacemaker resource for RabbitMQ;
+when it fails, the result is a full AMQP cluster downtime interval.
+Normally, its SLA is no more than several minutes.
+Failure of another node that is a slave
+of the corresponding Pacemaker resource for RabbitMQ
+results in no AMQP cluster downtime at all.
+
+Memcached back end
+~~~~~~~~~~~~~~~~~~
+
+Memcached is a memory cache demon that can be used
+by most OpenStack services to store ephemeral data, such as tokens.
+Although Memcached does not support
+typical forms of redundancy such as clustering,
+OpenStack services can use almost any number of instances
+by configuring multiple hostnames or IP addresses.
+The Memcached client implements hashing
+to balance objects among the instances.
+Failure of an instance only impacts a percentage of the objects
+and the client automatically removes it from the list of instances.
+The SLA is several minutes.