This creates a new ha-guide-draft directory to allow developing the guide. Add infrastructure for: * The draft guide is not translated currently. * Build RST version (but not PDF yet) with each run. * Publish guide as draft * Update built index for it. Also updates the ToC to reflect the proposed changes. Implements: blueprint implement-ha-guide-todos Change-Id: If6f65646e02ac3eac08c288fd727ab9851cf9569
10 KiB
Configuring the stateful services
Database for high availability
Galera
The first step is to install the database that sits at the heart of the cluster. To implement high availability, run an instance of the database on each controller node and use Galera Cluster to provide replication between them. Galera Cluster is a synchronous multi-master database cluster, based on MySQL and the InnoDB storage engine. It is a high-availability service that provides high system uptime, no data loss, and scalability for growth.
You can achieve high availability for the OpenStack database in many different ways, depending on the type of database that you want to use. There are three implementations of Galera Cluster available to you:
- Galera Cluster for MySQL: The MySQL reference implementation from Codership, Oy.
- MariaDB Galera Cluster: The MariaDB implementation of Galera Cluster, which is commonly supported in environments based on Red Hat distributions.
- Percona XtraDB Cluster: The XtraDB implementation of Galera Cluster from Percona.
In addition to Galera Cluster, you can also achieve high availability through other database options, such as PostgreSQL, which has its own replication system.
Pacemaker active/passive with HAproxy
Replicated storage
For example: DRBD
Shared storage
Messaging service for high availability
RabbitMQ
An AMQP (Advanced Message Queuing Protocol) compliant message bus is required for most OpenStack components in order to coordinate the execution of jobs entered into the system.
The most popular AMQP implementation used in OpenStack installations is RabbitMQ.
RabbitMQ nodes fail over on the application and the infrastructure layers.
The application layer is controlled by the
oslo.messaging
configuration options for multiple AMQP
hosts. If the AMQP node fails, the application reconnects to the next
one configured within the specified reconnect interval. The specified
reconnect interval constitutes its SLA.
On the infrastructure layer, the SLA is the time for which RabbitMQ cluster reassembles. Several cases are possible. The Mnesia keeper node is the master of the corresponding Pacemaker resource for RabbitMQ. When it fails, the result is a full AMQP cluster downtime interval. Normally, its SLA is no more than several minutes. Failure of another node that is a slave of the corresponding Pacemaker resource for RabbitMQ results in no AMQP cluster downtime at all.
Making the RabbitMQ service highly available involves the following steps:
Install RabbitMQ<rabbitmq-install>
Configure RabbitMQ for HA queues<rabbitmq-configure>
Configure OpenStack services to use RabbitMQ HA queues <rabbitmq-services>
Note
Access to RabbitMQ is not normally handled by HAProxy. Instead,
consumers must be supplied with the full list of hosts running RabbitMQ
with rabbit_hosts
and turn on the
rabbit_ha_queues
option. For more information, read the core
issue. For more detail, read the history
and solution.
Install RabbitMQ
The commands for installing RabbitMQ are specific to the Linux distribution you are using.
For Ubuntu or Debian:
# apt-get install rabbitmq-server
For RHEL, Fedora, or CentOS:
# yum install rabbitmq-server
For openSUSE:
# zypper install rabbitmq-server
For SLES 12:
# zypper addrepo -f obs://Cloud:OpenStack:Kilo/SLE_12 Kilo [Verify the fingerprint of the imported GPG key. See below.] # zypper install rabbitmq-server
Note
For SLES 12, the packages are signed by GPG key 893A90DAD85F9316. You should verify the fingerprint of the imported GPG key before using it.
Key ID: 893A90DAD85F9316
Key Name: Cloud:OpenStack OBS Project <Cloud:OpenStack@build.opensuse.org>
Key Fingerprint: 35B34E18ABC1076D66D5A86B893A90DAD85F9316
Key Created: Tue Oct 8 13:34:21 2013
Key Expires: Thu Dec 17 13:34:21 2015
For more information, see the official installation manual for the distribution:
- Debian and Ubuntu
- RPM based (RHEL, Fedora, CentOS, openSUSE)
Configure RabbitMQ for HA queues
The following components/services can work with HA queues:
- OpenStack Compute
- OpenStack Block Storage
- OpenStack Networking
- Telemetry
Consider that, while exchanges and bindings survive the loss of individual nodes, queues and their messages do not because a queue and its contents are located on one node. If we lose this node, we also lose the queue.
Mirrored queues in RabbitMQ improve the availability of service since it is resilient to failures.
Production servers should run (at least) three RabbitMQ servers for
testing and demonstration purposes, however it is possible to run only
two servers. In this section, we configure two nodes, called
rabbit1
and rabbit2
. To build a broker, ensure
that all nodes have the same Erlang cookie file.
Stop RabbitMQ and copy the cookie from the first node to each of the other node(s):
# scp /var/lib/rabbitmq/.erlang.cookie root@NODE:/var/lib/rabbitmq/.erlang.cookie
On each target node, verify the correct owner, group, and permissions of the file
erlang.cookie
:# chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie # chmod 400 /var/lib/rabbitmq/.erlang.cookie
Start the message queue service on all nodes and configure it to start when the system boots. On Ubuntu, it is configured by default.
On CentOS, RHEL, openSUSE, and SLES:
# systemctl enable rabbitmq-server.service # systemctl start rabbitmq-server.service
Verify that the nodes are running:
# rabbitmqctl cluster_status Cluster status of node rabbit@NODE... [{nodes,[{disc,[rabbit@NODE]}]}, {running_nodes,[rabbit@NODE]}, {partitions,[]}] ...done.
Run the following commands on each node except the first one:
# rabbitmqctl stop_app Stopping node rabbit@NODE... ...done. # rabbitmqctl join_cluster --ram rabbit@rabbit1 # rabbitmqctl start_app Starting node rabbit@NODE ... ...done.
Note
The default node type is a disc node. In this guide, nodes join the cluster as RAM nodes.
Verify the cluster status:
# rabbitmqctl cluster_status Cluster status of node rabbit@NODE... [{nodes,[{disc,[rabbit@rabbit1]},{ram,[rabbit@NODE]}]}, \ {running_nodes,[rabbit@NODE,rabbit@rabbit1]}]
If the cluster is working, you can create usernames and passwords for the queues.
To ensure that all queues except those with auto-generated names are mirrored across all running nodes, set the
ha-mode
policy key to all by running the following command on one of the nodes:# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
More information is available in the RabbitMQ documentation:
Note
As another option to make RabbitMQ highly available, RabbitMQ contains the OCF scripts for the Pacemaker cluster resource agents since version 3.5.7. It provides the active/active RabbitMQ cluster with mirrored queues. For more information, see Auto-configuration of a cluster with a Pacemaker.
Configure OpenStack services to use Rabbit HA queues
Configure the OpenStack components to use at least two RabbitMQ nodes.
Use these steps to configurate all services using RabbitMQ:
RabbitMQ HA cluster
host:port
pairs:rabbit_hosts=rabbit1:5672,rabbit2:5672,rabbit3:5672
Retry connecting with RabbitMQ:
rabbit_retry_interval=1
How long to back-off for between retries when connecting to RabbitMQ:
rabbit_retry_backoff=2
Maximum retries with trying to connect to RabbitMQ (infinite by default):
rabbit_max_retries=0
Use durable queues in RabbitMQ:
rabbit_durable_queues=true
Use HA queues in RabbitMQ (
x-ha-policy: all
):rabbit_ha_queues=true
Note
If you change the configuration from an old set-up that did not use HA queues, restart the service:
# rabbitmqctl stop_app
# rabbitmqctl reset
# rabbitmqctl start_app