08267552f3
Change-Id: Ibe36dc8c21d1779e735225286a7c8852019ad2c9
632 lines
20 KiB
ReStructuredText
632 lines
20 KiB
ReStructuredText
=======================
|
||
Pacemaker cluster stack
|
||
=======================
|
||
|
||
`Pacemaker <http://clusterlabs.org/>`_ cluster stack is a state-of-the-art
|
||
high availability and load balancing stack for the Linux platform.
|
||
Pacemaker is used to make OpenStack infrastructure highly available.
|
||
|
||
.. note::
|
||
|
||
It is storage and application-agnostic, and in no way specific to OpenStack.
|
||
|
||
Pacemaker relies on the
|
||
`Corosync <https://corosync.github.io/corosync/>`_ messaging layer
|
||
for reliable cluster communications. Corosync implements the Totem single-ring
|
||
ordering and membership protocol. It also provides UDP and InfiniBand based
|
||
messaging, quorum, and cluster membership to Pacemaker.
|
||
|
||
Pacemaker does not inherently understand the applications it manages.
|
||
Instead, it relies on resource agents (RAs) that are scripts that encapsulate
|
||
the knowledge of how to start, stop, and check the health of each application
|
||
managed by the cluster.
|
||
|
||
These agents must conform to one of the `OCF <https://github.com/ClusterLabs/
|
||
OCF-spec/blob/master/ra/resource-agent-api.md>`_,
|
||
`SysV Init <http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/
|
||
LSB-Core-generic/iniscrptact.html>`_, Upstart, or Systemd standards.
|
||
|
||
Pacemaker ships with a large set of OCF agents (such as those managing
|
||
MySQL databases, virtual IP addresses, and RabbitMQ), but can also use
|
||
any agents already installed on your system and can be extended with
|
||
your own (see the
|
||
`developer guide <http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html>`_).
|
||
|
||
The steps to implement the Pacemaker cluster stack are:
|
||
|
||
- :ref:`pacemaker-install`
|
||
- :ref:`pacemaker-corosync-setup`
|
||
- :ref:`pacemaker-corosync-start`
|
||
- :ref:`pacemaker-start`
|
||
- :ref:`pacemaker-cluster-properties`
|
||
|
||
.. _pacemaker-install:
|
||
|
||
Install packages
|
||
~~~~~~~~~~~~~~~~
|
||
|
||
On any host that is meant to be part of a Pacemaker cluster, establish cluster
|
||
communications through the Corosync messaging layer.
|
||
This involves installing the following packages (and their dependencies, which
|
||
your package manager usually installs automatically):
|
||
|
||
- `pacemaker`
|
||
|
||
- `pcs` (CentOS or RHEL) or crmsh
|
||
|
||
- `corosync`
|
||
|
||
- `fence-agents` (CentOS or RHEL) or cluster-glue
|
||
|
||
- `resource-agents`
|
||
|
||
- `libqb0`
|
||
|
||
.. _pacemaker-corosync-setup:
|
||
|
||
Set up the cluster with pcs
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
#. Make sure `pcs` is running and configured to start at boot time:
|
||
|
||
.. code-block:: console
|
||
|
||
$ systemctl enable pcsd
|
||
$ systemctl start pcsd
|
||
|
||
#. Set a password for hacluster user on each host:
|
||
|
||
.. code-block:: console
|
||
|
||
$ echo my-secret-password-no-dont-use-this-one \
|
||
| passwd --stdin hacluster
|
||
|
||
.. note::
|
||
|
||
Since the cluster is a single administrative domain, it is
|
||
acceptable to use the same password on all nodes.
|
||
|
||
#. Use that password to authenticate to the nodes that will
|
||
make up the cluster:
|
||
|
||
.. code-block:: console
|
||
|
||
$ pcs cluster auth controller1 controller2 controller3 \
|
||
-u hacluster -p my-secret-password-no-dont-use-this-one --force
|
||
|
||
.. note::
|
||
|
||
The ``-p`` option is used to give the password on command
|
||
line and makes it easier to script.
|
||
|
||
#. Create and name the cluster. Then, start it and enable all components to
|
||
auto-start at boot time:
|
||
|
||
.. code-block:: console
|
||
|
||
$ pcs cluster setup --force --name my-first-openstack-cluster \
|
||
controller1 controller2 controller3
|
||
$ pcs cluster start --all
|
||
$ pcs cluster enable --all
|
||
|
||
.. note ::
|
||
|
||
In Red Hat Enterprise Linux or CentOS environments, this is a recommended
|
||
path to perform configuration. For more information, see the `RHEL docs
|
||
<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/ch-clusteradmin-HAAR.html#s1-clustercreate-HAAR>`_.
|
||
|
||
Set up the cluster with `crmsh`
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
After installing the Corosync package, you must create
|
||
the :file:`/etc/corosync/corosync.conf` configuration file.
|
||
|
||
.. note::
|
||
|
||
For Ubuntu, you should also enable the Corosync service in the
|
||
``/etc/default/corosync`` configuration file.
|
||
|
||
Corosync can be configured to work with either multicast or unicast IP
|
||
addresses or to use the votequorum library.
|
||
|
||
- :ref:`corosync-multicast`
|
||
- :ref:`corosync-unicast`
|
||
- :ref:`corosync-votequorum`
|
||
|
||
.. _corosync-multicast:
|
||
|
||
Set up Corosync with multicast
|
||
------------------------------
|
||
|
||
Most distributions ship an example configuration file
|
||
(:file:`corosync.conf.example`) as part of the documentation bundled with
|
||
the Corosync package. An example Corosync configuration file is shown below:
|
||
|
||
**Example Corosync configuration file for multicast (``corosync.conf``)**
|
||
|
||
.. code-block:: none
|
||
|
||
totem {
|
||
version: 2
|
||
|
||
# Time (in ms) to wait for a token (1)
|
||
token: 10000
|
||
|
||
# How many token retransmits before forming a new
|
||
# configuration
|
||
token_retransmits_before_loss_const: 10
|
||
|
||
# Turn off the virtual synchrony filter
|
||
vsftype: none
|
||
|
||
# Enable encryption (2)
|
||
secauth: on
|
||
|
||
# How many threads to use for encryption/decryption
|
||
threads: 0
|
||
|
||
# This specifies the redundant ring protocol, which may be
|
||
# none, active, or passive. (3)
|
||
rrp_mode: active
|
||
|
||
# The following is a two-ring multicast configuration. (4)
|
||
interface {
|
||
ringnumber: 0
|
||
bindnetaddr: 10.0.0.0
|
||
mcastaddr: 239.255.42.1
|
||
mcastport: 5405
|
||
}
|
||
interface {
|
||
ringnumber: 1
|
||
bindnetaddr: 10.0.42.0
|
||
mcastaddr: 239.255.42.2
|
||
mcastport: 5405
|
||
}
|
||
}
|
||
|
||
amf {
|
||
mode: disabled
|
||
}
|
||
|
||
service {
|
||
# Load the Pacemaker Cluster Resource Manager (5)
|
||
ver: 1
|
||
name: pacemaker
|
||
}
|
||
|
||
aisexec {
|
||
user: root
|
||
group: root
|
||
}
|
||
|
||
logging {
|
||
fileline: off
|
||
to_stderr: yes
|
||
to_logfile: no
|
||
to_syslog: yes
|
||
syslog_facility: daemon
|
||
debug: off
|
||
timestamp: on
|
||
logger_subsys {
|
||
subsys: AMF
|
||
debug: off
|
||
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
|
||
}}
|
||
|
||
Note the following:
|
||
|
||
- The ``token`` value specifies the time, in milliseconds,
|
||
during which the Corosync token is expected
|
||
to be transmitted around the ring.
|
||
When this timeout expires, the token is declared lost,
|
||
and after ``token_retransmits_before_loss_const lost`` tokens,
|
||
the non-responding processor (cluster node) is declared dead.
|
||
``token × token_retransmits_before_loss_const``
|
||
is the maximum time a node is allowed to not respond to cluster messages
|
||
before being considered dead.
|
||
The default for token is 1000 milliseconds (1 second),
|
||
with 4 allowed retransmits.
|
||
These defaults are intended to minimize failover times,
|
||
but can cause frequent false alarms and unintended failovers
|
||
in case of short network interruptions. The values used here are safer,
|
||
albeit with slightly extended failover times.
|
||
|
||
- With ``secauth`` enabled,
|
||
Corosync nodes mutually authenticates using a 128-byte shared secret
|
||
stored in the :file:`/etc/corosync/authkey` file.
|
||
This can be generated with the :command:`corosync-keygen` utility.
|
||
Cluster communications are encrypted when using ``secauth``.
|
||
|
||
- In Corosync, configurations use redundant networking
|
||
(with more than one interface). This means you must select a Redundant
|
||
Ring Protocol (RRP) mode other than none. We recommend ``active`` as
|
||
the RRP mode.
|
||
|
||
Note the following about the recommended interface configuration:
|
||
|
||
- Each configured interface must have a unique ``ringnumber``,
|
||
starting with 0.
|
||
|
||
- The ``bindnetaddr`` is the network address of the interfaces to bind to.
|
||
The example uses two network addresses of /24 IPv4 subnets.
|
||
|
||
- Multicast groups (``mcastaddr``) must not be reused
|
||
across cluster boundaries. No two distinct clusters
|
||
should ever use the same multicast group.
|
||
Be sure to select multicast addresses compliant with
|
||
`RFC 2365, "Administratively Scoped IP Multicast"
|
||
<http://www.ietf.org/rfc/rfc2365.txt>`_.
|
||
|
||
- For firewall configurations, Corosync communicates over UDP only,
|
||
and uses ``mcastport`` (for receives) and ``mcastport - 1`` (for sends).
|
||
|
||
- The service declaration for the Pacemaker service
|
||
may be placed in the :file:`corosync.conf` file directly
|
||
or in its own separate file, :file:`/etc/corosync/service.d/pacemaker`.
|
||
|
||
.. note::
|
||
|
||
If you are using Corosync version 2 on Ubuntu 14.04,
|
||
remove or comment out lines under the service stanza.
|
||
These stanzas enable Pacemaker to start up. Another potential
|
||
problem is the boot and shutdown order of Corosync and
|
||
Pacemaker. To force Pacemaker to start after Corosync and
|
||
stop before Corosync, fix the start and kill symlinks manually:
|
||
|
||
.. code-block:: console
|
||
|
||
# update-rc.d pacemaker start 20 2 3 4 5 . stop 00 0 1 6 .
|
||
|
||
The Pacemaker service also requires an additional
|
||
configuration file ``/etc/corosync/uidgid.d/pacemaker``
|
||
to be created with the following content:
|
||
|
||
.. code-block:: none
|
||
|
||
uidgid {
|
||
uid: hacluster
|
||
gid: haclient
|
||
}
|
||
|
||
- Once created, synchronize the :file:`corosync.conf` file
|
||
(and the :file:`authkey` file if the secauth option is enabled)
|
||
across all cluster nodes.
|
||
|
||
.. _corosync-unicast:
|
||
|
||
Set up Corosync with unicast
|
||
----------------------------
|
||
|
||
For environments that do not support multicast, Corosync should be configured
|
||
for unicast. An example fragment of the :file:`corosync.conf` file
|
||
for unicastis is shown below:
|
||
|
||
**Corosync configuration file fragment for unicast (``corosync.conf``)**
|
||
|
||
.. code-block:: none
|
||
|
||
totem {
|
||
#...
|
||
interface {
|
||
ringnumber: 0
|
||
bindnetaddr: 10.0.0.0
|
||
broadcast: yes (1)
|
||
mcastport: 5405
|
||
}
|
||
interface {
|
||
ringnumber: 1
|
||
bindnetaddr: 10.0.42.0
|
||
broadcast: yes
|
||
mcastport: 5405
|
||
}
|
||
transport: udpu (2)
|
||
}
|
||
|
||
nodelist { (3)
|
||
node {
|
||
ring0_addr: 10.0.0.12
|
||
ring1_addr: 10.0.42.12
|
||
nodeid: 1
|
||
}
|
||
node {
|
||
ring0_addr: 10.0.0.13
|
||
ring1_addr: 10.0.42.13
|
||
nodeid: 2
|
||
}
|
||
node {
|
||
ring0_addr: 10.0.0.14
|
||
ring1_addr: 10.0.42.14
|
||
nodeid: 3
|
||
}
|
||
}
|
||
#...
|
||
|
||
Note the following:
|
||
|
||
- If the ``broadcast`` parameter is set to ``yes``, the broadcast address is
|
||
used for communication. If this option is set, the ``mcastaddr`` parameter
|
||
should not be set.
|
||
|
||
- The ``transport`` directive controls the transport mechanism.
|
||
To avoid the use of multicast entirely, specify the ``udpu`` unicast
|
||
transport parameter. This requires specifying the list of members in the
|
||
``nodelist`` directive. This potentially makes up the membership before
|
||
deployment. The default is ``udp``. The transport type can also be set to
|
||
``udpu`` or ``iba``.
|
||
|
||
- Within the ``nodelist`` directive, it is possible to specify specific
|
||
information about the nodes in the cluster. The directive can contain only
|
||
the node sub-directive, which specifies every node that should be a member
|
||
of the membership, and where non-default options are needed. Every node must
|
||
have at least the ``ring0_addr`` field filled.
|
||
|
||
.. note::
|
||
|
||
For UDPU, every node that should be a member of the membership must be specified.
|
||
|
||
Possible options are:
|
||
|
||
- ``ring{X}_addr`` specifies the IP address of one of the nodes.
|
||
``{X}`` is the ring number.
|
||
|
||
- ``nodeid`` is optional when using IPv4 and required when using IPv6.
|
||
This is a 32-bit value specifying the node identifier delivered to the
|
||
cluster membership service. If this is not specified with IPv4,
|
||
the node ID is determined from the 32-bit IP address of the system to which
|
||
the system is bound with ring identifier of 0. The node identifier value of
|
||
zero is reserved and should not be used.
|
||
|
||
|
||
.. _corosync-votequorum:
|
||
|
||
Set up Corosync with votequorum library
|
||
---------------------------------------
|
||
|
||
The votequorum library is part of the Corosync project. It provides an
|
||
interface to the vote-based quorum service and it must be explicitly enabled
|
||
in the Corosync configuration file. The main role of votequorum library is to
|
||
avoid split-brain situations, but it also provides a mechanism to:
|
||
|
||
- Query the quorum status
|
||
|
||
- List the nodes known to the quorum service
|
||
|
||
- Receive notifications of quorum state changes
|
||
|
||
- Change the number of votes assigned to a node
|
||
|
||
- Change the number of expected votes for a cluster to be quorate
|
||
|
||
- Connect an additional quorum device to allow small clusters remain quorate
|
||
during node outages
|
||
|
||
The votequorum library has been created to replace and eliminate ``qdisk``, the
|
||
disk-based quorum daemon for CMAN, from advanced cluster configurations.
|
||
|
||
A sample votequorum service configuration in the :file:`corosync.conf` file is:
|
||
|
||
.. code-block:: none
|
||
|
||
quorum {
|
||
provider: corosync_votequorum (1)
|
||
expected_votes: 7 (2)
|
||
wait_for_all: 1 (3)
|
||
last_man_standing: 1 (4)
|
||
last_man_standing_window: 10000 (5)
|
||
}
|
||
|
||
Note the following:
|
||
|
||
- Specifying ``corosync_votequorum`` enables the votequorum library.
|
||
This is the only required option.
|
||
|
||
- The cluster is fully operational with ``expected_votes`` set to 7 nodes
|
||
(each node has 1 vote), quorum: 4. If a list of nodes is specified as
|
||
``nodelist``, the ``expected_votes`` value is ignored.
|
||
|
||
- When you start up a cluster (all nodes down) and set ``wait_for_all`` to 1,
|
||
the cluster quorum is held until all nodes are online and have joined the
|
||
cluster for the first time. This parameter is new in Corosync 2.0.
|
||
|
||
- Setting ``last_man_standing`` to 1 enables the Last Man Standing (LMS)
|
||
feature. By default, it is disabled (set to 0).
|
||
If a cluster is on the quorum edge (``expected_votes:`` set to 7;
|
||
``online nodes:`` set to 4) for longer than the time specified
|
||
for the ``last_man_standing_window`` parameter, the cluster can recalculate
|
||
quorum and continue operating even if the next node will be lost.
|
||
This logic is repeated until the number of online nodes in the cluster
|
||
reaches 2. In order to allow the cluster to step down from 2 members to only
|
||
1, the ``auto_tie_breaker`` parameter needs to be set.
|
||
We do not recommended this for production environments.
|
||
|
||
- ``last_man_standing_window`` specifies the time, in milliseconds,
|
||
required to recalculate quorum after one or more hosts
|
||
have been lost from the cluster. To perform a new quorum recalculation,
|
||
the cluster must have quorum for at least the interval
|
||
specified for ``last_man_standing_window``. The default is 10000ms.
|
||
|
||
|
||
.. _pacemaker-corosync-start:
|
||
|
||
Start Corosync
|
||
--------------
|
||
|
||
Corosync is started as a regular system service. Depending on your
|
||
distribution, it may ship with an LSB init script, an upstart job, or
|
||
a Systemd unit file.
|
||
|
||
- Start ``corosync`` with the LSB init script:
|
||
|
||
.. code-block:: console
|
||
|
||
# /etc/init.d/corosync start
|
||
|
||
Alternatively:
|
||
|
||
.. code-block:: console
|
||
|
||
# service corosync start
|
||
|
||
- Start ``corosync`` with upstart:
|
||
|
||
.. code-block:: console
|
||
|
||
# start corosync
|
||
|
||
- Start ``corosync`` with systemd unit file:
|
||
|
||
.. code-block:: console
|
||
|
||
# systemctl start corosync
|
||
|
||
You can now check the ``corosync`` connectivity with one of these tools.
|
||
|
||
Use the :command:`corosync-cfgtool` utility with the ``-s`` option
|
||
to get a summary of the health of the communication rings:
|
||
|
||
.. code-block:: console
|
||
|
||
# corosync-cfgtool -s
|
||
Printing ring status.
|
||
Local node ID 435324542
|
||
RING ID 0
|
||
id = 10.0.0.82
|
||
status = ring 0 active with no faults
|
||
RING ID 1
|
||
id = 10.0.42.100
|
||
status = ring 1 active with no faults
|
||
|
||
Use the :command:`corosync-objctl` utility to dump the Corosync cluster
|
||
member list:
|
||
|
||
.. code-block:: console
|
||
|
||
# corosync-objctl runtime.totem.pg.mrp.srp.members
|
||
runtime.totem.pg.mrp.srp.435324542.ip=r(0) ip(10.0.0.82) r(1) ip(10.0.42.100)
|
||
runtime.totem.pg.mrp.srp.435324542.join_count=1
|
||
runtime.totem.pg.mrp.srp.435324542.status=joined
|
||
runtime.totem.pg.mrp.srp.983895584.ip=r(0) ip(10.0.0.87) r(1) ip(10.0.42.254)
|
||
runtime.totem.pg.mrp.srp.983895584.join_count=1
|
||
runtime.totem.pg.mrp.srp.983895584.status=joined
|
||
|
||
You should see a ``status=joined`` entry for each of your constituent
|
||
cluster nodes.
|
||
|
||
.. note::
|
||
|
||
If you are using Corosync version 2, use the :command:`corosync-cmapctl`
|
||
utility instead of :command:`corosync-objctl`; it is a direct replacement.
|
||
|
||
.. _pacemaker-start:
|
||
|
||
Start Pacemaker
|
||
---------------
|
||
|
||
After the ``corosync`` service have been started and you have verified that the
|
||
cluster is communicating properly, you can start :command:`pacemakerd`, the
|
||
Pacemaker master control process. Choose one from the following four ways to
|
||
start it:
|
||
|
||
#. Start ``pacemaker`` with the LSB init script:
|
||
|
||
.. code-block:: console
|
||
|
||
# /etc/init.d/pacemaker start
|
||
|
||
Alternatively:
|
||
|
||
.. code-block:: console
|
||
|
||
# service pacemaker start
|
||
|
||
#. Start ``pacemaker`` with upstart:
|
||
|
||
.. code-block:: console
|
||
|
||
# start pacemaker
|
||
|
||
#. Start ``pacemaker`` with the systemd unit file:
|
||
|
||
.. code-block:: console
|
||
|
||
# systemctl start pacemaker
|
||
|
||
After the ``pacemaker`` service has started, Pacemaker creates a default empty
|
||
cluster configuration with no resources. Use the :command:`crm_mon` utility to
|
||
observe the status of ``pacemaker``:
|
||
|
||
.. code-block:: console
|
||
|
||
# crm_mon -1
|
||
Last updated: Sun Oct 7 21:07:52 2012
|
||
Last change: Sun Oct 7 20:46:00 2012 via cibadmin on controller2
|
||
Stack: openais
|
||
Current DC: controller2 - partition with quorum
|
||
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
|
||
3 Nodes configured, 3 expected votes
|
||
0 Resources configured.
|
||
|
||
|
||
Online: [ controller3 controller2 controller1 ]
|
||
...
|
||
|
||
.. _pacemaker-cluster-properties:
|
||
|
||
Set basic cluster properties
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
After you set up your Pacemaker cluster, set a few basic cluster properties:
|
||
|
||
- ``crmsh``
|
||
|
||
.. code-block:: console
|
||
|
||
$ crm configure property pe-warn-series-max="1000" \
|
||
pe-input-series-max="1000" \
|
||
pe-error-series-max="1000" \
|
||
cluster-recheck-interval="5min"
|
||
|
||
- ``pcs``
|
||
|
||
.. code-block:: console
|
||
|
||
$ pcs property set pe-warn-series-max=1000 \
|
||
pe-input-series-max=1000 \
|
||
pe-error-series-max=1000 \
|
||
cluster-recheck-interval=5min
|
||
|
||
Note the following:
|
||
|
||
- Setting the ``pe-warn-series-max``, ``pe-input-series-max``,
|
||
and ``pe-error-series-max`` parameters to 1000
|
||
instructs Pacemaker to keep a longer history of the inputs processed
|
||
and errors and warnings generated by its Policy Engine.
|
||
This history is useful if you need to troubleshoot the cluster.
|
||
|
||
- Pacemaker uses an event-driven approach to cluster state processing.
|
||
The ``cluster-recheck-interval`` parameter (which defaults to 15 minutes)
|
||
defines the interval at which certain Pacemaker actions occur.
|
||
It is usually prudent to reduce this to a shorter interval,
|
||
such as 5 or 3 minutes.
|
||
|
||
By default, STONITH is enabled in Pacemaker, but STONITH mechanisms (to
|
||
shutdown a node via IPMI or ssh) are not configured. In this case Pacemaker
|
||
will refuse to start any resources.
|
||
For production cluster it is recommended to configure appropriate STONITH
|
||
mechanisms. But for demo or testing purposes STONITH can be disabled completely
|
||
as follows:
|
||
|
||
- ``crmsh``
|
||
|
||
.. code-block:: console
|
||
|
||
$ crm configure property stonith-enabled=false
|
||
|
||
- ``pcs``
|
||
|
||
.. code-block:: console
|
||
|
||
$ pcs property set stonith-enabled=false
|
||
|
||
After you make these changes, commit the updated configuration.
|