19 KiB

Raw Blame History

Pacemaker cluster stack

Pacemaker cluster stack is the state-of-the-art high availability and load balancing stack for the Linux platform. Pacemaker is useful to make OpenStack infrastructure highly available. Also, it is storage and application-agnostic, and in no way specific to OpenStack.

Pacemaker relies on the Corosync messaging layer for reliable cluster communications. Corosync implements the Totem single-ring ordering and membership protocol. It also provides UDP and InfiniBand based messaging, quorum, and cluster membership to Pacemaker.

Pacemaker does not inherently (need or want to) understand the applications it manages. Instead, it relies on resource agents (RAs), scripts that encapsulate the knowledge of how to start, stop, and check the health of each application managed by the cluster.

These agents must conform to one of the OCF <https://github.com/ClusterLabs/ OCF-spec/blob/master/ra/resource-agent-api.md>, SysV Init <http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/ LSB-Core-generic/iniscrptact.html>, Upstart, or Systemd standards.

Pacemaker ships with a large set of OCF agents (such as those managing MySQL databases, virtual IP addresses, and RabbitMQ), but can also use any agents already installed on your system and can be extended with your own (see the developer guide).

The steps to implement the Pacemaker cluster stack are:

pacemaker-install
pacemaker-corosync-setup
pacemaker-corosync-start
pacemaker-start
pacemaker-cluster-properties

Install packages

On any host that is meant to be part of a Pacemaker cluster, you must first establish cluster communications through the Corosync messaging layer. This involves installing the following packages (and their dependencies, which your package manager usually installs automatically):

pacemaker
pcs (CentOS or RHEL) or crmsh
corosync
fence-agents (CentOS or RHEL) or cluster-glue
resource-agents
libqb0

Set up the cluster with pcs

Make sure pcs is running and configured to start at boot time:
```
$ systemctl enable pcsd
$ systemctl start pcsd
```
Set a password for hacluster user on each host.

Since the cluster is a single administrative domain, it is generally accepted to use the same password on all nodes.
```
$ echo my-secret-password-no-dont-use-this-one \
  | passwd --stdin hacluster
```
Use that password to authenticate to the nodes which will make up the cluster. The -p option is used to give the password on command line and makes it easier to script.
```
$ pcs cluster auth controller1 controller2 controller3 \
  -u hacluster -p my-secret-password-no-dont-use-this-one --force
```

Create the cluster, giving it a name, and start it:

$ pcs cluster setup --force --name my-first-openstack-cluster \
  controller1 controller2 controller3
$ pcs cluster start --all

In Red Hat Enterprise Linux or CentOS environments, this is a recommended path to perform configuration. For more information, see the RHEL docs.

Set up the cluster with crmsh

After installing the Corosync package, you must create the /etc/corosync/corosync.conf configuration file.

Note

For Ubuntu, you should also enable the Corosync service in the /etc/default/corosync configuration file.

Corosync can be configured to work with either multicast or unicast IP addresses or to use the votequorum library.

corosync-multicast
corosync-unicast
corosync-votequorum

Set up Corosync with multicast

Most distributions ship an example configuration file (corosync.conf.example) as part of the documentation bundled with the Corosync package. An example Corosync configuration file is shown below:

Example Corosync configuration file for multicast (corosync.conf)

totem {
      version: 2

      # Time (in ms) to wait for a token (1)
      token: 10000

     # How many token retransmits before forming a new
     # configuration
     token_retransmits_before_loss_const: 10

     # Turn off the virtual synchrony filter
     vsftype: none

     # Enable encryption (2)
     secauth: on

     # How many threads to use for encryption/decryption
     threads: 0

     # This specifies the redundant ring protocol, which may be
     # none, active, or passive. (3)
     rrp_mode: active

     # The following is a two-ring multicast configuration. (4)
     interface {
             ringnumber: 0
             bindnetaddr: 10.0.0.0
             mcastaddr: 239.255.42.1
             mcastport: 5405
     }
     interface {
             ringnumber: 1
             bindnetaddr: 10.0.42.0
             mcastaddr: 239.255.42.2
             mcastport: 5405
     }
}

amf {
     mode: disabled
}

service {
        # Load the Pacemaker Cluster Resource Manager (5)
        ver:       1
        name:      pacemaker
}

aisexec {
        user:   root
        group:  root
}

logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
        syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }}

Note the following:

The token value specifies the time, in milliseconds, during which the Corosync token is expected to be transmitted around the ring. When this timeout expires, the token is declared lost, and after token_retransmits_before_loss_const lost tokens, the non-responding processor (cluster node) is declared dead. In other words, token × token_retransmits_before_loss_const is the maximum time a node is allowed to not respond to cluster messages before being considered dead. The default for token is 1000 milliseconds (1 second), with 4 allowed retransmits. These defaults are intended to minimize failover times, but can cause frequent "false alarms" and unintended failovers in case of short network interruptions. The values used here are safer, albeit with slightly extended failover times.
With secauth enabled, Corosync nodes mutually authenticate using a 128-byte shared secret stored in the /etc/corosync/authkey file, which may be generated with the corosync-keygen utility. When using secauth, cluster communications are also encrypted.
In Corosync configurations using redundant networking (with more than one interface), you must select a Redundant Ring Protocol (RRP) mode other than none. active is the recommended RRP mode.

Note the following about the recommended interface configuration:
- Each configured interface must have a unique ringnumber, starting with 0.
- The bindnetaddr is the network address of the interfaces to bind to. The example uses two network addresses of /24 IPv4 subnets.
- Multicast groups (mcastaddr) must not be reused across cluster boundaries. In other words, no two distinct clusters should ever use the same multicast group. Be sure to select multicast addresses compliant with RFC 2365, "Administratively Scoped IP Multicast".
- For firewall configurations, note that Corosync communicates over UDP only, and uses mcastport (for receives) and mcastport - 1 (for sends).
The service declaration for the pacemaker service may be placed in the corosync.conf file directly or in its own separate file, /etc/corosync/service.d/pacemaker.
Note

If you are using Corosync version 2 on Ubuntu 14.04, remove or comment out lines under the service stanza, which enables Pacemaker to start up. Another potential problem is the boot and shutdown order of Corosync and Pacemaker. To force Pacemaker to start after Corosync and stop before Corosync, fix the start and kill symlinks manually:
```
# update-rc.d pacemaker start 20 2 3 4 5 . stop 00 0 1 6 .
```
The Pacemaker service also requires an additional configuration file /etc/corosync/uidgid.d/pacemaker to be created with the following content:
```
uidgid {
  uid: hacluster
  gid: haclient
}
```
Once created, the corosync.conf file (and the authkey file if the secauth option is enabled) must be synchronized across all cluster nodes.

Set up Corosync with unicast

For environments that do not support multicast, Corosync should be configured for unicast. An example fragment of the corosync.conf file for unicastis shown below:

Corosync configuration file fragment for unicast (corosync.conf)

totem {
        #...
        interface {
                ringnumber: 0
                bindnetaddr: 10.0.0.0
                broadcast: yes (1)
                mcastport: 5405
        }
        interface {
                ringnumber: 1
                bindnetaddr: 10.0.42.0
                broadcast: yes
                mcastport: 5405
        }
        transport: udpu (2)
}

nodelist { (3)
        node {
                ring0_addr: 10.0.0.12
                ring1_addr: 10.0.42.12
                nodeid: 1
        }
        node {
                ring0_addr: 10.0.0.13
                ring1_addr: 10.0.42.13
                nodeid: 2
        }
        node {
                ring0_addr: 10.0.0.14
                ring1_addr: 10.0.42.14
                nodeid: 3
        }
}
#...

Note the following:

If the broadcast parameter is set to yes, the broadcast address is used for communication. If this option is set, the mcastaddr parameter should not be set.
The transport directive controls the transport mechanism used. To avoid the use of multicast entirely, specify the udpu unicast transport parameter. This requires specifying the list of members in the nodelist directive; this could potentially make up the membership before deployment. The default is udp. The transport type can also be set to udpu or iba.
Within the nodelist directive, it is possible to specify specific information about the nodes in the cluster. The directive can contain only the node sub-directive, which specifies every node that should be a member of the membership, and where non-default options are needed. Every node must have at least the ring0_addr field filled.

Note

For UDPU, every node that should be a member of the membership must be specified.

Possible options are:
- ring{X}_addr specifies the IP address of one of the nodes. {X} is the ring number.
- nodeid is optional when using IPv4 and required when using IPv6. This is a 32-bit value specifying the node identifier delivered to the cluster membership service. If this is not specified with IPv4, the node id is determined from the 32-bit IP address of the system to which the system is bound with ring identifier of 0. The node identifier value of zero is reserved and should not be used.

Set up Corosync with votequorum library

The votequorum library is part of the corosync project. It provides an interface to the vote-based quorum service and it must be explicitly enabled in the Corosync configuration file. The main role of votequorum library is to avoid split-brain situations, but it also provides a mechanism to:

Query the quorum status
Get a list of nodes known to the quorum service
Receive notifications of quorum state changes
Change the number of votes assigned to a node
Change the number of expected votes for a cluster to be quorate
Connect an additional quorum device to allow small clusters remain quorate during node outages

The votequorum library has been created to replace and eliminate qdisk, the disk-based quorum daemon for CMAN, from advanced cluster configurations.

A sample votequorum service configuration in the corosync.conf file is:

quorum {
        provider: corosync_votequorum (1)
        expected_votes: 7 (2)
        wait_for_all: 1 (3)
        last_man_standing: 1 (4)
        last_man_standing_window: 10000 (5)
       }

Note the following:

Specifying corosync_votequorum enables the votequorum library; this is the only required option.
The cluster is fully operational with expected_votes set to 7 nodes (each node has 1 vote), quorum: 4. If a list of nodes is specified as nodelist, the expected_votes value is ignored.
Setting wait_for_all to 1 means that, When starting up a cluster (all nodes down), the cluster quorum is held until all nodes are online and have joined the cluster for the first time. This parameter is new in Corosync 2.0.
Setting last_man_standing to 1 enables the Last Man Standing (LMS) feature; by default, it is disabled (set to 0). If a cluster is on the quorum edge (expected_votes: set to 7; online nodes: set to 4) for longer than the time specified for the last_man_standing_window parameter, the cluster can recalculate quorum and continue operating even if the next node will be lost. This logic is repeated until the number of online nodes in the cluster reaches 2. In order to allow the cluster to step down from 2 members to only 1, the auto_tie_breaker parameter needs to be set; this is not recommended for production environments.
last_man_standing_window specifies the time, in milliseconds, required to recalculate quorum after one or more hosts have been lost from the cluster. To do the new quorum recalculation, the cluster must have quorum for at least the interval specified for last_man_standing_window; the default is 10000ms.

Start Corosync

Corosync is started as a regular system service. Depending on your distribution, it may ship with an LSB init script, an upstart job, or a systemd unit file. Either way, the service is usually named corosync:

To start corosync with the LSB init script:
```
# /etc/init.d/corosync start
```
Alternatively:
```
# service corosync start
```
To start corosync with upstart:
```
# start corosync
```
To start corosync with systemd unit file:
```
# systemctl start corosync
```

You can now check the corosync connectivity with one of these tools.

Use the corosync-cfgtool utility with the -s option to get a summary of the health of the communication rings:

# corosync-cfgtool -s
Printing ring status.
Local node ID 435324542
RING ID 0
        id      = 10.0.0.82
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.0.42.100
        status  = ring 1 active with no faults

Use the corosync-objctl utility to dump the Corosync cluster member list:

# corosync-objctl runtime.totem.pg.mrp.srp.members
runtime.totem.pg.mrp.srp.435324542.ip=r(0) ip(10.0.0.82) r(1) ip(10.0.42.100)
runtime.totem.pg.mrp.srp.435324542.join_count=1
runtime.totem.pg.mrp.srp.435324542.status=joined
runtime.totem.pg.mrp.srp.983895584.ip=r(0) ip(10.0.0.87) r(1) ip(10.0.42.254)
runtime.totem.pg.mrp.srp.983895584.join_count=1
runtime.totem.pg.mrp.srp.983895584.status=joined

You should see a status=joined entry for each of your constituent cluster nodes.

[TODO: Should the main example now use corosync-cmapctl and have the note give the command for Corosync version 1?]

Note

If you are using Corosync version 2, use the corosync-cmapctl utility instead of corosync-objctl; it is a direct replacement.

Start Pacemaker

After the corosync service have been started and you have verified that the cluster is communicating properly, you can start pacemakerd, the Pacemaker master control process. Choose one from the following four ways to start it:

To start pacemaker with the LSB init script:
```
# /etc/init.d/pacemaker start
```
Alternatively:
```
# service pacemaker start
```
To start pacemaker with upstart:
```
# start pacemaker
```
To start pacemaker with the systemd unit file:
```
# systemctl start pacemaker
```

After the pacemaker service have started, Pacemaker creates a default empty cluster configuration with no resources. Use the crm_mon utility to observe the status of pacemaker:

============
Last updated: Sun Oct  7 21:07:52 2012
Last change: Sun Oct  7 20:46:00 2012 via cibadmin on controller2
Stack: openais
Current DC: controller2 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
3 Nodes configured, 3 expected votes
0 Resources configured.
============

Online: [ controller3 controller2 controller1 ]

Set basic cluster properties

After you set up your Pacemaker cluster, you should set a few basic cluster properties:

crmsh

$ crm configure property pe-warn-series-max="1000" \
  pe-input-series-max="1000" \
  pe-error-series-max="1000" \
  cluster-recheck-interval="5min"

pcs

$ pcs property set pe-warn-series-max=1000 \
  pe-input-series-max=1000 \
  pe-error-series-max=1000 \
  cluster-recheck-interval=5min

Note the following:

Setting the pe-warn-series-max, pe-input-series-max and pe-error-series-max parameters to 1000 instructs Pacemaker to keep a longer history of the inputs processed and errors and warnings generated by its Policy Engine. This history is useful if you need to troubleshoot the cluster.
Pacemaker uses an event-driven approach to cluster state processing. The cluster-recheck-interval parameter (which defaults to 15 minutes) defines the interval at which certain Pacemaker actions occur. It is usually prudent to reduce this to a shorter interval, such as 5 or 3 minutes.

After you make these changes, you may commit the updated configuration.

19 KiB Raw Blame History Unescape Escape

Pacemaker cluster stack

Install packages

Set up the cluster with pcs

Set up the cluster with crmsh

Set up Corosync with multicast

Set up Corosync with unicast

Set up Corosync with votequorum library

Start Corosync

Start Pacemaker

Set basic cluster properties

19 KiB

Raw Blame History