Boss drum, motivating rhythm of life with the healing, rhythmic synergy. More seriously, this patch re-arranges the documentation structure to conform to the structure outlined in [1]. With it, some changes are made to effectively transition the links and simplify the sphinx configuration. The Mitaka/Liberty documentation links are removed as they are no longer available. [1] http://specs.openstack.org/openstack/docs-specs/specs/pike/os-manuals-migration.html Change-Id: Icc985de3af4de5ea7a5aa01b6e6f6e524c67f11b
16 KiB
Galera cluster maintenance
Routine maintenance includes gracefully adding or removing nodes from the cluster without impacting operation and also starting a cluster after gracefully shutting down all nodes.
MySQL instances are restarted when creating a cluster, when adding a
node, when the service is not running, or when changes are made to the
/etc/mysql/my.cnf
configuration file.
Verify cluster status
Compare the output of the following command with the following output. It should give you information about the status of your cluster.
# ansible galera_container -m shell -a "mysql -h 127.0.0.1 \
-e 'show status like \"%wsrep_cluster_%\";'"
node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (2)
node2_galera_container-49a47d25 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (2)
node4_galera_container-76275635 | success | rc=0 >>
Variable_name Value
wsrep_cluster_conf_id 7
wsrep_cluster_size 1
wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status Primary
In the previous example, all but one node was responding.
Gracefully shutting down the MariaDB service on all but one node allows the remaining operational node to continue processing SQL requests. When gracefully shutting down multiple nodes, perform the actions sequentially to retain operation.
Start a cluster
Gracefully shutting down all nodes destroys the cluster. Starting or restarting a cluster from zero nodes requires creating a new cluster on one of the nodes.
Start a new cluster on the most advanced node. Change to the
playbooks
directory and check theseqno
value in thegrastate.dat
file on all of the nodes:# ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat" node2_galera_container-49a47d25 | success | rc=0 >> # GALERA saved state version: 2.1 uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 seqno: 31 cert_index: node3_galera_container-3ea2cbd3 | success | rc=0 >> # GALERA saved state version: 2.1 uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 seqno: 31 cert_index: node4_galera_container-76275635 | success | rc=0 >> # GALERA saved state version: 2.1 uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 seqno: 31 cert_index:
In this example, all nodes in the cluster contain the same positive
seqno
values as they were synchronized just prior to graceful shutdown. If allseqno
values are equal, any node can start the new cluster.## for init # /etc/init.d/mysql start --wsrep-new-cluster ## for systemd # systemctl start mysql --wsrep-new-cluster
Please also have a look at upstream starting a cluster page
This can also be done with the help of ansible using the shell module:
# ansible galera_container -m shell -a "/etc/init.d/mysql start --wsrep-new-cluster" --limit galera_container[0]
This command results in a cluster containing a single node. The
wsrep_cluster_size
value shows the number of nodes in the cluster.node2_galera_container-49a47d25 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 1 wsrep_cluster_size 1 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Restart MariaDB on the other nodes (replace [0] from previous ansible command with [1:]) and verify that they rejoin the cluster.
node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 3 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 3 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 3 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Galera cluster recovery
Run the galera-install
playbook using the
galera-bootstrap
tag to automatically recover a node or an
entire environment.
Run the following Ansible command to show the failed nodes:
# openstack-ansible galera-install.yml --tags galera-bootstrap
The cluster comes back online after completion of this command.
Recover a single-node failure
If a single node fails, the other nodes maintain quorum and continue to process SQL requests.
Change to the
playbooks
directory and run the following Ansible command to determine the failed node:# ansible galera_container -m shell -a "mysql -h 127.0.0.1 \ -e 'show status like \"%wsrep_cluster_%\";'" node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
In this example, node 3 has failed.
Restart MariaDB on the failed node and verify that it rejoins the cluster.
If MariaDB fails to start, run the
mysqld
command and perform further analysis on the output. As a last resort, rebuild the container for the node.
Recover a multi-node failure
When all but one node fails, the remaining node cannot achieve quorum and stops processing SQL requests. In this situation, failed nodes that recover cannot join the cluster because it no longer exists.
Run the following Ansible command to show the failed nodes:
# ansible galera_container -m shell -a "mysql \ -h 127.0.0.1 -e 'show status like \"%wsrep_cluster_%\";'" node2_galera_container-49a47d25 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 18446744073709551615 wsrep_cluster_size 1 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status non-Primary
In this example, nodes 2 and 3 have failed. The remaining operational server indicates
non-Primary
because it cannot achieve quorum.Run the following command to rebootstrap the operational node into the cluster:
# mysql -e "SET GLOBAL wsrep_provider_options='pc.bootstrap=yes';" node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 15 wsrep_cluster_size 1 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node2_galera_container-49a47d25 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)
The remaining operational node becomes the primary node and begins processing SQL requests.
Restart MariaDB on the failed nodes and verify that they rejoin the cluster:
# ansible galera_container -m shell -a "mysql \ -h 127.0.0.1 -e 'show status like \"%wsrep_cluster_%\";'" node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
If MariaDB fails to start on any of the failed nodes, run the
mysqld
command and perform further analysis on the output. As a last resort, rebuild the container for the node.
Recover a complete environment failure
Restore from backup if all of the nodes in a Galera cluster fail (do
not shutdown gracefully). Change to the playbook
directory
and run the following command to determine if all nodes in the cluster
have failed:
# ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat"
node3_galera_container-3ea2cbd3 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno: -1
cert_index:
node2_galera_container-49a47d25 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno: -1
cert_index:
node4_galera_container-76275635 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno: -1
cert_index:
All the nodes have failed if mysqld
is not running on
any of the nodes and all of the nodes contain a seqno
value
of -1.
If any single node has a positive seqno
value, then that
node can be used to restart the cluster. However, because there is no
guarantee that each node has an identical copy of the data, we do not
recommend to restart the cluster using the
--wsrep-new-cluster
command on one node.
Rebuild a container
Recovering from certain failures require rebuilding one or more containers.
Disable the failed node on the load balancer.
Note
Do not rely on the load balancer health checks to disable the node. If the node is not disabled, the load balancer sends SQL requests to it before it rejoins the cluster and cause data inconsistencies.
Destroy the container and remove MariaDB data stored outside of the container:
# lxc-stop -n node3_galera_container-3ea2cbd3 # lxc-destroy -n node3_galera_container-3ea2cbd3 # rm -rf /openstack/node3_galera_container-3ea2cbd3/*
In this example, node 3 failed.
Run the host setup playbook to rebuild the container on node 3:
# openstack-ansible setup-hosts.yml -l node3 \ -l node3_galera_container-3ea2cbd3
The playbook restarts all other containers on the node.
Run the infrastructure playbook to configure the container specifically on node 3:
# openstack-ansible setup-infrastructure.yml \ --limit node3_galera_container-3ea2cbd3
Warning
The new container runs a single-node Galera cluster, which is a dangerous state because the environment contains more than one active database with potentially different data.
# ansible galera_container -m shell -a "mysql \ -h 127.0.0.1 -e 'show status like \"%wsrep_cluster_%\";'" node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 1 wsrep_cluster_size 1 wsrep_cluster_state_uuid da078d01-29e5-11e4-a051-03d896dbdb2d wsrep_cluster_status Primary node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 4 wsrep_cluster_size 2 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 4 wsrep_cluster_size 2 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Restart MariaDB in the new container and verify that it rejoins the cluster.
Note
In larger deployments, it may take some time for the MariaDB daemon to start in the new container. It will be synchronizing data from the other MariaDB servers during this time. You can monitor the status during this process by tailing the
/var/log/mysql_logs/galera_server_error.log
log file.Lines starting with
WSREP_SST
will appear during the sync process and you should see a line withWSREP: SST complete, seqno: <NUMBER>
if the sync was successful.# ansible galera_container -m shell -a "mysql \ -h 127.0.0.1 -e 'show status like \"%wsrep_cluster_%\";'" node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 5 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 5 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 5 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Enable the previously failed node on the load balancer.