9.7 KiB

Raw Blame History

Mysql Galera Guide

This guide provides an overview of Galera implementation in CCP.

Overview

Galera Cluster is a synchronous multi-master database cluster, based on synchronous replication and MySQL/InnoDB. When Galera Cluster is in use, you can direct reads and writes to any node, and you can lose any individual node without interruption in operations and without the need to handle complex failover procedures.

CCP implementaion details

Entrypoint script

To handle all required logic, CCP has a dedicated entrypoint script for Galera and its side-containers. Because of that, Galera pods are slightly different from the rest of CCP pods. For example, Galera container still uses CCP global entrypoint, but it executes Galera entrypoint, which is executing MySQL and handles all required logic, like bootstrapping, fail detection, etc.

Galera pod

Each Galera pod consists of 3 containers:

galera
galera-checker
galera-haproxy

galera - a container which runs Galera itself.

galera-checker - a container with galera-checker script. It is used to check readiness and liveness of the Galera node.

galera-haproxy - a container with a haproxy instance.

Note

More info about each container is available in the "Galera containers" section.

Etcd usage

The current implementation uses etcd to store cluster state. The default etcd root the directory will be /galera/k8scluster.

Additional keys and directories are:

leader - key with the IP address of the current leader. Leader - is just a single, random Galera node, which haproxy will be used as a backend.
nodes/ - directory with current Galera nodes. Each node key will be named as an IP address of the node and value will be a Unix time of the key creation.
queue/ - directory with current Galera nodes waiting in the recovery queue. This is needed to ensure that all nodes are ready, before looking for the node with the highest seqno. Each node key will be named as an IP addr of the node and value will be a Unix time of the key creation.
seqno/ - directory with current Galera nodes seqno's. Each node key will be named as an IP address of the node and its value will be a seqno of the node's data.
state - key with current cluster state. Can be "STEADY", "BUILDING" or "RECOVERY"
uuid - key with current uuid of the Galera cluster. If a new node will have a different uuid, this will indicate that we have a split brain situation. Nodes with the wrong uuid will be destroyed.

Galera containers

galera

This container runs Galera daemon, plus handles all the bootstrapping, reconnecting and recovery logic.

At the start of the container, it checks for the init.ok file in the Galera data directory. If this file doesn't exist, it removes all files from the data directory, running Mysql init, to create base mysql data files, after we're starting mysqld daemon without networking and setting needed permissions for expected users.

If init.ok file is found, it runs the mysqld_safe --wsrep-recover to recover Galera related information and write it to the grastate.dat file.

After that, it checks the cluster state and depending on the current state it chose required scenario.

galera-checker

This container is used for liveness and readiness checks of Galera pod.

To check if this Galera pod is ready it checks for the following things:

wsrep_local_state_comment = "Synced"
wsrep_evs_state = "OPERATIONAL"
wsrep_connected = "ON"
wsrep_ready = "ON"
wsrep_cluster_state_uuid = uuid in the etcd

To check if this Galera pod is alive we checking the following things:

If current cluster state is not "STEADY" - it skips liveness check.
If it detects that SST sync is in progress - it skips liveness check.
If it detects that there is no Mysql pid file yet - it skips liveness check.
If node "wsrep_cluster_state_uuid" differs from the etcd one - it kills Galera container, since it's a "split brain" situation.
If "wsrep_local_state_comment" is "Joined", and the previous state was "Joined" too - it kills Galera container since it can't finish joining to the cluster for some reason.
If it caught any exception during the checks - it kills Galera container.

If all checks passed - we're deciding that Galera pod is alive.

galera-haproxy

This container is used to run haproxy daemon, which is used to send all traffic to a single Galera pod.

This is needed to avoid deadlocks and stale reads. It chooses the "leader" out of all available Galera pods and once leader is chosen, all haproxy instances update their configuration with the new leader.

Supported scenarios

Initial bootstrap

In this scenario, there is no working Galera cluster yet. Each node trying to get the lock in etcd, first one which can start cluster bootstrapping. After it's done, next node gets the lock and connects to the existing cluster.

Note

During the bootstrap state of the cluster will be "BUILDING". It will be changed to "STEADY" after last node connection.

Re-connecting to the existing cluster

In this scenario, Galera cluster is already available. In most case it will be a node re-connection after some failures, such as node reboot. Each node tries to get the lock in etcd, once lock acquiring node connects to the existing cluster.

Note

During this scenario state of the cluster will be "STEADY".

Recovery

This scenario could be triggered by two possible options:

Operator manually sets cluster state in etcd to the "RECOVERY"
New node does a few checks before bootstrapping, if it finds that cluster state is "STEADY", but there is zero nodes in the cluster - it assumes that cluster has been destroyed somehow and we need to run recovery. In that case, it sets the state to the "RECOVERY" and starts recovery scenario.

During the recovery scenario cluster bootstrapping is different from the "Initial bootstrap". In this scenario, each node looks for its "seqno", which is basically the registered number of the transactions. A node with the highest seqno will bootstrap cluster and other nodes will join it, so in the end, we will have the latest data available before the cluster destruction.

Note

During the bootstrap state of the cluster will be "RECOVERY". It will be changed to "STEADY" after last node connection.

There is an option to manually choose the node to recover data from. For details please see the "force bootstrap" section in the "Advanced features" .

Advanced features

Cluster size

By default, galera cluster size will be 3 nodes. This is optimal for the most cases. If you want to change it to some custom number, you need to override cluster_size variable in the percona tree, for example:

configs:
  percona:
    cluster_size: 5

Note

Cluster size should be an odd number. Cluster size with more that 5 nodes will lead to big latency for write operations.

Force bootstrap

Sometimes operators may want to manually specify Galera node which recovery should be done from. In that case, you need to override force_bootstrap variable in the percona tree, for example:

configs:
  percona:
    force_bootstrap:
      enabled: true
      node: NODE_NAME

NODE_NAME should be the name of the k8s node, which will run Galera node with required data.

Troubleshooting

Galera operation requires some advanced knowledge in Mysql and in some general clustering conceptions. In most cases, we expect that Galera will "self-heal" itself, in the worst case via restart, full resync and reconnection to the cluster.

Our readiness and liveness scripts should cover this, and not allow misconfigured or non-operational node receive production traffic.

Yet it's possible that some failure scenarios is not covered and to fix them some manual actions could be required.

Check the logs

Each container of the Galera pod writes detailed logs to the stdout. You could read them via kubectl logs POD_NAME -c CONT_NAME. Make sure you check the galera container logs and galera-checker ones.

Additionally you should check the Mysql logs in the /var/log/ccp/mysql/mysql.log

Check the etcd state

Galera keeps its state in the etcd and it could be useful to check what is going on in the etcd right now. Assuming that you're using the ccp namespace, you could check etcd state using this command:

etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 ls -r -p --sort /galera
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/state
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/leader
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/uuid

Node restart

In most cases, it should be safe to restart a single Galera node. If you need to do it for some reason, just delete the pod, via kubectl:

kubectl delete pod POD_NAME

Full cluster restart

In some cases, you may need to restart the whole cluster. Make sure you have a backup before doing this. To do this, set the cluster state to the "RECOVERY":

etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 set /galera/k8scluster/state RECOVERY

After that restart all Galera pods:

kubectl delete pod POD1_NAME POD2_NAME POD3_NAME

Once that done, Galera cluster will be rebuilt and should be operational.

Note

For more info about cluster recovery please refer to the "Supported scenarios" section.

9.7 KiB Raw Blame History

Mysql Galera Guide

Overview

CCP implementaion details

Entrypoint script

Galera pod

Etcd usage

Galera containers

galera

galera-checker

galera-haproxy

Supported scenarios

Initial bootstrap

Re-connecting to the existing cluster

Recovery

Advanced features

Cluster size

Force bootstrap

Troubleshooting

Check the logs

Check the etcd state

Node restart

Full cluster restart

9.7 KiB

Raw Blame History