Add Galera documentation
Change-Id: I72222a1c20622ad0304ba2c6ab8984ac0ea01093
This commit is contained in:
parent
01b0cc1a65
commit
6297e2ad73
|
@ -0,0 +1,286 @@
|
||||||
|
.. _galera:
|
||||||
|
|
||||||
|
==================
|
||||||
|
Mysql Galera Guide
|
||||||
|
==================
|
||||||
|
|
||||||
|
This guide provides an overview of Galera implementation in CCP.
|
||||||
|
|
||||||
|
Overview
|
||||||
|
~~~~~~~~
|
||||||
|
|
||||||
|
Galera Cluster is a synchronous multi-master database cluster, based on
|
||||||
|
synchronous replication and MySQL/InnoDB. When Galera Cluster is in use, you
|
||||||
|
can direct reads and writes to any node, and you can lose any individual node
|
||||||
|
without interruption in operations and without the need to handle complex
|
||||||
|
failover procedures.
|
||||||
|
|
||||||
|
CCP implementaion details
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Entrypoint script
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
To handle all required logic, CCP has a dedicated entrypoint script for
|
||||||
|
Galera and its side-containers. Because of that, Galera pods are slightly
|
||||||
|
different from the rest of CCP pods. For example, Galera container still uses
|
||||||
|
CCP global entrypoint, but it executes Galera entrypoint, which is executing
|
||||||
|
MySQL and handles all required logic, like bootstrapping, fail detection, etc.
|
||||||
|
|
||||||
|
Galera pod
|
||||||
|
----------
|
||||||
|
|
||||||
|
Each Galera pod consists of 3 containers:
|
||||||
|
|
||||||
|
* galera
|
||||||
|
* galera-checker
|
||||||
|
* galera-haproxy
|
||||||
|
|
||||||
|
**galera** - a container which runs Galera itself.
|
||||||
|
|
||||||
|
**galera-checker** - a container with galera-checker script. It is used to
|
||||||
|
check readiness and liveness of the Galera node.
|
||||||
|
|
||||||
|
**galera-haproxy** - a container with a haproxy instance.
|
||||||
|
|
||||||
|
.. NOTE:: More info about each container is available in the
|
||||||
|
"Galera containers" section.
|
||||||
|
|
||||||
|
Etcd usage
|
||||||
|
----------
|
||||||
|
|
||||||
|
The current implementation uses etcd to store cluster state. The default etcd
|
||||||
|
root the directory will be ``/galera/k8scluster``.
|
||||||
|
|
||||||
|
Additional keys and directories are:
|
||||||
|
|
||||||
|
* **leader** - key with the IP address of the current leader. Leader - is just
|
||||||
|
a single, random Galera node, which haproxy will be used as a backend.
|
||||||
|
* **nodes/** - directory with current Galera nodes. Each node key will be
|
||||||
|
named as an IP address of the node and value will be a Unix time of the key
|
||||||
|
creation.
|
||||||
|
* **queue/** - directory with current Galera nodes waiting in the recovery
|
||||||
|
queue. This is needed to ensure that all nodes are ready, before looking for
|
||||||
|
the node with the highest seqno. Each node key will be named as an IP addr
|
||||||
|
of the node and value will be a Unix time of the key creation.
|
||||||
|
* **seqno/** - directory with current Galera nodes seqno's.
|
||||||
|
Each node key will be named as an IP address of the node and its value will
|
||||||
|
be a seqno of the node's data.
|
||||||
|
* **state** - key with current cluster state. Can be "STEADY", "BUILDING" or
|
||||||
|
"RECOVERY"
|
||||||
|
* **uuid** - key with current uuid of the Galera cluster. If a new node will
|
||||||
|
have a different uuid, this will indicate that we have a split brain
|
||||||
|
situation. Nodes with the wrong uuid will be destroyed.
|
||||||
|
|
||||||
|
Galera containers
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
galera
|
||||||
|
------
|
||||||
|
|
||||||
|
This container runs Galera daemon, plus handles all the bootstrapping,
|
||||||
|
reconnecting and recovery logic.
|
||||||
|
|
||||||
|
At the start of the container, it checks for the ``init.ok`` file in the Galera
|
||||||
|
data directory. If this file doesn't exist, it removes all files from the
|
||||||
|
data directory, running Mysql init, to create base mysql data files, after
|
||||||
|
we're starting mysqld daemon without networking and setting needed permissions
|
||||||
|
for expected users.
|
||||||
|
|
||||||
|
If ``init.ok`` file is found, it runs the ``mysqld_safe --wsrep-recover``
|
||||||
|
to recover Galera related information and write it to the ``grastate.dat``
|
||||||
|
file.
|
||||||
|
|
||||||
|
After that, it checks the cluster state and depending on the current state
|
||||||
|
it chose required scenario.
|
||||||
|
|
||||||
|
galera-checker
|
||||||
|
--------------
|
||||||
|
|
||||||
|
This container is used for liveness and readiness checks of Galera pod.
|
||||||
|
|
||||||
|
To check if this Galera pod is ready it checks for the following things:
|
||||||
|
|
||||||
|
#. wsrep_local_state_comment = "Synced"
|
||||||
|
#. wsrep_evs_state = "OPERATIONAL"
|
||||||
|
#. wsrep_connected = "ON"
|
||||||
|
#. wsrep_ready = "ON"
|
||||||
|
#. wsrep_cluster_state_uuid = uuid in the etcd
|
||||||
|
|
||||||
|
To check if this Galera pod is alive we checking the following things:
|
||||||
|
|
||||||
|
#. If current cluster state is not "STEADY" - it skips liveness check.
|
||||||
|
#. If it detects that SST sync is in progress - it skips liveness check.
|
||||||
|
#. If it detects that there is no Mysql pid file yet - it skips liveness
|
||||||
|
check.
|
||||||
|
#. If node "wsrep_cluster_state_uuid" differs from the etcd one - it kills
|
||||||
|
Galera container, since it's a "split brain" situation.
|
||||||
|
#. If "wsrep_local_state_comment" is "Joined", and the previous state was
|
||||||
|
"Joined" too - it kills Galera container since it can't finish joining
|
||||||
|
to the cluster for some reason.
|
||||||
|
#. If it caught any exception during the checks - it kills Galera container.
|
||||||
|
|
||||||
|
If all checks passed - we're deciding that Galera pod is alive.
|
||||||
|
|
||||||
|
galera-haproxy
|
||||||
|
--------------
|
||||||
|
|
||||||
|
This container is used to run haproxy daemon, which is used to send all traffic
|
||||||
|
to a single Galera pod.
|
||||||
|
|
||||||
|
This is needed to avoid deadlocks and stale reads. It chooses the "leader"
|
||||||
|
out of all available Galera pods and once leader is chosen, all haproxy
|
||||||
|
instances update their configuration with the new leader.
|
||||||
|
|
||||||
|
Supported scenarios
|
||||||
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Initial bootstrap
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
In this scenario, there is no working Galera cluster yet. Each node trying to
|
||||||
|
get the lock in etcd, first one which can start cluster bootstrapping. After
|
||||||
|
it's done, next node gets the lock and connects to the existing cluster.
|
||||||
|
|
||||||
|
.. NOTE:: During the bootstrap state of the cluster will be "BUILDING". It will
|
||||||
|
be changed to "STEADY" after last node connection.
|
||||||
|
|
||||||
|
Re-connecting to the existing cluster
|
||||||
|
-------------------------------------
|
||||||
|
|
||||||
|
In this scenario, Galera cluster is already available. In most case it will be
|
||||||
|
a node re-connection after some failures, such as node reboot. Each node tries
|
||||||
|
to get the lock in etcd, once lock acquiring node connects to the existing
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
.. NOTE:: During this scenario state of the cluster will be "STEADY".
|
||||||
|
|
||||||
|
Recovery
|
||||||
|
--------
|
||||||
|
|
||||||
|
This scenario could be triggered by two possible options:
|
||||||
|
|
||||||
|
* Operator manually sets cluster state in etcd to the "RECOVERY"
|
||||||
|
* New node does a few checks before bootstrapping, if it finds that cluster
|
||||||
|
state is "STEADY", but there is zero nodes in the cluster - it assumes that
|
||||||
|
cluster has been destroyed somehow and we need to run recovery. In that case,
|
||||||
|
it sets the state to the "RECOVERY" and starts recovery scenario.
|
||||||
|
|
||||||
|
During the recovery scenario cluster bootstrapping is different from the
|
||||||
|
"Initial bootstrap". In this scenario, each node looks for its "seqno", which
|
||||||
|
is basically the registered number of the transactions. A node with the highest
|
||||||
|
seqno will bootstrap cluster and other nodes will join it, so in the end, we
|
||||||
|
will have the latest data available before the cluster destruction.
|
||||||
|
|
||||||
|
.. NOTE:: During the bootstrap state of the cluster will be "RECOVERY". It will
|
||||||
|
be changed to "STEADY" after last node connection.
|
||||||
|
|
||||||
|
There is an option to manually choose the node to recover data from.
|
||||||
|
For details please see the "force bootstrap" section in the "Advanced features"
|
||||||
|
.
|
||||||
|
|
||||||
|
Advanced features
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Cluster size
|
||||||
|
------------
|
||||||
|
|
||||||
|
By default, galera cluster size will be 3 nodes. This is optimal for the most
|
||||||
|
cases. If you want to change it to some custom number, you need to override
|
||||||
|
**cluster_size** variable in the **percona** tree, for example:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
configs:
|
||||||
|
percona:
|
||||||
|
cluster_size: 5
|
||||||
|
|
||||||
|
.. NOTE:: Cluster size should be an odd number. Cluster size with more that 5
|
||||||
|
nodes will lead to big latency for write operations.
|
||||||
|
|
||||||
|
Force bootstrap
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Sometimes operators may want to manually specify Galera node which recovery
|
||||||
|
should be done from. In that case, you need to override **force_bootstrap**
|
||||||
|
variable in the **percona** tree, for example:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
configs:
|
||||||
|
percona:
|
||||||
|
force_bootstrap:
|
||||||
|
enabled: true
|
||||||
|
node: NODE_NAME
|
||||||
|
|
||||||
|
**NODE_NAME** should be the name of the k8s node, which will run Galera node
|
||||||
|
with required data.
|
||||||
|
|
||||||
|
Troubleshooting
|
||||||
|
~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Galera operation requires some advanced knowledge in Mysql and in some general
|
||||||
|
clustering conceptions. In most cases, we expect that Galera will "self-heal"
|
||||||
|
itself, in the worst case via restart, full resync and reconnection to the
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
Our readiness and liveness scripts should cover this, and not allow
|
||||||
|
misconfigured or non-operational node receive production traffic.
|
||||||
|
|
||||||
|
Yet it's possible that some failure scenarios is not covered and to fix them
|
||||||
|
some manual actions could be required.
|
||||||
|
|
||||||
|
Check the logs
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Each container of the Galera pod writes detailed logs to the stdout. You could
|
||||||
|
read them via ``kubectl logs POD_NAME -c CONT_NAME``. Make sure you check the
|
||||||
|
``galera`` container logs and ``galera-checker`` ones.
|
||||||
|
|
||||||
|
Additionally you should check the Mysql logs in the
|
||||||
|
``/var/log/ccp/mysql/mysql.log``
|
||||||
|
|
||||||
|
Check the etcd state
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
Galera keeps its state in the etcd and it could be useful to check what is
|
||||||
|
going on in the etcd right now. Assuming that you're using the **ccp**
|
||||||
|
namespace, you could check etcd state using this command:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 ls -r -p --sort /galera
|
||||||
|
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/state
|
||||||
|
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/leader
|
||||||
|
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/uuid
|
||||||
|
|
||||||
|
Node restart
|
||||||
|
------------
|
||||||
|
|
||||||
|
In most cases, it should be safe to restart a single Galera node. If you need
|
||||||
|
to do it for some reason, just delete the pod, via kubectl:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
kubectl delete pod POD_NAME
|
||||||
|
|
||||||
|
Full cluster restart
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
In some cases, you may need to restart the whole cluster. Make sure you have a
|
||||||
|
backup before doing this. To do this, set the cluster state to the "RECOVERY":
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 set /galera/k8scluster/state RECOVERY
|
||||||
|
|
||||||
|
After that restart all Galera pods:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
kubectl delete pod POD1_NAME POD2_NAME POD3_NAME
|
||||||
|
|
||||||
|
Once that done, Galera cluster will be rebuilt and should be operational.
|
||||||
|
|
||||||
|
.. NOTE:: For more info about cluster recovery please refer to the
|
||||||
|
"Supported scenarios" section.
|
|
@ -25,6 +25,7 @@ Advanced topics
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
deploying_multiple_parallel_environments
|
deploying_multiple_parallel_environments
|
||||||
|
galera
|
||||||
ceph
|
ceph
|
||||||
ceph_cluster
|
ceph_cluster
|
||||||
using_calico_instead_of_ovs
|
using_calico_instead_of_ovs
|
||||||
|
|
Loading…
Reference in New Issue