Add Galera documentation

Change-Id: I72222a1c20622ad0304ba2c6ab8984ac0ea01093
2017-01-26 13:34:06 +03:00 · 2017-01-26 13:34:06 +03:00 · 6297e2ad73
parent 01b0cc1a65
commit 6297e2ad73
2 changed files with 287 additions and 0 deletions
--- a/doc/source/galera.rst
+++ b/doc/source/galera.rst
@ -0,0 +1,286 @@
 .. _galera:
 ==================
 Mysql Galera Guide
 ==================
 This guide provides an overview of Galera implementation in CCP.
 Overview
 ~~~~~~~~
 Galera Cluster is a synchronous multi-master database cluster, based on
 synchronous replication and MySQL/InnoDB. When Galera Cluster is in use, you
 can direct reads and writes to any node, and you can lose any individual node
 without interruption in operations and without the need to handle complex
 failover procedures.
 CCP implementaion details
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 Entrypoint script
 -----------------
 To handle all required logic, CCP has a dedicated entrypoint script for
 Galera and its side-containers. Because of that, Galera pods are slightly
 different from the rest of CCP pods. For example, Galera container still uses
 CCP global entrypoint, but it executes Galera entrypoint, which is executing
 MySQL and handles all required logic, like bootstrapping, fail detection, etc.
 Galera pod
 ----------
 Each Galera pod consists of 3 containers:
 * galera
 * galera-checker
 * galera-haproxy
 **galera** - a container which runs Galera itself.
 **galera-checker** - a container with galera-checker script. It is used to
 check readiness and liveness of the Galera node.
 **galera-haproxy** - a container with a haproxy instance.
 .. NOTE:: More info about each container is available in the
  "Galera containers" section.
 Etcd usage
 ----------
 The current implementation uses etcd to store cluster state. The default etcd
 root the directory will be ``/galera/k8scluster``.
 Additional keys and directories are:
 * **leader** - key with the IP address of the current leader. Leader - is just
  a single, random Galera node, which haproxy will be used as a backend.
 * **nodes/** - directory with current Galera nodes. Each node key will be
  named as an IP address of the node and value will be a Unix time of the key
  creation.
 * **queue/** - directory with current Galera nodes waiting in the recovery
  queue. This is needed to ensure that all nodes are ready, before looking for
  the node with the highest seqno. Each node key will be named as an IP addr
  of the node and value will be a Unix time of the key creation.
 * **seqno/** - directory with current Galera nodes seqno's.
  Each node key will be named as an IP address of the node and its value will
  be a seqno of the node's data.
 * **state** - key with current cluster state. Can be "STEADY", "BUILDING" or
  "RECOVERY"
 * **uuid** - key with current uuid of the Galera cluster. If a new node will
  have a different uuid, this will indicate that we have a split brain
  situation. Nodes with the wrong uuid will be destroyed.
 Galera containers
 ~~~~~~~~~~~~~~~~~
 galera
 ------
 This container runs Galera daemon, plus handles all the bootstrapping,
 reconnecting and recovery logic.
 At the start of the container, it checks for the ``init.ok`` file in the Galera
 data directory. If this file doesn't exist, it removes all files from the
 data directory, running Mysql init, to create base mysql data files, after
 we're starting mysqld daemon without networking and setting needed permissions
 for expected users.
 If ``init.ok`` file is found, it runs the ``mysqld_safe --wsrep-recover``
 to recover Galera related information and write it to the ``grastate.dat``
 file.
 After that, it checks the cluster state and depending on the current state
 it chose required scenario.
 galera-checker
 --------------
 This container is used for liveness and readiness checks of Galera pod.
 To check if this Galera pod is ready it checks for the following things:
 #. wsrep_local_state_comment = "Synced"
 #. wsrep_evs_state = "OPERATIONAL"
 #. wsrep_connected = "ON"
 #. wsrep_ready = "ON"
 #. wsrep_cluster_state_uuid = uuid in the etcd
 To check if this Galera pod is alive we checking the following things:
 #. If current cluster state is not "STEADY" - it skips liveness check.
 #. If it detects that SST sync is in progress - it skips liveness check.
 #. If it detects that there is no Mysql pid file yet - it skips liveness
   check.
 #. If node "wsrep_cluster_state_uuid" differs from the etcd one - it kills
   Galera container, since it's a "split brain" situation.
 #. If "wsrep_local_state_comment" is "Joined", and the previous state was
   "Joined" too - it kills Galera container since it can't finish joining
   to the cluster for some reason.
 #. If it caught any exception during the checks - it kills Galera container.
 If all checks passed - we're deciding that Galera pod is alive.
 galera-haproxy
 --------------
 This container is used to run haproxy daemon, which is used to send all traffic
 to a single Galera pod.
 This is needed to avoid deadlocks and stale reads. It chooses the "leader"
 out of all available Galera pods and once leader is chosen, all haproxy
 instances update their configuration with the new leader.
 Supported scenarios
 ~~~~~~~~~~~~~~~~~~~
 Initial bootstrap
 -----------------
 In this scenario, there is no working Galera cluster yet. Each node trying to
 get the lock in etcd, first one which can start cluster bootstrapping. After
 it's done, next node gets the lock and connects to the existing cluster.
 .. NOTE:: During the bootstrap state of the cluster will be "BUILDING". It will
  be changed to "STEADY" after last node connection.
 Re-connecting to the existing cluster
 -------------------------------------
 In this scenario, Galera cluster is already available. In most case it will be
 a node re-connection after some failures, such as node reboot. Each node tries
 to get the lock in etcd, once lock acquiring node connects to the existing
 cluster.
 .. NOTE:: During this scenario state of the cluster will be "STEADY".
 Recovery
 --------
 This scenario could be triggered by two possible options:
 * Operator manually sets cluster state in etcd to the "RECOVERY"
 * New node does a few checks before bootstrapping, if it finds that cluster
  state is "STEADY", but there is zero nodes in the cluster - it assumes that
  cluster has been destroyed somehow and we need to run recovery. In that case,
  it sets the state to the "RECOVERY" and starts recovery scenario.
 During the recovery scenario cluster bootstrapping is different from the
 "Initial bootstrap". In this scenario, each node looks for its "seqno", which
 is basically the registered number of the transactions. A node with the highest
 seqno will bootstrap cluster and other nodes will join it, so in the end, we
 will have the latest data available before the cluster destruction.
 .. NOTE:: During the bootstrap state of the cluster will be "RECOVERY". It will
  be changed to "STEADY" after last node connection.
 There is an option to manually choose the node to recover data from.
 For details please see the "force bootstrap" section in the "Advanced features"
 .
 Advanced features
 ~~~~~~~~~~~~~~~~~
 Cluster size
 ------------
 By default, galera cluster size will be 3 nodes. This is optimal for the most
 cases. If you want to change it to some custom number, you need to override
 **cluster_size** variable in the **percona** tree, for example:
 ::
    configs:
      percona:
        cluster_size: 5
 .. NOTE:: Cluster size should be an odd number. Cluster size with more that 5
  nodes will lead to big latency for write operations.
 Force bootstrap
 ---------------
 Sometimes operators may want to manually specify Galera node which recovery
 should be done from. In that case, you need to override **force_bootstrap**
 variable in the **percona** tree, for example:
 ::
    configs:
      percona:
        force_bootstrap:
          enabled: true
          node: NODE_NAME
 **NODE_NAME** should be the name of the k8s node, which will run Galera node
 with required data.
 Troubleshooting
 ~~~~~~~~~~~~~~~
 Galera operation requires some advanced knowledge in Mysql and in some general
 clustering conceptions. In most cases, we expect that Galera will "self-heal"
 itself, in the worst case via restart, full resync and reconnection to the
 cluster.
 Our readiness and liveness scripts should cover this, and not allow
 misconfigured or non-operational node receive production traffic.
 Yet it's possible that some failure scenarios is not covered and to fix them
 some manual actions could be required.
 Check the logs
 --------------
 Each container of the Galera pod writes detailed logs to the stdout. You could
 read them via ``kubectl logs POD_NAME -c CONT_NAME``. Make sure you check the
 ``galera`` container logs and ``galera-checker`` ones.
 Additionally you should check the Mysql logs in the
 ``/var/log/ccp/mysql/mysql.log``
 Check the etcd state
 --------------------
 Galera keeps its state in the etcd and it could be useful to check what is
 going on in the etcd right now. Assuming that you're using the **ccp**
 namespace, you could check etcd state using this command:
 ::
    etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 ls -r -p --sort /galera
    etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/state
    etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/leader
    etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 get /galera/k8scluster/uuid
 Node restart
 ------------
 In most cases, it should be safe to restart a single Galera node. If you need
 to do it for some reason, just delete the pod, via kubectl:
 ::
    kubectl delete pod POD_NAME
 Full cluster restart
 --------------------
 In some cases, you may need to restart the whole cluster. Make sure you have a
 backup before doing this. To do this, set the cluster state to the "RECOVERY":
 ::
    etcdctl --endpoints http://etcd.ccp.svc.cluster.local:2379 set /galera/k8scluster/state RECOVERY
 After that restart all Galera pods:
 ::
    kubectl delete pod POD1_NAME POD2_NAME POD3_NAME
 Once that done, Galera cluster will be rebuilt and should be operational.
 .. NOTE:: For more info about cluster recovery please refer to the
  "Supported scenarios" section.
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -25,6 +25,7 @@ Advanced topics
   :maxdepth: 1
   deploying_multiple_parallel_environments
   galera
   ceph
   ceph_cluster
   using_calico_instead_of_ovs