fuel-specs/specs/5.1/pacemaker-galera-resource-a...

151 lines
3.5 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===============================================
Reliable Pacemaker Galera Resource Agent Script
===============================================
https://blueprints.launchpad.net/fuel/+spec/reliable-galera-ocf-script [1]_
This document is intended to capture the problems and requirements for
Pacemaker OCF “Resource Agent” (hereafter RA) to improve Galera Cluster
management under Pacemaker Resource Manager
Problem description
===================
* Reboot Whole cluster (Power outages scenario)
- RA script doesnt determine the latest Galera GTID version. It always
relies on “primary controller” as a donor. Under some circumstances
Pacemaker cannot assemble Galera cluster.
* Reboot any node from cluster
* Add a new node to active cluster
* Advanced features
- Currently puppet manifests use *cs_shadow* as a method for cluster
management. It's not possible to use *crm_attribute* to store attributes in
configuration as *cs_shadow* will revert values back
Proposed change
===============
* Write a new RA script for Galera with the following requirements
- RA script allows to bootstrap cluster even when wsrep_cluster_address has
all nodes specified.
- RA script introduces timeout where pacemaker waits for 60-120 seconds until
all nodes specified in CIB became online after reboot or outage.
- After 60-120 seconds RA script must start the process of Primary Component
election which is the node with the latest GTID. This timeout is specified
as node attribute and can be changed by administrator. If all nodes
specified in CIB are UP the election process will be started immediately.
- RA script dertemines Galera GTID state and set it as node attribute. RA
gets GTID from **mysqld --wsrep-recover** or SQL query
**SHOW STATUS LIKE wsrep_local_state_uuid**
- The node with the latest GTID will become Galera Primary Controller. It
will be started with empty gcomm:// string. All other nodes will join to
Galera Primary controller to synchronize their state.
- If the node bootstrapped after timeout it will discard its configuration.
This usually happenes when it's stuck performing *fsck*.
- When new a node is added to cluster it will join cluster normally.
* Remove cs_shadow
- Remove cs_shadow from manifests to allow to store node attributes
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
None
Upgrade impact
--------------
This change doesn't affect master node upgrade. Openstack upgrade should be
disabled as this change impacts on HA logic.
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Sergii Golovatiuk (sgolovatiuk@mirantis.com)
Work Items
----------
- Write Galera OCF script
- Perform all set of destructive tests
Dependencies
============
Testing
=======
All set of destructive tests: Reboot single node, reboot whole cluster, add a
new node from Fuel UI
Documentation Impact
====================
The documentation should indicate how to increase/decrease Bootstrap timeout.
References
==========
.. [1] https://blueprints.launchpad.net/fuel/+spec/reliable-galera-ocf-script