Browse Source

Implements: spec charm-masakari

Spec for implementing charms for masakari (api and engine) and
masakari monitors (host, instance and process)

Change-Id: Ic99169e07b8a749e93467620c23183c6e8a9dbe5
Andrew McLeod 1 month ago
parent
commit
f6e4c168a7
1 changed files with 205 additions and 0 deletions
  1. 205
    0
      specs/stein/implemented/masakari.rst

+ 205
- 0
specs/stein/implemented/masakari.rst View File

@@ -0,0 +1,205 @@
1
+..
2
+  Copyright 2019 Canonical UK
3
+
4
+  This work is licensed under a Creative Commons Attribution 3.0
5
+  Unported License.
6
+  http://creativecommons.org/licenses/by/3.0/legalcode
7
+
8
+..
9
+  This template should be in ReSTructured text. Please do not delete
10
+  any of the sections in this template.  If you have nothing to say
11
+  for a whole section, just write: "None". For help with syntax, see
12
+  http://sphinx-doc.org/rest.html To test out your formatting, see
13
+  http://www.tele3.cz/jbar/rest/rest.html
14
+
15
+===============================
16
+Masakari
17
+===============================
18
+
19
+Masakari provides an API and ENGINE which receive messages from MONITORS
20
+(agents) on compute nodes to enable:
21
+
22
+* instance evacuation on host failure (masakari-hostmonitor) - requires shared
23
+  storage
24
+
25
+* instance restart on certain instance errors (masakari-instancemonitor)
26
+
27
+* openstack related process monitoring and restarting
28
+  (masakari-processmonitor)
29
+
30
+This spec is for two charms which will install the masarki api and engine, and
31
+monitors respectively.
32
+
33
+Problem Description
34
+===================
35
+
36
+If an openstack hypervisor fails, there is no method of automatic recovery.
37
+Masakari provides a way of migrating instances from a failed host to a
38
+functional host, when used in conjunction with a cluster manager.
39
+There are similar issues with instance and process failures which masakari
40
+addresses.
41
+
42
+Proposed Change
43
+===============
44
+
45
+A solution using corosync and pacemaker on each hypervisor, along with
46
+masakari-hostmonitor has been validated. Unfortunately a cluster where each node
47
+runs the full cluster stack is only recommended up to around 16 nodes:
48
+
49
+http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Remote/#_overview
50
+
51
+The scale issue with this initial solution can be mitigated by breaking
52
+compute nodes up into smaller clusters but this should be seen as a work
53
+around.
54
+
55
+They suggested approach to scale to a greater number of nodes is by using
56
+pacemaker-remote on the hypervisors rather than the full cluster stack.
57
+Unfortunately masakari-hostmonitors does not currently work with
58
+pacemaker-remote due to the following bug:
59
+
60
+https://bugs.launchpad.net/masakari-monitors/+bug/1728527
61
+
62
+This work will involve the following new charms:
63
+
64
+* masakari (provides api, engine and python-masakariclient)
65
+  The masakari charm will have identity-service, amqp and shared-db relations
66
+
67
+* masakari-monitors (optionally provides one or all of hostmonitor,
68
+  instancemonitor, processmonitor)
69
+  The masakari-monitors charm will be related to nova-compute via the juju-info
70
+  relation and related to keystone via the identity-credentials relation. The
71
+  monitors needs credentials for posting notifications to the masakari api
72
+  service. When not using pacemaker_remote masakari-monitors will rely on the
73
+  hacluster charm having been related to the nova-compute charm via juju-info.
74
+
75
+* pacemaker-remote.
76
+  This charm simple installs the pacemaker remote service and requires a
77
+  relation with fully fledged cluster.
78
+
79
+
80
+This work will involve the following new relations:
81
+
82
+* hacluster <-> pacemaker-remote
83
+  This relation will allow the pacemaker-remote charm to inform the hacluster
84
+  charm of the location of remote nodes to be added to the cluster. It will
85
+  also be used to expose the pacemaker-remotes stonith information and
86
+  whether or not the pacemaker remote node should run resources. In the case
87
+  of a masakri deployment the pacemaker-remote nodes should be set to
88
+  not run resources.
89
+
90
+The following charm updates:
91
+
92
+* hacluster charm.
93
+  The hacluster charm need to be able to support adding pacemaker-remote nodes
94
+  to the cluster and also confgiuring resources such that if requested no
95
+  resources are run on the remote nodes.
96
+
97
+* hacluster charm.
98
+  The hacluster charm need to be able to support creating maas stonith devices
99
+
100
+Additional work:
101
+
102
+* Fix masakari-hostmonitors to work with pacemaker remote
103
+  As mentioned masakari-hostmonitors does not work wih pacemaker-remote at the
104
+  moment. A patch will need to be written to fix this and ideally landed
105
+   upstream.
106
+
107
+* Write maas stonith driver.
108
+  It is important that instances that are using shared storage are only running
109
+  on one compute node at a time to avoid a split-brained cluster leading to
110
+  two instances writting to the same storage simultaniously which would result
111
+  in data corruption.
112
+
113
+These charms will support TLS for API communications
114
+
115
+Alternatives
116
+------------
117
+
118
+Although not as feature rich as maskari much of the functionality can be
119
+achieved using pacemaker/corosync resource configuration.
120
+
121
+Implementation
122
+==============
123
+
124
+Assignee(s)
125
+-----------
126
+
127
+Primary assignee:
128
+  gnuoy
129
+
130
+Secondary assignee:
131
+  admcleod
132
+
133
+Gerrit Topic
134
+------------
135
+
136
+Use Gerrit topic "masakari" or "masakari-monitors" as appropriate for all
137
+patches related to this spec.
138
+
139
+.. code-block:: bash
140
+
141
+    git-review -t masakari
142
+    git-review -t masakari-monitors
143
+
144
+
145
+Work Items
146
+----------
147
+
148
+* Masakari charm cleanup
149
+* Add domain id information to identity-credentials relation. The keystone charm
150
+and the identity-credentials interface will both need updating.
151
+* Masakari-monitors charm cleanup
152
+* Add pacemaker-remote charm
153
+* Add pacemaker-remote interface
154
+* Add functional and unit tests
155
+* Mojo specs for full stack functional testing.
156
+* Write patch to fix pacemaker-remote support in masakari-hostmonitors
157
+* Write Maas stonith plugin
158
+* Extend hacluster charm to support registering pacemaker_remote nodes.
159
+* Extend hacluster charm to support only running resources on a subset
160
+  of available nodes.
161
+* Extend hacluster charm to support creating maas stonith devices.
162
+* Write deployment guide instructions.
163
+* Add new charms and interfaces to OpenStack gerrit.
164
+
165
+
166
+Repositories
167
+------------
168
+
169
+git://git.openstack.org/openstack/charm-masakari
170
+git://git.openstack.org/openstack/charm-masakari-monitors
171
+
172
+Documentation
173
+-------------
174
+
175
+Both charms will contain README.md files with instructions
176
+
177
+Security
178
+--------
179
+
180
+We will have credentials for the cloud stored on the compute node.  Dropping
181
+from the guest to the host in this case could allow a user to compromise the
182
+cloud by signally the masakari api about one or more false compute node
183
+failures. Keystone credentials which are used by the placement api are
184
+already stored on the compute node so this does not increase the attack
185
+surface but is worth mentioning for completeness.
186
+
187
+We will need to enable a certificate relation in the nova compute host to
188
+facilitate the use of a vault charm to enable masakari ssl functionality.
189
+
190
+
191
+Testing
192
+-------
193
+
194
+Code changes will be covered by unit tests; functional testing will be done
195
+using a combination of zaza and Mojo specification.
196
+
197
+Dependencies
198
+============
199
+
200
+- Requires cluster management such as corosync or pacemaker. At the very least,
201
+  hacluster charm is required
202
+
203
+- Shared storage is required
204
+
205
+- Some administrative intervention will be required after a host failure.

Loading…
Cancel
Save