Browse Source

add compute node monitoring spec

Change-Id: Ifdaa902346a4ac52dad22f13053362cb8f0bb2da
Adam Spiers 2 years ago
parent
commit
4acc9011b0
1 changed files with 433 additions and 0 deletions
  1. 433
    0
      specs/newton/approved/newton-instance-ha-host-monitoring-spec.rst

+ 433
- 0
specs/newton/approved/newton-instance-ha-host-monitoring-spec.rst View File

@@ -0,0 +1,433 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+==========================================
8
+Host Monitoring
9
+==========================================
10
+
11
+The purpose of this spec is to describe a method for monitoring the
12
+health of OpenStack compute nodes.
13
+
14
+Problem description
15
+===================
16
+
17
+Monitoring compute node health is essential for providing high
18
+availability for VMs. A health monitor must be able to detect crashes,
19
+freezes, network connectivity issues, and any other OS-level errors on
20
+the compute node which prevent it from being able to run the necessary
21
+services in order to host existing or new VMs.
22
+
23
+Use Cases
24
+---------
25
+
26
+As a cloud operator, I would like to provide my users with highly
27
+available VMs to meet high SLA requirements. Therefore, I need my
28
+compute nodes automatically monitored for hardware failure, kernel
29
+crashes and hangs, and other failures at the operating system level.
30
+Any failure event detected needs to be passed to a compute host
31
+recovery workflow service which can then take the appropriate remedial
32
+action.
33
+
34
+For example, if a compute host fails (or appears to fail to the extent
35
+that the monitor can detect), the recovery service will typically
36
+identify all VMs which were running on this compute host, and may take
37
+any of the following possible actions:
38
+
39
+- Fence the host (STONITH) to eliminate the risk of a still-running
40
+  instance being resurrected elsewhere (see the next step) and
41
+  simultaneously running in two places as a result, which could cause
42
+  data corruption.
43
+
44
+- Resurrect some or all of the VMs on other compute hosts.
45
+
46
+- Notify the cloud operator.
47
+
48
+- Notify affected users.
49
+
50
+- Make the failure and recovery events available to telemetry /
51
+  auditing systems.
52
+
53
+Scope
54
+-----
55
+
56
+This spec only addresses monitoring the health of the compute node
57
+hardware and basic operating system functions, and notifying
58
+appropriate recovery components in the case of any failure.
59
+
60
+Monitoring the health of ``nova-compute`` and other processes it
61
+depends on, such as ``libvirtd`` and anything else at or above the
62
+hypervisor layer, including individual VMs, will be covered by
63
+separate specs, and are therefore out of scope for this spec.
64
+
65
+Any kind of recovery workflow is also out of scope and will be covered
66
+by separate specs.
67
+
68
+This spec has the following goals:
69
+
70
+1. Encourage all implementations of compute node monitoring, whether
71
+   upstream or downstream, to output failure notifications in a
72
+   standardized manner.  This will allow cloud vendors and operators
73
+   to implement HA of the compute plane via a collection of compatible
74
+   components (of which one is compute node monitoring), whilst not
75
+   being tied to any one implementation.
76
+
77
+2. Provide details of and recommend a specific implementation which
78
+   for the most part already exists and is proven to work.
79
+
80
+3. Identify gaps with that implementation and corresponding future
81
+   work required.
82
+
83
+Acceptance criteria
84
+===================
85
+
86
+Here the words "must", "should" etc. are used with the strict meaning
87
+defined in `RFC2119 <https://www.ietf.org/rfc/rfc2119.txt>`_.
88
+
89
+- Compute nodes must be automatically monitored for hardware failure,
90
+  kernel crashes and hangs, and other failures at the operating system
91
+  level.
92
+
93
+- The solution must scale to hundreds of compute hosts.
94
+
95
+- Any failure event detected must cause the component responsible for
96
+  alerting to send a notification to a configurable endpoint so that
97
+  it can be consumed by the cloud operator's choice of compute node
98
+  recovery workflow controller.
99
+
100
+- If a failure notification is not accepted by the recovery component,
101
+  it should be persisted within the monitoring/alerting components,
102
+  and sending of the notification should be retried periodically until
103
+  it succeeds.  This will ensure that remediation of failures is never
104
+  dropped due to temporary failure or other unavailability of any
105
+  component.
106
+
107
+- The alerting component must be extensible in order to allow
108
+  communication with multiple types of recovery workflow controller,
109
+  via a driver abstraction layer, and drivers for each type.  At least
110
+  one driver must be implemented initially.
111
+
112
+- One of the drivers should send notifications to an HTTP endpoint
113
+  using a standardized JSON format as the payload.
114
+
115
+- Another driver should send notifications to the `masakari API server
116
+  <https://wiki.openstack.org/wiki/Masakari#Masakari_API_Design>`_.
117
+
118
+Implementation
119
+==============
120
+
121
+The implementation described here was presented at OpenStack Day
122
+Israel, June 2017, from which `this diagram
123
+<https://aspiers.github.io/openstack-day-israel-2017-compute-ha/#/no-fence_evacuate>`_
124
+should assist in understanding the below description.
125
+
126
+Running a `pacemaker_remote
127
+<http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Remote/>`_
128
+service on each compute host allows it to be monitored by a central
129
+Pacemaker cluster via a straight-forward TCP connection.  This is an
130
+ideal solution to this problem for the following reasons:
131
+
132
+- Pacemaker can scale to handling a very large number of remote nodes.
133
+
134
+- ``pacemaker_remote`` can be simultaneously used for monitoring and
135
+  managing services on each compute host.
136
+
137
+- ``pacemaker_remote`` is a very lightweight service which will not
138
+  cause any significantly increased load on each compute host.
139
+
140
+- Pacemaker has excellent support for fencing for a wide range of
141
+  STONITH devices, and it is easy to extend support to other devices,
142
+  as shown by the `fence_agents repository
143
+  <https://github.com/ClusterLabs/fence-agents>`_.
144
+
145
+- Pacemaker is easily extensible via OCF Resource Agents, which allow
146
+  custom design of monitoring and of the automated reaction when those
147
+  monitors fail.
148
+
149
+- Many clouds will already be running one or more Pacemaker clusters
150
+  on the control plane, as recommended by the |ha-guide|_, so
151
+  deployment complexity is not significantly increased.
152
+
153
+- This architecture is already implemented and proven via the
154
+  commercially supported enterprise products RHEL OpenStack Platform
155
+  and SUSE OpenStack Cloud, and via `masakari
156
+  <https://github.com/openstack/masakari/blob/master/README.rst>`_
157
+  which is used by production deployments at NTT.
158
+
159
+Since many different tools are currently in use for deployment of
160
+OpenStack with HA, configuration of Pacemaker is currently out of
161
+scope for upstream projects, so the exact details will be left as the
162
+responsibility of each individual deployer.  Nevertheless, examples
163
+of partial configurations for Pacemaker are given below.
164
+
165
+Fencing
166
+-------
167
+
168
+Fencing is technically outside the scope of this spec, in order to
169
+allow any cloud operator to choose their own clustering technology
170
+whilst remaining compliant and hence compatible with the notification
171
+standard described here.  However, Pacemaker offers such a convenient
172
+solution to fencing which is also used to send the failure
173
+notification, so it is described here in full.
174
+
175
+Pacemaker already implements effective heartbeat monitoring of its
176
+remote nodes via the TCP connection with ``pacemaker_remote``, so it
177
+only remains to ensure that the correct steps are taken when the
178
+monitor detects failure:
179
+
180
+1. Firstly, the compute host must be fenced via an appropriate STONITH
181
+   agent, for the reasons stated above.
182
+
183
+2. Once the host has been fenced, the monitor must mark the host as
184
+   needing remediation in a manner which is persisted to disk (in case
185
+   of changes in cluster state during handling of the failure) and
186
+   read/write-accessible by a separate alerting component which can
187
+   hand over responsibility of processing the failure to a recovery
188
+   workflow controller, by sending it the appropriate notification.
189
+
190
+These steps should be implemented by using two features of Pacemaker.
191
+Firstly, its ``fencing_topology`` configuration directive to implement
192
+the second step as a custom fencing agent which is triggered after the
193
+first step is complete.  For example, the custom fencing agent might
194
+be set up via a Pacemaker ``primitive`` resource such as:
195
+
196
+.. code::
197
+
198
+    primitive fence-nova stonith:fence_compute \
199
+        params auth-url="http://cluster.my.cloud.com:5000/v3/" \
200
+               domain=my.cloud.com \
201
+               tenant-name=admin \
202
+               endpoint-type=internalURL \
203
+               login=admin \
204
+               passwd=s3kr1t \
205
+        op monitor interval=10m
206
+
207
+and then it could be configured as the second device in the fencing
208
+sequence:
209
+
210
+.. code::
211
+
212
+    fencing_topology compute1: stonith-compute1,fence-nova
213
+
214
+Secondly, the ``fence_compute`` agent here should persist the marking of
215
+the fenced compute host via `attrd
216
+<http://clusterlabs.org/man/pacemaker/attrd_updater.8.html>`_, so that
217
+a separate alerting component can transfer ownership of this host's
218
+failure to a recovery workflow controller by sending it the
219
+appropriate notification message.
220
+
221
+It is worth noting that the ``fence_compute`` fencing agent `already
222
+exists
223
+<https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/compute/fence_compute.py>`_
224
+as part of an earlier architecture, so it is strongly recommended to
225
+reuse and adapt the existing implementation rather than writing a new
226
+one from scratch.
227
+
228
+Sending failure notifications to a host recovery workflow controller
229
+--------------------------------------------------------------------
230
+
231
+There must be a highly available service responsible for taking host
232
+failures marked in ``attrd``, notifying a recovery workflow
233
+controller, and updating ``attrd`` accordingly once appropriate action
234
+has been taken.  A suggested name for this service is
235
+``nova-host-alerter``.
236
+
237
+It should be easy to ensure this alerter service is highly available
238
+by placing it under management of the existing Pacemaker cluster.  It
239
+could be written as an `OCF resource agent
240
+<http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html>`_, or as a
241
+Python daemon which is controlled by an OCF / LSB / ``systemd`` resource
242
+agent.
243
+
244
+The alerter service must contain an extensible driver-based
245
+architecture, so that it is capable of sending notifications to a
246
+number of different recovery workflow controllers.
247
+
248
+In particular it must have a driver for sending notifications via the
249
+`masakari API <https://github.com/openstack/masakari>`_.  If the
250
+service is implemented as a shell script, this could be achieved by
251
+invoking masakari's ``notification-create`` CLI, or if in Python, via
252
+the `python-masakariclient library
253
+<https://github.com/openstack/python-masakariclient>`_.
254
+
255
+Ideally it should also have a driver for sending HTTP POST messages to
256
+a configurable endpoint with JSON data formatted in the following
257
+form:
258
+
259
+.. code-block:: json
260
+
261
+    {
262
+        "id": UUID,
263
+        "event_type": "host failure",
264
+        "version": "1.0",
265
+        "generated_time" : TIMESTAMP,
266
+        "payload": {
267
+            "hostname": COMPUTE_NAME
268
+            "on_shared_storage": [true|false],
269
+            "failure_time" : TIMESTAMP
270
+        },
271
+    }
272
+
273
+``COMPUTE_NAME`` refers to the FQDN of the compute node on which the
274
+failures have occurred.  ``on_shared_storage`` is ``true`` if and only
275
+if the compute host's instances are backed by shared storage.
276
+``failure_time`` provides a timestamp (in seconds since the UNIX
277
+epoch) for when the failure occurred.
278
+
279
+This is already implemented as `fence_evacuate.py
280
+<https://github.com/gryf/mistral-evacuate/blob/master/fence_evacuate.py>`_,
281
+although the message sent by that script is currently specifically
282
+formatted to be consumed by Mistral.
283
+
284
+Alternatives
285
+============
286
+
287
+No alternatives to the overall architecture are obviously apparent at
288
+this point.  However it is possible that the use of `attrd
289
+<http://clusterlabs.org/man/pacemaker/attrd_updater.8.html>`_ (which
290
+is functional but not comprehensively documented) could be substituted
291
+for some other highly available key/value attribute store, such as
292
+`etcd <https://coreos.com/etcd>`_.
293
+
294
+Impact assessment
295
+=================
296
+
297
+Data model impact
298
+-----------------
299
+
300
+None
301
+
302
+API impact
303
+----------
304
+
305
+The HTTP API of the host recovery workflow service needs to be able to
306
+receive events in the format they are sent by this host monitor.
307
+
308
+Security impact
309
+---------------
310
+
311
+Ideally it should be possible for the host monitor to send
312
+instance event data securely to the recovery workflow service
313
+(e.g. via TLS), without relying on the security of the admin network
314
+over which the data is sent.
315
+
316
+Other end user impact
317
+---------------------
318
+
319
+None
320
+
321
+Performance Impact
322
+------------------
323
+
324
+There will be a small amount of extra RAM and CPU required on each
325
+compute node for running the ``pacemaker_remote`` service.  However
326
+it's a relatively simple service, so this should not have significant
327
+impact on the node.
328
+
329
+Other deployer impact
330
+---------------------
331
+
332
+Distributions need to package ``pacemaker_remote``; however this is
333
+already done for many distributions including SLES, openSUSE, RHEL,
334
+CentOS, Fedora, Ubuntu, and Debian.
335
+
336
+Automated deployment solutions need to deploy and configure the
337
+``pacemaker_remote`` service on each compute node; however this is a
338
+relatively simple task.
339
+
340
+Developer impact
341
+----------------
342
+
343
+Nothing other than the listed work items below.
344
+
345
+Documentation Impact
346
+--------------------
347
+
348
+The service should be documented in the |ha-guide|_.
349
+
350
+Assignee(s)
351
+===========
352
+
353
+Primary assignee:
354
+
355
+- Adam Spiers
356
+
357
+Other contributors:
358
+
359
+- Sampath Priyankara
360
+- Andrew Beekhof
361
+- Dawid Deja
362
+
363
+Work Items
364
+==========
365
+
366
+- Implement ``nova-host-alerter`` (**TODO**: choose owner for this)
367
+
368
+- If appropriate, move the existing `fence_evacuate.py
369
+  <https://github.com/gryf/mistral-evacuate/blob/master/fence_evacuate.py>`_
370
+  to a more suitable long-term home (**TODO**: choose owner for this)
371
+
372
+- Add SSL support (**TODO**: choose owner for this)
373
+
374
+- Add documentation to the |ha-guide|_ (``aspiers`` / ``beekhof``)
375
+
376
+.. |ha-guide| replace:: OpenStack High Availability Guide
377
+.. _ha-guide: http://docs.openstack.org/ha-guide/
378
+
379
+Dependencies
380
+============
381
+
382
+- `Pacemaker <http://clusterlabs.org/>`_
383
+
384
+Testing
385
+=======
386
+
387
+`Cloud99 <https://github.com/cisco-oss-eng/Cloud99>`_ could
388
+possibly be used for testing.
389
+
390
+References
391
+==========
392
+
393
+- `Architecture diagram presented at OpenStack Day Israel, June 2017
394
+  <https://aspiers.github.io/openstack-day-israel-2017-compute-ha/#/nova-host-alerter>`_
395
+  (see also `the video of the talk <https://youtu.be/uMCMDF9VkYk?t=20m9s>`_)
396
+
397
+- `"High Availability for Virtual Machines" user story
398
+  <http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
399
+
400
+- `Video of "High Availability for Instances: Moving to a Converged Upstream Solution"
401
+  presentation at OpenStack conference in Boston, May 2017
402
+  <https://www.openstack.org/videos/boston-2017/high-availability-for-instances-moving-to-a-converged-upstream-solution>`_
403
+
404
+- `Instance HA etherpad started at Newton Design Summit in Austin, April 2016
405
+  <https://etherpad.openstack.org/p/newton-instance-ha>`_
406
+
407
+- `Video of "HA for Pets and Hypervisors" presentation at OpenStack conference
408
+  in Austin, April 2016
409
+  <https://www.openstack.org/videos/video/high-availability-for-pets-and-hypervisors-state-of-the-nation>`_
410
+
411
+- `automatic-evacuation etherpad
412
+  <https://etherpad.openstack.org/p/automatic-evacuation>`_
413
+
414
+- Existing `fence agent
415
+  <https://github.com/gryf/mistral-evacuate/blob/master/fence_evacuate.py>`_
416
+  which sends failure notification payload as JSON over HTTP.
417
+
418
+- `Instance auto-evacuation cross project spec (WIP)
419
+  <https://review.openstack.org/#/c/257809>`_
420
+
421
+
422
+History
423
+=======
424
+
425
+.. list-table:: Revisions
426
+   :header-rows: 1
427
+
428
+   * - Release Name
429
+     - Description
430
+   * - Pike
431
+     - Updated to have alerting mechanism decoupled from fencing process
432
+   * - Newton
433
+     - First introduced

Loading…
Cancel
Save