Browse Source

VM Monitoring

The purpose of this spec is to describe a method for monitoring
the health of the VMs without access to the VMs's internals.

Change-Id: I82ccb4ae64f48ca154c5450641ed41e04fee9d17
sampathP 2 years ago
parent
commit
468d526263
1 changed files with 295 additions and 0 deletions
  1. 295
    0
      specs/newton/approved/newton-instance-ha-vm-monitoring-spec.rst

+ 295
- 0
specs/newton/approved/newton-instance-ha-vm-monitoring-spec.rst View File

@@ -0,0 +1,295 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+==========================================
8
+VM Monitoring
9
+==========================================
10
+
11
+The purpose of this spec is to describe a method for monitoring the
12
+health of OpenStack VM instances without access to the VMs' internals.
13
+
14
+Problem description
15
+===================
16
+
17
+Monitoring VM health is essential for providing high availability for
18
+the VMs. Typically cloud operators cannot access inside VMs in order
19
+to monitor their health, because this would violate the contract
20
+between cloud operators and users that users have complete autonomy
21
+over the contents of their VMs and all actions are performed inside
22
+them. Operators cannot assume any knowledge of the software stack
23
+inside the VM or make any changes to it. Therefore, VM health
24
+monitoring must be done externally. This VM monitor must be able to
25
+detect VM crashes, hangs (e.g. due to I/O errors) and so on.
26
+
27
+Use Cases
28
+---------
29
+
30
+As a cloud operator, I would like to provide my users with highly
31
+available VMs to meet high SLA requirements. Therefore, I need my VMs
32
+automatically monitored for sudden stops, crashes, IO failures and
33
+similar.  Any VM failure event detected needs to be passed to a VM
34
+recovery workflow service which takes the appropriate actions to
35
+recover the VM.  For example:
36
+
37
+- If a VM crashes, the recovery service will try to restart it,
38
+  possibly on the same host at first, and then on a different host if
39
+  it fails to restart or if it restarts successfully but then crashes
40
+  a second time on the original host.
41
+
42
+- If a VM receives an I/O error, the recovery service may prefer to
43
+  immediately contact ``nova-api`` to centrally disable the
44
+  ``nova-compute`` service on that host (so that no new VMs are
45
+  scheduled on the host) and restart the VM on a different host. It
46
+  could also potentially live-migrate all other VMs off that host, in
47
+  order to pre-empt an further I/O errors.
48
+
49
+Proposed change
50
+===============
51
+
52
+VM monitoring can be done at the hypervisor level without accessing
53
+inside the VMs.  In particular, |libvirt|_ provides a mechanism for
54
+monitoring its event stream via an event loop.  We need to filter the
55
+required events and pass them to a recovery workflow service.  In
56
+order to eliminate redundancy and improve extensibility, these event
57
+filters must be configurable.
58
+
59
+.. |libvirt| replace:: `libvirt`
60
+.. _libvirt: https://libvirt.org/
61
+
62
+Potential advantages:
63
+
64
+- Catching events at their source (the hypervisor layer) means that we
65
+  don't have to rely on ``nova`` having knowledge of those events.
66
+  For example, ``libvirtd`` can output errors when a VM's I/O layer
67
+  encounters issues, but ``nova`` doesn't emit corresponding events for
68
+  this.
69
+- It should be relatively easy to support a configurable event filter.
70
+- The VM instance monitor can be run on each compute node, so it should
71
+  scale well as the number of compute nodes increases.
72
+- The VM instance monitors could be managed by `pacemaker_remote`__ via a
73
+  new `OCF RA (resource agent)`__.
74
+
75
+__ http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Remote/
76
+__ http://www.linux-ha.org/wiki/OCF_Resource_Agents
77
+
78
+Alternatives
79
+------------
80
+
81
+There are three alternatives to the proposed change:
82
+
83
+1. Listen for VM status change events on message queue.
84
+
85
+   Potential disadvantages:
86
+
87
+   - It might be less reliable, if for some reason the
88
+     message queue introduced latency or was lossy.
89
+
90
+   - There also might be some gaps in which events are propagated to
91
+     the queue; if so, we could submit a ``nova`` spec to plug the gaps.
92
+
93
+   - If we listen for events from the control plane, it won't scale as
94
+     well to large numbers of compute nodes, and then would be awkward
95
+     to trigger recovery via Pacemaker.
96
+
97
+2. Write a new ``nova-libvirt`` OCF RA
98
+
99
+   It would compare ``nova``'s expectations of which VMs should be running
100
+   on the compute node with the reality.  Any differences between the
101
+   two would send appropriate failure events to the recovery workflow
102
+   service.
103
+
104
+   Potential disadvantages:
105
+
106
+   - This is more complexity than is expected to run inside an RA.
107
+     RAs are supposed to be lightweight components which simply start,
108
+     stop, and monitor services, whereas this would require abusing
109
+     that model by pretending there is a separate monitoring service
110
+     when there isn't. The ``monitor`` action would need to fail when
111
+     any differences as mentioned above were detected, and then the
112
+     ``stop`` or ``start`` action would need to send the failure
113
+     events.
114
+
115
+   - Within this "fake service" model, it's not clear how to avoid
116
+     sending the same failure events over and over again until the
117
+     failures were corrected.
118
+
119
+   - Typically RAs are implemented in ``bash``.  This is not a hard
120
+     requirement, but something of this complexity would be much
121
+     better coded in Python, resulting in either a mix of languages
122
+     within the `openstack-resource-agents`_ repository
123
+
124
+3. Same as 2. above, but as part of the NovaCompute_ RA
125
+
126
+   - This has all the disadvantages of 2., but even more so, since
127
+     new functionality would have to be mixed alongside the existing
128
+     NovaCompute_ functionality.
129
+
130
+.. _openstack-resource-agents: https://launchpad.net/openstack-resource-agents
131
+.. _NovaCompute: https://github.com/openstack/openstack-resource-agents/blob/master/ocf/NovaCompute
132
+
133
+Data model impact
134
+-----------------
135
+
136
+None
137
+
138
+API impact
139
+----------
140
+
141
+The HTTP API of the VM recovery workflow service needs to be able to
142
+receive events in the format they are sent by this instance monitor.
143
+
144
+Security impact
145
+---------------
146
+
147
+Ideally it should be possible for the instance monitor to send
148
+instance event data securely to the recovery workflow service
149
+(e.g. via TLS), without relying on the security of the admin network
150
+over which the data is sent.
151
+
152
+Other end user impact
153
+---------------------
154
+
155
+None
156
+
157
+Performance Impact
158
+------------------
159
+
160
+There will be a small amount of extra RAM and CPU required on each
161
+compute node for running the instance monitor.  However it's a
162
+relatively simple service, so this should not have significant impact
163
+on the node.
164
+
165
+Other deployer impact
166
+---------------------
167
+
168
+Distributions need to package and deploy an extra service on each
169
+compute node.  However the existing `instance monitor`_ implementation
170
+in masakari_ already provides files to simplify packaging on the Linux
171
+distributions most commonly used for OpenStack infrastructure.
172
+
173
+.. _masakari: https://github.com/ntt-sic/masakari
174
+.. _`instance monitor`:
175
+   https://github.com/ntt-sic/masakari/tree/master/masakari-instancemonitor/
176
+
177
+Developer impact
178
+----------------
179
+
180
+Nothing other than the listed work items below.
181
+
182
+Implementation
183
+==============
184
+
185
+``libvirtd`` uses `QMP (QEMU Machine Protocol)`__ via UNIX domain
186
+socket (``/var/lib/libvirt/qemu/xxxx.monitor``) to communicate with
187
+the VM domain.  ``libvirt`` catches the failure events and passes them
188
+to the VM monitor.  The VM monitor filters the events and passes them
189
+to an external recovery workflow via HTTP, which then takes the action
190
+required to recover the VM.
191
+
192
+__ http://wiki.qemu.org/QMP
193
+
194
+::
195
+
196
+ +-----------------------+
197
+ | +----------------+    |
198
+ | |       VM       |    |
199
+ | | (qemu Process) |    |
200
+ | +---------^------+    |
201
+ |       |   |QMP        |
202
+ | +-----v----------+    |
203
+ | |    libvirtd    |    |
204
+ | +---------^------+    |
205
+ |       |   |           |
206
+ | +-----v----------+    |        +-----------------------+
207
+ | |    VM Monitor  +------------>+  VM recovery workflow |
208
+ | +----------------+    |        +-----------------------+
209
+ |                       |
210
+ | Compute Node          |
211
+ +-----------------------+
212
+
213
+We can almost certainly reuse the `instance monitor`_ provided
214
+by masakari_.
215
+
216
+**FIXME**:
217
+
218
+- Need to detail how and in which format the event data should
219
+  be sent over HTTP.  **This should allow for support for other
220
+  hypervisors not based on** ``libvirt`` **being added in the future.**
221
+- Need to give details of in which exact ways the service can
222
+  be configured.
223
+
224
+  - How should event filtering be configurable?
225
+
226
+  - Where should the configuration live?  With `masakari`, it
227
+    lives in ``/etc/masakari-instancemonitor.conf``.
228
+
229
+Assignee(s)
230
+-----------
231
+
232
+Primary assignee:
233
+  <launchpad-id or None>
234
+
235
+Other contributors:
236
+  <launchpad-id or None>
237
+
238
+Work Items
239
+----------
240
+
241
+- Package `masakari`_'s `instance monitor`_ for SLES (`aspiers`)
242
+- Add documentation to the |ha-guide|_ (`beekhof`)
243
+- Look into libvirt-test-API_
244
+- Write test suite
245
+
246
+.. |ha-guide| replace:: OpenStack High Availability Guide
247
+.. _ha-guide: http://docs.openstack.org/ha-guide/
248
+.. _libvirt-test-API: https://libvirt.org/testapi.html
249
+
250
+Dependencies
251
+============
252
+
253
+- `libvirt <https://libvirt.org/>`_
254
+- `libvirt's Python bindings <https://libvirt.org/python.html>`_
255
+
256
+Testing
257
+=======
258
+
259
+It may be possible to write a test suite using libvirt-test-API_ or
260
+at least some of its components.
261
+
262
+Documentation Impact
263
+====================
264
+
265
+The service should be documented in the |ha-guide|_.
266
+
267
+References
268
+==========
269
+
270
+- `Instance HA etherpad started at Newton Design Summit in Austin
271
+  <https://etherpad.openstack.org/p/newton-instance-ha>`_
272
+
273
+- `"High Availability for Virtual Machines" user story
274
+  <http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
275
+
276
+- `video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
277
+  <https://youtu.be/lddtWUP_IKQ>`_
278
+
279
+- `automatic-evacuation etherpad
280
+  <https://etherpad.openstack.org/p/automatic-evacuation>`_
281
+
282
+- `Instance auto-evacuation cross project spec (WIP)
283
+  <https://review.openstack.org/#/c/257809>`_
284
+
285
+
286
+History
287
+=======
288
+
289
+.. list-table:: Revisions
290
+   :header-rows: 1
291
+
292
+   * - Release Name
293
+     - Description
294
+   * - Newton
295
+     - Introduced

Loading…
Cancel
Save