Browse Source

VM Recovery

The purpose of this spec is to describe a method for
recover the VMs from VM failures.

Change-Id: I3648aacc2cfefe2bb5981f694415ceab17b2dfb8
changes/62/387262/5
sampathP 2 years ago
parent
commit
8a4a70db74
1 changed files with 200 additions and 0 deletions
  1. 200
    0
      specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst

+ 200
- 0
specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst View File

@@ -0,0 +1,200 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+==========================================
8
+VM Recovery
9
+==========================================
10
+
11
+The purpose of this spec is to describe a method to recover
12
+individual virtual machines that are marked as failed by
13
+the VM monitoring component.
14
+
15
+Problem description
16
+===================
17
+
18
+VM failure can be detected by VM monitoring method discussed in
19
+`vm monitoring spec`__.
20
+
21
+__ https://review.openstack.org/#/c/352217/
22
+
23
+When VM failure event is detected, appropriate recovery actions must
24
+be taken. Those recovery actions should be decided using configurable
25
+policies based on inputs such as the state of storage (shared or
26
+otherwise), status of the VM, and cause of the VM failure.
27
+
28
+Use Cases
29
+---------
30
+
31
+As a cloud operator, I would like to provide my users with highly
32
+available VMs to meet high SLA requirements. There are several types
33
+of VM failure events that can occur in OpenStack clouds.
34
+We need to make sure such events can be detected and recovered
35
+by the system. Possible VM failure events include:
36
+
37
+- VM crashes
38
+
39
+- VM hangs
40
+
41
+Possible recovery methods include:
42
+
43
+- VM restart (stop and start)
44
+
45
+- VM restart on different host
46
+
47
+Scope
48
+-----
49
+
50
+This spec only addresses recovery from isolated failures of individual
51
+VMs. Monitoring of the VMs, and detection and recovery from wider
52
+failures, such as failure of a whole compute host, will be covered by
53
+separate specs, and are therefore out of scope for this spec.
54
+
55
+This spec has the following goals:
56
+
57
+1. Encourage all implementations of VM recovery, whether upstream or
58
+   downstream, to receive failure notifications in a standardized
59
+   manner. This will allow cloud vendors and operators to implement
60
+   HA of the compute plane via a collection of compatible components
61
+   (of which one is compute node monitoring), whilst not being tied to
62
+   any one implementation.
63
+
64
+2. Suggest appropriate actions which can be taken for each failure
65
+   case.
66
+
67
+3. Provide details of and recommend a specific implementation which
68
+   for the most part already exists and is proven to work.
69
+
70
+4. Identify gaps with that implementation and corresponding future
71
+   work required.
72
+
73
+Proposed change
74
+===============
75
+
76
+VM monitors send failure events to a recovery workflow service. This
77
+workflow service can analyze the content of the failure event message
78
+and execute the appropriate recovery action. This workflow service
79
+could also handle the advanced recovery options such as maximum
80
+restart threshold, execute next recovery action or execute multiple
81
+workflows.
82
+
83
+If a VM crashes, the first approach to recovery is stop and start the
84
+VM from nova-api. The maximum restart threshold should be
85
+configurable, and it could be 0, which means do not restart and go to
86
+next recovery method. If restart fails, or threshold is 0, it should
87
+try to restart the VM on a different host. The threshold could even be
88
+-1, to indicate an infinite number of retries on this host, preventing
89
+the VM from ever being restarted on a different host. This might be
90
+desirable in certain configurations where there is no shared storage
91
+for ephemeral disks, and rebuild of a disk from a glance image during
92
+``nova evacuate`` is undesirable.
93
+
94
+If a VM hangs due to an I/O error, the recovery service may be
95
+required to automatically disable the ``nova-compute`` service on that
96
+host and restart the VM on a different host. It could also migrate
97
+other VMs from the host, in order to preempt further I/O errors.
98
+
99
+Implementation
100
+==============
101
+
102
+There are at least three possible ways to implement the proposed
103
+change:
104
+
105
+1. Use Masakari as recovery workflow service
106
+
107
+   VM monitors send the failure events to Masakari using Masakari's
108
+   notification API. Masakari will execute pre-defined recovery actions.
109
+
110
+2. Use Mistral as recovery workflow service
111
+
112
+   VM monitors call the Mistral workflow to execute execute appropriate
113
+   recovery actions.
114
+
115
+3. Use Masakari as recovery engine and Mistral as workflow service
116
+
117
+   VM monitors send the failure events to Masakari and Masakari will
118
+   analyze the content of the failure event message and call Mistral
119
+   workflow to execute recovery actions.
120
+
121
+
122
+Data model impact
123
+-----------------
124
+
125
+None
126
+
127
+REST API impact
128
+---------------
129
+
130
+The HTTP API of the VM recovery workflow service needs to be able to
131
+receive events in the format they are sent by the VM monitor.
132
+
133
+Security impact
134
+---------------
135
+
136
+Ideally it should be possible for the VM monitor to send instance
137
+event data securely to the recovery workflow service (e.g. via TLS),
138
+without relying on the security of the admin network over which the
139
+data is sent.
140
+
141
+Other end user impact
142
+---------------------
143
+
144
+None
145
+
146
+Performance Impact
147
+------------------
148
+
149
+None
150
+
151
+Other deployer impact
152
+---------------------
153
+
154
+
155
+Developer impact
156
+----------------
157
+
158
+Documentation Impact
159
+--------------------
160
+
161
+The service should be documented in the |ha-guide|_.
162
+
163
+.. |ha-guide| replace:: OpenStack High Availability Guide
164
+.. _ha-guide: http://docs.openstack.org/ha-guide/
165
+
166
+Assignee(s)
167
+-----------
168
+
169
+Primary assignee:
170
+  <launchpad-id or None>
171
+
172
+Other contributors:
173
+  <launchpad-id or None>
174
+
175
+
176
+Work Items
177
+==========
178
+
179
+ WIP
180
+
181
+Dependencies
182
+============
183
+
184
+
185
+Testing
186
+=======
187
+
188
+
189
+Documentation Impact
190
+====================
191
+
192
+
193
+
194
+References
195
+==========
196
+
197
+
198
+
199
+History
200
+=======

Loading…
Cancel
Save