Browse Source

Spec for host recovery

Change-Id: Ifa2583901cd2dff0b450d81fd7de96b27e9c315a
Dawid Deja 2 years ago
parent
commit
e243a2c545
1 changed files with 161 additions and 0 deletions
  1. 161
    0
      specs/newton/approved/newton-instance-ha-host-recovery.rst

+ 161
- 0
specs/newton/approved/newton-instance-ha-host-recovery.rst View File

@@ -0,0 +1,161 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+=============
8
+Host Recovery
9
+=============
10
+
11
+The purpose of this spec is to describe a method to recover all virtual
12
+machines that are on the host after its failure.
13
+
14
+Problem description
15
+===================
16
+
17
+In case of whole compute node failure, recovering of instances is crucial for
18
+providing the high availability for the virtual machines. On the other hand,
19
+automatic recovery of some instances may cause even more problems than the fact,
20
+that they were suddenly turned off.
21
+
22
+When taking both arguments into account it seems obvious that there is a need
23
+for automatic recovery that has to be configurable, on both instance and host
24
+level. This spec is to describe what are possible actions in case of compute
25
+node failure and to describe the configuration. Automatic recovery of
26
+particular instances is out of scope of this spec and would be described in
27
+another document.
28
+
29
+Use Cases
30
+---------
31
+
32
+* As a cloud operator, I would like to provide my users with highly
33
+available VMs to meet high SLA requirements. Therefore, I need some of my VMs
34
+to automatically resurrect after compute node failure.
35
+
36
+Proposed change
37
+===============
38
+
39
+VMs recovery can be perform on the control plane of OpenStack cloud. It would be
40
+done using mistral workflow service and pacemaker resource agent. The resource
41
+agent would be responsible for starting the workflow, whereas mistral would
42
+be responsible for performing *nova_evacuate* for each VM and for observing the
43
+state of each evacuated VM. Usage of mistral would ensure that evacuation
44
+workflow will end, even if some of the controllers dies during the process.
45
+
46
+Alternatives
47
+------------
48
+
49
+1. We may not use mistral workflow at all and do all *nova_evacuate* related
50
+stuff in the pacemaker resource agents. But this means that we would have to
51
+implement all the HA mechanism in it, which would be difficult.
52
+
53
+2. We may try to implement real *host-evacuate* in nova. Right now
54
+*host-evacuate* iterate over all instances from given host on the client side.
55
+We can try to change it and implement it in nova, but nova cores were against
56
+this change in the past.
57
+
58
+Data model impact
59
+-----------------
60
+
61
+None
62
+
63
+API impact
64
+----------
65
+
66
+None
67
+
68
+Security impact
69
+---------------
70
+
71
+None
72
+
73
+Other end user impact
74
+---------------------
75
+
76
+None
77
+
78
+Performance Impact
79
+------------------
80
+
81
+There would be extra amount of RAM and CPU needed on each controller node to
82
+run both pacemaker and mistral services. If they are already present on the
83
+control plane, there would be no performance impact.
84
+
85
+Other deployer impact
86
+---------------------
87
+
88
+Distributions need to package and deploy an extra services on each
89
+controller node. Those services are mistral service and pacemaker resource
90
+agent.
91
+
92
+Developer impact
93
+----------------
94
+
95
+Nothing other than the listed work items below.
96
+
97
+Implementation
98
+==============
99
+
100
+Resource agent would receive information from host monitor, that given host
101
+is down. Then it would send a request to mistral to start recovery workflow.
102
+Request needs to have below input parameters:
103
+
104
+.. code-block:: json
105
+    {
106
+        "search_opts": {
107
+            "host": COMPUTE_NAME
108
+        },
109
+        "on_shared_storage": [true|false]
110
+    }
111
+
112
+Assignee(s)
113
+-----------
114
+
115
+Primary assignee:
116
+  <launchpad-id or None>
117
+
118
+Other contributors:
119
+  <launchpad-id or None>
120
+
121
+Work Items
122
+----------
123
+
124
+* Prepare resource agent that would trigger mistral
125
+* Prepare mistral workflow
126
+* Document changes in HA guide
127
+
128
+Dependencies
129
+============
130
+
131
+Host monitor
132
+
133
+Testing
134
+=======
135
+
136
+Documentation Impact
137
+====================
138
+
139
+The service should be documented in the ha-guide.
140
+
141
+References
142
+==========
143
+
144
+- `Instance HA etherpad started at Newton Design Summit in Austin
145
+  <https://etherpad.openstack.org/p/newton-instance-ha>`_
146
+
147
+- `"High Availability for Virtual Machines" user story
148
+  <http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
149
+
150
+- `video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
151
+  <https://youtu.be/lddtWUP_IKQ>`_
152
+
153
+- `automatic-evacuation etherpad
154
+  <https://etherpad.openstack.org/p/automatic-evacuation>`_
155
+
156
+- `Instance auto-evacuation cross project spec (WIP)
157
+  <https://review.openstack.org/#/c/257809>`_
158
+
159
+
160
+History
161
+=======

Loading…
Cancel
Save