Browse Source

Implement reserved_host, auto_priority and rh_priority recovery methods

Implements: bp implement-recovery-methods
Change-Id: I83ce204d8f25b240fa6ce723dc15192ae9b4e191
Abhishek Kekane 2 years ago
parent
commit
83d1a0aae1
1 changed files with 224 additions and 0 deletions
  1. 224
    0
      specs/ocata/approved/implement-reserved-host-action.rst

+ 224
- 0
specs/ocata/approved/implement-reserved-host-action.rst View File

@@ -0,0 +1,224 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+=================================================================
8
+Implement RESERVED_HOST recovery action for host failure workflow
9
+=================================================================
10
+
11
+https://blueprints.launchpad.net/masakari/+spec/implement-recovery-methods
12
+
13
+This spec talks about adding RESERVED_HOST recovery action for host
14
+failure workflow. In masakari each failover segment has recovery_method
15
+defined for it, so that if any of the host within that failover segment
16
+goes down then recovery action will be executed to evacuate "HA_Enabled" or all
17
+instances depending on "evacuate_all_instances" configuration option from that
18
+host based on recovery_method.
19
+
20
+What is RESERVED_HOST?
21
+
22
+In each failover segment operators will keep some hosts as reserved by
23
+disabling the compute service on those hosts and setting "reserved"
24
+property of that host as "True". As a result, these hosts are not
25
+selected by the nova scheduler while provisioning new instances or for
26
+performing any other instance actions such as resize, migration etc.
27
+These hosts can be used by masakari engine for evacuating the instances
28
+from the failed host. Once the reserved host is used for evacuating
29
+the instances it is no longer treated as reserved and nova scheduler can
30
+use that host for scheduling the instances.
31
+
32
+Problem description
33
+===================
34
+
35
+Masakari provides a driver interface for implementing the workflows
36
+synchronously or asynchronously. Whoever wants to implement the
37
+workflow can inherit the masakari driver and implement the workflows.
38
+
39
+For implementing the RESERVED_HOST recovery action masakari engine
40
+should provide list of reserved hosts associated with its failover segment
41
+to the driver. Its then job of the driver to execute the workflow and use
42
+this list for evacuating the instances from failed host. One of the task of
43
+the workflow is to enable the compute service on reserved host so that
44
+instances can be evacuated on that host. At the same time "reserved" property
45
+of that host needs to be set to False. There is a possibility for multiple
46
+host failures under one failover segment may take the same reserved host and
47
+start the recovery workflows at the same time. To avoid this situation, lock
48
+with current reserved host name will be acquired on each of the task and that
49
+reserved host will be skipped if the lock acquired and evacuation will be done
50
+on the next reserved host from the list.
51
+
52
+Use Cases
53
+---------
54
+
55
+Operator may want to execute host_failure workflow using 'RESERVED_HOST'
56
+recovery method.
57
+
58
+Proposed change
59
+===============
60
+
61
+Masakari engine should execute the workflows synchronously only. Masakari
62
+engine will load all the drivers. Whoever is going to implement the new driver,
63
+it should be the responsibility of that driver to get the result of workflow
64
+and send it back to the masakari engine. If someone wants to add a driver
65
+which will execute the workflow on a different host and not on the same host
66
+where masakari engine is running then they will need to design that driver
67
+in such a way that workflow will execute on any host in asynchronous way and
68
+send back the result to the masakari engine, so that masakari engine will
69
+set the notification status to "ERROR" or "FINISHED" based on the results.
70
+
71
+To implement "reserved_host" recovery method, we need to implement lock
72
+mechanism over reserved host so that masakari-engine don't use same reserved
73
+host for multiple failure notifications. There are two ways to implement the
74
+lock mechanism:
75
+
76
+1. Use oslo_concurrency.lockutils file based lock:
77
+   Easy to implement, but cannot manage lock among multiple nodes.
78
+
79
+2. Implement lock mechanism using Tooz:
80
+   Operator would want to deploy multiple masakari-engine services on
81
+   different nodes. For this purpose we recommend to use distributed locking
82
+   mechanism provided by Tooz library. By default Tooz would be configured to
83
+   use file locks, so everything will work as oslo_concurrency lock mechanism.
84
+   If operator would want to run multiple masakari-engine services he/she
85
+   would need to configure Tooz backend service and set it in masakari.conf.
86
+   Currently most reliable Tooz backends are ZooKeeper and Redis.
87
+
88
+As of now, in masakari file based lock (oslo_concurrency.lockutils) is already
89
+used. Same mechanism will be used to acquire lock on reserved host. Tooz
90
+support will be added later and all the existing locks will be migrated to
91
+use Tooz locking mechanism.
92
+
93
+Pros:
94
+-----
95
+1. No need to change current masakari engine implementation.
96
+2. It's easy to implement other recovery actions such as "AUTO_PRIORITY" and
97
+   "RH_PRIORITY" with this design.
98
+
99
+Alternatives
100
+------------
101
+
102
+Execute the workflows asynchronously i.e. either workflow can be executed on
103
+same host where masakari engine is running or on a different host altogether.
104
+In this case masakari engine might not get the results from the workflow
105
+execution and will not able to update notification status and set reserved
106
+hosts to False.
107
+
108
+To achieve this masakari engine will generate a callback URL based on the
109
+notification_id and pass it to the driver. Sample callback URL will be like,
110
+
111
+http://<host>:<port>/v1/notification/<notification_id>
112
+
113
+Driver will further pass this URL and the required information such as
114
+reserved host list, host name etc. to the workflow. Workflow will be
115
+responsible to call the masakari using callback URL with notification status
116
+and reserved hosts used by workflow as a body of the request.
117
+
118
+Once this request is received by masakari, then based on the notification_id
119
+it will map it to the notification from the database table and update the
120
+status of the notification accordingly. Also masakari will get the list of
121
+used reserved hosts in request body, so it will loop through it and set those
122
+host's "reserved" property as False.
123
+
124
+Rest API can be::
125
+
126
+method: PUT
127
+URL: URL that is passed to the workflow(contains notification_id)
128
+Body:
129
+result {
130
+    notification_status: status of the notification, either 'error' or
131
+    'finished', used_reserved_hosts: list of actually used reserved_hosts
132
+    by workflow
133
+}
134
+
135
+Pros:
136
+-----
137
+
138
+1. Better approach to implement new workflow drivers.
139
+
140
+Cons:
141
+-----
142
+1. The other service which is going to request masakari using REST api should
143
+   have required admin credentials to call the API.
144
+2. Need to change current driver (taskflow) implementation to adopt this
145
+   design.
146
+3. Need to modify PUT api to incorporate this change.
147
+
148
+Data model impact
149
+-----------------
150
+
151
+None
152
+
153
+REST API impact
154
+---------------
155
+
156
+None
157
+
158
+Security impact
159
+---------------
160
+
161
+None
162
+
163
+Notifications impact
164
+--------------------
165
+
166
+None
167
+
168
+Other end user impact
169
+---------------------
170
+
171
+None
172
+
173
+Performance Impact
174
+------------------
175
+
176
+None
177
+
178
+Other deployer impact
179
+---------------------
180
+
181
+None
182
+
183
+Developer impact
184
+----------------
185
+
186
+None
187
+
188
+Implementation
189
+==============
190
+
191
+Assignee(s)
192
+-----------
193
+
194
+Primary assignee:
195
+  Dinesh Bhor <dinesh.bhor@nttdata.com>
196
+  Abhishek Kekane <abhishek.kekane@nttdata.com>
197
+
198
+Work Items
199
+----------
200
+
201
+* Implement RESERVED_HOST recovery_method for host_failure recovery in
202
+  synchronous way for taskflow driver
203
+* Add unit tests for the coverage
204
+
205
+Dependencies
206
+============
207
+
208
+None
209
+
210
+Testing
211
+=======
212
+
213
+None
214
+
215
+Documentation Impact
216
+====================
217
+
218
+None
219
+
220
+References
221
+==========
222
+
223
+http://eavesdrop.openstack.org/meetings/masakari/2016/masakari.2016-12-13-04.02.log.html
224
+http://eavesdrop.openstack.org/meetings/masakari/2017/masakari.2017-02-07-04.01.log.html

Loading…
Cancel
Save