Browse Source

Merge "Add progress details for recovery workflow"

Zuul 1 month ago
parent
commit
d4c06a0eeb
1 changed files with 411 additions and 0 deletions
  1. 411
    0
      specs/stein/approved/progress-details-for-recovery-workflows.rst

+ 411
- 0
specs/stein/approved/progress-details-for-recovery-workflows.rst View File

@@ -0,0 +1,411 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+============================================
8
+Add progress details for recovery workflows
9
+============================================
10
+
11
+https://blueprints.launchpad.net/masakari/+spec/progress-details-recovery-workflows
12
+
13
+This blueprint proposes to have a feature that notifies events for recovery
14
+workflows.
15
+
16
+Problem description
17
+===================
18
+
19
+Currently, Masakari doesn't send any events during recovery operation request
20
+received by Masakari monitor.
21
+
22
+It would be useful to receive events at each stage of task of recovery
23
+workflow along with completion status and progress details so that operator
24
+will come to know about what's happening during execution.
25
+
26
+Use Cases
27
+---------
28
+
29
+Operators will be able to know following things by detailed progress details
30
+captured during each event of recovery:
31
+
32
+* Beginning/End of each task of recovery flow
33
+* Errors of failure of process recovery
34
+* Progress details which will contain the details of each task
35
+
36
+
37
+Proposed change
38
+===============
39
+
40
+Masakari Recovery Workflow is a certain set of tasks executed to recover
41
+from failure. Masakari supports three types of recovery failures:
42
+
43
+* instance-failure
44
+* process-failure
45
+* host-failure
46
+
47
+For each of these failures, Masakari executes a workflow to recover from
48
+failure. Currently Masakari uses taskflow library to execute the workflow
49
+which consists of recovery actions which are predefined and are executed
50
+linearly. Proposing here to record these recovery actions with the help of
51
+Taskflow persistence feature. Masakari will persist the flow so that it can be
52
+resumed, restarted or rolled-back on engine failure.
53
+
54
+Taskflow supports persistence of workflow which helps to persist each task
55
+details in the database. For more details please refer `persistence-doc`_
56
+
57
+Taskflow has below three tables where workflow/task details are getting
58
+stored:
59
+
60
+* logbooks
61
+* flowdetails
62
+* atomdetails
63
+
64
+In particular, for each flow there is a corresponding flowdetails
65
+record, and for each task there is a corresponding atomdetails record. These
66
+form the basic level of information about how a flow will be persisted.
67
+
68
+With the help of importing persistence package `taskflow_persistence`_ and by
69
+accessing Masakari storage via masakari engine, able to import Taskflow tables
70
+into Masakari. In taskflow library there is workflow, and each workflow has
71
+task which has state and status. With the help of `notifier_method`_ will
72
+update progress details for detailed execution flow for each task of recovery.
73
+
74
+Saved recovery task details (failures, successes, intermediary results) going
75
+to render on Horizon on tabular format which helps operators to understand
76
+progress/status of recovery. Each flow execution details stored with scale
77
+0 to 1, so that operator will able to get progress completion along with
78
+detailed information of each task.
79
+
80
+Explaining below the how actions/events that going to be recorded for
81
+‘instance-failure recovery workflow’ along with progress details:
82
+
83
+* Stop Instance Task: Below listed are possible events along with progress
84
+  details that will be recorded:
85
+
86
+  * Starting of Stop instance task::
87
+
88
+      "progress_details" = {
89
+        "progress": 0.50,
90
+        "progress_data": "Started execution of StopInstanceTask <INSTANCE_UUID>"
91
+      }
92
+
93
+  * Skipping recovery event if an instance is not HA_Enabled and
94
+    "process_all_instances" config option is also disabled::
95
+
96
+      "progress_details" = {
97
+        "progress": 1,
98
+        "progress_data": "Skipping recovery for instance <INSTANCE_UUID> as it is not Ha_Enabled"
99
+      }
100
+
101
+  * Ignored recovery event if an instance VM state is either in 'paused',
102
+    'rescued'::
103
+
104
+      "progress_details" = {
105
+        "progress": 1,
106
+        "progress_data": "Ignoring recovery for instance <INSTANCE_UUID> as it is in paused/rescued state"
107
+      }
108
+
109
+  * Stop instance event::
110
+
111
+      "progress_details" = {
112
+        "progress": 1,
113
+        "progress_data": "Finished execution of StopInstanceTask <INSTANCE_UUID>"
114
+      }
115
+
116
+  * Failure event in case failed to stop instance::
117
+
118
+      "progress_details" = {
119
+        "progress": 1,
120
+        "progress_data": "Failed to stop instance <INSTANCE_UUID>"
121
+      }
122
+
123
+* Start Instance Task: Below listed are possible events along with progress
124
+  details that will be recorded:
125
+
126
+  * Start instance event::
127
+
128
+      "progress_details" = {
129
+        "progress": 0.5,
130
+        "progress_data": "Started execution of StartInstanceTask <INSTANCE_UUID>"
131
+      }
132
+
133
+  * Finish of Start instance event::
134
+
135
+      "progress_details" = {
136
+        "progress": 1,
137
+        "progress_data": "Finished execution of StartInstanceTask <INSTANCE_UUID>"
138
+      }
139
+
140
+  * Failure event in case failed to start instance or if invalid state of it::
141
+
142
+      "progress_details" = {
143
+        "progress": 1,
144
+        "progress_data": "Failed to start instance <INSTANCE_UUID>"
145
+      }
146
+
147
+* Confirm Instance Active Task: Below listed are possible events along with
148
+  progress details that will be recorded:
149
+
150
+  * Start of Confirm instance event::
151
+
152
+      "progress_details" = {
153
+        "progress": 0.5,
154
+        "progress_data": "Confirming instance <INSTANCE_UUID> is Active"
155
+      }
156
+
157
+  * Finish of Confirm instance started event::
158
+
159
+      "progress_details" = {
160
+        "progress": 1,
161
+        "progress_data": "Confirmed instance <INSTANCE_UUID> is Active"
162
+      }
163
+
164
+  * Failure event in case failed to confirm instance::
165
+
166
+      "progress_details" = {
167
+        "progress": 1,
168
+        "progress_data": "Failed to confirm instance <INSTANCE_UUID>"
169
+      }
170
+
171
+.. note::
172
+   Events are emitted only when masakari engine starts processing received
173
+   notifications by executing recovery workflow.
174
+
175
+Mentioning below the database entries that going to be recorded for
176
+‘instance-failure recovery workflow’::
177
+
178
+    LogBook: 'instance_recovery'
179
+    - uuid = 68e86fda-25ba-4b1d-a9fc-d999bc1c796e
180
+    - created_at = 2019-01-08 08:15:21
181
+    - updated_at = 2019-01-08 08:15:21
182
+    - meta: {"notification_uuid": "9ca38361-eef9-4fca-a1fe-49ef0c7e23e8"}
183
+    FlowDetail: 'instance_recovery_engine'
184
+    - uuid = 6a780ae7-9c63-42d9-8510-aa020d7ee566
185
+    - state = SUCCESS
186
+    TaskDetail: 'StopInstanceTask'
187
+    - uuid = c165b8c2-5123-4489-99c1-97eafff72d24
188
+    - state = SUCCESS
189
+    - version = 1.0
190
+    - failure = False
191
+    - meta: {}
192
+    - results: <CONTEXT_DETAILS>
193
+    TaskDetail: 'StopInstanceTask'
194
+    - uuid = c165b8c2-5123-4489-99c1-97eafff72d24
195
+    - state = SUCCESS
196
+    - version = 1.0
197
+    - failure = False
198
+    - meta:
199
+        + progress = 100.00%
200
+        + progress_details = {
201
+            "progress": 1,
202
+            "progress_details": {
203
+                "at_progress": 1,
204
+                "details": {
205
+                    "progress_details": [
206
+                        "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
207
+                    ]
208
+                }
209
+            }
210
+         }
211
+    - results: NULL
212
+    TaskDetail: 'StartInstanceTask'
213
+    - uuid = a4155556-fb5a-44f8-b8aa-ab8ecfe8f1ce
214
+    - state = SUCCESS
215
+    - version = 1.0
216
+    - failure = False
217
+    - meta:
218
+        + progress = 100.00%
219
+        + progress_details = {
220
+            "progress": 1,
221
+            "progress_details": {
222
+                "at_progress": 1,
223
+                "details": {
224
+                    "progress_details": [
225
+                        "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
226
+                    ]
227
+                }
228
+            }
229
+         }
230
+    - results: NULL
231
+    TaskDetail: 'ConfirmInstanceActiveTask'
232
+    - uuid = 0ea82633-599b-422d-8fd2-df2057efb29d
233
+    - state = SUCCESS
234
+    - version = 1.0
235
+    - failure = False
236
+    - meta:
237
+        + progress = 100.00%
238
+        + progress_details = {
239
+            "progress": 1,
240
+            "progress_details": {
241
+                "at_progress": 1,
242
+                "details": {
243
+                    "progress_details": [
244
+                        "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
245
+                    ]
246
+                }
247
+            }
248
+         }
249
+    - results: NULL
250
+
251
+
252
+Mentioning below how the recorded data will be used to render task details
253
+in tabular format for ‘instance-failure recovery workflow’ on Horizon::
254
+
255
+    * Stop Instance Task
256
+    ============================================  ==========================  ==========================  ====================================================
257
+    Request ID                                    Action                      Start Time                  Message
258
+    ============================================  ==========================  ==========================  ====================================================
259
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StopInstanceTask            Jan 10 2019, 10:40 a.m      Started execution of StopInstanceTask <INSTANCE_UUID>
260
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StopInstanceTask            Jan 10 2019, 10:41 a.m      Finished execution of StopInstanceTask <INSTANCE_UUID>
261
+    ============================================  ==========================  ==========================  ====================================================
262
+
263
+    * Start Instance Task
264
+    ============================================  ==========================  ==========================  ====================================================
265
+    Request ID                                    Action                      Start Time                  Message
266
+    ============================================  ==========================  ==========================  ====================================================
267
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StartInstanceTask           Jan 10 2019, 10:41 a.m      Starting instance <INSTANCE_UUID>
268
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StartInstanceTask           Jan 10 2019, 10:42 a.m      Started instance <INSTANCE_UUID>
269
+    ============================================  ==========================  ==========================  ====================================================
270
+
271
+    * Confirm Instance Active Task
272
+    ============================================  ==========================  ==========================  ====================================================
273
+    Request ID                                    Action                      Start Time                  Message
274
+    ============================================  ==========================  ==========================  ====================================================
275
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      ConfirmInstanceActiveTask   Jan 10 2019, 10:43 a.m      Confirming instance is Active <INSTANCE_UUID>
276
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      ConfirmInstanceActiveTask   Jan 10 2019, 10:43 a.m      Confirmed instance is Active <INSTANCE_UUID>
277
+    ============================================  ==========================  ==========================  ====================================================
278
+
279
+Alternatives
280
+------------
281
+
282
+Send Versioned notifications similar to the other OpenStack services for
283
+recovery workflows.
284
+
285
+Data model impact
286
+-----------------
287
+
288
+Below tables will get added into Masakari Database
289
+
290
+* alembic_version
291
+* logbooks
292
+* flowdetails
293
+* atomdetails
294
+
295
+.. note::
296
+   alembic_version here stores version information of taskflow database
297
+   version, not of Masakari database.
298
+   Masakaari database as of now is not under alembic control.
299
+
300
+For example in case of ‘instance-failure recovery workflow’, data will be
301
+stored in below columns
302
+
303
+* logbooks: Parent table, one entry for each notification received.
304
+* flowdetails: Child table for logbooks, one entry for each notification received.
305
+* atomdetails: Child table for flowdetails, one entry for each task of recovery.
306
+
307
+.. note::
308
+   Foreign key association is not there for taskflow persistence tables.
309
+   If we delete logbook entry, respective child entries also got deleted.
310
+
311
+REST API impact
312
+---------------
313
+
314
+A new microversion will be created to add event details to GET
315
+/notifications/<notification_uuid> API.
316
+
317
+Security impact
318
+---------------
319
+
320
+None
321
+
322
+Notifications impact
323
+--------------------
324
+
325
+Masakari recovery failure doesn't support event notification feature.
326
+This spec will add this feature.
327
+
328
+Other end user impact
329
+---------------------
330
+
331
+None
332
+
333
+Performance Impact
334
+------------------
335
+
336
+There will be a slight performance impact due to the overhead for storing
337
+events during processing of each recovery failure into database.
338
+
339
+Other deployer impact
340
+---------------------
341
+
342
+None
343
+
344
+
345
+Developer impact
346
+----------------
347
+
348
+None
349
+
350
+
351
+Implementation
352
+==============
353
+
354
+Assignee(s)
355
+-----------
356
+
357
+Primary assignee:
358
+
359
+* Jayashri Bidwe <Jayashri.Bidwe@nttdata.com>
360
+* Vrushali Kamde <Vrushali.Kamde@nttdata.com>
361
+
362
+Work Items
363
+----------
364
+
365
+* Fetch backend as Masakari backend for each taskflow
366
+* Execute taskflow with all details at each task that required
367
+* Populate meta with progress status
368
+* Update the notification API for GET /notifications/<notification_uuid> in a
369
+  new microversion to pass the stored event related information of recovery
370
+  failure
371
+* Update unit tests for code coverage
372
+* Add documentation on how to use this feature at Horizon
373
+
374
+
375
+Dependencies
376
+============
377
+
378
+None
379
+
380
+
381
+Testing
382
+=======
383
+
384
+No need to write tempest tests as unit tests are sufficient to check
385
+whether the events are sent or not for recovery operations.
386
+
387
+
388
+Documentation Impact
389
+====================
390
+
391
+None
392
+
393
+
394
+References
395
+==========
396
+
397
+..  _`persistence-doc`: https://docs.openstack.org/taskflow/latest/user/persistence.html
398
+..  _`taskflow_persistence`: https://github.com/openstack/taskflow/tree/master/taskflow/persistence
399
+..  _`notifier_method`: https://github.com/openstack/taskflow/blob/master/taskflow/types/notifier.py#L186
400
+
401
+
402
+History
403
+=======
404
+
405
+.. list-table:: Revisions
406
+   :header-rows: 1
407
+
408
+   * - Release Name
409
+     - Description
410
+   * - Stein
411
+     - Introduced

Loading…
Cancel
Save