Browse Source

Add progress details for recovery workflow

Taskflow supports persistence of task which helps to persist each
task details in the database. Using this functionality, Masakari
will store task details for recovery failures.

Change-Id: I4fe394f473a93aedc9e167bbde3dd196cfc89559
Implements: bp progress-details-recovery-workflows
changes/79/632079/3
shilpa.devharakar 5 months ago
parent
commit
af589037e7
1 changed files with 411 additions and 0 deletions
  1. 411
    0
      specs/stein/approved/progress-details-for-recovery-workflows.rst

+ 411
- 0
specs/stein/approved/progress-details-for-recovery-workflows.rst View File

@@ -0,0 +1,411 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+============================================
8
+Add progress details for recovery workflows
9
+============================================
10
+
11
+https://blueprints.launchpad.net/masakari/+spec/progress-details-recovery-workflows
12
+
13
+This blueprint proposes to have a feature that notifies events for recovery
14
+workflows.
15
+
16
+Problem description
17
+===================
18
+
19
+Currently, Masakari doesn't send any events during recovery operation request
20
+received by Masakari monitor.
21
+
22
+It would be useful to receive events at each stage of task of recovery
23
+workflow along with completion status and progress details so that operator
24
+will come to know about what's happening during execution.
25
+
26
+Use Cases
27
+---------
28
+
29
+Operators will be able to know following things by detailed progress details
30
+captured during each event of recovery:
31
+
32
+* Beginning/End of each task of recovery flow
33
+* Errors of failure of process recovery
34
+* Progress details which will contain the details of each task
35
+
36
+
37
+Proposed change
38
+===============
39
+
40
+Masakari Recovery Workflow is a certain set of tasks executed to recover
41
+from failure. Masakari supports three types of recovery failures:
42
+
43
+* instance-failure
44
+* process-failure
45
+* host-failure
46
+
47
+For each of these failures, Masakari executes a workflow to recover from
48
+failure. Currently Masakari uses taskflow library to execute the workflow
49
+which consists of recovery actions which are predefined and are executed
50
+linearly. Proposing here to record these recovery actions with the help of
51
+Taskflow persistence feature. Masakari will persist the flow so that it can be
52
+resumed, restarted or rolled-back on engine failure.
53
+
54
+Taskflow supports persistence of workflow which helps to persist each task
55
+details in the database. For more details please refer `persistence-doc`_
56
+
57
+Taskflow has below three tables where workflow/task details are getting
58
+stored:
59
+
60
+* logbooks
61
+* flowdetails
62
+* atomdetails
63
+
64
+In particular, for each flow there is a corresponding flowdetails
65
+record, and for each task there is a corresponding atomdetails record. These
66
+form the basic level of information about how a flow will be persisted.
67
+
68
+With the help of importing persistence package `taskflow_persistence`_ and by
69
+accessing Masakari storage via masakari engine, able to import Taskflow tables
70
+into Masakari. In taskflow library there is workflow, and each workflow has
71
+task which has state and status. With the help of `notifier_method`_ will
72
+update progress details for detailed execution flow for each task of recovery.
73
+
74
+Saved recovery task details (failures, successes, intermediary results) going
75
+to render on Horizon on tabular format which helps operators to understand
76
+progress/status of recovery. Each flow execution details stored with scale
77
+0 to 1, so that operator will able to get progress completion along with
78
+detailed information of each task.
79
+
80
+Explaining below the how actions/events that going to be recorded for
81
+‘instance-failure recovery workflow’ along with progress details:
82
+
83
+* Stop Instance Task: Below listed are possible events along with progress
84
+  details that will be recorded:
85
+
86
+  * Starting of Stop instance task::
87
+
88
+      "progress_details" = {
89
+        "progress": 0.50,
90
+        "progress_data": "Started execution of StopInstanceTask <INSTANCE_UUID>"
91
+      }
92
+
93
+  * Skipping recovery event if an instance is not HA_Enabled and
94
+    "process_all_instances" config option is also disabled::
95
+
96
+      "progress_details" = {
97
+        "progress": 1,
98
+        "progress_data": "Skipping recovery for instance <INSTANCE_UUID> as it is not Ha_Enabled"
99
+      }
100
+
101
+  * Ignored recovery event if an instance VM state is either in 'paused',
102
+    'rescued'::
103
+
104
+      "progress_details" = {
105
+        "progress": 1,
106
+        "progress_data": "Ignoring recovery for instance <INSTANCE_UUID> as it is in paused/rescued state"
107
+      }
108
+
109
+  * Stop instance event::
110
+
111
+      "progress_details" = {
112
+        "progress": 1,
113
+        "progress_data": "Finished execution of StopInstanceTask <INSTANCE_UUID>"
114
+      }
115
+
116
+  * Failure event in case failed to stop instance::
117
+
118
+      "progress_details" = {
119
+        "progress": 1,
120
+        "progress_data": "Failed to stop instance <INSTANCE_UUID>"
121
+      }
122
+
123
+* Start Instance Task: Below listed are possible events along with progress
124
+  details that will be recorded:
125
+
126
+  * Start instance event::
127
+
128
+      "progress_details" = {
129
+        "progress": 0.5,
130
+        "progress_data": "Started execution of StartInstanceTask <INSTANCE_UUID>"
131
+      }
132
+
133
+  * Finish of Start instance event::
134
+
135
+      "progress_details" = {
136
+        "progress": 1,
137
+        "progress_data": "Finished execution of StartInstanceTask <INSTANCE_UUID>"
138
+      }
139
+
140
+  * Failure event in case failed to start instance or if invalid state of it::
141
+
142
+      "progress_details" = {
143
+        "progress": 1,
144
+        "progress_data": "Failed to start instance <INSTANCE_UUID>"
145
+      }
146
+
147
+* Confirm Instance Active Task: Below listed are possible events along with
148
+  progress details that will be recorded:
149
+
150
+  * Start of Confirm instance event::
151
+
152
+      "progress_details" = {
153
+        "progress": 0.5,
154
+        "progress_data": "Confirming instance <INSTANCE_UUID> is Active"
155
+      }
156
+
157
+  * Finish of Confirm instance started event::
158
+
159
+      "progress_details" = {
160
+        "progress": 1,
161
+        "progress_data": "Confirmed instance <INSTANCE_UUID> is Active"
162
+      }
163
+
164
+  * Failure event in case failed to confirm instance::
165
+
166
+      "progress_details" = {
167
+        "progress": 1,
168
+        "progress_data": "Failed to confirm instance <INSTANCE_UUID>"
169
+      }
170
+
171
+.. note::
172
+   Events are emitted only when masakari engine starts processing received
173
+   notifications by executing recovery workflow.
174
+
175
+Mentioning below the database entries that going to be recorded for
176
+‘instance-failure recovery workflow’::
177
+
178
+    LogBook: 'instance_recovery'
179
+    - uuid = 68e86fda-25ba-4b1d-a9fc-d999bc1c796e
180
+    - created_at = 2019-01-08 08:15:21
181
+    - updated_at = 2019-01-08 08:15:21
182
+    - meta: {"notification_uuid": "9ca38361-eef9-4fca-a1fe-49ef0c7e23e8"}
183
+    FlowDetail: 'instance_recovery_engine'
184
+    - uuid = 6a780ae7-9c63-42d9-8510-aa020d7ee566
185
+    - state = SUCCESS
186
+    TaskDetail: 'StopInstanceTask'
187
+    - uuid = c165b8c2-5123-4489-99c1-97eafff72d24
188
+    - state = SUCCESS
189
+    - version = 1.0
190
+    - failure = False
191
+    - meta: {}
192
+    - results: <CONTEXT_DETAILS>
193
+    TaskDetail: 'StopInstanceTask'
194
+    - uuid = c165b8c2-5123-4489-99c1-97eafff72d24
195
+    - state = SUCCESS
196
+    - version = 1.0
197
+    - failure = False
198
+    - meta:
199
+        + progress = 100.00%
200
+        + progress_details = {
201
+            "progress": 1,
202
+            "progress_details": {
203
+                "at_progress": 1,
204
+                "details": {
205
+                    "progress_details": [
206
+                        "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
207
+                    ]
208
+                }
209
+            }
210
+         }
211
+    - results: NULL
212
+    TaskDetail: 'StartInstanceTask'
213
+    - uuid = a4155556-fb5a-44f8-b8aa-ab8ecfe8f1ce
214
+    - state = SUCCESS
215
+    - version = 1.0
216
+    - failure = False
217
+    - meta:
218
+        + progress = 100.00%
219
+        + progress_details = {
220
+            "progress": 1,
221
+            "progress_details": {
222
+                "at_progress": 1,
223
+                "details": {
224
+                    "progress_details": [
225
+                        "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
226
+                    ]
227
+                }
228
+            }
229
+         }
230
+    - results: NULL
231
+    TaskDetail: 'ConfirmInstanceActiveTask'
232
+    - uuid = 0ea82633-599b-422d-8fd2-df2057efb29d
233
+    - state = SUCCESS
234
+    - version = 1.0
235
+    - failure = False
236
+    - meta:
237
+        + progress = 100.00%
238
+        + progress_details = {
239
+            "progress": 1,
240
+            "progress_details": {
241
+                "at_progress": 1,
242
+                "details": {
243
+                    "progress_details": [
244
+                        "progress_details" = {<progress_details_of_event_1>, <progress_details_of_event_2>, ..., <progress_details_of_event_n>}
245
+                    ]
246
+                }
247
+            }
248
+         }
249
+    - results: NULL
250
+
251
+
252
+Mentioning below how the recorded data will be used to render task details
253
+in tabular format for ‘instance-failure recovery workflow’ on Horizon::
254
+
255
+    * Stop Instance Task
256
+    ============================================  ==========================  ==========================  ====================================================
257
+    Request ID                                    Action                      Start Time                  Message
258
+    ============================================  ==========================  ==========================  ====================================================
259
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StopInstanceTask            Jan 10 2019, 10:40 a.m      Started execution of StopInstanceTask <INSTANCE_UUID>
260
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StopInstanceTask            Jan 10 2019, 10:41 a.m      Finished execution of StopInstanceTask <INSTANCE_UUID>
261
+    ============================================  ==========================  ==========================  ====================================================
262
+
263
+    * Start Instance Task
264
+    ============================================  ==========================  ==========================  ====================================================
265
+    Request ID                                    Action                      Start Time                  Message
266
+    ============================================  ==========================  ==========================  ====================================================
267
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StartInstanceTask           Jan 10 2019, 10:41 a.m      Starting instance <INSTANCE_UUID>
268
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      StartInstanceTask           Jan 10 2019, 10:42 a.m      Started instance <INSTANCE_UUID>
269
+    ============================================  ==========================  ==========================  ====================================================
270
+
271
+    * Confirm Instance Active Task
272
+    ============================================  ==========================  ==========================  ====================================================
273
+    Request ID                                    Action                      Start Time                  Message
274
+    ============================================  ==========================  ==========================  ====================================================
275
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      ConfirmInstanceActiveTask   Jan 10 2019, 10:43 a.m      Confirming instance is Active <INSTANCE_UUID>
276
+    req-679033b7-1755-4929-bf85-eb3bfaef7e0b      ConfirmInstanceActiveTask   Jan 10 2019, 10:43 a.m      Confirmed instance is Active <INSTANCE_UUID>
277
+    ============================================  ==========================  ==========================  ====================================================
278
+
279
+Alternatives
280
+------------
281
+
282
+Send Versioned notifications similar to the other OpenStack services for
283
+recovery workflows.
284
+
285
+Data model impact
286
+-----------------
287
+
288
+Below tables will get added into Masakari Database
289
+
290
+* alembic_version
291
+* logbooks
292
+* flowdetails
293
+* atomdetails
294
+
295
+.. note::
296
+   alembic_version here stores version information of taskflow database
297
+   version, not of Masakari database.
298
+   Masakaari database as of now is not under alembic control.
299
+
300
+For example in case of ‘instance-failure recovery workflow’, data will be
301
+stored in below columns
302
+
303
+* logbooks: Parent table, one entry for each notification received.
304
+* flowdetails: Child table for logbooks, one entry for each notification received.
305
+* atomdetails: Child table for flowdetails, one entry for each task of recovery.
306
+
307
+.. note::
308
+   Foreign key association is not there for taskflow persistence tables.
309
+   If we delete logbook entry, respective child entries also got deleted.
310
+
311
+REST API impact
312
+---------------
313
+
314
+A new microversion will be created to add event details to GET
315
+/notifications/<notification_uuid> API.
316
+
317
+Security impact
318
+---------------
319
+
320
+None
321
+
322
+Notifications impact
323
+--------------------
324
+
325
+Masakari recovery failure doesn't support event notification feature.
326
+This spec will add this feature.
327
+
328
+Other end user impact
329
+---------------------
330
+
331
+None
332
+
333
+Performance Impact
334
+------------------
335
+
336
+There will be a slight performance impact due to the overhead for storing
337
+events during processing of each recovery failure into database.
338
+
339
+Other deployer impact
340
+---------------------
341
+
342
+None
343
+
344
+
345
+Developer impact
346
+----------------
347
+
348
+None
349
+
350
+
351
+Implementation
352
+==============
353
+
354
+Assignee(s)
355
+-----------
356
+
357
+Primary assignee:
358
+
359
+* Jayashri Bidwe <Jayashri.Bidwe@nttdata.com>
360
+* Vrushali Kamde <Vrushali.Kamde@nttdata.com>
361
+
362
+Work Items
363
+----------
364
+
365
+* Fetch backend as Masakari backend for each taskflow
366
+* Execute taskflow with all details at each task that required
367
+* Populate meta with progress status
368
+* Update the notification API for GET /notifications/<notification_uuid> in a
369
+  new microversion to pass the stored event related information of recovery
370
+  failure
371
+* Update unit tests for code coverage
372
+* Add documentation on how to use this feature at Horizon
373
+
374
+
375
+Dependencies
376
+============
377
+
378
+None
379
+
380
+
381
+Testing
382
+=======
383
+
384
+No need to write tempest tests as unit tests are sufficient to check
385
+whether the events are sent or not for recovery operations.
386
+
387
+
388
+Documentation Impact
389
+====================
390
+
391
+None
392
+
393
+
394
+References
395
+==========
396
+
397
+..  _`persistence-doc`: https://docs.openstack.org/taskflow/latest/user/persistence.html
398
+..  _`taskflow_persistence`: https://github.com/openstack/taskflow/tree/master/taskflow/persistence
399
+..  _`notifier_method`: https://github.com/openstack/taskflow/blob/master/taskflow/types/notifier.py#L186
400
+
401
+
402
+History
403
+=======
404
+
405
+.. list-table:: Revisions
406
+   :header-rows: 1
407
+
408
+   * - Release Name
409
+     - Description
410
+   * - Stein
411
+     - Introduced

Loading…
Cancel
Save