From af589037e73f5435e3f3231050f2d236ca78c813 Mon Sep 17 00:00:00 2001 From: "shilpa.devharakar" Date: Fri, 4 Jan 2019 16:15:36 +0000 Subject: [PATCH] Add progress details for recovery workflow Taskflow supports persistence of task which helps to persist each task details in the database. Using this functionality, Masakari will store task details for recovery failures. Change-Id: I4fe394f473a93aedc9e167bbde3dd196cfc89559 Implements: bp progress-details-recovery-workflows --- ...rogress-details-for-recovery-workflows.rst | 411 ++++++++++++++++++ 1 file changed, 411 insertions(+) create mode 100644 specs/stein/approved/progress-details-for-recovery-workflows.rst diff --git a/specs/stein/approved/progress-details-for-recovery-workflows.rst b/specs/stein/approved/progress-details-for-recovery-workflows.rst new file mode 100644 index 0000000..d5569be --- /dev/null +++ b/specs/stein/approved/progress-details-for-recovery-workflows.rst @@ -0,0 +1,411 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================================ +Add progress details for recovery workflows +============================================ + +https://blueprints.launchpad.net/masakari/+spec/progress-details-recovery-workflows + +This blueprint proposes to have a feature that notifies events for recovery +workflows. + +Problem description +=================== + +Currently, Masakari doesn't send any events during recovery operation request +received by Masakari monitor. + +It would be useful to receive events at each stage of task of recovery +workflow along with completion status and progress details so that operator +will come to know about what's happening during execution. + +Use Cases +--------- + +Operators will be able to know following things by detailed progress details +captured during each event of recovery: + +* Beginning/End of each task of recovery flow +* Errors of failure of process recovery +* Progress details which will contain the details of each task + + +Proposed change +=============== + +Masakari Recovery Workflow is a certain set of tasks executed to recover +from failure. Masakari supports three types of recovery failures: + +* instance-failure +* process-failure +* host-failure + +For each of these failures, Masakari executes a workflow to recover from +failure. Currently Masakari uses taskflow library to execute the workflow +which consists of recovery actions which are predefined and are executed +linearly. Proposing here to record these recovery actions with the help of +Taskflow persistence feature. Masakari will persist the flow so that it can be +resumed, restarted or rolled-back on engine failure. + +Taskflow supports persistence of workflow which helps to persist each task +details in the database. For more details please refer `persistence-doc`_ + +Taskflow has below three tables where workflow/task details are getting +stored: + +* logbooks +* flowdetails +* atomdetails + +In particular, for each flow there is a corresponding flowdetails +record, and for each task there is a corresponding atomdetails record. These +form the basic level of information about how a flow will be persisted. + +With the help of importing persistence package `taskflow_persistence`_ and by +accessing Masakari storage via masakari engine, able to import Taskflow tables +into Masakari. In taskflow library there is workflow, and each workflow has +task which has state and status. With the help of `notifier_method`_ will +update progress details for detailed execution flow for each task of recovery. + +Saved recovery task details (failures, successes, intermediary results) going +to render on Horizon on tabular format which helps operators to understand +progress/status of recovery. Each flow execution details stored with scale +0 to 1, so that operator will able to get progress completion along with +detailed information of each task. + +Explaining below the how actions/events that going to be recorded for +‘instance-failure recovery workflow’ along with progress details: + +* Stop Instance Task: Below listed are possible events along with progress + details that will be recorded: + + * Starting of Stop instance task:: + + "progress_details" = { + "progress": 0.50, + "progress_data": "Started execution of StopInstanceTask " + } + + * Skipping recovery event if an instance is not HA_Enabled and + "process_all_instances" config option is also disabled:: + + "progress_details" = { + "progress": 1, + "progress_data": "Skipping recovery for instance as it is not Ha_Enabled" + } + + * Ignored recovery event if an instance VM state is either in 'paused', + 'rescued':: + + "progress_details" = { + "progress": 1, + "progress_data": "Ignoring recovery for instance as it is in paused/rescued state" + } + + * Stop instance event:: + + "progress_details" = { + "progress": 1, + "progress_data": "Finished execution of StopInstanceTask " + } + + * Failure event in case failed to stop instance:: + + "progress_details" = { + "progress": 1, + "progress_data": "Failed to stop instance " + } + +* Start Instance Task: Below listed are possible events along with progress + details that will be recorded: + + * Start instance event:: + + "progress_details" = { + "progress": 0.5, + "progress_data": "Started execution of StartInstanceTask " + } + + * Finish of Start instance event:: + + "progress_details" = { + "progress": 1, + "progress_data": "Finished execution of StartInstanceTask " + } + + * Failure event in case failed to start instance or if invalid state of it:: + + "progress_details" = { + "progress": 1, + "progress_data": "Failed to start instance " + } + +* Confirm Instance Active Task: Below listed are possible events along with + progress details that will be recorded: + + * Start of Confirm instance event:: + + "progress_details" = { + "progress": 0.5, + "progress_data": "Confirming instance is Active" + } + + * Finish of Confirm instance started event:: + + "progress_details" = { + "progress": 1, + "progress_data": "Confirmed instance is Active" + } + + * Failure event in case failed to confirm instance:: + + "progress_details" = { + "progress": 1, + "progress_data": "Failed to confirm instance " + } + +.. note:: + Events are emitted only when masakari engine starts processing received + notifications by executing recovery workflow. + +Mentioning below the database entries that going to be recorded for +‘instance-failure recovery workflow’:: + + LogBook: 'instance_recovery' + - uuid = 68e86fda-25ba-4b1d-a9fc-d999bc1c796e + - created_at = 2019-01-08 08:15:21 + - updated_at = 2019-01-08 08:15:21 + - meta: {"notification_uuid": "9ca38361-eef9-4fca-a1fe-49ef0c7e23e8"} + FlowDetail: 'instance_recovery_engine' + - uuid = 6a780ae7-9c63-42d9-8510-aa020d7ee566 + - state = SUCCESS + TaskDetail: 'StopInstanceTask' + - uuid = c165b8c2-5123-4489-99c1-97eafff72d24 + - state = SUCCESS + - version = 1.0 + - failure = False + - meta: {} + - results: + TaskDetail: 'StopInstanceTask' + - uuid = c165b8c2-5123-4489-99c1-97eafff72d24 + - state = SUCCESS + - version = 1.0 + - failure = False + - meta: + + progress = 100.00% + + progress_details = { + "progress": 1, + "progress_details": { + "at_progress": 1, + "details": { + "progress_details": [ + "progress_details" = {, , ..., } + ] + } + } + } + - results: NULL + TaskDetail: 'StartInstanceTask' + - uuid = a4155556-fb5a-44f8-b8aa-ab8ecfe8f1ce + - state = SUCCESS + - version = 1.0 + - failure = False + - meta: + + progress = 100.00% + + progress_details = { + "progress": 1, + "progress_details": { + "at_progress": 1, + "details": { + "progress_details": [ + "progress_details" = {, , ..., } + ] + } + } + } + - results: NULL + TaskDetail: 'ConfirmInstanceActiveTask' + - uuid = 0ea82633-599b-422d-8fd2-df2057efb29d + - state = SUCCESS + - version = 1.0 + - failure = False + - meta: + + progress = 100.00% + + progress_details = { + "progress": 1, + "progress_details": { + "at_progress": 1, + "details": { + "progress_details": [ + "progress_details" = {, , ..., } + ] + } + } + } + - results: NULL + + +Mentioning below how the recorded data will be used to render task details +in tabular format for ‘instance-failure recovery workflow’ on Horizon:: + + * Stop Instance Task + ============================================ ========================== ========================== ==================================================== + Request ID Action Start Time Message + ============================================ ========================== ========================== ==================================================== + req-679033b7-1755-4929-bf85-eb3bfaef7e0b StopInstanceTask Jan 10 2019, 10:40 a.m Started execution of StopInstanceTask + req-679033b7-1755-4929-bf85-eb3bfaef7e0b StopInstanceTask Jan 10 2019, 10:41 a.m Finished execution of StopInstanceTask + ============================================ ========================== ========================== ==================================================== + + * Start Instance Task + ============================================ ========================== ========================== ==================================================== + Request ID Action Start Time Message + ============================================ ========================== ========================== ==================================================== + req-679033b7-1755-4929-bf85-eb3bfaef7e0b StartInstanceTask Jan 10 2019, 10:41 a.m Starting instance + req-679033b7-1755-4929-bf85-eb3bfaef7e0b StartInstanceTask Jan 10 2019, 10:42 a.m Started instance + ============================================ ========================== ========================== ==================================================== + + * Confirm Instance Active Task + ============================================ ========================== ========================== ==================================================== + Request ID Action Start Time Message + ============================================ ========================== ========================== ==================================================== + req-679033b7-1755-4929-bf85-eb3bfaef7e0b ConfirmInstanceActiveTask Jan 10 2019, 10:43 a.m Confirming instance is Active + req-679033b7-1755-4929-bf85-eb3bfaef7e0b ConfirmInstanceActiveTask Jan 10 2019, 10:43 a.m Confirmed instance is Active + ============================================ ========================== ========================== ==================================================== + +Alternatives +------------ + +Send Versioned notifications similar to the other OpenStack services for +recovery workflows. + +Data model impact +----------------- + +Below tables will get added into Masakari Database + +* alembic_version +* logbooks +* flowdetails +* atomdetails + +.. note:: + alembic_version here stores version information of taskflow database + version, not of Masakari database. + Masakaari database as of now is not under alembic control. + +For example in case of ‘instance-failure recovery workflow’, data will be +stored in below columns + +* logbooks: Parent table, one entry for each notification received. +* flowdetails: Child table for logbooks, one entry for each notification received. +* atomdetails: Child table for flowdetails, one entry for each task of recovery. + +.. note:: + Foreign key association is not there for taskflow persistence tables. + If we delete logbook entry, respective child entries also got deleted. + +REST API impact +--------------- + +A new microversion will be created to add event details to GET +/notifications/ API. + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +Masakari recovery failure doesn't support event notification feature. +This spec will add this feature. + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +There will be a slight performance impact due to the overhead for storing +events during processing of each recovery failure into database. + +Other deployer impact +--------------------- + +None + + +Developer impact +---------------- + +None + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + +* Jayashri Bidwe +* Vrushali Kamde + +Work Items +---------- + +* Fetch backend as Masakari backend for each taskflow +* Execute taskflow with all details at each task that required +* Populate meta with progress status +* Update the notification API for GET /notifications/ in a + new microversion to pass the stored event related information of recovery + failure +* Update unit tests for code coverage +* Add documentation on how to use this feature at Horizon + + +Dependencies +============ + +None + + +Testing +======= + +No need to write tempest tests as unit tests are sufficient to check +whether the events are sent or not for recovery operations. + + +Documentation Impact +==================== + +None + + +References +========== + +.. _`persistence-doc`: https://docs.openstack.org/taskflow/latest/user/persistence.html +.. _`taskflow_persistence`: https://github.com/openstack/taskflow/tree/master/taskflow/persistence +.. _`notifier_method`: https://github.com/openstack/taskflow/blob/master/taskflow/types/notifier.py#L186 + + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Stein + - Introduced