From 83d1a0aae17e4e8110ac64c7975a8520592712f9 Mon Sep 17 00:00:00 2001
From: Abhishek Kekane <abhishek.kekane@nttdata.com>
Date: Fri, 20 Jan 2017 12:00:12 +0530
Subject: [PATCH] Implement reserved_host, auto_priority and rh_priority
 recovery methods

Implements: bp implement-recovery-methods
Change-Id: I83ce204d8f25b240fa6ce723dc15192ae9b4e191
---
 .../implement-reserved-host-action.rst        | 224 ++++++++++++++++++
 1 file changed, 224 insertions(+)
 create mode 100644 specs/ocata/approved/implement-reserved-host-action.rst

diff --git a/specs/ocata/approved/implement-reserved-host-action.rst b/specs/ocata/approved/implement-reserved-host-action.rst
new file mode 100644
index 0000000..93b554c
--- /dev/null
+++ b/specs/ocata/approved/implement-reserved-host-action.rst
@@ -0,0 +1,224 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=================================================================
+Implement RESERVED_HOST recovery action for host failure workflow
+=================================================================
+
+https://blueprints.launchpad.net/masakari/+spec/implement-recovery-methods
+
+This spec talks about adding RESERVED_HOST recovery action for host
+failure workflow. In masakari each failover segment has recovery_method
+defined for it, so that if any of the host within that failover segment
+goes down then recovery action will be executed to evacuate "HA_Enabled" or all
+instances depending on "evacuate_all_instances" configuration option from that
+host based on recovery_method.
+
+What is RESERVED_HOST?
+
+In each failover segment operators will keep some hosts as reserved by
+disabling the compute service on those hosts and setting "reserved"
+property of that host as "True". As a result, these hosts are not
+selected by the nova scheduler while provisioning new instances or for
+performing any other instance actions such as resize, migration etc.
+These hosts can be used by masakari engine for evacuating the instances
+from the failed host. Once the reserved host is used for evacuating
+the instances it is no longer treated as reserved and nova scheduler can
+use that host for scheduling the instances.
+
+Problem description
+===================
+
+Masakari provides a driver interface for implementing the workflows
+synchronously or asynchronously. Whoever wants to implement the
+workflow can inherit the masakari driver and implement the workflows.
+
+For implementing the RESERVED_HOST recovery action masakari engine
+should provide list of reserved hosts associated with its failover segment
+to the driver. Its then job of the driver to execute the workflow and use
+this list for evacuating the instances from failed host. One of the task of
+the workflow is to enable the compute service on reserved host so that
+instances can be evacuated on that host. At the same time "reserved" property
+of that host needs to be set to False. There is a possibility for multiple
+host failures under one failover segment may take the same reserved host and
+start the recovery workflows at the same time. To avoid this situation, lock
+with current reserved host name will be acquired on each of the task and that
+reserved host will be skipped if the lock acquired and evacuation will be done
+on the next reserved host from the list.
+
+Use Cases
+---------
+
+Operator may want to execute host_failure workflow using 'RESERVED_HOST'
+recovery method.
+
+Proposed change
+===============
+
+Masakari engine should execute the workflows synchronously only. Masakari
+engine will load all the drivers. Whoever is going to implement the new driver,
+it should be the responsibility of that driver to get the result of workflow
+and send it back to the masakari engine. If someone wants to add a driver
+which will execute the workflow on a different host and not on the same host
+where masakari engine is running then they will need to design that driver
+in such a way that workflow will execute on any host in asynchronous way and
+send back the result to the masakari engine, so that masakari engine will
+set the notification status to "ERROR" or "FINISHED" based on the results.
+
+To implement "reserved_host" recovery method, we need to implement lock
+mechanism over reserved host so that masakari-engine don't use same reserved
+host for multiple failure notifications. There are two ways to implement the
+lock mechanism:
+
+1. Use oslo_concurrency.lockutils file based lock:
+   Easy to implement, but cannot manage lock among multiple nodes.
+
+2. Implement lock mechanism using Tooz:
+   Operator would want to deploy multiple masakari-engine services on
+   different nodes. For this purpose we recommend to use distributed locking
+   mechanism provided by Tooz library. By default Tooz would be configured to
+   use file locks, so everything will work as oslo_concurrency lock mechanism.
+   If operator would want to run multiple masakari-engine services he/she
+   would need to configure Tooz backend service and set it in masakari.conf.
+   Currently most reliable Tooz backends are ZooKeeper and Redis.
+
+As of now, in masakari file based lock (oslo_concurrency.lockutils) is already
+used. Same mechanism will be used to acquire lock on reserved host. Tooz
+support will be added later and all the existing locks will be migrated to
+use Tooz locking mechanism.
+
+Pros:
+-----
+1. No need to change current masakari engine implementation.
+2. It's easy to implement other recovery actions such as "AUTO_PRIORITY" and
+   "RH_PRIORITY" with this design.
+
+Alternatives
+------------
+
+Execute the workflows asynchronously i.e. either workflow can be executed on
+same host where masakari engine is running or on a different host altogether.
+In this case masakari engine might not get the results from the workflow
+execution and will not able to update notification status and set reserved
+hosts to False.
+
+To achieve this masakari engine will generate a callback URL based on the
+notification_id and pass it to the driver. Sample callback URL will be like,
+
+http://<host>:<port>/v1/notification/<notification_id>
+
+Driver will further pass this URL and the required information such as
+reserved host list, host name etc. to the workflow. Workflow will be
+responsible to call the masakari using callback URL with notification status
+and reserved hosts used by workflow as a body of the request.
+
+Once this request is received by masakari, then based on the notification_id
+it will map it to the notification from the database table and update the
+status of the notification accordingly. Also masakari will get the list of
+used reserved hosts in request body, so it will loop through it and set those
+host's "reserved" property as False.
+
+Rest API can be::
+
+method: PUT
+URL: URL that is passed to the workflow(contains notification_id)
+Body:
+result {
+    notification_status: status of the notification, either 'error' or
+    'finished', used_reserved_hosts: list of actually used reserved_hosts
+    by workflow
+}
+
+Pros:
+-----
+
+1. Better approach to implement new workflow drivers.
+
+Cons:
+-----
+1. The other service which is going to request masakari using REST api should
+   have required admin credentials to call the API.
+2. Need to change current driver (taskflow) implementation to adopt this
+   design.
+3. Need to modify PUT api to incorporate this change.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Dinesh Bhor <dinesh.bhor@nttdata.com>
+  Abhishek Kekane <abhishek.kekane@nttdata.com>
+
+Work Items
+----------
+
+* Implement RESERVED_HOST recovery_method for host_failure recovery in
+  synchronous way for taskflow driver
+* Add unit tests for the coverage
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+None
+
+Documentation Impact
+====================
+
+None
+
+References
+==========
+
+http://eavesdrop.openstack.org/meetings/masakari/2016/masakari.2016-12-13-04.02.log.html
+http://eavesdrop.openstack.org/meetings/masakari/2017/masakari.2017-02-07-04.01.log.html