From 83d1a0aae17e4e8110ac64c7975a8520592712f9 Mon Sep 17 00:00:00 2001 From: Abhishek Kekane Date: Fri, 20 Jan 2017 12:00:12 +0530 Subject: [PATCH] Implement reserved_host, auto_priority and rh_priority recovery methods Implements: bp implement-recovery-methods Change-Id: I83ce204d8f25b240fa6ce723dc15192ae9b4e191 --- .../implement-reserved-host-action.rst | 224 ++++++++++++++++++ 1 file changed, 224 insertions(+) create mode 100644 specs/ocata/approved/implement-reserved-host-action.rst diff --git a/specs/ocata/approved/implement-reserved-host-action.rst b/specs/ocata/approved/implement-reserved-host-action.rst new file mode 100644 index 0000000..93b554c --- /dev/null +++ b/specs/ocata/approved/implement-reserved-host-action.rst @@ -0,0 +1,224 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +================================================================= +Implement RESERVED_HOST recovery action for host failure workflow +================================================================= + +https://blueprints.launchpad.net/masakari/+spec/implement-recovery-methods + +This spec talks about adding RESERVED_HOST recovery action for host +failure workflow. In masakari each failover segment has recovery_method +defined for it, so that if any of the host within that failover segment +goes down then recovery action will be executed to evacuate "HA_Enabled" or all +instances depending on "evacuate_all_instances" configuration option from that +host based on recovery_method. + +What is RESERVED_HOST? + +In each failover segment operators will keep some hosts as reserved by +disabling the compute service on those hosts and setting "reserved" +property of that host as "True". As a result, these hosts are not +selected by the nova scheduler while provisioning new instances or for +performing any other instance actions such as resize, migration etc. +These hosts can be used by masakari engine for evacuating the instances +from the failed host. Once the reserved host is used for evacuating +the instances it is no longer treated as reserved and nova scheduler can +use that host for scheduling the instances. + +Problem description +=================== + +Masakari provides a driver interface for implementing the workflows +synchronously or asynchronously. Whoever wants to implement the +workflow can inherit the masakari driver and implement the workflows. + +For implementing the RESERVED_HOST recovery action masakari engine +should provide list of reserved hosts associated with its failover segment +to the driver. Its then job of the driver to execute the workflow and use +this list for evacuating the instances from failed host. One of the task of +the workflow is to enable the compute service on reserved host so that +instances can be evacuated on that host. At the same time "reserved" property +of that host needs to be set to False. There is a possibility for multiple +host failures under one failover segment may take the same reserved host and +start the recovery workflows at the same time. To avoid this situation, lock +with current reserved host name will be acquired on each of the task and that +reserved host will be skipped if the lock acquired and evacuation will be done +on the next reserved host from the list. + +Use Cases +--------- + +Operator may want to execute host_failure workflow using 'RESERVED_HOST' +recovery method. + +Proposed change +=============== + +Masakari engine should execute the workflows synchronously only. Masakari +engine will load all the drivers. Whoever is going to implement the new driver, +it should be the responsibility of that driver to get the result of workflow +and send it back to the masakari engine. If someone wants to add a driver +which will execute the workflow on a different host and not on the same host +where masakari engine is running then they will need to design that driver +in such a way that workflow will execute on any host in asynchronous way and +send back the result to the masakari engine, so that masakari engine will +set the notification status to "ERROR" or "FINISHED" based on the results. + +To implement "reserved_host" recovery method, we need to implement lock +mechanism over reserved host so that masakari-engine don't use same reserved +host for multiple failure notifications. There are two ways to implement the +lock mechanism: + +1. Use oslo_concurrency.lockutils file based lock: + Easy to implement, but cannot manage lock among multiple nodes. + +2. Implement lock mechanism using Tooz: + Operator would want to deploy multiple masakari-engine services on + different nodes. For this purpose we recommend to use distributed locking + mechanism provided by Tooz library. By default Tooz would be configured to + use file locks, so everything will work as oslo_concurrency lock mechanism. + If operator would want to run multiple masakari-engine services he/she + would need to configure Tooz backend service and set it in masakari.conf. + Currently most reliable Tooz backends are ZooKeeper and Redis. + +As of now, in masakari file based lock (oslo_concurrency.lockutils) is already +used. Same mechanism will be used to acquire lock on reserved host. Tooz +support will be added later and all the existing locks will be migrated to +use Tooz locking mechanism. + +Pros: +----- +1. No need to change current masakari engine implementation. +2. It's easy to implement other recovery actions such as "AUTO_PRIORITY" and + "RH_PRIORITY" with this design. + +Alternatives +------------ + +Execute the workflows asynchronously i.e. either workflow can be executed on +same host where masakari engine is running or on a different host altogether. +In this case masakari engine might not get the results from the workflow +execution and will not able to update notification status and set reserved +hosts to False. + +To achieve this masakari engine will generate a callback URL based on the +notification_id and pass it to the driver. Sample callback URL will be like, + +http://:/v1/notification/ + +Driver will further pass this URL and the required information such as +reserved host list, host name etc. to the workflow. Workflow will be +responsible to call the masakari using callback URL with notification status +and reserved hosts used by workflow as a body of the request. + +Once this request is received by masakari, then based on the notification_id +it will map it to the notification from the database table and update the +status of the notification accordingly. Also masakari will get the list of +used reserved hosts in request body, so it will loop through it and set those +host's "reserved" property as False. + +Rest API can be:: + +method: PUT +URL: URL that is passed to the workflow(contains notification_id) +Body: +result { + notification_status: status of the notification, either 'error' or + 'finished', used_reserved_hosts: list of actually used reserved_hosts + by workflow +} + +Pros: +----- + +1. Better approach to implement new workflow drivers. + +Cons: +----- +1. The other service which is going to request masakari using REST api should + have required admin credentials to call the API. +2. Need to change current driver (taskflow) implementation to adopt this + design. +3. Need to modify PUT api to incorporate this change. + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +None + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +None + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Dinesh Bhor + Abhishek Kekane + +Work Items +---------- + +* Implement RESERVED_HOST recovery_method for host_failure recovery in + synchronous way for taskflow driver +* Add unit tests for the coverage + +Dependencies +============ + +None + +Testing +======= + +None + +Documentation Impact +==================== + +None + +References +========== + +http://eavesdrop.openstack.org/meetings/masakari/2016/masakari.2016-12-13-04.02.log.html +http://eavesdrop.openstack.org/meetings/masakari/2017/masakari.2017-02-07-04.01.log.html