Browse Source

Implement reserved_host, auto_priority and rh_priority recovery methods

Implements: bp implement-recovery-methods
Change-Id: I83ce204d8f25b240fa6ce723dc15192ae9b4e191
Abhishek Kekane 5 years ago
  1. 224


@ -0,0 +1,224 @@
This work is licensed under a Creative Commons Attribution 3.0 Unported
Implement RESERVED_HOST recovery action for host failure workflow
This spec talks about adding RESERVED_HOST recovery action for host
failure workflow. In masakari each failover segment has recovery_method
defined for it, so that if any of the host within that failover segment
goes down then recovery action will be executed to evacuate "HA_Enabled" or all
instances depending on "evacuate_all_instances" configuration option from that
host based on recovery_method.
In each failover segment operators will keep some hosts as reserved by
disabling the compute service on those hosts and setting "reserved"
property of that host as "True". As a result, these hosts are not
selected by the nova scheduler while provisioning new instances or for
performing any other instance actions such as resize, migration etc.
These hosts can be used by masakari engine for evacuating the instances
from the failed host. Once the reserved host is used for evacuating
the instances it is no longer treated as reserved and nova scheduler can
use that host for scheduling the instances.
Problem description
Masakari provides a driver interface for implementing the workflows
synchronously or asynchronously. Whoever wants to implement the
workflow can inherit the masakari driver and implement the workflows.
For implementing the RESERVED_HOST recovery action masakari engine
should provide list of reserved hosts associated with its failover segment
to the driver. Its then job of the driver to execute the workflow and use
this list for evacuating the instances from failed host. One of the task of
the workflow is to enable the compute service on reserved host so that
instances can be evacuated on that host. At the same time "reserved" property
of that host needs to be set to False. There is a possibility for multiple
host failures under one failover segment may take the same reserved host and
start the recovery workflows at the same time. To avoid this situation, lock
with current reserved host name will be acquired on each of the task and that
reserved host will be skipped if the lock acquired and evacuation will be done
on the next reserved host from the list.
Use Cases
Operator may want to execute host_failure workflow using 'RESERVED_HOST'
recovery method.
Proposed change
Masakari engine should execute the workflows synchronously only. Masakari
engine will load all the drivers. Whoever is going to implement the new driver,
it should be the responsibility of that driver to get the result of workflow
and send it back to the masakari engine. If someone wants to add a driver
which will execute the workflow on a different host and not on the same host
where masakari engine is running then they will need to design that driver
in such a way that workflow will execute on any host in asynchronous way and
send back the result to the masakari engine, so that masakari engine will
set the notification status to "ERROR" or "FINISHED" based on the results.
To implement "reserved_host" recovery method, we need to implement lock
mechanism over reserved host so that masakari-engine don't use same reserved
host for multiple failure notifications. There are two ways to implement the
lock mechanism:
1. Use oslo_concurrency.lockutils file based lock:
Easy to implement, but cannot manage lock among multiple nodes.
2. Implement lock mechanism using Tooz:
Operator would want to deploy multiple masakari-engine services on
different nodes. For this purpose we recommend to use distributed locking
mechanism provided by Tooz library. By default Tooz would be configured to
use file locks, so everything will work as oslo_concurrency lock mechanism.
If operator would want to run multiple masakari-engine services he/she
would need to configure Tooz backend service and set it in masakari.conf.
Currently most reliable Tooz backends are ZooKeeper and Redis.
As of now, in masakari file based lock (oslo_concurrency.lockutils) is already
used. Same mechanism will be used to acquire lock on reserved host. Tooz
support will be added later and all the existing locks will be migrated to
use Tooz locking mechanism.
1. No need to change current masakari engine implementation.
2. It's easy to implement other recovery actions such as "AUTO_PRIORITY" and
"RH_PRIORITY" with this design.
Execute the workflows asynchronously i.e. either workflow can be executed on
same host where masakari engine is running or on a different host altogether.
In this case masakari engine might not get the results from the workflow
execution and will not able to update notification status and set reserved
hosts to False.
To achieve this masakari engine will generate a callback URL based on the
notification_id and pass it to the driver. Sample callback URL will be like,
Driver will further pass this URL and the required information such as
reserved host list, host name etc. to the workflow. Workflow will be
responsible to call the masakari using callback URL with notification status
and reserved hosts used by workflow as a body of the request.
Once this request is received by masakari, then based on the notification_id
it will map it to the notification from the database table and update the
status of the notification accordingly. Also masakari will get the list of
used reserved hosts in request body, so it will loop through it and set those
host's "reserved" property as False.
Rest API can be::
method: PUT
URL: URL that is passed to the workflow(contains notification_id)
result {
notification_status: status of the notification, either 'error' or
'finished', used_reserved_hosts: list of actually used reserved_hosts
by workflow
1. Better approach to implement new workflow drivers.
1. The other service which is going to request masakari using REST api should
have required admin credentials to call the API.
2. Need to change current driver (taskflow) implementation to adopt this
3. Need to modify PUT api to incorporate this change.
Data model impact
REST API impact
Security impact
Notifications impact
Other end user impact
Performance Impact
Other deployer impact
Developer impact
Primary assignee:
Dinesh Bhor <>
Abhishek Kekane <>
Work Items
* Implement RESERVED_HOST recovery_method for host_failure recovery in
synchronous way for taskflow driver
* Add unit tests for the coverage
Documentation Impact