Implement reserved_host, auto_priority and rh_priority recovery methods
Implements: bp implement-recovery-methods Change-Id: I83ce204d8f25b240fa6ce723dc15192ae9b4e191
This commit is contained in:
parent
4e746cb5a3
commit
83d1a0aae1
|
@ -0,0 +1,224 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=================================================================
|
||||
Implement RESERVED_HOST recovery action for host failure workflow
|
||||
=================================================================
|
||||
|
||||
https://blueprints.launchpad.net/masakari/+spec/implement-recovery-methods
|
||||
|
||||
This spec talks about adding RESERVED_HOST recovery action for host
|
||||
failure workflow. In masakari each failover segment has recovery_method
|
||||
defined for it, so that if any of the host within that failover segment
|
||||
goes down then recovery action will be executed to evacuate "HA_Enabled" or all
|
||||
instances depending on "evacuate_all_instances" configuration option from that
|
||||
host based on recovery_method.
|
||||
|
||||
What is RESERVED_HOST?
|
||||
|
||||
In each failover segment operators will keep some hosts as reserved by
|
||||
disabling the compute service on those hosts and setting "reserved"
|
||||
property of that host as "True". As a result, these hosts are not
|
||||
selected by the nova scheduler while provisioning new instances or for
|
||||
performing any other instance actions such as resize, migration etc.
|
||||
These hosts can be used by masakari engine for evacuating the instances
|
||||
from the failed host. Once the reserved host is used for evacuating
|
||||
the instances it is no longer treated as reserved and nova scheduler can
|
||||
use that host for scheduling the instances.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Masakari provides a driver interface for implementing the workflows
|
||||
synchronously or asynchronously. Whoever wants to implement the
|
||||
workflow can inherit the masakari driver and implement the workflows.
|
||||
|
||||
For implementing the RESERVED_HOST recovery action masakari engine
|
||||
should provide list of reserved hosts associated with its failover segment
|
||||
to the driver. Its then job of the driver to execute the workflow and use
|
||||
this list for evacuating the instances from failed host. One of the task of
|
||||
the workflow is to enable the compute service on reserved host so that
|
||||
instances can be evacuated on that host. At the same time "reserved" property
|
||||
of that host needs to be set to False. There is a possibility for multiple
|
||||
host failures under one failover segment may take the same reserved host and
|
||||
start the recovery workflows at the same time. To avoid this situation, lock
|
||||
with current reserved host name will be acquired on each of the task and that
|
||||
reserved host will be skipped if the lock acquired and evacuation will be done
|
||||
on the next reserved host from the list.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
Operator may want to execute host_failure workflow using 'RESERVED_HOST'
|
||||
recovery method.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Masakari engine should execute the workflows synchronously only. Masakari
|
||||
engine will load all the drivers. Whoever is going to implement the new driver,
|
||||
it should be the responsibility of that driver to get the result of workflow
|
||||
and send it back to the masakari engine. If someone wants to add a driver
|
||||
which will execute the workflow on a different host and not on the same host
|
||||
where masakari engine is running then they will need to design that driver
|
||||
in such a way that workflow will execute on any host in asynchronous way and
|
||||
send back the result to the masakari engine, so that masakari engine will
|
||||
set the notification status to "ERROR" or "FINISHED" based on the results.
|
||||
|
||||
To implement "reserved_host" recovery method, we need to implement lock
|
||||
mechanism over reserved host so that masakari-engine don't use same reserved
|
||||
host for multiple failure notifications. There are two ways to implement the
|
||||
lock mechanism:
|
||||
|
||||
1. Use oslo_concurrency.lockutils file based lock:
|
||||
Easy to implement, but cannot manage lock among multiple nodes.
|
||||
|
||||
2. Implement lock mechanism using Tooz:
|
||||
Operator would want to deploy multiple masakari-engine services on
|
||||
different nodes. For this purpose we recommend to use distributed locking
|
||||
mechanism provided by Tooz library. By default Tooz would be configured to
|
||||
use file locks, so everything will work as oslo_concurrency lock mechanism.
|
||||
If operator would want to run multiple masakari-engine services he/she
|
||||
would need to configure Tooz backend service and set it in masakari.conf.
|
||||
Currently most reliable Tooz backends are ZooKeeper and Redis.
|
||||
|
||||
As of now, in masakari file based lock (oslo_concurrency.lockutils) is already
|
||||
used. Same mechanism will be used to acquire lock on reserved host. Tooz
|
||||
support will be added later and all the existing locks will be migrated to
|
||||
use Tooz locking mechanism.
|
||||
|
||||
Pros:
|
||||
-----
|
||||
1. No need to change current masakari engine implementation.
|
||||
2. It's easy to implement other recovery actions such as "AUTO_PRIORITY" and
|
||||
"RH_PRIORITY" with this design.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Execute the workflows asynchronously i.e. either workflow can be executed on
|
||||
same host where masakari engine is running or on a different host altogether.
|
||||
In this case masakari engine might not get the results from the workflow
|
||||
execution and will not able to update notification status and set reserved
|
||||
hosts to False.
|
||||
|
||||
To achieve this masakari engine will generate a callback URL based on the
|
||||
notification_id and pass it to the driver. Sample callback URL will be like,
|
||||
|
||||
http://<host>:<port>/v1/notification/<notification_id>
|
||||
|
||||
Driver will further pass this URL and the required information such as
|
||||
reserved host list, host name etc. to the workflow. Workflow will be
|
||||
responsible to call the masakari using callback URL with notification status
|
||||
and reserved hosts used by workflow as a body of the request.
|
||||
|
||||
Once this request is received by masakari, then based on the notification_id
|
||||
it will map it to the notification from the database table and update the
|
||||
status of the notification accordingly. Also masakari will get the list of
|
||||
used reserved hosts in request body, so it will loop through it and set those
|
||||
host's "reserved" property as False.
|
||||
|
||||
Rest API can be::
|
||||
|
||||
method: PUT
|
||||
URL: URL that is passed to the workflow(contains notification_id)
|
||||
Body:
|
||||
result {
|
||||
notification_status: status of the notification, either 'error' or
|
||||
'finished', used_reserved_hosts: list of actually used reserved_hosts
|
||||
by workflow
|
||||
}
|
||||
|
||||
Pros:
|
||||
-----
|
||||
|
||||
1. Better approach to implement new workflow drivers.
|
||||
|
||||
Cons:
|
||||
-----
|
||||
1. The other service which is going to request masakari using REST api should
|
||||
have required admin credentials to call the API.
|
||||
2. Need to change current driver (taskflow) implementation to adopt this
|
||||
design.
|
||||
3. Need to modify PUT api to incorporate this change.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Dinesh Bhor <dinesh.bhor@nttdata.com>
|
||||
Abhishek Kekane <abhishek.kekane@nttdata.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement RESERVED_HOST recovery_method for host_failure recovery in
|
||||
synchronous way for taskflow driver
|
||||
* Add unit tests for the coverage
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
None
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
http://eavesdrop.openstack.org/meetings/masakari/2016/masakari.2016-12-13-04.02.log.html
|
||||
http://eavesdrop.openstack.org/meetings/masakari/2017/masakari.2017-02-07-04.01.log.html
|
Loading…
Reference in New Issue