host monitor by consul
This spec introduces a new host monitor by consul. It monitors host heartbeat via management, tenant and storage connectivity. Only in the case of defined HA strategy, it will trigger host recovery, for example, storage interface disconnected. Change-Id: If81a6a9543513acf199afe6a17dceb1544657272 Implements: bp host-monitor-by-consul
This commit is contained in:
parent
9631ab25d3
commit
7b6f56dd81
BIN
doc/source/images/host-monitor-by-consul.png
Normal file
BIN
doc/source/images/host-monitor-by-consul.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 41 KiB |
187
specs/xena/approved/host-monitor-by-consul.rst
Normal file
187
specs/xena/approved/host-monitor-by-consul.rst
Normal file
@ -0,0 +1,187 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
=======================
|
||||||
|
host monitor by consul
|
||||||
|
=======================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/masakari-monitors/+spec/host-monitor-by-consul
|
||||||
|
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
Usually, there are management network, tenant network and storage network
|
||||||
|
in one cloud platform. The compute nodes may have management, tenant and
|
||||||
|
storage interface to connect to these three networks.
|
||||||
|
|
||||||
|
Currently, Masakari hostmonitor uses pacemaker and pacemaker-remote to
|
||||||
|
monitor hosts' connection. Actually, it is monitoring the hosts heartbeat
|
||||||
|
through management interface. Once a host's management connectivity is
|
||||||
|
detected down, it will send notification to masakari to trigger host
|
||||||
|
failure recovery workflow.
|
||||||
|
|
||||||
|
This solution has some flaws especially when ``management`` connectivity
|
||||||
|
is down and the other two connectivity ``tenant`` and ``storage`` are up.
|
||||||
|
Users can still access their VMs without any interruptions, so there is no
|
||||||
|
need to send host failure notification in this case.
|
||||||
|
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
This spec introduces a new host monitor. Specifically, host connectivity
|
||||||
|
monitoring via management, tenant and storage interfaces by consul agent.
|
||||||
|
|
||||||
|
The low-level architecture for host monitoring is shown as below:
|
||||||
|
|
||||||
|
.. image:: /images/host-monitor-by-consul.png
|
||||||
|
|
||||||
|
Each host runs three consul agents, which respectively bind management, tenant
|
||||||
|
and storage interfaces. They make up three independent consul cluster.
|
||||||
|
|
||||||
|
Consul agent runs in server mode on controller nodes, while in client mode on
|
||||||
|
compute nodes.
|
||||||
|
|
||||||
|
For example, consul cluster via management connectivity. All agents bind
|
||||||
|
management interface, and are responsible for running checks and keeping
|
||||||
|
services in sync.
|
||||||
|
|
||||||
|
Consul is built on top of Serf which provides a full gossip protocol that is
|
||||||
|
used for multiple purposes. Serf provides membership, failure detection, and
|
||||||
|
event broadcast. Consul uses gossip protocol to manage membership. If an agent
|
||||||
|
is found disconnected, it will broadcast messages to the cluster quickly.
|
||||||
|
|
||||||
|
Host-monitor periodically retrieves all consul members heath data from local
|
||||||
|
consul agents. It picks out every nodes' management, tenant and storage health
|
||||||
|
state separately, and combines them together. Then it will send notification
|
||||||
|
depending on the HA strategy - host states and the corresponding actions.
|
||||||
|
There will be a config file for user to decide the HA strategy and the
|
||||||
|
default HA strategy is as follows:
|
||||||
|
|
||||||
|
+-----------------+-----------------+-----------------+------------------+
|
||||||
|
| management | tenant | storage | actions |
|
||||||
|
+-----------------+-----------------+-----------------+------------------+
|
||||||
|
| up | up | down | recovery |
|
||||||
|
+-----------------+-----------------+-----------------+------------------+
|
||||||
|
| up | down | down | recovery |
|
||||||
|
+-----------------+-----------------+-----------------+------------------+
|
||||||
|
| down | up | down | recovery |
|
||||||
|
+-----------------+-----------------+-----------------+------------------+
|
||||||
|
| down | down | down | recovery |
|
||||||
|
+-----------------+-----------------+-----------------+------------------+
|
||||||
|
|
||||||
|
* 'up' represents connectivity up.
|
||||||
|
* 'down' represents connectivity down.
|
||||||
|
* 'recovery' represents host recovery.
|
||||||
|
|
||||||
|
User can define the HA strategy according to the physical environment.
|
||||||
|
For example, if there is only one consul cluster of management, the HA
|
||||||
|
strategy would be the same as the existing solution based on pacemaker.
|
||||||
|
|
||||||
|
+-----------------+------------------+
|
||||||
|
| management | actions |
|
||||||
|
+-----------------+------------------+
|
||||||
|
| down | recovery |
|
||||||
|
+-----------------+------------------+
|
||||||
|
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Notifications impact
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
|
||||||
|
* suzhengwei <sugar-2008@163.com>
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
- Masakari host-monitor driver based on consul
|
||||||
|
|
||||||
|
- masakari documentation updates
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
- Requires that consul agents are installed and running to monitor
|
||||||
|
the hosts management, tenant and storage connectivity.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
Unit tests and functional tests will be needed.
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
The admin configuration documentation need to be updated.
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
https://www.consul.io/
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
|
|
||||||
|
.. list-table:: Revisions
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Release Name
|
||||||
|
- Description
|
||||||
|
* - Victoria
|
||||||
|
- Introduced
|
||||||
|
* - Xena
|
||||||
|
- Re-proposed
|
Loading…
Reference in New Issue
Block a user