Healthchecks: a new hope
Following the "Healthcheck cleanup" spec, this one proposes a new way of implementing the healthchecks, without using the podman native things. Change-Id: Iad3f1c82e6faef7066514452d568459fbe905032
This commit is contained in:
parent
00b9f10ce5
commit
c70c924924
|
@ -0,0 +1,202 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================
|
||||
Service healthchecks: a new hope
|
||||
================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/clean-container-healthchecks
|
||||
|
||||
Since the accepted `healthcheck cleanup`_ spec, there were some discussions
|
||||
with the people in charge with Monitoring.
|
||||
|
||||
We are now able to propose a new way to run healthchecks, without involving
|
||||
any "podman exec", while hopefully providing a better state overview.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Since the mentioned `healthcheck cleanup`_ spec, services aren't monitored
|
||||
anymore. While it might be fine under some circonstances, it's a may raise
|
||||
concerns on productive environment, where Operators want to get a proper
|
||||
overview of their infrastructure.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
The proposal will rely on a collaborative effort toward actual, proper
|
||||
healthchecks.
|
||||
|
||||
While it will require some work from any Team wanting to get their service
|
||||
under check, it will ensure the services are, actually, really monitored.
|
||||
|
||||
There are multiple steps in order to get to this point, as described bellow.
|
||||
|
||||
Provide a /status endpoint
|
||||
--------------------------
|
||||
A first step is to get a dedicated endpoint in the different APIs, such as
|
||||
heat, nova, neutron and so on.
|
||||
|
||||
Such an endpoint will run different small tests, and return the status.
|
||||
For instance, it will ensure the service is actually able to communicate with
|
||||
the Database, Message Queue, or even with other APIs endpoints.
|
||||
|
||||
Then, a "report" will be sent out as the body content.
|
||||
|
||||
Such a thing can leverage the existing healthcheck implementation as provided
|
||||
by `oslo.middleware`_ - it has the advantage of supporting both HTML and JSON
|
||||
output, which can make the overall status understanding easier for the
|
||||
Operators, while making it easy to parse/treat on the computer side.
|
||||
|
||||
Note that the endpoint name can set to whatever we want. /status and
|
||||
/healthcheck are probably the best suited names, but if those are already used
|
||||
within the app, it can of course be something else.
|
||||
|
||||
Also note this effort must be coordinated, in order to ensure all the endpoint
|
||||
will provide the same kind of data, behaviour and status code. This is also
|
||||
why using the `oslo.middleware`_ helper would be nice, since it provides a
|
||||
generic, common way to output the status. It also allows to configure the
|
||||
endpoint name.
|
||||
|
||||
Get a healthcheck script and systemd units
|
||||
------------------------------------------
|
||||
Once we get the endpoint, a script will be needed in order to actually call it,
|
||||
forward the data to some logging system and, if configured, take action on the
|
||||
service.
|
||||
|
||||
A `health.py PoC`_ is already provided - while it needs actual testing, unit
|
||||
tests and so on, it provides a basic handler for the healthcheck already.
|
||||
|
||||
An associated configuration file must be generated during the
|
||||
deploy/update/upgrade times in order to ensure all the data are provided.
|
||||
|
||||
It will more than probably require the following information:
|
||||
|
||||
- hostname or IP
|
||||
- service port
|
||||
- endpoint name (such as /status, /healthcheck)
|
||||
- whether it uses TLS or not
|
||||
- what action to take in case of failure
|
||||
- what backend classes are critical enough to trigger an action
|
||||
|
||||
This is just an example, based on the PoC. Of course, implementation details
|
||||
are still open, and will depend on the actual data provided by the endpoint.
|
||||
|
||||
As per the PoC, the configuration is common for all the endpoints, since it's
|
||||
a simple INI file, with one section per service.
|
||||
|
||||
Systemd
|
||||
.......
|
||||
This script will then be loaded in different systemd units, and systemd timers
|
||||
will then be configured in order to trigger the units. Usually, one systemd
|
||||
unit will be needed for each endpoints, so that we get the possibility to
|
||||
configure different intervals.
|
||||
|
||||
The script should run as root, in order to be able to leverage the proper
|
||||
actions (such as calling dbus in order to restart the service unit).
|
||||
|
||||
|
||||
Alternatives
|
||||
============
|
||||
We won't discuss the "keep current healthchecks" alternative, since this one
|
||||
was already described in the `healthcheck cleanup`_ spec.
|
||||
|
||||
TODO (at that time, I don't see what could be done as a viable alternative)
|
||||
|
||||
|
||||
Security Impact
|
||||
===============
|
||||
|
||||
While the exposure of the healthcheck endpoint might be a security issue,
|
||||
proper ACL on that URI can be set in the web server directly, lowering the
|
||||
actual risk to see this endpoint being used for bad reasons.
|
||||
|
||||
Regarding the script and related systemd unit/timers, there isn't actual
|
||||
security issues, since the service won't load external resources, and will
|
||||
only access known sources with read-only rights only.
|
||||
|
||||
Upgrade Impact
|
||||
==============
|
||||
|
||||
No Upgrade impact.
|
||||
|
||||
Other End User Impact
|
||||
=====================
|
||||
The End User will be able to get an actual overview of their system, with real
|
||||
healthcheck status being reported.
|
||||
|
||||
They will also be able to benefit from the log stream in order to gather
|
||||
information within some external service, such as elasticsearch, prometheus
|
||||
and so on (provided they get the infrastructure and get the proper exporters).
|
||||
|
||||
|
||||
Performance Impact
|
||||
==================
|
||||
The healthcheck endpoint WILL have an impact on the infrastructure, but since
|
||||
it will be developped for the service it's intended to check, a special care
|
||||
must be taken in order to lessen the actual impact.
|
||||
|
||||
Getting a DB connection has a "cost" - so reusable connections will probably
|
||||
be preferred, for instance.
|
||||
|
||||
The interval of the call will also be set in order to prevent a service
|
||||
overload.
|
||||
|
||||
Also, please note we will be able to monitor actual backend services such as
|
||||
Database or Message Queue without actually calling them directly, since they
|
||||
will be checked from within the API endpoints directly.
|
||||
|
||||
Other Deployer Impact
|
||||
=====================
|
||||
Deployment can take advantage of the healthchecks, ensuring a service is
|
||||
properly running before going further in the deploy steps. This is already
|
||||
something used in some services, and can be reused with this new
|
||||
implementation.
|
||||
|
||||
Developer Impact
|
||||
================
|
||||
Developers will have to contribute to the effort, since the new endpoint is
|
||||
needed within the service application itself.
|
||||
|
||||
New APIs will also need to follow this practice.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
- cjeanner
|
||||
- mmagr
|
||||
- mrunge
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
A documentation update will be issued in order to explain how to run the
|
||||
healthchecks, how to interpret the result (in a generic way), and what can be
|
||||
configured (failure action, for instance).
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
|
||||
.. _`healthcheck cleanup`: ./healthcheck-cleanup.html
|
||||
.. _`oslo.middleware`: https://opendev.org/openstack/oslo.middleware/src/branch/master/oslo_middleware/healthcheck/__init__.py
|
||||
.. _`health.py PoC`: https://github.com/cjeanner/triploe-healthcheck-poc
|
Loading…
Reference in New Issue