Healthchecks: a new hope

Following the "Healthcheck cleanup" spec, this one proposes a new way of
implementing the healthchecks, without using the podman native things.

Change-Id: Iad3f1c82e6faef7066514452d568459fbe905032
This commit is contained in:
Cédric Jeanneret 2021-08-02 10:28:29 +02:00
parent 00b9f10ce5
commit c70c924924
1 changed files with 202 additions and 0 deletions

View File

@ -0,0 +1,202 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================
Service healthchecks: a new hope
================================
https://blueprints.launchpad.net/tripleo/+spec/clean-container-healthchecks
Since the accepted `healthcheck cleanup`_ spec, there were some discussions
with the people in charge with Monitoring.
We are now able to propose a new way to run healthchecks, without involving
any "podman exec", while hopefully providing a better state overview.
Problem Description
===================
Since the mentioned `healthcheck cleanup`_ spec, services aren't monitored
anymore. While it might be fine under some circonstances, it's a may raise
concerns on productive environment, where Operators want to get a proper
overview of their infrastructure.
Proposed Change
===============
The proposal will rely on a collaborative effort toward actual, proper
healthchecks.
While it will require some work from any Team wanting to get their service
under check, it will ensure the services are, actually, really monitored.
There are multiple steps in order to get to this point, as described bellow.
Provide a /status endpoint
--------------------------
A first step is to get a dedicated endpoint in the different APIs, such as
heat, nova, neutron and so on.
Such an endpoint will run different small tests, and return the status.
For instance, it will ensure the service is actually able to communicate with
the Database, Message Queue, or even with other APIs endpoints.
Then, a "report" will be sent out as the body content.
Such a thing can leverage the existing healthcheck implementation as provided
by `oslo.middleware`_ - it has the advantage of supporting both HTML and JSON
output, which can make the overall status understanding easier for the
Operators, while making it easy to parse/treat on the computer side.
Note that the endpoint name can set to whatever we want. /status and
/healthcheck are probably the best suited names, but if those are already used
within the app, it can of course be something else.
Also note this effort must be coordinated, in order to ensure all the endpoint
will provide the same kind of data, behaviour and status code. This is also
why using the `oslo.middleware`_ helper would be nice, since it provides a
generic, common way to output the status. It also allows to configure the
endpoint name.
Get a healthcheck script and systemd units
------------------------------------------
Once we get the endpoint, a script will be needed in order to actually call it,
forward the data to some logging system and, if configured, take action on the
service.
A `health.py PoC`_ is already provided - while it needs actual testing, unit
tests and so on, it provides a basic handler for the healthcheck already.
An associated configuration file must be generated during the
deploy/update/upgrade times in order to ensure all the data are provided.
It will more than probably require the following information:
- hostname or IP
- service port
- endpoint name (such as /status, /healthcheck)
- whether it uses TLS or not
- what action to take in case of failure
- what backend classes are critical enough to trigger an action
This is just an example, based on the PoC. Of course, implementation details
are still open, and will depend on the actual data provided by the endpoint.
As per the PoC, the configuration is common for all the endpoints, since it's
a simple INI file, with one section per service.
Systemd
.......
This script will then be loaded in different systemd units, and systemd timers
will then be configured in order to trigger the units. Usually, one systemd
unit will be needed for each endpoints, so that we get the possibility to
configure different intervals.
The script should run as root, in order to be able to leverage the proper
actions (such as calling dbus in order to restart the service unit).
Alternatives
============
We won't discuss the "keep current healthchecks" alternative, since this one
was already described in the `healthcheck cleanup`_ spec.
TODO (at that time, I don't see what could be done as a viable alternative)
Security Impact
===============
While the exposure of the healthcheck endpoint might be a security issue,
proper ACL on that URI can be set in the web server directly, lowering the
actual risk to see this endpoint being used for bad reasons.
Regarding the script and related systemd unit/timers, there isn't actual
security issues, since the service won't load external resources, and will
only access known sources with read-only rights only.
Upgrade Impact
==============
No Upgrade impact.
Other End User Impact
=====================
The End User will be able to get an actual overview of their system, with real
healthcheck status being reported.
They will also be able to benefit from the log stream in order to gather
information within some external service, such as elasticsearch, prometheus
and so on (provided they get the infrastructure and get the proper exporters).
Performance Impact
==================
The healthcheck endpoint WILL have an impact on the infrastructure, but since
it will be developped for the service it's intended to check, a special care
must be taken in order to lessen the actual impact.
Getting a DB connection has a "cost" - so reusable connections will probably
be preferred, for instance.
The interval of the call will also be set in order to prevent a service
overload.
Also, please note we will be able to monitor actual backend services such as
Database or Message Queue without actually calling them directly, since they
will be checked from within the API endpoints directly.
Other Deployer Impact
=====================
Deployment can take advantage of the healthchecks, ensuring a service is
properly running before going further in the deploy steps. This is already
something used in some services, and can be reused with this new
implementation.
Developer Impact
================
Developers will have to contribute to the effort, since the new endpoint is
needed within the service application itself.
New APIs will also need to follow this practice.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
- cjeanner
- mmagr
- mrunge
Work Items
----------
Dependencies
============
Testing
=======
Documentation Impact
====================
A documentation update will be issued in order to explain how to run the
healthchecks, how to interpret the result (in a generic way), and what can be
configured (failure action, for instance).
References
==========
.. _`healthcheck cleanup`: ./healthcheck-cleanup.html
.. _`oslo.middleware`: https://opendev.org/openstack/oslo.middleware/src/branch/master/oslo_middleware/healthcheck/__init__.py
.. _`health.py PoC`: https://github.com/cjeanner/triploe-healthcheck-poc