Healthchecks: a new hope

Following the "Healthcheck cleanup" spec, this one proposes a new way of implementing the healthchecks, without using the podman native things. Change-Id: Iad3f1c82e6faef7066514452d568459fbe905032
2021-08-02 10:28:29 +02:00 · 2021-08-02 10:28:29 +02:00 · c70c924924
parent 00b9f10ce5
commit c70c924924
1 changed files with 202 additions and 0 deletions
--- a/specs/xena/healthcheck-new-way.rst
+++ b/specs/xena/healthcheck-new-way.rst
@ -0,0 +1,202 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+================================
+Service healthchecks: a new hope
+================================
+
+https://blueprints.launchpad.net/tripleo/+spec/clean-container-healthchecks
+
+Since the accepted `healthcheck cleanup`_ spec, there were some discussions
+with the people in charge with Monitoring.
+
+We are now able to propose a new way to run healthchecks, without involving
+any "podman exec", while hopefully providing a better state overview.
+
+
+Problem Description
+===================
+
+Since the mentioned `healthcheck cleanup`_ spec, services aren't monitored
+anymore. While it might be fine under some circonstances, it's a may raise
+concerns on productive environment, where Operators want to get a proper
+overview of their infrastructure.
+
+Proposed Change
+===============
+
+The proposal will rely on a collaborative effort toward actual, proper
+healthchecks.
+
+While it will require some work from any Team wanting to get their service
+under check, it will ensure the services are, actually, really monitored.
+
+There are multiple steps in order to get to this point, as described bellow.
+
+Provide a /status endpoint
+--------------------------
+A first step is to get a dedicated endpoint in the different APIs, such as
+heat, nova, neutron and so on.
+
+Such an endpoint will run different small tests, and return the status.
+For instance, it will ensure the service is actually able to communicate with
+the Database, Message Queue, or even with other APIs endpoints.
+
+Then, a "report" will be sent out as the body content.
+
+Such a thing can leverage the existing healthcheck implementation as provided
+by `oslo.middleware`_ - it has the advantage of supporting both HTML and JSON
+output, which can make the overall status understanding easier for the
+Operators, while making it easy to parse/treat on the computer side.
+
+Note that the endpoint name can set to whatever we want. /status and
+/healthcheck are probably the best suited names, but if those are already used
+within the app, it can of course be something else.
+
+Also note this effort must be coordinated, in order to ensure all the endpoint
+will provide the same kind of data, behaviour and status code. This is also
+why using the `oslo.middleware`_ helper would be nice, since it provides a
+generic, common way to output the status. It also allows to configure the
+endpoint name.
+
+Get a healthcheck script and systemd units
+------------------------------------------
+Once we get the endpoint, a script will be needed in order to actually call it,
+forward the data to some logging system and, if configured, take action on the
+service.
+
+A `health.py PoC`_ is already provided - while it needs actual testing, unit
+tests and so on, it provides a basic handler for the healthcheck already.
+
+An associated configuration file must be generated during the
+deploy/update/upgrade times in order to ensure all the data are provided.
+
+It will more than probably require the following information:
+
+- hostname or IP
+- service port
+- endpoint name (such as /status, /healthcheck)
+- whether it uses TLS or not
+- what action to take in case of failure
+- what backend classes are critical enough to trigger an action
+
+This is just an example, based on the PoC. Of course, implementation details
+are still open, and will depend on the actual data provided by the endpoint.
+
+As per the PoC, the configuration is common for all the endpoints, since it's
+a simple INI file, with one section per service.
+
+Systemd
+.......
+This script will then be loaded in different systemd units, and systemd timers
+will then be configured in order to trigger the units. Usually, one systemd
+unit will be needed for each endpoints, so that we get the possibility to
+configure different intervals.
+
+The script should run as root, in order to be able to leverage the proper
+actions (such as calling dbus in order to restart the service unit).
+
+
+Alternatives
+============
+We won't discuss the "keep current healthchecks" alternative, since this one
+was already described in the `healthcheck cleanup`_ spec.
+
+TODO (at that time, I don't see what could be done as a viable alternative)
+
+
+Security Impact
+===============
+
+While the exposure of the healthcheck endpoint might be a security issue,
+proper ACL on that URI can be set in the web server directly, lowering the
+actual risk to see this endpoint being used for bad reasons.
+
+Regarding the script and related systemd unit/timers, there isn't actual
+security issues, since the service won't load external resources, and will
+only access known sources with read-only rights only.
+
+Upgrade Impact
+==============
+
+No Upgrade impact.
+
+Other End User Impact
+=====================
+The End User will be able to get an actual overview of their system, with real
+healthcheck status being reported.
+
+They will also be able to benefit from the log stream in order to gather
+information within some external service, such as elasticsearch, prometheus
+and so on (provided they get the infrastructure and get the proper exporters).
+
+
+Performance Impact
+==================
+The healthcheck endpoint WILL have an impact on the infrastructure, but since
+it will be developped for the service it's intended to check, a special care
+must be taken in order to lessen the actual impact.
+
+Getting a DB connection has a "cost" - so reusable connections will probably
+be preferred, for instance.
+
+The interval of the call will also be set in order to prevent a service
+overload.
+
+Also, please note we will be able to monitor actual backend services such as
+Database or Message Queue without actually calling them directly, since they
+will be checked from within the API endpoints directly.
+
+Other Deployer Impact
+=====================
+Deployment can take advantage of the healthchecks, ensuring a service is
+properly running before going further in the deploy steps. This is already
+something used in some services, and can be reused with this new
+implementation.
+
+Developer Impact
+================
+Developers will have to contribute to the effort, since the new endpoint is
+needed within the service application itself.
+
+New APIs will also need to follow this practice.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  - cjeanner
+  - mmagr
+  - mrunge
+
+Work Items
+----------
+
+
+Dependencies
+============
+
+
+Testing
+=======
+
+
+Documentation Impact
+====================
+A documentation update will be issued in order to explain how to run the
+healthchecks, how to interpret the result (in a generic way), and what can be
+configured (failure action, for instance).
+
+References
+==========
+
+
+.. _`healthcheck cleanup`: ./healthcheck-cleanup.html
+.. _`oslo.middleware`: https://opendev.org/openstack/oslo.middleware/src/branch/master/oslo_middleware/healthcheck/__init__.py
+.. _`health.py PoC`: https://github.com/cjeanner/triploe-healthcheck-poc