diff --git a/specs/2023.1/approved/per-process-healthchecks.rst b/specs/2023.1/approved/per-process-healthchecks.rst new file mode 100644 index 000000000..5194ad419 --- /dev/null +++ b/specs/2023.1/approved/per-process-healthchecks.rst @@ -0,0 +1,517 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +================================= +Per Process Healthcheck endpoints +================================= + +https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks + +In many modern deployment frameworks, there is an expectation that +an application can expose a health-check endpoint so that the binary +status can be monitored. Nova currently does not provide a native way +to inspect the health of its binaries which doesn't help cloud monitoring +and maintenance. While limited support exists for health checks via +Oslo middleware for our WSGI based API binaries, this blueprint seeks +to expose a local HTTP health-check endpoint to address this +feature gap consistently for all Nova components. + + +Problem description +=================== + +To monitor the health of a Nova service today requires experience to +develop and implement a series of external heuristics to infer the state +of the service binaries. + +This can be as simple as checking the service status for those with heartbeats +or can comprise monitoring log output via a watchdog and restarting +the service if no output is detected after a protracted period. +Processing the logs for known error messages and executing a remediation script +or other methods that are easy to do incorrectly are also common. + +This is also quite unfriendly to new Nova users who have not gained enough +experience with operating Nova to know what warning signs they should look +for such as inability to connect to the message bus. Nova developers however +do know what some of the important health indicators are and can expose +those as a local health-check endpoint that operators can use instead. + +The existing Oslo middleware does not address this problem statement because: + +#. It can only be used by the API and metadata binaries + +#. The middleware does not tell you the service is alive if its hosted by a + WSGI server like Apache since the middleware is executed independently from + the WSGI application. i.e. the middleware can pass while the nova-api can't + connect to the DB and is otherwise broken. + +#. The Oslo middleware in detailed mode leaks info about the host Python + kernel, Python version and hostname which can be used to determine in the + host is vulnerable to CVEs which means it should never be exposed to the + Internet. e.g. + +:: + + platform: 'Linux-5.15.2-xanmod1-tt-x86_64-with-glibc2.2.5', + python_version: '3.8.12 (default, Aug 30 2021, 16:42:10) \n[GCC 10.3.0]' + + + +Use Cases +--------- + +As an operator, I want a simple REST endpoint I can consume to know +if a Nova process is healthy. + +As an operator I want this health check to not impact the performance of the +service so it can be queried frequently at short intervals. + +As a deployment tool implementer, I want the health check to be local with no +dependencies on other hosts or services to function so I can integrate it with +service managers such as systemd or a container runtime like Docker. + +As a packager, I would like the use of the health check endpoints to not +require special clients or packages to consume them. cURL, socat, or netcat +should be all that is required to connect to the health check and retrieve the +service status. + +As an operator I would like to be able to use health-check of the Nova API and +metadata services to manage the membership of endpoints in my load-balancer +or reverse proxy automatically. + +Proposed change +=============== + +Definitions +----------- +``TTL``: The time interval for which a health check item is valid. + +``pass``: all health indicators are passing and their TTLs have not expired. + +``warn``: any health indicator has an expired TTL or where there is +a partial transient failure. + +``fail``: any health indicator is reporting an error or all TTLs are expired. + +.. Note: + + In line with the recommendation in the IETF RFC API health check draft + `[1]`_ + ``pass`` and ``warn`` will respond with a 200 OK + ``fail`` will respond with a 503 Service Unavailable + Content-Type: application/health+json will be used in all cases. + + + +Warn vs fail +------------ + +In general if any of the health check indicators are failing then the service +should be reported as ``fail`` however if the specific error condition is +recoverable or only a partial failure the ``warn`` state can and should be +used. + +An example of this is a service that has lost a connection to the message bus. +When the connection is lost it should go to the ``warn`` state, if the first +attempt to reconnect fails it should go to the ``fail`` state. Transient +failure should be considered warning but persistent errors should be escalated +to failures. + +In many cases external management systems will treat ``warn`` and ``fail`` as +equivalent and raise an alarm or restart the service. While this spec does +not specify how you should recover from a degraded state, it is +important to include a human readable description of why the ``warn`` or +``fail`` state was entered. + +Services in the ``warn`` state are still considered healthy in most cases but +they may be about to fail soon or be partially degraded. + +.. NOTE: + + Where no health check items are currently registered such as during start up + the health check, status will be considered ``pass`` not ``warn`` + or ``fail``. This will prevent restart loops for any service managers that + treat any value other then ``pass`` as an error state. + +Code changes +------------ +A new top-level Nova health check module will be created to encapsulate the +common code and data structure required to implement this feature. + +A new health check manager class will be introduced which will maintain the +health-check state and all functions related to retrieving, updating and +summarizing that state. + +.. NOTE: + + All health check state will be stored in memory and reset/lost on restart + of the binary. For services that support dynamic reconfiguration via SIG_HUP + the health check data will be reset as part of this process. + +The health check manager will be responsible for creating the health check +endpoint when it is enabled in the nova.conf and exposing the health check +over HTTP. + +The initial implementation will support HTTP over TCP with optional support for +UNIX domain sockets as a more secure alternative to be added later. +The HTTP endpoint in both cases will be unauthenticated and the response will +be in JSON format. + +A new HealthcheckStausItem data class will be introduced to store an +individual health check data-point. The HealtcheckStatusItem will contain +the name of the health check, its status, the time it was recorded, +and an optional output string that should be populated if the +status is ``warn`` or ``fail``. + +A new decorator will be introduced that will automatically retrieve the +reference to the healthcheck manager from the Nova context object and update +the result based on whether the function decorated raises an exception or not. +The exception list and healthcheck item name will be specifiable. + +The decorator will accept the name of the health check as a positional argument +and include the exception message as the output of the health check on failure. +Note that the decorator will only support the pass or fail status for +simplicity; where warn is appropriate a manual check should be written. +If multiple functions act as indicators of the same capability the same name +should be used. + +e.g. + +.. code-block:: python + + @healthcheck('database', [SQLAlchemyError]) + def my_db_func(self): + pass + + @healthcheck('database', [SQLAlchemyError]) + def my_other_db_func(self): + pass + +By default all exceptions will be caught and re-raised by the decorator. + +The new REST health check endpoint exposed by this spec will initially only +support one URL path ``/health``. The ``/health`` endpoint will include a +`Cache-Control: max-age=` header as part of its response which can +optionally be consumed by the client. + +The endpoint may also implement a simple incrementing etag at a later date +once the initial implementation is complete, if required. +Initially adding an ``etag`` is not provided as the response is expected to be +small and cheap to query, so etags do not actually provide much benefit form +a performance perspective. + +If implemented, the ``etag`` will be incremented whenever the service state +changes and will reset to 0 when the service is restarted. + +Additional URL paths may be supported in the future, for example to retrieve +the running configuration or trigger the generation of Guru Meditation Reports +or enable debug logging. However, any endpoint beyond ``/health`` is out of +scope of this spec. ``/`` is not used for health check response to facilitate +additional paths in the future. + +Example output +~~~~~~~~~~~~~~ + +:: + + GET /health HTTP/1.1 + Host: example.org + Accept: application/health+json + + HTTP/1.1 200 OK + Content-Type: application/health+json + Cache-Control: max-age=3600 + Connection: close + + { + "status": "pass", + "version": "1.0", + "serviceId": "e3c22423-cd7a-47dc-b6e9-e18d1a8b3bdf", + "description": "nova-api", + "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"} + "checks": { + "message_bus": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"}, + "api_db": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"} + } + } + + GET /health HTTP/1.1 + Host: example.org + Accept: application/health+json + + HTTP/1.1 503 Sevice Unavailable + Content-Type: application/health+json + Cache-Control: no-cache + Connection: close + + { + "status": "fail", + "version": "1.0", + "serviceId": "0a47dceb-11b1-4d94-8b9c-927d998be320", + "description": "nova-compute", + "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"} + "checks":{ + "message_bus":{"status": "pass", "time": "2021-12-17T16:02:55+00:00"}, + "hypervisor":{ + "status": "fail", "time": "2021-12-17T16:05:55+00:00", + "output": "Libvirt Error: ..." + } + } + } + + +.. NOTE: + + ``version`` will initially be 1.0 and can be incremented following + ``SemVer`` conventions if we extend the response format. + This is not the Nova version. + Adding new checks to the Nova code base will not increment the version of + the response but adding or removing any new field to the response will. + The set of check names will be closed and each new check name that is added + will be signaled by a minor version bump. The initial set of check names for + version 1.0 is left to the implementation. + ``serviceId`` should be set to the Nova service ID for this binary. + ``description`` will contain the binary name for the service. + ``notes`` the notes will contain the CONF.host value in the host field and + optionally the hypervisor_hostname in the hostname field. + ``status`` will contain the overall status of the service with details + provided in the ``checks`` dictionary. + The keys of the ``checks`` dictionary will be the name of the health check + and the value will contain the ``status`` and ``time`` in ISO datetime + format that the status was recorded at. If the ``status`` is ``warn`` or + ``error`` an ``output`` key will be present with a message explaining the + status. + + +Alternatives +------------ + +Instead of maintaining the state of the process in a data structure and +returning the cached state we, could implement the health check as a series of +active probes such as checking the DB schema version to ensure we can access +it or making a ping RPC call to the cell conductor or our own services RPC +endpoint. + +While this approach has some advantages it will have a negative performance +impact if the health-check is queried frequently or in a large deployment where +infrequent queries may still degrade the DB and message bus performance due to +the scale of the deployment. + +This spec initially suggested using ``OK``, ``Degraded`` and ``Faulty`` as the +values for the status field. These were updated to ``pass``, ``warn`` and +``fail`` to align with the draft IETF RFC for health check response format for +HTTP APIs `[1]`_. + + +Data model impact +----------------- + +The Nova context object will be extended to store a reference to the +health check manager. + + +REST API impact +--------------- + +None + +While this change will expose a new REST API endpoint it will not be +part of the existing Nova API. + +In the Nova API the /health check route will not initially be used to allow +those that already enable the Oslo middleware to continue to do so. +In a future release Nova reserves the right to add a /health check endpoint +that may or may not correspond to the response format defined in Oslo. +A translation between the Oslo response format and the health check module +may be provided in the future but it is out of the scope of this spec. + + + +Security impact +--------------- + +The new health check endpoint will be disabled by default. +When enabled it will not provide any authentication or explicit access control. +The documentation will detail that when enabled, the TCP endpoint should be +bound to ``localhost`` and that file system permission should be used to secure +the UNIX socket. + +The TCP configuration option will not prevent binding it to a routable IP if +the operator chooses to do so. The intent is that the data contained in the +endpoint will be non-privileged however it may contain hostnames/FQDNs or other +infrastructure information such as service UUIDs, so it should not be +accessible from the Internet. + +Notifications impact +-------------------- + +None + +While the health checks will use the ability to send notification as an input +to determine the health of the system, this spec will not introduce any new +notifications and as such it will not impact the Notification subsystem in +Nova. New notifications are not added as this would incur a performance +overhead. + +Other end user impact +--------------------- + +None + +At present, it is not planned to extend the Nova client or the unified client +to query the new endpoint. cURL, socat, or any other UNIX socket or TCP HTTP +client can be used to invoke the endpoint. + +Performance Impact +------------------ + +None + +We expect there to be little or no performance impact as we will be taking a +minimally invasive approach to add health indicators to key functions +which will be cached in memory. While this will slightly increase memory usage +there is no expected impact on system performance. + + +Other deployer impact +--------------------- + +A new config section ``healthcheck`` will be added in the nova.conf + +A ``uri`` config option will be introduced to enable the health check +functionality. The config option will be a string opt that supports a +comma-separated list of URIs with the following format + +uri=://[host:port|path],://[host:port|path] + +e.g. + +:: + + [healthcheck] + uri=tcp://localhost:424242 + + [healthcheck] + uri=unix:///run/nova/nova-compute.sock + + [healthcheck] + uri=tcp://localhost:424242,unix:///run/nova/nova-compute.sock + +The URI should be limited to the following characters ``[a-zA-Z0-9_-]``, +``,`` is reserved as a separation character, ``.`` may only be used in IPv4 +addresses, and ``:`` is reserved for port separation unless the address is an +IPv6 address. IPv6 addresses must be enclosed in ``[`` and ``]``. ``/`` may +be used with the UNIX protocol however relative paths are not supported. +These constraints and the parsing of the URI will be enforced and provided by +the RFC3986 lib https://pypi.org/project/rfc3986/ + +A ``ttl`` IntOpt will be added with a default value of 300 seconds. +If set to 0, the time to live of a health check item will be infinite. +If the TTL expires, the state will be considered unknown and the healthcheck +item will be discarded. + +A cache_control IntOpt will be provided to set the max-age value in the +cache_control header. By default it will have the same max-age as the TTL +config option. Setting this to 0 will disable the reporting of the header. +Setting this to -1 will report ``Cache-Control: no-cache``. +Any other positive integer value will be used as the max-age. + + + +Developer impact +---------------- + +Developers should be aware of the new decorator and consider whether it should +be added to more functions, if that function is an indicator of the system's +health. Failures due to interactions with external systems such as Neutron port +binding external events should be handled with caution. While failure to +receive a port binding event will likely result in the failure to boot a VM, it +should not be used as a health indicator for the nova-compute agent. This is +because such a failure may be due to a failure in Neutron, not Nova. As such, +other operations such as VM snapshot may be unaffected and the Nova compute +service may be otherwise healthy. Any failure to connect to a non-OpenStack +service such as the message bus, hypervisor, or database should be treated as a +``warn`` or ``fail`` health indicator if it prevents the Nova binary from +functioning correctly. + + +Upgrade impact +-------------- + +None + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + sean-k-mooney + +Other contributors: + melwitt + +Feature Liaison +--------------- + +Feature liaison: + sean-k-mooney + +Work Items +---------- + +* Add new module +* Introduce decorator +* Extend context object to store a reference to health check manager +* Add config options +* Expose TCP endpoint +* Expose UNIX socket endpoint support +* Add docs + +Dependencies +============ + +None + +Testing +======= + +This can be tested entirely with unit and functional tests, however, +Devstack will be extended to expose the endpoint and use it to determine +whether the Nova services have started. + +Documentation Impact +==================== + +The config options will be documented in the config reference +and a release note will be added for the feature. + +A new health check section will be added to the admin docs describing +the current response format and how to enable the feature and its intended +usage. This document should be evolved whenever the format changes or +new functionality is added beyond the scope of this spec. + +References +========== + +* Yoga PTG topic: + https://etherpad.opendev.org/p/r.e70aa851abf8644c29c8abe4bce32b81#L415 + +.. _`[1]`: https://tools.ietf.org/id/draft-inadarei-api-health-check-06.html + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Yoga + - Introduced + * - 2023.1 Antelope + - Reproposed