add per process healthcheck spec

Change-Id: Ie3d4013ac78dd9daf5e42998aea2a8bed257a71d
2021-12-09 19:25:50 +00:00 · 2021-12-09 19:25:50 +00:00 · 51c3a15608
parent fa3654ff19
commit 51c3a15608
1 changed files with 512 additions and 0 deletions
--- a/specs/yoga/approved/per-process-healthchecks.rst
+++ b/specs/yoga/approved/per-process-healthchecks.rst
@ -0,0 +1,512 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=================================
+Per Process Healthcheck endpoints
+=================================
+
+https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks
+
+In many modern deployment frameworks, there is an expectation that
+an application can expose a health-check endpoint so that the binary
+status can be monitored. Nova currently does not provide a native way
+to inspect the health of its binaries which doesn't help cloud monitoring
+and maintenance. While limited support exists for health checks via
+oslo middleware for our WSGI based API binaries, this blueprint seeks
+to expose a local HTTP health-check endpoint to address this
+feature gap consistently for all nova components.
+
+
+Problem description
+===================
+
+To monitor the health of a nova service today requires experience to
+develop and implement a series of external heuristics to infer the state
+of the service binaries.
+
+This can be as simple as checking the service status for those with heartbeats
+or can comprise monitoring log output via a watchdog and restarting
+the service if no output is detected after a protracted period.
+Processing the logs for known error messages and executing a remediation script
+or other methods that are easy to do incorrectly are also common.
+
+This is also quite unfriendly to new nova users who have not gained enough
+experience with operating nova to know what warning signs they should look
+for such as inability to connect to the message bus. Nova developers however
+do know what some of the important health indicators are and can expose
+those as a local health-check endpoint that operators can use instead.
+
+The existing oslo middleware does not address this problem statement because:
+
+#. it can only be used by the API and metadata binaries
+
+#. the middleware does not tell you the service is alive if its hosted by a
+   WSGI server like apache since the middleware is executed independently from
+   the WSGI application. i.e. the middleware can pass while the nova-api can't
+   connect to the db and is otherwise broken.
+
+#. the oslo middleware in detailed mode leaks info about the host python
+   kernel, python version and hostname which can be used to determine in the
+   host is vulnerable to CVEs which means it should never be exposed to the
+   Internet. e.g.
+
+::
+
+  platform: 'Linux-5.15.2-xanmod1-tt-x86_64-with-glibc2.2.5',
+  python_version: '3.8.12 (default, Aug 30 2021, 16:42:10) \n[GCC 10.3.0]'
+
+
+
+Use Cases
+---------
+
+As an operator, I want a simple REST endpoint I can consume to know
+if a nova process is healthy.
+
+As an operator I want this health check to not impact the performance of the
+service so it can be queried frequently at short intervals.
+
+As a deployment tool implementer, I want the health check to be local with no
+dependencies on other hosts or services to function so I can integrate it with
+service managers such as systemd or container runtime like docker
+
+As a packager, I would like using the health check endpoint to not require
+special clients or packages to consume them. cURL, socat, or netcat should be
+all that is required to connect to the health check and retrieve the service
+status.
+
+As an operator I would like to be able to use health-check of the nova API and
+metadata services to manage the membership of endpoints in my load-balancer
+or reverse proxy automatically.
+
+Proposed change
+===============
+
+Definitions
+-----------
+``TTL``: The time interval for which a health check item is valid.
+
+``pass``: all health indicators are passing and there TTLs have not expired.
+
+``warn``: any health indicator has an expired TTL or where there is
+a partial transient failure.
+
+``fail``: any health indicator is reporting an error or all TTLs are expired.
+
+.. Note:
+
+   In line with the recommendation in the IETF RFC API health check draft
+   `[1]`_
+   ``pass`` and ``warn`` will respond with a 200 OK
+   ``fail`` will respond with a 503 Service Unavailable
+   Content-Type: application/health+json will be used in all cases.
+
+
+
+warn vs fail
+------------
+
+In general if any of the health check indicators are failing then the service
+should be reported as ``fail`` however if the specific error condition is
+recoverable or only a partial failure the ``warn`` state can and should be
+used.
+
+An example of this is a service that has lost a connection to the message bus,
+when the connection is lost it should go to the ``warn`` state, if the first
+attempt to reconnect fails it should go to the ``fail`` status. Transient
+failure should be considered warning but persistent errors should be escalated
+to failures.
+
+In many cases external management systems will treat ``warn`` and ``fail`` as
+equivalent and raise an alarm or restart the service. While this spec does
+not specify how you should recover from a degraded state it is
+important to include a human readable description of why the ``warn`` or
+``fail`` state was entered.
+
+Services in the ``warn`` state are still considered healthy in most case but
+they may be about to fail soon or be partially degraded.
+
+.. NOTE:
+
+  where no health check items are currently registered such as during start up
+  the health check status will be considered ``pass`` not ``warn``
+  or ``fail``. This will prevent restart loops for and service managers that
+  Treat any value other then ``pass`` as an error state.
+
+code changes
+------------
+A new top-level nova health check module will be created to encapsulate the
+common code and data structure required to implement this feature.
+
+A new health check manager class will be introduced which will maintain the
+health-check state and all functions related to retrieving, updating and
+summarizing that state.
+
+.. NOTE:
+
+  All health check state will be stored in memory and reset/lost on restart
+  of the binary. For service that support dynamic reconfiguration via SIG_HUP
+  the health check data will be reset as part of this process.
+
+The health check manager will be responsible for creating the health check
+endpoint when it is enabled in the nova.conf and exposing the health check
+over HTTP.
+
+The initial implementation will support HTTP over TCP with optional support for
+UNIX domain sockets as a more secure alternative to be added later.
+The HTTP endpoint in both cases will be unauthenticated and the response will
+be in JSON format.
+
+A new HealthcheckStausItem data class will be introduced to store and
+individual health check data-point. The HealtcheckSTatusItem will contain
+the name of the health check, its status, the time it was recorded,
+and an optional output string that should be populated if the
+status is ``warn`` or ``fail``.
+
+A new decorator will be introduced that will automatically retrieve the
+reference to the healthcheck manager from the nova context object and update
+the result based on if the function decorated raises an exception or not.
+The exception list and healthcheck item name will be specifiable.
+
+The decorator will accept the name of the health check as a positional argument
+and include the exception message as the output of the health check on failure.
+Note that the decorator will only support the pass or fail status for
+simplicity, where warn is appropriate a manual check should be written.
+if multiple functions act as indicator of the same capability the same name
+should be used.
+
+e.g.
+
+.. code-block:: python
+
+   @healthcheck('database', [SQLAlchemyError])
+   def my_db_func(self):
+       pass
+
+   @healthcheck('database', [SQLAlchemyError])
+   def my_other_db_func(self):
+       pass
+
+By default all exceptions will be caught and re-raised by the decorator.
+
+The new REST health check endpoint exposed by this spec will initially only
+support one url path ``/health``. The ``/health`` endpoint will include a
+`Cache-Control: max-age=<ttl>` header as part of its response which can
+optionally be consumed by the client.
+
+The endpoint may also implement a simple incrementing etag at a later date
+once the initial implementation is complete if required.
+Initially adding an ``etag`` is not provided as the response is expected to be
+small and cheap to query so etags do not actually provide much benefit form
+a performance perspective.
+
+If implemented the ``etag`` will be incremented whenever the service state
+changes and will reset to 0 when the service is restarted.
+
+Additional url paths may be supported in the future for example to retrieve
+the running configuration or trigger the generation of Guru Meditation Reports
+or enable debug logging, however any endpoint beyond ``/health`` is out of
+scope of this spec. ``/`` is not used for health check response to facilitate
+additional paths in the future.
+
+example output
+~~~~~~~~~~~~~~
+
+::
+
+   GET /health HTTP/1.1
+   Host: example.org
+   Accept: application/health+json
+
+   HTTP/1.1 200 OK
+   Content-Type: application/health+json
+   Cache-Control: max-age=3600
+   Connection: close
+
+   {
+       "status": "pass",
+       "version": "1.0",
+       "serviceId": "e3c22423-cd7a-47dc-b6e9-e18d1a8b3bdf",
+       "description": "nova-API",
+       "notes": {"host": "contoler-1.cloud", "hostname": "contoler-1.cloud"}
+       "checks": {
+           "message_bus": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
+           "API_db": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"}
+       }
+   }
+
+   GET /health HTTP/1.1
+   Host: example.org
+   Accept: application/health+json
+
+   HTTP/1.1 503 Sevice Unavailable
+   Content-Type: application/health+json
+   Cache-Control: no-cache
+   Connection: close
+
+   {
+       "status": "fail",
+       "version": "1.0",
+       "serviceId": "0a47dceb-11b1-4d94-8b9c-927d998be320",
+       "description": "nova-compute",
+       "notes": {"host": "contoler-1.cloud", "hostname": "contoler-1.cloud"}
+       "checks":{
+           "message_bus":{"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
+           "hypervisor":{
+                "status": "fail", "time": "2021-12-17T16:05:55+00:00",
+                "output": "Libvirt Error: ..."
+           }
+       }
+   }
+
+
+.. NOTE:
+
+   ``version`` will initially be 1.0 and can be incremented following
+   ``SemVer`` conventions if we extend the response format.
+   This is not the nova version.
+   Adding new checks to the nova code base will not increment the version of
+   the response but adding or removing any new field to the response will.
+   The set of check names will be closed and each new check name that is added
+   will be signaled by a minor version bump. The initial set of check names for
+   version 1.0 is left to the implementation.
+   ``serviceId`` should be set to the nova service id for this binary.
+   ``description`` will contain the binary name for the service.
+   ``notes`` the notes will contain the CONF.host value in the host field and
+    optionally the hypervisor_hostname in the hostname field.
+   ``status`` will contain the overall status of the service with details
+   provided in the ``checks`` dictionary.
+   The keys of the ``checks`` dictionary will be the name of the health check
+   the value will contain the ``status`` and ``time`` in ISO datetime format
+   that the status was recorded at. If the ``status`` is ``warn`` or ``error``
+   an ``output`` key will be present with a message explaining the status.
+
+
+Alternatives
+------------
+
+Instead of maintaining the state of the process in a data structure and
+returning the cached state we could implement the health check as a series of
+active probes such as checking the DB schema version to ensure we can access
+it or making a ping RPC call to the cell conductor or our own services RPC
+endpoint.
+
+While this approach has some advantages it will have a negative performance
+impact if the health-check is queried frequently or in a large deployment where
+infrequent queries may still degrade the DB and message bus performance due to
+the scale of the deployment.
+
+This spec initially suggested using ``OK``, ``Degraded`` and ``Faulty`` as the
+values for the status field. these were updated to ``pass``, ``warn`` and
+``fail`` to align with the draft IETF RFC for health check response format for
+http APIs `[1]`_.
+
+
+Data model impact
+-----------------
+
+The nova context object will be extended to store a reference to the
+health check manager.
+
+
+REST API impact
+---------------
+
+None
+
+While this change will expose a new REST API endpoint it will not be
+part of the existing nova API.
+
+In the nova API the /health check route will not initially be used to allow
+those that already enable the oslo middleware to continue to do so.
+In a future release nova reserves the right to add a /health check endpoint
+that may or may not correspond to the response format defined in oslo.
+A translation between the oslo response format and the health check module
+may be provided in the future but it is out of the scope of this spec.
+
+
+
+Security impact
+---------------
+
+The new health check endpoint will be disabled by default.
+When enabled it will not provide any authentication or explicit access control.
+The documentation will detail that when enabled the TCP endpoint should be
+bound to ``localhost`` and that file system permission should be used to secure
+the UNIX socket.
+
+The TCP configuration option will not prevent you from binding it to a
+routable IP if the operator chooses to do so. The intent is that the data
+contained in the endpoint will be non-privileged however it may contain
+Hostnames/FQDNs or other infrastructure information such as service UUIDs so
+it should not be accessible from the Internet.
+
+Notifications impact
+--------------------
+
+None
+
+While the health checks will use the ability to send notification as an input
+to determine the health of the system, this spec will not introduce any new
+notifications as such it will not impact the Notification subsystem in nova.
+New notifications are not added as this would incur a performance overhead.
+
+Other end user impact
+---------------------
+
+None
+
+At present, it is not planned to extend the nova client or the unified client
+to query the new endpoint. CURL, SOCAT, or any other UNIX socket or TCP HTTP
+client can be used to invoke the endpoint.
+
+Performance Impact
+------------------
+
+None
+
+We expect there to be little or no performance impact as we will be taking a
+minimally invasive approach to add health indicators to key functions
+which will be cached in memory. While this will slightly increase memory usage
+there is no expected impact on system performance.
+
+
+Other deployer impact
+---------------------
+
+A  new config section ``healthcheck``  will be added in the nova.conf
+
+A ``uri`` config option will be introduced to enable the health check
+functionality. The config option will be a string opt that supports a
+comma-separated list of URIs with the following format
+
+uri=<scheme>://[host:port|path],<scheme>://[host:port|path]
+
+e.g.
+[healthcheck]
+uri=tcp://localhost:424242
+
+[healthcheck]
+uri=unix:///run/nova/nova-compute.sock
+
+[healthcheck]
+uri=tcp://localhost:424242,unix:///run/nova/nova-compute.sock
+
+The URI should be limited to the following charterers ``[a-zA-Z0-9_-]``
+``,`` is reserved as a separation charterer, ``.`` may only be used in ipv4
+addresses ``:`` is reserved for port separation unless the address is an ipv6
+address. ipv6 addresses must be enclosed in ``[`` and  ``]``.
+``/`` may be used with the UNIX protocol however relative paths are not
+supported.
+These constraints and the parsing of the URI will be enforced and provided by
+the rfc3986 lib https://pypi.org/project/rfc3986/
+
+A ``ttl`` intOpt will be added with a default value of 300 seconds.
+If set to 0 the time to live of a health check items will be infinite.
+If the TTL expires the state will be considered unknown and
+the healthcheck item will be discarded.
+
+A cache_contol intOpt will be provide to set the max-age value in the
+cache_contol header. By default it will have the same max-age as the ttl
+config option. setting this to 0 will disable the reporting of the header.
+setting this to -1 will report ``Cache-Control: no-cache``
+any other positive integer value will be used as the max-age.
+
+
+
+Developer impact
+----------------
+
+Developers should be aware of the new decorator and consider if it should be
+added to more functions if that function is an indicator of the system's
+health. Failures due to interactions with external system such as neutron port
+binding external events should be handled with caution. While failure to
+receive a port binding event will likely result in the failure to boot a vm it
+should not be used as a health indicator for the nova-compute agent. This is
+because such a failure may be due to a failure in neutron not nova. As such
+other operations such as vm snapshot may be unaffected and the nova compute
+service may be otherwise healthy. Any failure to connect to a non OpenStack
+service such as the message bus hypervisor or database however should be
+treated as a ``warn`` or ``fail`` health indicator if it prevents the nova
+binary from functioning correctly.
+
+
+Upgrade impact
+--------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  sean-k-mooney
+
+Other contributors:
+  None
+
+Feature Liaison
+---------------
+
+Feature liaison:
+  sean-k-mooney
+
+Work Items
+----------
+
+* Add new module
+* Introduce decorator
+* Extend context object to store a reference to health check manager.
+* Add config options
+* Expose TCP endpoint
+* Expose UNIX socket endpoint support
+* Add docs
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+This can be tested entirely with unit and functional tests however,
+devstack will be extended to expose the endpoint and use it to determine
+if the nova services have started.
+
+Documentation Impact
+====================
+
+The config options will be documented in the config reference
+and a release note will be added for the feature.
+
+A new health check section will be added to the admin docs describing
+the current response format and how to enable the feature and its intended
+usage. This document should be evolved whenever the format changes or
+new functionality is added beyond the scope of this spec.
+
+References
+==========
+
+* yoga ptg topic:
+    https://etherpad.opendev.org/p/r.e70aa851abf8644c29c8abe4bce32b81#L415
+
+.. _`[1]`: https://tools.ietf.org/id/draft-inadarei-api-health-check-06.html
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Yoga
+     - Introduced
+