diff --git a/specs/pike/diagnostics.rst b/specs/pike/diagnostics.rst new file mode 100644 index 000000000..892f75ee4 --- /dev/null +++ b/specs/pike/diagnostics.rst @@ -0,0 +1,353 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================ +Neutron Resource Diagnostics +============================ + +RFE: https://bugs.launchpad.net/neutron/+bug/1507499 + +Problem Description +=================== + +Cloud software is complex and problems are inevitable. Neutron is no exception +to this rule. To cope today we have helper scripts [1]_, blogs [2]_, diagnostic +tools [3]_, out-of-tree plugin utilities [4]_, etc. that could all benefit from +additional Neutron diagnostic data APIs. And while we certainly don't want +Neutron to become a new diagnostic system, other projects [5]_ have +found value in exposing some level of diagnostic data for users (typically +operators) needing to reach under the hood. + +By the same token, Neutron offers a broad array of resources backed by a +heterogeneous set of technologies. As a result, standardizing a set of +concrete diagnostics across this diverse domain is no simple task. +Moreover, a Neutron resource may be realized through multiple components +(plugins, agents) spanning multiple nodes (hosts). Thus, any proposal +to collect resource diagnostic data must plan accordingly to support +multiple, potentially disparate, diagnostic participants. + +Proposed Change +=============== + +This spec proposes to deliver a diagnostic framework consisting of the +following high level constructs: + +- Diagnostic checks that are effectively micro-plugins containing concrete + logic to carry out a diagnostic check for a given set of resource types + (``ports``, ``networks``, etc.) as well as check specific metadata such + as the check's name, description, etc. +- Diagnostic providers that are capable of discovering, managing and + executing diagnostic checks. These providers are the sources for diagnostics + and will allow neutron plugins and agents to plug-and-play with the + framework. +- Diagnostics API extension that supports a ``/diagnostics`` URI for existing + neutron resources and services the request by discovering and executing + checks on applicable providers and rolling up the responses. + +Each of the following constructs is discussed in greater detail in the following +sections. + +Diagnostic Checks +----------------- + +Diagnostic checks are effectively small plugins that provide the following: + +- Check specific metadata including: + + - A unique name for the check. + - (optional) A description of the check's diagnostics. + - A boolean flag indicating if the check is required or not. + - A set of resource types (``ports``, ``networks``, etc.) the check is + applicable to. + +- The actual diagnostic check logic. This logic will be invoked when collecting + diagnostics for a specific resource instance of a resource type that's + supported by the check (as per the check's metadata). Checks return a + well-formed response including a response code, (optional) text message, etc. + If a check supports multiple resource types, its implementation can key off + the resource type that'll be passed in by the framework. + +As part of this work a sample check will be provided for both a server and +agent side plugin that illustrates how to use the framework. + +Diagnostic Providers +-------------------- + +Diagnostic providers are simply 'sources' for diagnostic checks and contain +the logic to discover, manage and execute checks. Services in a deployment +wishing to contribute diagnostics will therefore implement a diagnostic +provider. For this effort we'll deliver a diagnostic provider binding +for both the neutron service as well as neutron based agents. + +Diagnostic providers have the following characteristics: + +- The means to discover and load known checks. More details in subsequent + sections. +- An internal cache of loaded checks and public APIs to: + + - Determine if the provider has a check for a given resource type or list + of resource types. + - List the resource types the provider has checks for. + - Execute all checks applicable to a specified resource type and resource + instance, collect the well-formed results and return them. + +The neutron server diagnostic provider binding delivered by this effort +will be implemented as a service plugin; consumers wishing to invoke it's +APIs can get a reference to the plugin instance via neutron manager. In +addition, other plugins can act as a diagnostic provider by supporting +the extension and implementing the respective diagnostic methods required. +This approach will also work for agent-less plugins and mimics what's typically +done today in such plugins when supporting extensions in their implementation. + +The neutron agent based diagnostic provider binding delivered by this +implementation will use the agent extension framework; plumbing the support +for all neutron based agents. However since the agent binding is remote to the +neutron API server it has the following notable differences: + +- To indicate the resource types the agent diagnostic provider supports, + a set of (string) resource types is returned in the agent state report + and stored in the agent database. On the agent, this set is built by + collecting the supported resource types from the checks it manages. On the + server side, the agent's supported resource list can be used by the API + extension controller logic to determine if the agent is applicable + to service specific diagnostic requests. +- The public APIs to invoke checks on a specific resource type and resource + instance are remoteable so they can be called via RPC. + +Diagnostic Providers: Loading Checks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Diagnostic providers are responsible for loading checks upon start-up +(e.g. when service plugin starts or the agent starts). There are multiple +ways to support check plugins including: + +- Having each check as a stevedore entry point; then loading and managing them + via a driver manager. +- Defining a specific check directory that only contains python modules where + each module is a check (plugin) and discovering + importing them from the + directory. +- Statically defining the checks in the diagnostic provider binding code. + This is just as it sounds; statically have a list of checks in the code + rather than dynamically discovering + loading. + +Each approach above has it's pros/cons. For the initial implementation we +suggest using the last approach of statically defining the checks in the code. +While this is not the most robust approach, it minimizes complexity and allows +us to get a tire-kicker more rapidly that we can then later iterate on once +we get some feedback from users. + +Diagnostic Providers: Data Model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As mentioned previously, neutron agent based diagnostic providers report +the resource types the checks they manage support. This is a list of +unique string resource types that is transported in a new key/value of +the agent state report. This list is stored in a new database table +in a new column called ``diagnostic_resource_types`` that maps to its +corresponding agent table entry. + +None (default) or an empty list signifies the agent does not provide +diagnostic data and thus will not be called to service a diagnostic +request. Neutron (OVO) objects will be updated as necessary. + +API Extension Plugin +-------------------- + +The implementation delivered will include an API extension plugin that acts +as the REST API controller for diagnostics. As described in the REST API +section, a ``/diagnostics`` URI is dangled off existing neutron resources +that only supports ``POST`` with an empty request body. Therefore this +controller needs to service diagnostic requests for a specific resource +type and ID. + +The following pseudo code outlines the controllers logic for request handling: + +- Get a list of all plugin instances from the neutron manager that support + the diagnostic extension and filter them to only those that support + the said resource type. +- Get a list of all active agents from the DB and filter to only those that + have the said resource type in their ``diagnostic_resource_types`` column. +- From the list of plugin and agent providers that support the given resource + type, invoke their API(s) to run the checks they know about for the said + resource type and resource instance. +- Collect the diagnostic results, roll them into a nice response and return + them to the caller. + +REST API +-------- + +When enabled, this implementation dangles a ``/diagnostics`` URI off Neutron +resources. The only supported HTTP method for this URI is ``POST`` (with an +empty request body) that triggers diagnostic data collection from all +applicable registered diagnostic providers. While this spec proposes the +diagnostic collection be run synchronously for the initial iteration of the +implementation, future workings could make the collection job based run async +in the back ground. + +The generic form of the diagnostics URL is:: + + POST /v2.0/{resource}/{resource_id}/diagnostics + +The response is a list of diagnostic (dict) objects, one object per +``diagnostic``. A diagnostic is an aspect of the ``{resource}`` checked, +and contains a ``description`` as well as a diagnostic ``status`` object +and list of individual ``checks`` run for the said diagnostic. Diagnostic +``status`` is set by the diagnostics framework based on the result of +all checks for the said resource ``diagnostic``. + +The array of ``checks`` returned for each diagnostic includes high level +details about the check such as ``name``, ``description`` and ``provider``. +In addition, checks report their own ``status`` based on the result of +their check execution. If a check is not successful, the check must +return a ``remediation`` to describe potential ways to remediate the failed +check. + +The ``status`` at the diagnostic level is handled by the framework and +can be one of the following: + +- ``OK``: All checks for the diagnostic are successful. +- ``ERROR``: One or more checks failed. +- ``INACTIVE``: One or more providers is inactive and couldn't be invoked + to run the diagnostic checks. +- ``DEGRADED``: Checks can be registered as optional. If a check is optional + and fails, or is ``INACTIVE``, the diagnostic status will be ``DEGRADED``. + +For example:: + + POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics + EMPTY POST BODY + ==> All successful DHCP diagnostics + { + "diagnostics": [ + { + "diagnostic": "dhcp", + "description": "Neutron network DHCP diagnostics.", + "status": { + "code": "DS000", + "label": "OK", + "message": "All checks completed successfully." + }, + "checks": [ + { + "name:": "check1", + "description": "Check1 does this and that.", + "status": { + "code": "DS000", + "label": "OK", + "message": null + }, + "provider": { + "name": "DHCP Agent", + "host": "dhcp-host1" + }, + "remediation": {} + }, + { + "name:": "check2", + "description": "Check2 does cool stuff.", + "status": { + "code": "DS000", + "label": "OK", + "message": null, + }, + "provider": { + "name": "DHCP Agent", + "host": "dhcp-host2" + }, + "remediation": {} + } + ] + } + ] + } + + POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics + EMPTY POST BODY + ==> A failed dhcp diagnostic + { + "diagnostics": [ + { + "diagnostic": "dhcp", + "description": "Neutron network DHCP diagnostics.", + "status": { + "code": "DS002", + "label": "ERROR", + "message": "Check 'check1' failed. See the check details for more info." + }, + "checks": [ + { + "name:": "check1", + "description": "Check1 does this and that.", + "status": { + "code": "DHCPE001", + "label": "ERROR", + "message": "The dnsmasq process is not running." + }, + "provider": { + "name": "DHCP Agent", + "host": "dhcp-host1" + }, + "remediation": { + "code": "DHCPR001", + "message": "Re-enable DHCP for this network, then rerun this check." + } + }, + { + "name:": "check2", + "description": "Check2 does cool stuff.", + "status": { + "code": "DS000", + "label": "OK", + "message": null, + }, + "provider": { + "name": "DHCP Agent", + "host": "dhcp-host2" + }, + "remediation": {} + } + ] + } + ] + } + +Access control to ``/diagnostics`` is handled via standard policy definition. +The default access control is ``admin_only``, but operators can change this +in ``policy.json`` as needed. + + +Benefits +-------- + +While the main purpose of this effort is to spearhead diagnostics in Neutron and +start building out the plumbing, this functionality is immediately valuable for +consumers and out-of-tree plugins alike. + +For example: + +- The python-don project [3]_ can implement diagnostic data collection for + data used in its analysis. +- The vmware-nsx plugin can migrate some of its operator CLI functionality [4]_ + into diagnostic data. +- The implementation can be enhanced to collect interface stats similar to how + Nova diagnostics [5]_ does. + + +Future work +----------- + +- Once the API solidifies, a CLI can be added to support diagnostics. + + +References +========== + +.. [1] https://github.com/openstack/osops-tools-generic/blob/master/neutron/listorphans.py +.. [2] http://www.tuxfixer.com/openstack-how-to-manually-delete-orphaned-neutron-port/ +.. [3] https://github.com/openstack/python-don +.. [4] https://github.com/openstack/vmware-nsx/blob/master/vmware_nsx/shell/admin/README.rst +.. [5] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics +.. [6] https://bugs.launchpad.net/neutron/+bug/1563538