Neutron resource diagnostics

This spec proposes the introduction of a neutron diagnostics framework and API extension capable collecting resource diagnostics across neutron API and agent nodes. To keep the spec containable, the proposal suggests only providing a sample diagnostic check and reiterating on concrete diagnostics once we get the plumbing in place. While this spec has some inspiration from nova diagnostics [1], the approach herein is more generic and extensible supporting a broader set of use cases longer term. Finally it seeks to pave the way for supporting use case / features proposed in the related bugs. [1] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics Related-Bug: #1507499 Related-Bug: #1519537 Related-Bug: #1537686 Related-Bug: #1563538 Change-Id: Id534acb1593f1fe210c561b1451656dce69514db
2017-02-15 15:47:07 -07:00 · 2017-02-15 15:47:07 -07:00 · dc11da5109
parent 095580ec46
commit dc11da5109
1 changed files with 353 additions and 0 deletions
--- a/specs/pike/diagnostics.rst
+++ b/specs/pike/diagnostics.rst
@ -0,0 +1,353 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+============================
+Neutron Resource Diagnostics
+============================
+
+RFE: https://bugs.launchpad.net/neutron/+bug/1507499
+
+Problem Description
+===================
+
+Cloud software is complex and problems are inevitable. Neutron is no exception
+to this rule. To cope today we have helper scripts [1]_, blogs [2]_, diagnostic
+tools [3]_, out-of-tree plugin utilities [4]_, etc. that could all benefit from
+additional Neutron diagnostic data APIs. And while we certainly don't want
+Neutron to become a new diagnostic system, other projects [5]_ have
+found value in exposing some level of diagnostic data for users (typically
+operators) needing to reach under the hood.
+
+By the same token, Neutron offers a broad array of resources backed by a
+heterogeneous set of technologies. As a result, standardizing a set of
+concrete diagnostics across this diverse domain is no simple task.
+Moreover, a Neutron resource may be realized through multiple components
+(plugins, agents) spanning multiple nodes (hosts). Thus, any proposal
+to collect resource diagnostic data must plan accordingly to support
+multiple, potentially disparate, diagnostic participants.
+
+Proposed Change
+===============
+
+This spec proposes to deliver a diagnostic framework consisting of the
+following high level constructs:
+
+- Diagnostic checks that are effectively micro-plugins containing concrete
+  logic to carry out a diagnostic check for a given set of resource types
+  (``ports``, ``networks``, etc.) as well as check specific metadata such
+  as the check's name, description, etc.
+- Diagnostic providers that are capable of discovering, managing and
+  executing diagnostic checks. These providers are the sources for diagnostics
+  and will allow neutron plugins and agents to plug-and-play with the
+  framework.
+- Diagnostics API extension that supports a ``/diagnostics`` URI for existing
+  neutron resources and services the request by discovering and executing
+  checks on applicable providers and rolling up the responses.
+
+Each of the following constructs is discussed in greater detail in the following
+sections.
+
+Diagnostic Checks
+-----------------
+
+Diagnostic checks are effectively small plugins that provide the following:
+
+- Check specific metadata including:
+
+  - A unique name for the check.
+  - (optional) A description of the check's diagnostics.
+  - A boolean flag indicating if the check is required or not.
+  - A set of resource types (``ports``, ``networks``, etc.) the check is
+    applicable to.
+
+- The actual diagnostic check logic. This logic will be invoked when collecting
+  diagnostics for a specific resource instance of a resource type that's
+  supported by the check (as per the check's metadata). Checks return a
+  well-formed response including a response code, (optional) text message, etc.
+  If a check supports multiple resource types, its implementation can key off
+  the resource type that'll be passed in by the framework.
+
+As part of this work a sample check will be provided for both a server and
+agent side plugin that illustrates how to use the framework.
+
+Diagnostic Providers
+--------------------
+
+Diagnostic providers are simply 'sources' for diagnostic checks and contain
+the logic to discover, manage and execute checks. Services in a deployment
+wishing to contribute diagnostics will therefore implement a diagnostic
+provider. For this effort we'll deliver a diagnostic provider binding
+for both the neutron service as well as neutron based agents.
+
+Diagnostic providers have the following characteristics:
+
+- The means to discover and load known checks. More details in subsequent
+  sections.
+- An internal cache of loaded checks and public APIs to:
+
+  - Determine if the provider has a check for a given resource type or list
+    of resource types.
+  - List the resource types the provider has checks for.
+  - Execute all checks applicable to a specified resource type and resource
+    instance, collect the well-formed results and return them.
+
+The neutron server diagnostic provider binding delivered by this effort
+will be implemented as a service plugin; consumers wishing to invoke it's
+APIs can get a reference to the plugin instance via neutron manager. In
+addition, other plugins can act as a diagnostic provider by supporting
+the extension and implementing the respective diagnostic methods required.
+This approach will also work for agent-less plugins and mimics what's typically
+done today in such plugins when supporting extensions in their implementation.
+
+The neutron agent based diagnostic provider binding delivered by this
+implementation will use the agent extension framework; plumbing the support
+for all neutron based agents. However since the agent binding is remote to the
+neutron API server it has the following notable differences:
+
+- To indicate the resource types the agent diagnostic provider supports,
+  a set of (string) resource types is returned in the agent state report
+  and stored in the agent database. On the agent, this set is built by
+  collecting the supported resource types from the checks it manages. On the
+  server side, the agent's supported resource list can be used by the API
+  extension controller logic to determine if the agent is applicable
+  to service specific diagnostic requests.
+- The public APIs to invoke checks on a specific resource type and resource
+  instance are remoteable so they can be called via RPC.
+
+Diagnostic Providers: Loading Checks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Diagnostic providers are responsible for loading checks upon start-up
+(e.g. when service plugin starts or the agent starts). There are multiple
+ways to support check plugins including:
+
+- Having each check as a stevedore entry point; then loading and managing them
+  via a driver manager.
+- Defining a specific check directory that only contains python modules where
+  each module is a check (plugin) and discovering + importing them from the
+  directory.
+- Statically defining the checks in the diagnostic provider binding code.
+  This is just as it sounds; statically have a list of checks in the code
+  rather than dynamically discovering + loading.
+
+Each approach above has it's pros/cons. For the initial implementation we
+suggest using the last approach of statically defining the checks in the code.
+While this is not the most robust approach, it minimizes complexity and allows
+us to get a tire-kicker more rapidly that we can then later iterate on once
+we get some feedback from users.
+
+Diagnostic Providers: Data Model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As mentioned previously, neutron agent based diagnostic providers report
+the resource types the checks they manage support. This is a list of
+unique string resource types that is transported in a new key/value of
+the agent state report. This list is stored in a new database table
+in a new column called ``diagnostic_resource_types`` that maps to its
+corresponding agent table entry.
+
+None (default) or an empty list signifies the agent does not provide
+diagnostic data and thus will not be called to service a diagnostic
+request. Neutron (OVO) objects will be updated as necessary.
+
+API Extension Plugin
+--------------------
+
+The implementation delivered will include an API extension plugin that acts
+as the REST API controller for diagnostics. As described in the REST API
+section, a ``/diagnostics`` URI is dangled off existing neutron resources
+that only supports ``POST`` with an empty request body. Therefore this
+controller needs to service diagnostic requests for a specific resource
+type and ID.
+
+The following pseudo code outlines the controllers logic for request handling:
+
+- Get a list of all plugin instances from the neutron manager that support
+  the diagnostic extension and filter them to only those that support
+  the said resource type.
+- Get a list of all active agents from the DB and filter to only those that
+  have the said resource type in their ``diagnostic_resource_types`` column.
+- From the list of plugin and agent providers that support the given resource
+  type, invoke their API(s) to run the checks they know about for the said
+  resource type and resource instance.
+- Collect the diagnostic results, roll them into a nice response and return
+  them to the caller.
+
+REST API
+--------
+
+When enabled, this implementation dangles a ``/diagnostics`` URI off Neutron
+resources. The only supported HTTP method for this URI is ``POST`` (with an
+empty request body) that triggers diagnostic data collection from all
+applicable registered diagnostic providers. While this spec proposes the
+diagnostic collection be run synchronously for the initial iteration of the
+implementation, future workings could make the collection job based run async
+in the back ground.
+
+The generic form of the diagnostics URL is::
+
+    POST /v2.0/{resource}/{resource_id}/diagnostics
+
+The response is a list of diagnostic (dict) objects, one object per
+``diagnostic``. A diagnostic is an aspect of the ``{resource}`` checked,
+and contains a ``description`` as well as a diagnostic ``status`` object
+and list of individual ``checks`` run for the said diagnostic. Diagnostic
+``status`` is set by the diagnostics framework based on the result of
+all checks for the said resource ``diagnostic``.
+
+The array of ``checks`` returned for each diagnostic includes high level
+details about the check such as ``name``, ``description`` and ``provider``.
+In addition, checks report their own ``status`` based on the result of
+their check execution. If a check is not successful, the check must
+return a ``remediation`` to describe potential ways to remediate the failed
+check.
+
+The ``status`` at the diagnostic level is handled by the framework and
+can be one of the following:
+
+- ``OK``: All checks for the diagnostic are successful.
+- ``ERROR``: One or more checks failed.
+- ``INACTIVE``: One or more providers is inactive and couldn't be invoked
+  to run the diagnostic checks.
+- ``DEGRADED``: Checks can be registered as optional. If a check is optional
+  and fails, or is ``INACTIVE``, the diagnostic status will be ``DEGRADED``.
+
+For example::
+
+    POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
+    EMPTY POST BODY
+    ==> All successful DHCP diagnostics
+    {
+        "diagnostics": [
+            {
+                "diagnostic": "dhcp",
+                "description": "Neutron network DHCP diagnostics.",
+                "status": {
+                    "code": "DS000",
+                    "label": "OK",
+                    "message": "All checks completed successfully."
+                },
+                "checks": [
+                    {
+                        "name:": "check1",
+                        "description": "Check1 does this and that.",
+                        "status": {
+                            "code": "DS000",
+                            "label": "OK",
+                            "message": null
+                        },
+                        "provider": {
+                            "name": "DHCP Agent",
+                            "host": "dhcp-host1"
+                        },
+                        "remediation": {}
+                    },
+                    {
+                        "name:": "check2",
+                        "description": "Check2 does cool stuff.",
+                        "status": {
+                            "code": "DS000",
+                            "label": "OK",
+                            "message": null,
+                        },
+                        "provider": {
+                            "name": "DHCP Agent",
+                            "host": "dhcp-host2"
+                        },
+                        "remediation": {}
+                    }
+                ]
+            }
+        ]
+    }
+
+    POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
+    EMPTY POST BODY
+    ==> A failed dhcp diagnostic
+    {
+        "diagnostics": [
+            {
+                "diagnostic": "dhcp",
+                "description": "Neutron network DHCP diagnostics.",
+                "status": {
+                    "code": "DS002",
+                    "label": "ERROR",
+                    "message": "Check 'check1' failed. See the check details for more info."
+                },
+                "checks": [
+                    {
+                        "name:": "check1",
+                        "description": "Check1 does this and that.",
+                        "status": {
+                            "code": "DHCPE001",
+                            "label": "ERROR",
+                            "message": "The dnsmasq process is not running."
+                        },
+                        "provider": {
+                            "name": "DHCP Agent",
+                            "host": "dhcp-host1"
+                        },
+                        "remediation": {
+                            "code": "DHCPR001",
+                            "message": "Re-enable DHCP for this network, then rerun this check."
+                        }
+                    },
+                    {
+                        "name:": "check2",
+                        "description": "Check2 does cool stuff.",
+                        "status": {
+                            "code": "DS000",
+                            "label": "OK",
+                            "message": null,
+                        },
+                        "provider": {
+                            "name": "DHCP Agent",
+                            "host": "dhcp-host2"
+                        },
+                        "remediation": {}
+                    }
+                ]
+            }
+        ]
+    }
+
+Access control to ``/diagnostics`` is handled via standard policy definition.
+The default access control is ``admin_only``, but operators can change this
+in ``policy.json`` as needed.
+
+
+Benefits
+--------
+
+While the main purpose of this effort is to spearhead diagnostics in Neutron and
+start building out the plumbing, this functionality is immediately valuable for
+consumers and out-of-tree plugins alike.
+
+For example:
+
+- The python-don project [3]_ can implement diagnostic data collection for
+  data used in its analysis.
+- The vmware-nsx plugin can migrate some of its operator CLI functionality [4]_
+  into diagnostic data.
+- The implementation can be enhanced to collect interface stats similar to how
+  Nova diagnostics [5]_ does.
+
+
+Future work
+-----------
+
+- Once the API solidifies, a CLI can be added to support diagnostics.
+
+
+References
+==========
+
+.. [1] https://github.com/openstack/osops-tools-generic/blob/master/neutron/listorphans.py
+.. [2] http://www.tuxfixer.com/openstack-how-to-manually-delete-orphaned-neutron-port/
+.. [3] https://github.com/openstack/python-don
+.. [4] https://github.com/openstack/vmware-nsx/blob/master/vmware_nsx/shell/admin/README.rst
+.. [5] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics
+.. [6] https://bugs.launchpad.net/neutron/+bug/1563538