Neutron resource diagnostics

This spec proposes the introduction of a neutron diagnostics framework
and API extension capable collecting resource diagnostics across
neutron API and agent nodes. To keep the spec containable, the proposal
suggests only providing a sample diagnostic check and reiterating on
concrete diagnostics once we get the plumbing in place.

While this spec has some inspiration from nova diagnostics [1],
the approach herein is more generic and extensible supporting a
broader set of use cases longer term.

Finally it seeks to pave the way for supporting use case / features
proposed in the related bugs.

[1] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics

Related-Bug: #1507499
Related-Bug: #1519537
Related-Bug: #1537686
Related-Bug: #1563538

Change-Id: Id534acb1593f1fe210c561b1451656dce69514db
This commit is contained in:
Boden R 2017-02-15 15:47:07 -07:00
parent 095580ec46
commit dc11da5109
1 changed files with 353 additions and 0 deletions

353
specs/pike/diagnostics.rst Normal file
View File

@ -0,0 +1,353 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
============================
Neutron Resource Diagnostics
============================
RFE: https://bugs.launchpad.net/neutron/+bug/1507499
Problem Description
===================
Cloud software is complex and problems are inevitable. Neutron is no exception
to this rule. To cope today we have helper scripts [1]_, blogs [2]_, diagnostic
tools [3]_, out-of-tree plugin utilities [4]_, etc. that could all benefit from
additional Neutron diagnostic data APIs. And while we certainly don't want
Neutron to become a new diagnostic system, other projects [5]_ have
found value in exposing some level of diagnostic data for users (typically
operators) needing to reach under the hood.
By the same token, Neutron offers a broad array of resources backed by a
heterogeneous set of technologies. As a result, standardizing a set of
concrete diagnostics across this diverse domain is no simple task.
Moreover, a Neutron resource may be realized through multiple components
(plugins, agents) spanning multiple nodes (hosts). Thus, any proposal
to collect resource diagnostic data must plan accordingly to support
multiple, potentially disparate, diagnostic participants.
Proposed Change
===============
This spec proposes to deliver a diagnostic framework consisting of the
following high level constructs:
- Diagnostic checks that are effectively micro-plugins containing concrete
logic to carry out a diagnostic check for a given set of resource types
(``ports``, ``networks``, etc.) as well as check specific metadata such
as the check's name, description, etc.
- Diagnostic providers that are capable of discovering, managing and
executing diagnostic checks. These providers are the sources for diagnostics
and will allow neutron plugins and agents to plug-and-play with the
framework.
- Diagnostics API extension that supports a ``/diagnostics`` URI for existing
neutron resources and services the request by discovering and executing
checks on applicable providers and rolling up the responses.
Each of the following constructs is discussed in greater detail in the following
sections.
Diagnostic Checks
-----------------
Diagnostic checks are effectively small plugins that provide the following:
- Check specific metadata including:
- A unique name for the check.
- (optional) A description of the check's diagnostics.
- A boolean flag indicating if the check is required or not.
- A set of resource types (``ports``, ``networks``, etc.) the check is
applicable to.
- The actual diagnostic check logic. This logic will be invoked when collecting
diagnostics for a specific resource instance of a resource type that's
supported by the check (as per the check's metadata). Checks return a
well-formed response including a response code, (optional) text message, etc.
If a check supports multiple resource types, its implementation can key off
the resource type that'll be passed in by the framework.
As part of this work a sample check will be provided for both a server and
agent side plugin that illustrates how to use the framework.
Diagnostic Providers
--------------------
Diagnostic providers are simply 'sources' for diagnostic checks and contain
the logic to discover, manage and execute checks. Services in a deployment
wishing to contribute diagnostics will therefore implement a diagnostic
provider. For this effort we'll deliver a diagnostic provider binding
for both the neutron service as well as neutron based agents.
Diagnostic providers have the following characteristics:
- The means to discover and load known checks. More details in subsequent
sections.
- An internal cache of loaded checks and public APIs to:
- Determine if the provider has a check for a given resource type or list
of resource types.
- List the resource types the provider has checks for.
- Execute all checks applicable to a specified resource type and resource
instance, collect the well-formed results and return them.
The neutron server diagnostic provider binding delivered by this effort
will be implemented as a service plugin; consumers wishing to invoke it's
APIs can get a reference to the plugin instance via neutron manager. In
addition, other plugins can act as a diagnostic provider by supporting
the extension and implementing the respective diagnostic methods required.
This approach will also work for agent-less plugins and mimics what's typically
done today in such plugins when supporting extensions in their implementation.
The neutron agent based diagnostic provider binding delivered by this
implementation will use the agent extension framework; plumbing the support
for all neutron based agents. However since the agent binding is remote to the
neutron API server it has the following notable differences:
- To indicate the resource types the agent diagnostic provider supports,
a set of (string) resource types is returned in the agent state report
and stored in the agent database. On the agent, this set is built by
collecting the supported resource types from the checks it manages. On the
server side, the agent's supported resource list can be used by the API
extension controller logic to determine if the agent is applicable
to service specific diagnostic requests.
- The public APIs to invoke checks on a specific resource type and resource
instance are remoteable so they can be called via RPC.
Diagnostic Providers: Loading Checks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Diagnostic providers are responsible for loading checks upon start-up
(e.g. when service plugin starts or the agent starts). There are multiple
ways to support check plugins including:
- Having each check as a stevedore entry point; then loading and managing them
via a driver manager.
- Defining a specific check directory that only contains python modules where
each module is a check (plugin) and discovering + importing them from the
directory.
- Statically defining the checks in the diagnostic provider binding code.
This is just as it sounds; statically have a list of checks in the code
rather than dynamically discovering + loading.
Each approach above has it's pros/cons. For the initial implementation we
suggest using the last approach of statically defining the checks in the code.
While this is not the most robust approach, it minimizes complexity and allows
us to get a tire-kicker more rapidly that we can then later iterate on once
we get some feedback from users.
Diagnostic Providers: Data Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As mentioned previously, neutron agent based diagnostic providers report
the resource types the checks they manage support. This is a list of
unique string resource types that is transported in a new key/value of
the agent state report. This list is stored in a new database table
in a new column called ``diagnostic_resource_types`` that maps to its
corresponding agent table entry.
None (default) or an empty list signifies the agent does not provide
diagnostic data and thus will not be called to service a diagnostic
request. Neutron (OVO) objects will be updated as necessary.
API Extension Plugin
--------------------
The implementation delivered will include an API extension plugin that acts
as the REST API controller for diagnostics. As described in the REST API
section, a ``/diagnostics`` URI is dangled off existing neutron resources
that only supports ``POST`` with an empty request body. Therefore this
controller needs to service diagnostic requests for a specific resource
type and ID.
The following pseudo code outlines the controllers logic for request handling:
- Get a list of all plugin instances from the neutron manager that support
the diagnostic extension and filter them to only those that support
the said resource type.
- Get a list of all active agents from the DB and filter to only those that
have the said resource type in their ``diagnostic_resource_types`` column.
- From the list of plugin and agent providers that support the given resource
type, invoke their API(s) to run the checks they know about for the said
resource type and resource instance.
- Collect the diagnostic results, roll them into a nice response and return
them to the caller.
REST API
--------
When enabled, this implementation dangles a ``/diagnostics`` URI off Neutron
resources. The only supported HTTP method for this URI is ``POST`` (with an
empty request body) that triggers diagnostic data collection from all
applicable registered diagnostic providers. While this spec proposes the
diagnostic collection be run synchronously for the initial iteration of the
implementation, future workings could make the collection job based run async
in the back ground.
The generic form of the diagnostics URL is::
POST /v2.0/{resource}/{resource_id}/diagnostics
The response is a list of diagnostic (dict) objects, one object per
``diagnostic``. A diagnostic is an aspect of the ``{resource}`` checked,
and contains a ``description`` as well as a diagnostic ``status`` object
and list of individual ``checks`` run for the said diagnostic. Diagnostic
``status`` is set by the diagnostics framework based on the result of
all checks for the said resource ``diagnostic``.
The array of ``checks`` returned for each diagnostic includes high level
details about the check such as ``name``, ``description`` and ``provider``.
In addition, checks report their own ``status`` based on the result of
their check execution. If a check is not successful, the check must
return a ``remediation`` to describe potential ways to remediate the failed
check.
The ``status`` at the diagnostic level is handled by the framework and
can be one of the following:
- ``OK``: All checks for the diagnostic are successful.
- ``ERROR``: One or more checks failed.
- ``INACTIVE``: One or more providers is inactive and couldn't be invoked
to run the diagnostic checks.
- ``DEGRADED``: Checks can be registered as optional. If a check is optional
and fails, or is ``INACTIVE``, the diagnostic status will be ``DEGRADED``.
For example::
POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
EMPTY POST BODY
==> All successful DHCP diagnostics
{
"diagnostics": [
{
"diagnostic": "dhcp",
"description": "Neutron network DHCP diagnostics.",
"status": {
"code": "DS000",
"label": "OK",
"message": "All checks completed successfully."
},
"checks": [
{
"name:": "check1",
"description": "Check1 does this and that.",
"status": {
"code": "DS000",
"label": "OK",
"message": null
},
"provider": {
"name": "DHCP Agent",
"host": "dhcp-host1"
},
"remediation": {}
},
{
"name:": "check2",
"description": "Check2 does cool stuff.",
"status": {
"code": "DS000",
"label": "OK",
"message": null,
},
"provider": {
"name": "DHCP Agent",
"host": "dhcp-host2"
},
"remediation": {}
}
]
}
]
}
POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
EMPTY POST BODY
==> A failed dhcp diagnostic
{
"diagnostics": [
{
"diagnostic": "dhcp",
"description": "Neutron network DHCP diagnostics.",
"status": {
"code": "DS002",
"label": "ERROR",
"message": "Check 'check1' failed. See the check details for more info."
},
"checks": [
{
"name:": "check1",
"description": "Check1 does this and that.",
"status": {
"code": "DHCPE001",
"label": "ERROR",
"message": "The dnsmasq process is not running."
},
"provider": {
"name": "DHCP Agent",
"host": "dhcp-host1"
},
"remediation": {
"code": "DHCPR001",
"message": "Re-enable DHCP for this network, then rerun this check."
}
},
{
"name:": "check2",
"description": "Check2 does cool stuff.",
"status": {
"code": "DS000",
"label": "OK",
"message": null,
},
"provider": {
"name": "DHCP Agent",
"host": "dhcp-host2"
},
"remediation": {}
}
]
}
]
}
Access control to ``/diagnostics`` is handled via standard policy definition.
The default access control is ``admin_only``, but operators can change this
in ``policy.json`` as needed.
Benefits
--------
While the main purpose of this effort is to spearhead diagnostics in Neutron and
start building out the plumbing, this functionality is immediately valuable for
consumers and out-of-tree plugins alike.
For example:
- The python-don project [3]_ can implement diagnostic data collection for
data used in its analysis.
- The vmware-nsx plugin can migrate some of its operator CLI functionality [4]_
into diagnostic data.
- The implementation can be enhanced to collect interface stats similar to how
Nova diagnostics [5]_ does.
Future work
-----------
- Once the API solidifies, a CLI can be added to support diagnostics.
References
==========
.. [1] https://github.com/openstack/osops-tools-generic/blob/master/neutron/listorphans.py
.. [2] http://www.tuxfixer.com/openstack-how-to-manually-delete-orphaned-neutron-port/
.. [3] https://github.com/openstack/python-don
.. [4] https://github.com/openstack/vmware-nsx/blob/master/vmware_nsx/shell/admin/README.rst
.. [5] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics
.. [6] https://bugs.launchpad.net/neutron/+bug/1563538