Neutron resource diagnostics
This spec proposes the introduction of a neutron diagnostics framework and API extension capable collecting resource diagnostics across neutron API and agent nodes. To keep the spec containable, the proposal suggests only providing a sample diagnostic check and reiterating on concrete diagnostics once we get the plumbing in place. While this spec has some inspiration from nova diagnostics [1], the approach herein is more generic and extensible supporting a broader set of use cases longer term. Finally it seeks to pave the way for supporting use case / features proposed in the related bugs. [1] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics Related-Bug: #1507499 Related-Bug: #1519537 Related-Bug: #1537686 Related-Bug: #1563538 Change-Id: Id534acb1593f1fe210c561b1451656dce69514db
This commit is contained in:
parent
095580ec46
commit
dc11da5109
|
@ -0,0 +1,353 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================
|
||||
Neutron Resource Diagnostics
|
||||
============================
|
||||
|
||||
RFE: https://bugs.launchpad.net/neutron/+bug/1507499
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Cloud software is complex and problems are inevitable. Neutron is no exception
|
||||
to this rule. To cope today we have helper scripts [1]_, blogs [2]_, diagnostic
|
||||
tools [3]_, out-of-tree plugin utilities [4]_, etc. that could all benefit from
|
||||
additional Neutron diagnostic data APIs. And while we certainly don't want
|
||||
Neutron to become a new diagnostic system, other projects [5]_ have
|
||||
found value in exposing some level of diagnostic data for users (typically
|
||||
operators) needing to reach under the hood.
|
||||
|
||||
By the same token, Neutron offers a broad array of resources backed by a
|
||||
heterogeneous set of technologies. As a result, standardizing a set of
|
||||
concrete diagnostics across this diverse domain is no simple task.
|
||||
Moreover, a Neutron resource may be realized through multiple components
|
||||
(plugins, agents) spanning multiple nodes (hosts). Thus, any proposal
|
||||
to collect resource diagnostic data must plan accordingly to support
|
||||
multiple, potentially disparate, diagnostic participants.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
This spec proposes to deliver a diagnostic framework consisting of the
|
||||
following high level constructs:
|
||||
|
||||
- Diagnostic checks that are effectively micro-plugins containing concrete
|
||||
logic to carry out a diagnostic check for a given set of resource types
|
||||
(``ports``, ``networks``, etc.) as well as check specific metadata such
|
||||
as the check's name, description, etc.
|
||||
- Diagnostic providers that are capable of discovering, managing and
|
||||
executing diagnostic checks. These providers are the sources for diagnostics
|
||||
and will allow neutron plugins and agents to plug-and-play with the
|
||||
framework.
|
||||
- Diagnostics API extension that supports a ``/diagnostics`` URI for existing
|
||||
neutron resources and services the request by discovering and executing
|
||||
checks on applicable providers and rolling up the responses.
|
||||
|
||||
Each of the following constructs is discussed in greater detail in the following
|
||||
sections.
|
||||
|
||||
Diagnostic Checks
|
||||
-----------------
|
||||
|
||||
Diagnostic checks are effectively small plugins that provide the following:
|
||||
|
||||
- Check specific metadata including:
|
||||
|
||||
- A unique name for the check.
|
||||
- (optional) A description of the check's diagnostics.
|
||||
- A boolean flag indicating if the check is required or not.
|
||||
- A set of resource types (``ports``, ``networks``, etc.) the check is
|
||||
applicable to.
|
||||
|
||||
- The actual diagnostic check logic. This logic will be invoked when collecting
|
||||
diagnostics for a specific resource instance of a resource type that's
|
||||
supported by the check (as per the check's metadata). Checks return a
|
||||
well-formed response including a response code, (optional) text message, etc.
|
||||
If a check supports multiple resource types, its implementation can key off
|
||||
the resource type that'll be passed in by the framework.
|
||||
|
||||
As part of this work a sample check will be provided for both a server and
|
||||
agent side plugin that illustrates how to use the framework.
|
||||
|
||||
Diagnostic Providers
|
||||
--------------------
|
||||
|
||||
Diagnostic providers are simply 'sources' for diagnostic checks and contain
|
||||
the logic to discover, manage and execute checks. Services in a deployment
|
||||
wishing to contribute diagnostics will therefore implement a diagnostic
|
||||
provider. For this effort we'll deliver a diagnostic provider binding
|
||||
for both the neutron service as well as neutron based agents.
|
||||
|
||||
Diagnostic providers have the following characteristics:
|
||||
|
||||
- The means to discover and load known checks. More details in subsequent
|
||||
sections.
|
||||
- An internal cache of loaded checks and public APIs to:
|
||||
|
||||
- Determine if the provider has a check for a given resource type or list
|
||||
of resource types.
|
||||
- List the resource types the provider has checks for.
|
||||
- Execute all checks applicable to a specified resource type and resource
|
||||
instance, collect the well-formed results and return them.
|
||||
|
||||
The neutron server diagnostic provider binding delivered by this effort
|
||||
will be implemented as a service plugin; consumers wishing to invoke it's
|
||||
APIs can get a reference to the plugin instance via neutron manager. In
|
||||
addition, other plugins can act as a diagnostic provider by supporting
|
||||
the extension and implementing the respective diagnostic methods required.
|
||||
This approach will also work for agent-less plugins and mimics what's typically
|
||||
done today in such plugins when supporting extensions in their implementation.
|
||||
|
||||
The neutron agent based diagnostic provider binding delivered by this
|
||||
implementation will use the agent extension framework; plumbing the support
|
||||
for all neutron based agents. However since the agent binding is remote to the
|
||||
neutron API server it has the following notable differences:
|
||||
|
||||
- To indicate the resource types the agent diagnostic provider supports,
|
||||
a set of (string) resource types is returned in the agent state report
|
||||
and stored in the agent database. On the agent, this set is built by
|
||||
collecting the supported resource types from the checks it manages. On the
|
||||
server side, the agent's supported resource list can be used by the API
|
||||
extension controller logic to determine if the agent is applicable
|
||||
to service specific diagnostic requests.
|
||||
- The public APIs to invoke checks on a specific resource type and resource
|
||||
instance are remoteable so they can be called via RPC.
|
||||
|
||||
Diagnostic Providers: Loading Checks
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Diagnostic providers are responsible for loading checks upon start-up
|
||||
(e.g. when service plugin starts or the agent starts). There are multiple
|
||||
ways to support check plugins including:
|
||||
|
||||
- Having each check as a stevedore entry point; then loading and managing them
|
||||
via a driver manager.
|
||||
- Defining a specific check directory that only contains python modules where
|
||||
each module is a check (plugin) and discovering + importing them from the
|
||||
directory.
|
||||
- Statically defining the checks in the diagnostic provider binding code.
|
||||
This is just as it sounds; statically have a list of checks in the code
|
||||
rather than dynamically discovering + loading.
|
||||
|
||||
Each approach above has it's pros/cons. For the initial implementation we
|
||||
suggest using the last approach of statically defining the checks in the code.
|
||||
While this is not the most robust approach, it minimizes complexity and allows
|
||||
us to get a tire-kicker more rapidly that we can then later iterate on once
|
||||
we get some feedback from users.
|
||||
|
||||
Diagnostic Providers: Data Model
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
As mentioned previously, neutron agent based diagnostic providers report
|
||||
the resource types the checks they manage support. This is a list of
|
||||
unique string resource types that is transported in a new key/value of
|
||||
the agent state report. This list is stored in a new database table
|
||||
in a new column called ``diagnostic_resource_types`` that maps to its
|
||||
corresponding agent table entry.
|
||||
|
||||
None (default) or an empty list signifies the agent does not provide
|
||||
diagnostic data and thus will not be called to service a diagnostic
|
||||
request. Neutron (OVO) objects will be updated as necessary.
|
||||
|
||||
API Extension Plugin
|
||||
--------------------
|
||||
|
||||
The implementation delivered will include an API extension plugin that acts
|
||||
as the REST API controller for diagnostics. As described in the REST API
|
||||
section, a ``/diagnostics`` URI is dangled off existing neutron resources
|
||||
that only supports ``POST`` with an empty request body. Therefore this
|
||||
controller needs to service diagnostic requests for a specific resource
|
||||
type and ID.
|
||||
|
||||
The following pseudo code outlines the controllers logic for request handling:
|
||||
|
||||
- Get a list of all plugin instances from the neutron manager that support
|
||||
the diagnostic extension and filter them to only those that support
|
||||
the said resource type.
|
||||
- Get a list of all active agents from the DB and filter to only those that
|
||||
have the said resource type in their ``diagnostic_resource_types`` column.
|
||||
- From the list of plugin and agent providers that support the given resource
|
||||
type, invoke their API(s) to run the checks they know about for the said
|
||||
resource type and resource instance.
|
||||
- Collect the diagnostic results, roll them into a nice response and return
|
||||
them to the caller.
|
||||
|
||||
REST API
|
||||
--------
|
||||
|
||||
When enabled, this implementation dangles a ``/diagnostics`` URI off Neutron
|
||||
resources. The only supported HTTP method for this URI is ``POST`` (with an
|
||||
empty request body) that triggers diagnostic data collection from all
|
||||
applicable registered diagnostic providers. While this spec proposes the
|
||||
diagnostic collection be run synchronously for the initial iteration of the
|
||||
implementation, future workings could make the collection job based run async
|
||||
in the back ground.
|
||||
|
||||
The generic form of the diagnostics URL is::
|
||||
|
||||
POST /v2.0/{resource}/{resource_id}/diagnostics
|
||||
|
||||
The response is a list of diagnostic (dict) objects, one object per
|
||||
``diagnostic``. A diagnostic is an aspect of the ``{resource}`` checked,
|
||||
and contains a ``description`` as well as a diagnostic ``status`` object
|
||||
and list of individual ``checks`` run for the said diagnostic. Diagnostic
|
||||
``status`` is set by the diagnostics framework based on the result of
|
||||
all checks for the said resource ``diagnostic``.
|
||||
|
||||
The array of ``checks`` returned for each diagnostic includes high level
|
||||
details about the check such as ``name``, ``description`` and ``provider``.
|
||||
In addition, checks report their own ``status`` based on the result of
|
||||
their check execution. If a check is not successful, the check must
|
||||
return a ``remediation`` to describe potential ways to remediate the failed
|
||||
check.
|
||||
|
||||
The ``status`` at the diagnostic level is handled by the framework and
|
||||
can be one of the following:
|
||||
|
||||
- ``OK``: All checks for the diagnostic are successful.
|
||||
- ``ERROR``: One or more checks failed.
|
||||
- ``INACTIVE``: One or more providers is inactive and couldn't be invoked
|
||||
to run the diagnostic checks.
|
||||
- ``DEGRADED``: Checks can be registered as optional. If a check is optional
|
||||
and fails, or is ``INACTIVE``, the diagnostic status will be ``DEGRADED``.
|
||||
|
||||
For example::
|
||||
|
||||
POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
|
||||
EMPTY POST BODY
|
||||
==> All successful DHCP diagnostics
|
||||
{
|
||||
"diagnostics": [
|
||||
{
|
||||
"diagnostic": "dhcp",
|
||||
"description": "Neutron network DHCP diagnostics.",
|
||||
"status": {
|
||||
"code": "DS000",
|
||||
"label": "OK",
|
||||
"message": "All checks completed successfully."
|
||||
},
|
||||
"checks": [
|
||||
{
|
||||
"name:": "check1",
|
||||
"description": "Check1 does this and that.",
|
||||
"status": {
|
||||
"code": "DS000",
|
||||
"label": "OK",
|
||||
"message": null
|
||||
},
|
||||
"provider": {
|
||||
"name": "DHCP Agent",
|
||||
"host": "dhcp-host1"
|
||||
},
|
||||
"remediation": {}
|
||||
},
|
||||
{
|
||||
"name:": "check2",
|
||||
"description": "Check2 does cool stuff.",
|
||||
"status": {
|
||||
"code": "DS000",
|
||||
"label": "OK",
|
||||
"message": null,
|
||||
},
|
||||
"provider": {
|
||||
"name": "DHCP Agent",
|
||||
"host": "dhcp-host2"
|
||||
},
|
||||
"remediation": {}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
|
||||
EMPTY POST BODY
|
||||
==> A failed dhcp diagnostic
|
||||
{
|
||||
"diagnostics": [
|
||||
{
|
||||
"diagnostic": "dhcp",
|
||||
"description": "Neutron network DHCP diagnostics.",
|
||||
"status": {
|
||||
"code": "DS002",
|
||||
"label": "ERROR",
|
||||
"message": "Check 'check1' failed. See the check details for more info."
|
||||
},
|
||||
"checks": [
|
||||
{
|
||||
"name:": "check1",
|
||||
"description": "Check1 does this and that.",
|
||||
"status": {
|
||||
"code": "DHCPE001",
|
||||
"label": "ERROR",
|
||||
"message": "The dnsmasq process is not running."
|
||||
},
|
||||
"provider": {
|
||||
"name": "DHCP Agent",
|
||||
"host": "dhcp-host1"
|
||||
},
|
||||
"remediation": {
|
||||
"code": "DHCPR001",
|
||||
"message": "Re-enable DHCP for this network, then rerun this check."
|
||||
}
|
||||
},
|
||||
{
|
||||
"name:": "check2",
|
||||
"description": "Check2 does cool stuff.",
|
||||
"status": {
|
||||
"code": "DS000",
|
||||
"label": "OK",
|
||||
"message": null,
|
||||
},
|
||||
"provider": {
|
||||
"name": "DHCP Agent",
|
||||
"host": "dhcp-host2"
|
||||
},
|
||||
"remediation": {}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Access control to ``/diagnostics`` is handled via standard policy definition.
|
||||
The default access control is ``admin_only``, but operators can change this
|
||||
in ``policy.json`` as needed.
|
||||
|
||||
|
||||
Benefits
|
||||
--------
|
||||
|
||||
While the main purpose of this effort is to spearhead diagnostics in Neutron and
|
||||
start building out the plumbing, this functionality is immediately valuable for
|
||||
consumers and out-of-tree plugins alike.
|
||||
|
||||
For example:
|
||||
|
||||
- The python-don project [3]_ can implement diagnostic data collection for
|
||||
data used in its analysis.
|
||||
- The vmware-nsx plugin can migrate some of its operator CLI functionality [4]_
|
||||
into diagnostic data.
|
||||
- The implementation can be enhanced to collect interface stats similar to how
|
||||
Nova diagnostics [5]_ does.
|
||||
|
||||
|
||||
Future work
|
||||
-----------
|
||||
|
||||
- Once the API solidifies, a CLI can be added to support diagnostics.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://github.com/openstack/osops-tools-generic/blob/master/neutron/listorphans.py
|
||||
.. [2] http://www.tuxfixer.com/openstack-how-to-manually-delete-orphaned-neutron-port/
|
||||
.. [3] https://github.com/openstack/python-don
|
||||
.. [4] https://github.com/openstack/vmware-nsx/blob/master/vmware_nsx/shell/admin/README.rst
|
||||
.. [5] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics
|
||||
.. [6] https://bugs.launchpad.net/neutron/+bug/1563538
|
Loading…
Reference in New Issue