Merge "Revert "Add instrumentation devref, Part I""
This commit is contained in:
commit
e5b9ba1ea9
@ -73,7 +73,6 @@ Neutron Internals
|
|||||||
external_dns_integration
|
external_dns_integration
|
||||||
upgrade
|
upgrade
|
||||||
i18n
|
i18n
|
||||||
instrumentation
|
|
||||||
address_scopes
|
address_scopes
|
||||||
openvswitch_firewall
|
openvswitch_firewall
|
||||||
network_ip_availability
|
network_ip_availability
|
||||||
|
@ -1,336 +0,0 @@
|
|||||||
Neutron Instrumentation
|
|
||||||
=======================
|
|
||||||
|
|
||||||
OpenStack operators require information about the status and health
|
|
||||||
of the Neutron system. While it is possible for an operator to pull
|
|
||||||
all of the interface counters from compute and network nodes, today
|
|
||||||
there is no capability to aggregate that information to provide
|
|
||||||
comprehensive counters for each project within Neutron. Neutron
|
|
||||||
instrumentation sets out to meet this need.
|
|
||||||
|
|
||||||
Neutron instrumentation can be broken down into three major pieces:
|
|
||||||
|
|
||||||
#. Data Collection (i.e. what data should be collected and how),
|
|
||||||
#. Data Aggregation (i.e. how and where raw data should be aggregated
|
|
||||||
into project information)
|
|
||||||
#. Data Consumption (i.e. how is aggregated data consumed)
|
|
||||||
|
|
||||||
While instrumentation might also be considered to include asynchronous event
|
|
||||||
notifications, like fault detection, this is considered out of scope
|
|
||||||
for the following two reasons:
|
|
||||||
|
|
||||||
#. In Kilo, Neutron added the ProcessManager class to allow agents to
|
|
||||||
spawn a monitor thread that would either respawn or exit the agent.
|
|
||||||
While this is a useful feature for ensuring that the agent gets
|
|
||||||
restarted, the only notification of this event is an error log entry.
|
|
||||||
To ensure that this event is asynchronously passed up to an upstream
|
|
||||||
consumer, the Neutron logger object should have its publish_errors
|
|
||||||
option set to True and the transport URL set to the point at the
|
|
||||||
upstream consumer. As the particular URL is consumer specific, further
|
|
||||||
discussion is outside the scope of this section.
|
|
||||||
#. For the data plane, it is necessary to have visibility into the hardware
|
|
||||||
status of the compute and networking nodes. As some upstream consumers
|
|
||||||
already support this (even incompletely) it is considered to be within
|
|
||||||
the scope of the upstream consumer and not Neutron itself.
|
|
||||||
|
|
||||||
How does Instrumentation differ from Metering Labels and Rules
|
|
||||||
--------------------------------------------------------------
|
|
||||||
|
|
||||||
The existing metering label and rule extension provides the ability to
|
|
||||||
collect traffic information on a per CIDR basis. Therefore, a possible
|
|
||||||
implementation of instrumentation would be to use per-instance metering
|
|
||||||
rules for all IP addresses in both directions. However, the information
|
|
||||||
collected by metering rules is focused more on billing and so does not
|
|
||||||
have the desired granularity (i.e. it counts transmitted packets without
|
|
||||||
keeping track of what caused packets to fail).
|
|
||||||
|
|
||||||
What Data to Collect
|
|
||||||
--------------------
|
|
||||||
|
|
||||||
The first step is to consider what data to collect. In the absence of a
|
|
||||||
standard, it is proposed to use the information set defined in
|
|
||||||
[RFC2863]_ and [RFC4293]_. This proposal should not be read as implying
|
|
||||||
that Neutron instrumentation data will be browsable via a MIB browser as
|
|
||||||
that would be a potential Data Consumption model.
|
|
||||||
|
|
||||||
.. [RFC2863] https://tools.ietf.org/html/rfc2863
|
|
||||||
.. [RFC4293] https://tools.ietf.org/html/rfc4293
|
|
||||||
|
|
||||||
For the reference implementation (Nova/VIF, OVS, and Linux Bridge), this
|
|
||||||
section identifies what data is already available and how it can be
|
|
||||||
mapped into the structures defined by the RFC. Other plugins are welcome
|
|
||||||
to define either their own data sets and/or their own mappings
|
|
||||||
to the data sets defined in the referenced RFCs.
|
|
||||||
|
|
||||||
Focus here is on what is available from "stock" Linux and OpenStack.
|
|
||||||
Additional statistics may become available if other items like NetFlow or
|
|
||||||
sFlow are added to the mix, but those should be covered as an addition to
|
|
||||||
the basic information discussed here.
|
|
||||||
|
|
||||||
What is Available from Nova
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Within Nova, the libvirt driver makes the following host traffic statistics
|
|
||||||
available under the get_diagnostics() and get_instance_diagnostics() calls
|
|
||||||
on a per-virtual NIC basis:
|
|
||||||
|
|
||||||
* Receive bytes, packets, errors and drops
|
|
||||||
* Transmit bytes, packets, errors and drops
|
|
||||||
|
|
||||||
There continues to be a long running effort to get these counters into
|
|
||||||
Ceilometer (the wiki page at [#]_ attempted to do this via a direct call
|
|
||||||
while [#]_ is trying to accomplish this via notifications from Nova).
|
|
||||||
Rather than propose another way for collecting these statistics from Nova,
|
|
||||||
this devref takes the approach of declaring them out of scope until there is
|
|
||||||
an agreed upon method for getting the counters from Nova to Ceilometer and
|
|
||||||
then see if Neutron can/should piggy-back off of that.
|
|
||||||
|
|
||||||
.. [#] https://wiki.openstack.org/wiki/EfficientMetering/FutureNovaInteractionModel
|
|
||||||
.. [#] http://lists.openstack.org/pipermail/openstack-dev/2015-June/067589.html
|
|
||||||
|
|
||||||
What is Available from Linux Bridge
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
For the Linux bridge, a check of [#]_ shows that IEEE 802.1d
|
|
||||||
mandated statistics are only a "wishlist" item. The alternative
|
|
||||||
is to use NETLINK/shell to list the interfaces attached to
|
|
||||||
a particular bridge and then to collect statistics for each
|
|
||||||
interface attached to the bridge. These statistics could then
|
|
||||||
be mapped to appropriate places, as discussed below.
|
|
||||||
|
|
||||||
Note: the examples below talk in terms of mapping counters
|
|
||||||
available from the Linux operating system:
|
|
||||||
|
|
||||||
* Receive bytes, packets, errors, dropped, overrun and multicast
|
|
||||||
* Transmit bytes, packets, errors, dropped, carrier and collisions
|
|
||||||
|
|
||||||
Available counters for interfaces on other operating systems
|
|
||||||
can be mapped in a similar fashion.
|
|
||||||
|
|
||||||
.. [#] http://git.kernel.org/cgit/linux/kernel/git/shemminger/bridge-utils.git/tree/doc/WISHLIST
|
|
||||||
|
|
||||||
Of interest are counters from the each of the following (as of this writing,
|
|
||||||
Linux Bridge only supports legacy routers, so the DVR case need not be
|
|
||||||
considered):
|
|
||||||
|
|
||||||
* Compute node
|
|
||||||
|
|
||||||
* * Instance tap interface
|
|
||||||
|
|
||||||
* Network node
|
|
||||||
|
|
||||||
* DHCP namespace tap interface (if defined)
|
|
||||||
* Router namespace qr interface
|
|
||||||
* Router namespace qg interface
|
|
||||||
|
|
||||||
What is Available from Openvswitch
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Like Linux bridge, the openvswitch implementation has interface counters
|
|
||||||
that will be collected of interest are the receive and transmit counters
|
|
||||||
from the following:
|
|
||||||
|
|
||||||
Legacy Routing
|
|
||||||
++++++++++++++
|
|
||||||
|
|
||||||
* Compute node
|
|
||||||
|
|
||||||
* * Instance tap interface
|
|
||||||
|
|
||||||
* Network node
|
|
||||||
|
|
||||||
* DHCP namespace tap interface (if defined)
|
|
||||||
* Router namespace qr interface
|
|
||||||
* Router namespace qg interface
|
|
||||||
|
|
||||||
Distributed Routing (DVR)
|
|
||||||
+++++++++++++++++++++++++
|
|
||||||
|
|
||||||
* Compute node
|
|
||||||
|
|
||||||
* * Instance tap interface
|
|
||||||
* * Router namespace qr interface
|
|
||||||
* * FIP namespace fg interface
|
|
||||||
|
|
||||||
* Network node
|
|
||||||
|
|
||||||
* DHCP tap interface (if defined)
|
|
||||||
* Router namespace qr interface
|
|
||||||
* SNAT namespace qg interface
|
|
||||||
|
|
||||||
Mapping from Available Information to MIB Data Set
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The following table summarizes how the interface counters are mapped
|
|
||||||
into each MIB Data Set. Specific details are covered in the sections
|
|
||||||
below:
|
|
||||||
|
|
||||||
+---------+--------------+----------------------+
|
|
||||||
| Node | Interface | Included in Data Set |
|
|
||||||
| | +-----------+----------+
|
|
||||||
| | | RFC2863 | RFC4293 |
|
|
||||||
+=========+==============+===========+==========+
|
|
||||||
| Compute | Instance tap | Yes | No |
|
|
||||||
| +--------------+-----------+----------+
|
|
||||||
| | Router qr | Yes | Yes |
|
|
||||||
| +--------------+-----------+----------+
|
|
||||||
| | FIP fg | No | Yes |
|
|
||||||
+---------+--------------+-----------+----------+
|
|
||||||
| Network | DHCP tap | Yes | No |
|
|
||||||
| +--------------+-----------+----------+
|
|
||||||
| | Router qr | Yes | Yes |
|
|
||||||
| +--------------+-----------+----------+
|
|
||||||
| | Router qg | No | Yes |
|
|
||||||
| +--------------+-----------+----------+
|
|
||||||
| | SNAT sg | No | Yes |
|
|
||||||
+---------+--------------+-----------+----------+
|
|
||||||
|
|
||||||
Note: because of replication of the router qg interface when running
|
|
||||||
distributed routing, aggregation of the individual counter information
|
|
||||||
will be necessary to fill in the appropriate data set entries. This
|
|
||||||
will be covered in the Data Aggregation section below:
|
|
||||||
|
|
||||||
RFC 2863 Structures
|
|
||||||
+++++++++++++++++++
|
|
||||||
|
|
||||||
For each compute host, each network will be represented with a
|
|
||||||
"switch", modeled by instances of ifTable and ifXTable. This
|
|
||||||
mapping has the advantage that for a particular network, the
|
|
||||||
view to the project or the operator is identical - the only
|
|
||||||
difference is that the operator can see all networks, while a
|
|
||||||
project will only see the networks under their project id.
|
|
||||||
|
|
||||||
The current reference implementation identifies tap interface names with
|
|
||||||
the Neutron port they are associated with. In turn, the Neutron port
|
|
||||||
identifies the Neutron network. Therefore, it is possible to take counters
|
|
||||||
from each tap interface and map them into entries in the appropriate tables,
|
|
||||||
using the following proposed assignments:
|
|
||||||
|
|
||||||
* ifTable
|
|
||||||
|
|
||||||
* ifInOctets = low 32 bits of interface received byte count
|
|
||||||
* ifInUcastPkts = low 32 bits of interface received packet count
|
|
||||||
* ifInDiscards = interface received dropped count
|
|
||||||
* ifInErrors = interface received errors count
|
|
||||||
* ifOutOctets = low 32 bits of interface transmit byte count
|
|
||||||
* ifOutUcastPkts = low 32 bits of interface transmit packet count
|
|
||||||
* ifOutDiscards = interface transmit dropped count
|
|
||||||
* ifOutErrors = interface transmit errors count
|
|
||||||
|
|
||||||
* ifXTable
|
|
||||||
|
|
||||||
* ifHCInOctets = 64 bits of interface received byte count
|
|
||||||
* ifHCInUcastPkts = 64 bits of interface received packet count
|
|
||||||
* ifHCOctOctets = 64 bits of interface transmit byte count
|
|
||||||
* ifHCOctUcastPkts = 64 bits of interface transmit packet count
|
|
||||||
|
|
||||||
Section 3.1.6 of [RFC2863]_ provides the details of why 64-bit sized
|
|
||||||
counters need to be supported. The summary is that with increasing
|
|
||||||
transmission bandwidth use of 32-bit counters would require a
|
|
||||||
problematic increase in counter polling frequency (a 1Gbs stream of
|
|
||||||
full-sized packets will cause a 32-bit counter to wrap in 34 seconds).
|
|
||||||
|
|
||||||
RFC 4293 Structures
|
|
||||||
+++++++++++++++++++
|
|
||||||
|
|
||||||
Counters tracked by RFC 4293 come in two flavors: ones that are
|
|
||||||
inherited from the interface, and those that track L3 events,
|
|
||||||
such as fragmentation, re-assembly, truncations, etc. As the current
|
|
||||||
instrumentation available from the reference implementation does not
|
|
||||||
provide appropriate source information, the following counters are
|
|
||||||
declared out of scope for this devref:
|
|
||||||
|
|
||||||
* ipSystemStatsInHdrErrors, ipIfStatsInHdrErrors
|
|
||||||
* ipSystemStatsInNoRoutes, ipIfStatsInNoRoutes
|
|
||||||
* ipSystemStatsInAddrErrors, ipIfStatsInAddrErrors
|
|
||||||
* ipSystemStatsInUnknownProtos, ipIfStatsInUnknownProtos
|
|
||||||
* ipSystemStatsInTruncatedPkts, ipIfStatsInTruncatedPkts
|
|
||||||
* ipSystemStatsInForwDatagrams, ipIfStatsInForwDatagrams
|
|
||||||
* ipSystemStatsHCInForwDatagrams, ipIfStatsHCInForwDatagrams
|
|
||||||
* ipSystemStatsReasmReqds, ipIfStatsReasmReqds
|
|
||||||
* ipSystemStatsReasmOKs, ipIfStatsReasmOKs
|
|
||||||
* ipSystemStatsReasmFails, ipIfStatsReasmFails
|
|
||||||
* ipSystemStatsInDelivers, ipIfStatsInDelivers
|
|
||||||
* ipSystemStatsHCInDelivers, ipIfStatsHCInDelivers
|
|
||||||
* ipSystemStatsOutRequests, ipIfStatsOutRequests
|
|
||||||
* ipSystemStatsHCOutRequests, ipIfStatsHCOutRequests
|
|
||||||
* ipSystemStatsOutNoRoutes, ipIfStatsOutNoRoutes
|
|
||||||
* ipSystemStatsOutForwDatagrams, ipIfStatsOutForwDatagrams
|
|
||||||
* ipSystemStatsHCOutForwDatagrams, ipIfStatsHCOutForwDatagrams
|
|
||||||
* ipSystemStatsOutFragReqds, ipIfStatsOutFragReqds
|
|
||||||
* ipSystemStatsOutFragOKs, ipIfStatsOutFragOKs
|
|
||||||
* ipSystemStatsOutFragFails, ipIfStatsOutFragFails
|
|
||||||
* ipSystemStatsOutFragCreates, ipIfStatsOutFragCreates
|
|
||||||
|
|
||||||
In ipIfStatsTable, the following counters will hold the same
|
|
||||||
value as the referenced counter from RFC 2863:
|
|
||||||
|
|
||||||
* ipIfStatsInReceives :== ifInUcastPkts
|
|
||||||
* ipIfStatsHCInReceives :== ifInHCUcastPkts
|
|
||||||
* ipIfStatsInOctets :== ifInOctets
|
|
||||||
* ipIfStatsHCInOctets :== ifInHCOctets
|
|
||||||
* ipIfStatsInDiscard :== ifInDiscards
|
|
||||||
* ipIfStatsOutDiscard :== ifOutDiscards
|
|
||||||
* ipIfStatsOutTransmits :== ifOutUcastPkts
|
|
||||||
* ipIfStatsHCOutTransmits :== ifHCOutUcastPkts
|
|
||||||
* ipIfStatsOutOctets :== ifOutOctets
|
|
||||||
* ipIfStatsHCOutOctets :== ifHCOutOctets
|
|
||||||
|
|
||||||
For ipSystemStatsTable, the following counters will hold values based
|
|
||||||
on the following assignments. Thess summations are covered in more detail
|
|
||||||
in the Data Aggregation section below
|
|
||||||
|
|
||||||
* ipSystemStatsInReceives :== sum of all ipIfStatsInReceives for the router
|
|
||||||
* ipSystemStatsHCInReceives :== sum of all ipIfStatsHCInReceives for the router
|
|
||||||
* ipSystemStatsInOctets :== sum of all ipIfStatsInOctets for the router
|
|
||||||
* ipSystemStatsHCInOctets :== sum of all ipIfStatsHCInOctets for the router
|
|
||||||
* ipSystemStatsInDiscard :== sum of all ipIfStatsInDiscard for the router
|
|
||||||
* ipSystemStatsOutDiscard :== sum of all ipIfStatsOutDiscard for the router
|
|
||||||
* ipSystemStatsOutTransmits :== sum of all ipIfStatsOutTrasmit for the router
|
|
||||||
* ipSystemStatsHCOutTransmits :== sum of all ipIfStatsHCOutTrasmit for the
|
|
||||||
router
|
|
||||||
* ipSystemStatsOutOctets :== sum of all ipIfStatsOctOctets for the router
|
|
||||||
* ipSystemStatsHCOutOctets :== sum of all ipIfStatsHCOutOctets for the router
|
|
||||||
|
|
||||||
Data Collection
|
|
||||||
---------------
|
|
||||||
|
|
||||||
There are two options for how data can be collected:
|
|
||||||
|
|
||||||
#. The Neutron L3 and ML2 agents could collect the counters themselves.
|
|
||||||
#. A separate collection agent could be started on each compute/network node
|
|
||||||
to collect counters.
|
|
||||||
|
|
||||||
Because of the number of counters needed to be collected (for example,
|
|
||||||
a cloud running legacy routing would need to collect (for each project)
|
|
||||||
three counters from a network node and a tap counter for each running
|
|
||||||
instance. While it would be desirable to reuse the existing L3 and ML2 agents,
|
|
||||||
the initial proof of concept will run a separate agent that will use
|
|
||||||
a separate threads to isolate the effects of counter collection from
|
|
||||||
reporting. Once the performance of the collection agent is understood,
|
|
||||||
then merging the functionality into the L3 or ML2 agents can be considered.
|
|
||||||
The collection thread will initially use shell commands via rootwrap, with
|
|
||||||
the plan of moving to native python libraries when support for them is
|
|
||||||
available.
|
|
||||||
|
|
||||||
In addition, there are two options for how to report counters back to the
|
|
||||||
Neutron server: push or pull (or asynchronous notification vs polling).
|
|
||||||
On the one hand, pull/polling eases the Neutron server's task in that it
|
|
||||||
only needs to store/aggregate the results from the current polling cycle.
|
|
||||||
However, this comes at the cost of dealing with the stale data issues that
|
|
||||||
scaling a polling cycle will entail. On the other hand, asynchronous
|
|
||||||
notification requires that the Neutron server has the capability to hold
|
|
||||||
the current results from each collector. As the L3 and ML2 agents already
|
|
||||||
have use asynchronous notification to report status back to the Neutron
|
|
||||||
server, the proof of concept will follow the same model to ease a future
|
|
||||||
merging of functionality.
|
|
||||||
|
|
||||||
Data Aggregation
|
|
||||||
----------------
|
|
||||||
|
|
||||||
Will be covered in a follow-on patch set.
|
|
||||||
|
|
||||||
Data Consumption
|
|
||||||
----------------
|
|
||||||
|
|
||||||
Will be covered in a follow-on patch set.
|
|
Loading…
Reference in New Issue
Block a user