ironic-specs/specs/approved/add-pluggable-metrics-backe...

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

================================================
Add pluggable metrics backend for Ironic and IPA
================================================

https://bugs.launchpad.net/ironic/+bug/1526219

This proposes the addition of metric data reporting features to
Ironic, and Ironic Python Agent (IPA). Initially, this will include a statsd
reference implementation, but will be sufficiently generic to permit the
creation of alternative backends.

Problem description
===================

Software metrics are extremely useful to operators for recognizing and
diagnosing problems in running software, and can be used to monitor the
real time and historical performance of Ironic and IPA in a production
environment.

Metrics can be used to determine how quickly (or slowly) parts of the system
are running, how often errors (such as API error responses or BMC failures)
occur, or the performance impact of a given change.

Currently, neither Ironic nor IPA report any application metrics.

Proposed change
===============

* Design a shared pluggable metric reporting system.
* Implement a generic MetricsLogger which includes:

  * Gauges (generic numerical data).
  * Counters (increment/decrement a counter).
  * Timers (time something).
  * Decorators, and context managers for same.

* Implement a StatsdMetricsLogger as the reference backend [1].
* Instrument Ironic to report metric data including:

  * Counting and timing of API requests.  This may be accomplished by hooking
    into Pecan.
  * Counting and timing of RPCs.
  * Counting and timing of most worker functions in ConductorManager.
  * Counting and timing of important driver functions.
  * Count and time node state changes.  By inspecting provision_updated_at
    during a state change, the time the node spent in that state can be
    calculated.

* Instrument IPA to report metric data including, but not limited to:

  * Image download/write counts and times.
  * Deploy/cleaning counts and times.

Example code follows (based on Python logging module naming conventions):

.. code:: python

  METRICS = metrics.getLogger(__name__)

  class Foo(object):
    def func1(self):
      # Emit gauge metric with value 1
      METRICS.send_gauge("one.fish", 1)

      # Increment counter metric by two
      METRICS.send_counter("two.fish", 2)

      # Decrement counter metric by one
      METRICS.send_counter("red.fish", -1)

      # Randomly sample the data (emit metric 10% of the time)
      METRICS.send_counter("blue.fish", 42, sample_rate=0.1)

      # Emit a timer metric with value of 125 (milliseconds)
      METRICS.send_timer("black.fish", 125)

      # Randomly sample the data (emit metric 1% of the time)
      METRICS.send_timer("blue.fish", 125, sample_rate=0.01)

    @METRICS.counter("func2.count")
    @METRICS.timer("func2.time", sample_rate=0.1)
    def func2(self):
      pass

    # Context managers for counting and timing code blocks
    def func3(self):

      with METRICS.counter("func3.thing_one.count", sample_rate=0.25):
        thing_one()

      with METRICS.timer("func3.thing_two.time"):
        thing_two()


Metric names follow this convention (optional parts indicated by []):

``[global_prefix.][host_name.]prefix.metric_name``

If `--metrics-agent-prepend-host-reverse` is set, then ``host.example.com``
becomes ``com.example.host`` to assist with hierarchical data
representation.

For example, using the Statsd backend, and relevant config options,
``METRICS.send_timer("blue.fish", 125, sample_rate=0.25)`` is emitted to
statsd as ``globalprefix.com.example.host.moduleprefix.blue.fish:1|ms@0.25``.

Alternatives
------------

Alternatively, we could implement a Ceilometer backend.  Although Ironic
already reports some measurements (such as IPMI sensor data) to Ceilometer,
the metrics that are proposed in this spec do not fit with the Ceilometer
project mission, which is to "...collect measurements of the utilization of
the physical and virtual resources comprising deployed clouds..." [2]

Instead, this spec proposes that we instrument parts of the Ironic/IPA
codebase itself to report metrics and statistics about how/when the code is
run, and the performance of the code thereof.  This data is not directly
related to "physical and virtual resources comprising deployed clouds."
Therefore, we do not propose the addition of a Ceilometer backend, nor do we
propose that the existing Ceilometer measurements be converted to this
system, as they represent fundamentally different types of data.

Data model impact
-----------------

None

State Machine Impact
--------------------

None.

REST API impact
---------------

To support agent drivers, a config field will be added to the response for
the ``/drivers/<drivername>/vendor_passthru/lookup`` endpoint in the Ironic
API.

This field will contain the agent-related config options that an agent can
use to configure itself to report metric data.  For example: statsd host and
statsd port.

Client (CLI) impact
-------------------

None.

RPC API impact
--------------

None.

Driver API impact
-----------------

None.

Nova driver impact
------------------

None.

Ramdisk impact
--------------

N/A

.. NOTE: This section was not present at the time this spec was approved.

Security impact
---------------

The statsd daemon [3] has no authentication, and consequently anyone who is
able to send UDP datagrams to the daemon can send arbitrary metric data.
However, the statsd daemon is typically configured to listen only on a local
interface, which partially mitigates security concerns.

Other end user impact
---------------------

None.

Scalability impact
------------------

Deployers must ensure that their statsd infrastructure is scaled correctly
relative to the size of their deployment.  However, even if the statsd daemon
is overloaded, Ironic will not be negatively affected (statsd UDP datagrams
are non-blocking, and will simply not be processed).

Performance Impact
------------------

By default, metrics reporting will be disabled, reducing, but not totally
eliminating, the performance impact for users who do not wish to collect
metrics.  At the very least, a conditional must be checked at each place where
a metric could be reported. Furthermore, depending on exactly how and where
the conditional checking occurs, arguments may be evaluated even if the metric
data aren't actually sent.

Reporting metrics via statsd affects performance minimally.  The overhead
of sending a single piece of metric data is very small--in particular, statsd
metrics are sent via UDP (non-blocking) to a  daemon [2] that aggregates the
metrics before forwarding them to one of its supported backends.  Should this
backend become unresponsive or overloaded, then metric data will be lost, but
without other performance effects.

After the metric data are aggregated by a local statsd daemon, they are
periodically flushed to one of statsd's configured backends, usually Graphite
[4].

Other deployer impact
---------------------

There are two different sets of configuration options to be added:

These options will be set in the ironic-lib metrics library, and will be used
by any ironic service implementing metrics:

.. code::

  [metrics]

  # Backend options are "statsd" and "noop"
  backend="noop"
  statsd_host="localhost"
  statsd_port=8125

  # See proposed changes section for detailed description of how these are used
  prepend_host=false
  prepend_host_reverse=false
  global_prefix=""


Additionally, the following options will be added to the ironic-conductor and
used to configure the ironic-python-agent for metrics on lookup:

.. code::

  # Backend options are "statsd" and "noop"
  agent_backend="noop"
  agent_statsd_host="localhost"
  agent_statsd_port=8125

  # See proposed changes section for detailed description of how these are used
  agent_prepend_host=false
  agent_prepend_host_reverse=false
  agent_prepend_uuid=false
  agent_global_prefix=""


If the statsd metrics backend is enabled, then deployers must install and
configure statsd, as well as any other metrics software that they wish to use
(such as Graphite [3]). Additionally, if deployers wish to emit metrics from
ironic-python-agent as well, the statsd backend must be accessible from
networks that agents run on.

Developer impact
----------------

None.


Implementation
==============

Assignee(s)
-----------

Primary assignee:
  alineb

Other contributors:
  JayF

Work Items
----------

* Design/implement metric reporting into ironic-lib.

* Implement statsd backend.

* Instrument Ironic code to report metrics.

* Instrument IPA code to report metrics.

Dependencies
============

None.

Testing
=======

Additional care may be required to test the statsd network code.

Upgrades and Backwards Compatibility
====================================

None.

Documentation Impact
====================

Appropriate documentation must be written.

References
==========

For more on why metrics are useful to operators, and why the statsd project
began: https://codeascraft.com/2011/02/15/measure-anything-measure-everything/

[1] https://github.com/etsy/statsd/blob/master/docs/metric_types.md

[2] https://wiki.openstack.org/wiki/Ceilometer

[3] https://github.com/etsy/statsd/

[4] https://graphite.readthedocs.org/en/latest/faq.html