openstack-manuals/doc/admin-guide-cloud/section_object-storage-monitoring.xml

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="ch_introduction-to-openstack-object-storage-monitoring">
    <!-- ... Based on a blog, should be replaced with original material... -->
    <title>Object Storage monitoring</title>
    <?dbhtml stop-chunking?>
    <para>Excerpted from a blog post by <link
            xlink:href="http://swiftstack.com/blog/2012/04/11/swift-monitoring-with-statsd"
            >Darrell Bishop</link></para>
    <para>An OpenStack Object Storage cluster is a collection of many
        daemons that work together across many nodes. With so many
        different components, you must be able to tell what is going
        on inside the cluster. Tracking server-level metrics like CPU
        utilization, load, memory consumption, disk usage and
        utilization, and so on is necessary, but not
        sufficient.</para>
    <para>What are different daemons are doing on each server? What is
        the volume of object replication on node8? How long is it
        taking? Are there errors? If so, when did they happen?</para>
    <para>In such a complex ecosystem, you can use multiple approaches
        to get the answers to these questions. This section describes
        several approaches.</para>
    <section xml:id="monitoring-swiftrecon">
        <title>Swift Recon</title>
        <para>The Swift Recon middleware (see <link
                xlink:href="http://swift.openstack.org/admin_guide.html#cluster-telemetry-and-monitoring"
                >http://swift.openstack.org/admin_guide.html#cluster-telemetry-and-monitoring</link>)
            provides general machine statistics, such as load average,
            socket statistics, <code>/proc/meminfo</code> contents,
            and so on, as well as Swift-specific metrics:</para>
        <itemizedlist>
            <listitem>
                <para>The MD5 sum of each ring file.</para>
            </listitem>
            <listitem>
                <para>The most recent object replication time.</para>
            </listitem>
            <listitem>
                <para>Count of each type of quarantined file: Account,
                    container, or object.</para>
            </listitem>
            <listitem>
                <para>Count of "async_pendings" (deferred container
                    updates) on disk.</para>
            </listitem>
        </itemizedlist>
        <para>Swift Recon is middleware that is installed in the
            object servers pipeline and takes one required option: A
            local cache directory. To track
                <literal>async_pendings</literal>, you must set up an
            additional cron job for each object server. You access
            data by either sending HTTP requests directly to the
            object server or using the <command>swift-recon</command>
            command-line client.</para>
        <para>There are some good Object Storage cluster statistics
            but the general server metrics overlap with existing
            server monitoring systems. To get the Swift-specific
            metrics into a monitoring system, they must be polled.
            Swift Recon essentially acts as a middleware metrics
            collector. The process that feeds metrics to your
            statistics system, such as <literal>collectd</literal> and
                <literal>gmond</literal>, probably already runs on the
            storage node. So, you can choose to either talk to Swift
            Recon or collect the metrics directly.</para>
    </section>
    <section xml:id="monitoring-swift-informant">
        <title>Swift-Informant</title>
        <para>Florian Hines developed the Swift-Informant middleware
            (see <link
                xlink:href="http://pandemicsyn.posterous.com/swift-informant-statsd-getting-realtime-telem"
                >http://pandemicsyn.posterous.com/swift-informant-statsd-getting-realtime-telem</link>)
            to get real-time visibility into Object Storage client
            requests. It sits in the pipeline for the proxy server,
            and after each request to the proxy server, sends three
            metrics to a StatsD server (see <link
                xlink:href="http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/"
                >http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/</link>):</para>
        <itemizedlist>
            <listitem>
                <para>A counter increment for a metric like
                        <code>obj.GET.200</code> or
                        <code>cont.PUT.404</code>.</para>
            </listitem>
            <listitem>
                <para>Timing data for a metric like
                        <code>acct.GET.200</code> or
                        <code>obj.GET.200</code>. [The README says the
                    metrics look like
                        <code>duration.acct.GET.200</code>, but I do
                    not see the <literal>duration</literal> in the
                    code. I am not sure what the Etsy server does but
                    our StatsD server turns timing metrics into five
                    derivative metrics with new segments appended, so
                    it probably works as coded. The first metric turns
                    into <code>acct.GET.200.lower</code>,
                        <code>acct.GET.200.upper</code>,
                        <code>acct.GET.200.mean</code>,
                        <code>acct.GET.200.upper_90</code>, and
                        <code>acct.GET.200.count</code>].</para>
            </listitem>
            <listitem>
                <para>A counter increase by the bytes transferred for
                    a metric like
                    <code>tfer.obj.PUT.201</code>.</para>
            </listitem>
        </itemizedlist>
        <para>This is good for getting a feel for the quality of
            service clients are experiencing with the timing metrics,
            as well as getting a feel for the volume of the various
            permutations of request server type, command, and response
            code. Swift-Informant also requires no change to core
            Object Storage code because it is implemented as
            middleware. However, it gives you no insight into the
            workings of the cluster past the proxy server. If the
            responsiveness of one storage node degrades, you can only
            see that some of your requests are bad, either as high
            latency or error status codes. You do not know exactly why
            or where that request tried to go. Maybe the container
            server in question was on a good node but the object
            server was on a different, poorly-performing node.</para>
    </section>
    <section xml:id="monitoring-statsdlog">
        <title>Statsdlog</title>
        <para>Florian's <link
                xlink:href="https://github.com/pandemicsyn/statsdlog"
                >Statsdlog</link> project increments StatsD counters
            based on logged events. Like Swift-Informant, it is also
            non-intrusive, but statsdlog can track events from all
            Object Storage daemons, not just proxy-server. The daemon
            listens to a UDP stream of syslog messages and StatsD
            counters are incremented when a log line matches a regular
            expression. Metric names are mapped to regex match
            patterns in a JSON file, allowing flexible configuration
            of what metrics are extracted from the log stream.</para>
        <para>Currently, only the first matching regex triggers a
            StatsD counter increment, and the counter is always
            incremented by one. There is no way to increment a counter
            by more than one or send timing data to StatsD based on
            the log line content. The tool could be extended to handle
            more metrics for each line and data extraction, including
            timing data. But a coupling would still exist between the
            log textual format and the log parsing regexes, which
            would themselves be more complex to support multiple
            matches for each line and data extraction. Also, log
            processing introduces a delay between the triggering event
            and sending the data to StatsD. It would be preferable to
            increment error counters where they occur and send timing
            data as soon as it is known to avoid coupling between a
            log string and a parsing regex and prevent a time delay
            between events and sending data to StatsD.</para>
        <para>The next section describes another method for gathering
            Object Storage operational metrics.</para>
    </section>
    <section xml:id="monitoring-statsD">
        <title>Swift StatsD logging</title>
        <para>StatsD (see <link
                xlink:href="http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/"
                >http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/</link>)
            was designed for application code to be deeply
            instrumented; metrics are sent in real-time by the code
            that just noticed or did something. The overhead of
            sending a metric is extremely low: a <code>sendto</code>
            of one UDP packet. If that overhead is still too high, the
            StatsD client library can send only a random portion of
            samples and StatsD approximates the actual number when
            flushing metrics upstream.</para>
        <para>To avoid the problems inherent with middleware-based
            monitoring and after-the-fact log processing, the sending
            of StatsD metrics is integrated into Object Storage
            itself. The submitted change set (see <link
                xlink:href="https://review.openstack.org/#change,6058"
                >https://review.openstack.org/#change,6058</link>)
            currently reports 124 metrics across 15 Object Storage
            daemons and the tempauth middleware. Details of the
            metrics tracked are in the <link
                xlink:href="http://docs.openstack.org/developer/swift/admin_guide.html"
                >Administrator's Guide</link>.</para>
        <para>The sending of metrics is integrated with the logging
            framework. To enable, configure
                <code>log_statsd_host</code> in the relevant config
            file. You can also specify the port and a default sample
            rate. The specified default sample rate is used unless a
            specific call to a statsd logging method (see the list
            below) overrides it. Currently, no logging calls override
            the sample rate, but it is conceivable that some metrics
            may require accuracy (sample_rate == 1) while others may
            not.</para>
        <literallayout class="monospaced">[DEFAULT]
     ...
log_statsd_host = 127.0.0.1
log_statsd_port = 8125
log_statsd_default_sample_rate = 1</literallayout>
        <para>Then the LogAdapter object returned by
                <code>get_logger()</code>, usually stored in
                <code>self.logger</code>, has these new
            methods:</para>
        <itemizedlist>
            <listitem>
                <para><code>set_statsd_prefix(self, prefix)</code>
                    Sets the client library stat prefix value which
                    gets prefixed to every metric. The default prefix
                    is the "name" of the logger such as "object-server",
                    "container-auditor", and so on. This is currently used to
                    turn "proxy-server" into one of "proxy-server.Account",
                    "proxy-server.Container", or "proxy-server.Object"
                    as soon as the Controller object is determined and
                    instantiated for the request.</para>
            </listitem>
            <listitem>
                <para><code>update_stats(self, metric, amount,
                        sample_rate=1)</code> Increments the supplied
                    metric by the given amount. This is used when you
                    need to add or subtract more that one from a
                    counter, like incrementing "suffix.hashes" by the
                    number of computed hashes in the object
                    replicator.</para>
            </listitem>
            <listitem>
                <para><code>increment(self, metric,
                        sample_rate=1)</code> Increments the given
                    counter metric by one.</para>
            </listitem>
            <listitem>
                <para><code>decrement(self, metric,
                        sample_rate=1)</code> Lowers the given counter
                    metric by one.</para>
            </listitem>
            <listitem>
                <para><code>timing(self, metric, timing_ms,
                        sample_rate=1)</code> Record that the given
                    metric took the supplied number of
                    milliseconds.</para>
            </listitem>
            <listitem>
                <para><code>timing_since(self, metric, orig_time,
                        sample_rate=1)</code> Convenience method to
                    record a timing metric whose value is "now" minus
                    an existing timestamp.</para>
            </listitem>
        </itemizedlist>
        <para>Note that these logging methods may safely be called
            anywhere you have a logger object. If StatsD logging has
            not been configured, the methods are no-ops. This avoids
            messy conditional logic each place a metric is recorded.
            These example usages show the new logging methods:</para>
        <programlisting language="bash"># swift/obj/replicator.py
def update(self, job):
    # ...
    begin = time.time()
    try:
        hashed, local_hash = tpool.execute(tpooled_get_hashes, job['path'],
                do_listdir=(self.replication_count % 10) == 0,
                reclaim_age=self.reclaim_age)
        # See tpooled_get_hashes "Hack".
        if isinstance(hashed, BaseException):
            raise hashed
        self.suffix_hash += hashed
        self.logger.update_stats('suffix.hashes', hashed)
        # ...
    finally:
        self.partition_times.append(time.time() - begin)
        self.logger.timing_since('partition.update.timing', begin)</programlisting>
        <programlisting language="bash"># swift/container/updater.py
def process_container(self, dbfile):
    # ...
    start_time = time.time()
    # ...
        for event in events:
            if 200 &lt;= event.wait() &lt; 300:
                successes += 1
            else:
                failures += 1
        if successes > failures:
            self.logger.increment('successes')
            # ...
        else:
            self.logger.increment('failures')
            # ...
        # Only track timing data for attempted updates:
        self.logger.timing_since('timing', start_time)
    else:
        self.logger.increment('no_changes')
        self.no_changes += 1</programlisting>
        <para>The development team of StatsD wanted to use the <link
                xlink:href="https://github.com/sivy/py-statsd"
                >pystatsd</link> client library (not to be confused
            with a <link
                xlink:href="https://github.com/sivy/py-statsd"
                >similar-looking project</link> also hosted on
            GitHub), but the released version on PyPI was missing two
            desired features the latest version in GitHub had: the
            ability to configure a metrics prefix in the client object
            and a convenience method for sending timing data between
            "now" and a "start" timestamp you already have. So they
            just implemented a simple StatsD client library from
            scratch with the same interface. This has the nice fringe
            benefit of not introducing another external library
            dependency into Object Storage.</para>
    </section>
</section>