Add logstash documentation.

* doc/source/logstash.rst: Add documentation on our Logstash system architecture and how to query logstash. Change-Id: I9da3e6d6391081131d1fd852230ddac6326c01a2 Reviewed-on: https://review.openstack.org/31257 Reviewed-by: James E. Blair <corvus@inaugust.com> Reviewed-by: Elizabeth Krumbach Joseph <lyz@princessleia.com> Approved: Jeremy Stanley <fungi@yuggoth.org> Reviewed-by: Jeremy Stanley <fungi@yuggoth.org> Tested-by: Jenkins
2013-05-31 11:46:29 -07:00 · 2013-05-31 11:46:29 -07:00 · 6881008de0
commit 6881008de0
parent c42c7acdc4
1 changed files with 185 additions and 3 deletions
--- a/doc/source/logstash.rst
+++ b/doc/source/logstash.rst
@ -12,7 +12,7 @@ At a Glance
 :Hosts:
  * http://logstash.openstack.org
-  * logstash-worker-\*.openstack.org
+  * logstash-worker\*.openstack.org
  * elasticsearch.openstack.org
 :Puppet:
  * :file:`modules/logstash`
@ -21,13 +21,16 @@ At a Glance
  * :file:`modules/openstack_project/manifests/elasticsearch.pp`
 :Configuration:
  * :file:`modules/openstack_project/files/logstash`
  * :file:`modules/openstack_project/templates/logstash`
 :Projects:
  * http://logstash.net/
  * http://kibana.org/
  * http://www.elasticsearch.org/
 :Bugs:
  * http://bugs.launchpad.net/openstack-ci
  * https://logstash.jira.com/secure/Dashboard.jspa
  * https://github.com/rashidkpc/Kibana/issues
  * https://github.com/elasticsearch/elasticsearch/issues
 Overview
 ========
@ -38,7 +41,186 @@ sources in a single test run, searching for errors or particular
 events within a test run, as well as searching for log event trends
 across test runs.
-TODO(clarkb): more details about system architecture
+System Architecture
 ===================
-TODO(clarkb): useful queries
+There are four major layers in our Logstash setup.
 1. Log Pusher Script.
   Subscribes to the Jenkins ZeroMQ Event Publisher listening for build
   finished events. When a build finishes this script fetches the logs
   generated by that build, chops them up, annotates them with Jenkins
   build info and finally sends them to a Logstash indexer process.
 2. Logstash Indexer.
   Reads these log events from the log pusher, filters them to remove
   unwanted lines, collapses multiline events together, and parses
   useful information out of the events before shipping them to
   ElasticSearch for storage and indexing.
 3. ElasticSearch.
   Provides log storage, indexing, and search.
 4. Kibana.
   A Logstash oriented web client for ElasticSearch. You can perform
   queries on your Logstash logs in ElasticSearch through Kibana using
   the Lucene query language.
 Each layer scales horizontally. As the number of logs grows we can add
 more log pushers, more Logstash indexers, and more ElasticSearch nodes.
 Currently we have multiple Logstash worker nodes that pair a log pusher
 with a Logstash indexer. We did this as each Logstash process can only
 dedicate a single thread to filtering log events which turns into a
 bottleneck very quickly. This looks something like:
 ::
           _ logstash-worker1 _
          /                    \
  jenkins -- logstash-worker2 -- elasticsearch -- kibana
          \_                  _/
             logstash-worker3
 Log Pusher
 ----------
 This is a simple Python script that is given a list of log files to push
 to Logstash when Jenkins builds complete.
 Log pushing looks like this:
 * Jenkins publishes build complete notifications.
 * Log pusher receives the notification from Jenkins.
 * Using info in the notification log files are retrieved.
 * Log files are processed then shipped to Logstash.
 In the near future this script will be modified to act as a Gearman
 worker so that we can add an arbitrary number of them without needing
 to partition the log files that each worker handles by hand. Instead
 each worker will be able to fetch and push any log file and will do
 so as directed by Gearman.
 If you are interested in technical details The source of this script
 can be found at
 :file:`modules/openstack_project/files/logstash/log-pusher.py`
 Logstash
 --------
 Logstash does the heavy lifting of squashing all of our log lines into
 events with a common format. It reads the JSON log events from the log
 pusher connected to it, deletes events we don't want, parses log lines
 to set the timestamp, message, and other fields for the event, then
 ships these processed events off to ElasticSearch where they are stored
 and made queryable.
 At a high level Logstash takes:
 ::
  {
    "fields" {
      "build_name": "gate-foo",
      "build_numer": "10",
      "event_message": "2013-05-31T17:31:39.113 DEBUG Something happened",
    },
  }
 And turns that into:
 ::
  {
    "fields" {
      "build_name": "gate-foo",
      "build_numer": "10",
      "loglevel": "DEBUG"
    },
    "@message": "Something happened",
    "@timestamp": "2013-05-31T17:31:39.113Z",
  }
 It flattens each log line into something that looks very much like
 all of the other events regardless of the source log line format. This
 makes querying your logs for lines from a specific build that failed
 between two timestamps with specific message content very easy. You
 don't need to write complicated greps instead you query against a
 schema.
 The config file that tells Logstash how to do this flattening can be
 found at
 :file:`modules/openstack_project/templates/logstash/indexer.conf.erb`
 ElasticSearch
 -------------
 ElasticSearch is basically a REST API layer for Lucene. It provides
 the storage and search engine for Logstash. It scales horizontally and
 loves it when you give it more memory. Currently we run a single node
 cluster on a large VM to give ElasticSearch both memory and disk space.
 Per index (Logstash creates one index per day) we have one replica (on
 the same node, this does not provide HA, it speeds up searches) and
 five shards (each shard is basically its own index, having multiple
 shards increases indexing throughput).
 As this setup grows and handles more logs we may need to add more
 ElasticSearch nodes and run a proper cluster. Haven't reached that point
 yet, but will probably be necessary as disk and memory footprints
 increase.
 Kibana
 ------
 Kibana is a ruby app sitting behind Apache that provides a nice web UI
 for querying Logstash events stored in ElasticSearch. Our install can
 be reached at http://logstash.openstack.org. See
 :ref:`query-logstash` for more info on using Kibana to perform
 queries.
 .. _query-logstash:
 Querying Logstash
 =================
 Hop on over to http://logstash.openstack.org and by default you get the
 last 15 minutes of everything Logstash knows about in chunks of 100.
 We run a lot of tests but it is possible no logs have come in over the
 last 15 minutes, change the dropdown in the top left from ``Last 15m``
 to ``Last 60m`` to get a better window on the logs. At this point you
 should see a list of logs, if you click on a log event it will expand
 and show you all of the fields associated with that event and their
 values (not Chromium and Kibana seem to have trouble with this at times
 and some fields end up without values, use Firefox if this happens).
 You can search based on all of these fields and if you click the
 magnifying glass next to a field in the expanded event view it will add
 that field and value to your search. This is a good way of refining
 searches without a lot of typing.
 The above is good info for poking around in the Logstash logs, but
 one of your changes has a failing test and you want to know why. We
 can jumpstart the refining process with a simple query.
 ``@fields.build_change:"$FAILING_CHANGE" AND @fields.build_patchset:"$FAILING_PATCHSET" AND @fields.build_name:"$FAILING_BUILD_NAME" AND @fields.build_number:"$FAILING_BUILD_NUMBER"``
 This will show you all logs available from the patchset and build pair
 that failed. Chances are that this is still a significant number of
 logs and you will want to do more filtering. You can add more filters
 to the queriy using ``AND`` and ``OR`` and parentheses can be used to
 group sections of the query. Potential additions to the above query
 might be
 * ``AND @fields.filename:"logs/syslog.txt"`` to get syslog events.
 * ``AND @fields.filename:"logs/screen-n-api.txt"`` to get Nova API events.
 * ``AND @fields.loglevel:"ERROR"`` to get ERROR level events.
 * ``AND @message"error"`` to get events with error in their message.
  and so on.
 General query tips:
 * Don't search ``All time``. ElasticSearch is bad at trying to find all
  the things it ever knew about. Give it a window of time to look
  through. You can use the presets in the dropdown to select a window or
  use the ``foo`` to ``bar`` boxes above the frequency graph.
 * Only the @message field can have fuzzy searches performed on it. Other
  fields require specific information.
 * This system is growing fast and may not always keep up with the load.
  Be patient. If expected logs do not show up immediately after the
  Jenkins job completes wait a few minutes.