Merge "Add spec for creating a test metric database"

2014-12-12 23:34:00 +00:00
parent d606609db3 27fb97dcf8
commit 3013edd71c
1 changed files with 159 additions and 0 deletions
--- a/specs/test-metrics-db.rst
+++ b/specs/test-metrics-db.rst
@@ -0,0 +1,159 @@
+::
+
+  This work is licensed under a Creative Commons Attribution 3.0
+  Unported License.
+  http://creativecommons.org/licenses/by/3.0/legalcode
+
+=====================
+Test Metrics Database
+=====================
+
+
+https://storyboard.openstack.org/#!/story/156
+
+Using subunit2sql store data regarding gating test runs in a SQL database to
+enable longer term analytics on testing in the gate. Also, construct a
+dashboard for the stored data to visualize the gating job trends over the period
+of time for which there is data in the DB.
+
+Problem Description
+===================
+
+Right now we capture test run artifacts and archive them for roughly 1 release
+cycle on logs.o.o. In addition logstash provides 10 days of API queryable
+information about the test runs. This has proved invaluable for both
+ascertaining the current status of the gate and debugging failures. However,
+due to the lack of long term data being available there is an inability to
+perform analysis to categorize the long term trends in our test suite.
+
+This specifically becomes an issue when we want to optimize how we're using
+tempest in the gate. For example, if we wanted to trim down the tests we're
+gating on to maintain a timing quota would be very difficult because we don't
+have an API interface to use which we can figure out which tests never fail, or
+which tests on average take the longest to run.
+
+
+Proposed Change
+===============
+
+Using the subunit2sql library, which was recently added to openstack-infra, a
+new post processing job (similar to how logs are processed for logstash) is
+added to read the subunit file from run, and push that to a SQL DB setup for
+storing the data. The SQL DB should be put on a separate server, possibly using
+Trove to provision it. By having the data available in DB allows for API access
+to enable interesting analytics to be performed. However, to enable an
+environment conducive to developing tooling around performing this analysis,
+public read-only access to the DB will be needed; mysql-proxy will be used to
+enable a public DB endpoint.
+
+On top of the DB a new dashboard will be created that visualises the data
+stored in the DB. It's basic functionality will be to show an analysis
+of the performance and stability of individual tests over time, as well as
+showing the broad long term trends on the jobs over the whole stored history.
+(ie, graphs of test success and failures counts, total run time for each run
+type over the whole stored history)
+
+Additionally, storing the test timing data in a database will allow us to
+eventually use the timing data with testr to perform scheduler optimizations.
+Pre-seeding the timing data into testr is something which has been discussed
+previously, but there was not a good method of doing this before. However, this
+is dependent on adding a new SQL repository to testrepository based on
+subunit2sql.
+
+Alternatives
+------------
+
+An alternative approach for the data collection would be to build a log
+scraping tool that would scrape the archived logs in order to collect the
+required information from the logs instead of the subunit file directly. However
+that seems like an error prone and less efficient approach.
+
+As alternative to using a SQL DB a different data storage mechanism could be
+used. For example hadoop, a nosql db, and graphite have all been considered as
+alternatives. However, because the parts of the subunit data we need are
+already highly structured using SQL makes sense for the storage. Using SQL also
+enables us to exploit other tooling to make the task simpler.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Matthew Treinish <mtreinish@kortar.org>
+Clark Boylan <cboylan@sapwetik.org>
+
+Work Items
+----------
+
+* Add server for SQL DB and maybe the dashboard (possibly using trove)
+* Add post-processing workers (similar to how logstash does it) to parse the
+  subunit output from the run and push it to the DB using subunit2sql
+* Write a dashboard script to parse the database to visualize the metrics
+  we want from the database
+* Start running the dashboard on top of the DB
+* Add a SQL repository type using subunit2sql to testrepository
+* Switch the dsvm jobs to use the SQL repository in testr to access the
+  historical test data
+
+Repositories
+------------
+
+The only required repository is subnunit2sql which has already been created.
+However the dashboard script will need to be stored somewhere, it will be too
+specific to the CI environment that it probably doesn't belong in the
+subunit2sql repo. Depending on the size we can keep it directly in
+openstack-infra/config. However, if it ends up be sufficiently large we might
+need to create separate repository to store it.
+
+Servers
+-------
+
+So to enable this a new SQL database is required. While a new server is not
+strictly required to do this, based on the estimated load caused by running
+subnunit2sql for each test run in the gate it probably makes the most sense.
+It makes sense to use trove here to spin up a database to use. Another
+consideration is to enable public read access to the database which will be
+need to allow development on top of the data, as well as incorporating the data
+into the testr scheduler on gate slaves.
+
+DNS Entries
+-----------
+
+A DNS entry will need to be created the new DB server so that we can have the
+post-processing job send the results to it. Additionally, the dashboard view
+will need to be given an address, however a separate DNS entry probably isn't
+required for that since a new path on an existing webserver is probably
+sufficient. (ie status.o.o/test-stats) There is also going to be a mysql-proxy
+setup needed to enable public read access to the database, which also will need
+to be addressable.
+
+
+Documentation
+-------------
+
+subunit2sql is lacking substantial documentation right now. This documentation
+will include both operational instructions for subunit2sql as well as being a
+DB API and schema guide. Additionally, a new doc for setting up the test metric
+workflow shoud also be added.
+
+Security
+--------
+
+There shouldn't be any additional security implications besides the risks
+associated with running a new server and SQL DB. Care also needs to be taken
+when Storing the credentials for access to the DB server for the post
+processing jobs. Setting up a mysql-proxy to allow public read access to the
+DB does open up a new potential attack entry-point.
+
+Testing
+-------
+
+There shouldn't be any additional testing requires. Both subunit2sql and
+the dashboard should have there own unit testing.
+
+Dependencies
+============
+
+- This BP will be the first use of subunit2sql in the infra workflow
+- mysql-proxy will also need to be installed and configured