From dd9a25646f1711861b1eaaf6911fcf4ee8fa5505 Mon Sep 17 00:00:00 2001
From: Robert Collins <rbtcollins@hp.com>
Date: Mon, 11 Nov 2013 09:26:53 +1300
Subject: [PATCH] Document what it takes to be a check/gate test.

This is my recollection of the consensus from some infra folk on a
train late at night: it's probably wrong, but I wanted something I can
point contributors at.

Change-Id: Ic1ad99335ce41481995322f0ee5daadb08a09c2a
---
 doc/source/index.rst                   |  1 +
 doc/source/test-infra-requirements.rst | 74 ++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)
 create mode 100644 doc/source/test-infra-requirements.rst

diff --git a/doc/source/index.rst b/doc/source/index.rst
index 3306c108de..16a5f0b1e8 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -29,6 +29,7 @@ Contents:
    :maxdepth: 2
 
    project
+   test-infra-requirements
    sysadmin
    systems
 
diff --git a/doc/source/test-infra-requirements.rst b/doc/source/test-infra-requirements.rst
new file mode 100644
index 0000000000..ba38f1b186
--- /dev/null
+++ b/doc/source/test-infra-requirements.rst
@@ -0,0 +1,74 @@
+Test infrastructure Requirements
+################################
+
+Overview
+========
+
+There are multiple different ways that tests can be run. Each has different
+trade-offs between cost, reliability and test coverage.
+
+The primary goal for OpenStack test infrastructure is to deliver highly
+reliable testing: with 500 patches successfully getting through the OpenStack
+gate on a peak day, even short service interruptions have a significant impact
+on project velocity.
+
+The same velocity makes it extremely risky to disable tests: once disabled a
+test is likely to bitrot quickly, making re-enabling such tests hard.
+
+This gives the following principle:
+
+* Test runs that can stop a patch landing must be highly available - there must
+  be at least two distinct places the test can be run, with no shared failure
+  domains other than things that the infra team itself is responsible for.
+
+Test run styles
+===============
+
+Experimental
+------------
+
+Experimental jobs have low reliability requirements: they are run by hand on
+developer request, typically as part of bringing up a new job definition.
+Failures in experimental jobs are not the responsibility of openstack-infra,
+though they will offer best-effort assistance to developers.
+
+Silent
+------
+
+Silent jobs are jobs that are not yet ready to vote on code changes. They might
+not be ready because of known failures, a lack of redundancy in the
+infrastructure or some other reason. In all other regards they are the same
+as Check jobs, which means we find out about the test reliability and can
+accurately assess whether the job is ready to promote to Check status.
+
+Third party
+-----------
+
+Third party test jobs are able to vote on code changes (+/- 1 only). These jobs
+are run by third parties on code pushes, but are not able to prevent code
+landing. (Developers of projects usually take negative votes from third party
+systems seriously however). Third party test jobs cannot be gates, and cannot
+set the '+2 verified' flag on review.
+
+Check
+-----
+
+Check jobs are used to verify each patch pushed to Gerrit. Like a third party
+test job they run against a single pushed patch, rather than the proposed
+merged state of the repository. A failure reported by a check job will prevent
+the patch being approved. As such check jobs have to run in a highly available
+environment with only infra controlled components permitted to have shared
+failure domains.
+
+Gate
+----
+
+Gate jobs are used to detect failures in patches after they are approved. They
+run against the state the OpenStack projects will have if the code is merged
+(rather than the state of the pushed code). This allows detection of semantic
+conflicts cross-patch (and the usual state for OpenStack is that multiple
+patches are going through the gate at once, so this is crucial). Failures in
+the gate both prevent the patch landing and cause all the pending patches after
+it to be retested. Gate jobs, like check jobs, have to run in a highly
+available environment with only infra controlled components permitted to have
+shared failure domains.