Document what it takes to be a check/gate test.

This is my recollection of the consensus from some infra folk on a train late at night: it's probably wrong, but I wanted something I can point contributors at. Change-Id: Ic1ad99335ce41481995322f0ee5daadb08a09c2a
2013-11-11 09:26:53 +13:00 · 2013-11-11 09:26:53 +13:00 · dd9a25646f
commit dd9a25646f
parent c53349587b
2 changed files with 75 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -29,6 +29,7 @@ Contents:
   :maxdepth: 2
   project
   test-infra-requirements
   sysadmin
   systems
--- a/doc/source/test-infra-requirements.rst
+++ b/doc/source/test-infra-requirements.rst
@ -0,0 +1,74 @@
 Test infrastructure Requirements
 ################################
 Overview
 ========
 There are multiple different ways that tests can be run. Each has different
 trade-offs between cost, reliability and test coverage.
 The primary goal for OpenStack test infrastructure is to deliver highly
 reliable testing: with 500 patches successfully getting through the OpenStack
 gate on a peak day, even short service interruptions have a significant impact
 on project velocity.
 The same velocity makes it extremely risky to disable tests: once disabled a
 test is likely to bitrot quickly, making re-enabling such tests hard.
 This gives the following principle:
 * Test runs that can stop a patch landing must be highly available - there must
  be at least two distinct places the test can be run, with no shared failure
  domains other than things that the infra team itself is responsible for.
 Test run styles
 ===============
 Experimental
 ------------
 Experimental jobs have low reliability requirements: they are run by hand on
 developer request, typically as part of bringing up a new job definition.
 Failures in experimental jobs are not the responsibility of openstack-infra,
 though they will offer best-effort assistance to developers.
 Silent
 ------
 Silent jobs are jobs that are not yet ready to vote on code changes. They might
 not be ready because of known failures, a lack of redundancy in the
 infrastructure or some other reason. In all other regards they are the same
 as Check jobs, which means we find out about the test reliability and can
 accurately assess whether the job is ready to promote to Check status.
 Third party
 -----------
 Third party test jobs are able to vote on code changes (+/- 1 only). These jobs
 are run by third parties on code pushes, but are not able to prevent code
 landing. (Developers of projects usually take negative votes from third party
 systems seriously however). Third party test jobs cannot be gates, and cannot
 set the '+2 verified' flag on review.
 Check
 -----
 Check jobs are used to verify each patch pushed to Gerrit. Like a third party
 test job they run against a single pushed patch, rather than the proposed
 merged state of the repository. A failure reported by a check job will prevent
 the patch being approved. As such check jobs have to run in a highly available
 environment with only infra controlled components permitted to have shared
 failure domains.
 Gate
 ----
 Gate jobs are used to detect failures in patches after they are approved. They
 run against the state the OpenStack projects will have if the code is merged
 (rather than the state of the pushed code). This allows detection of semantic
 conflicts cross-patch (and the usual state for OpenStack is that multiple
 patches are going through the gate at once, so this is crucial). Failures in
 the gate both prevent the patch landing and cause all the pending patches after
 it to be retested. Gate jobs, like check jobs, have to run in a highly
 available environment with only infra controlled components permitted to have
 shared failure domains.