Tripleo CI improvements

Implement several changes to the current tripleo ci to provide more
reliable results.

Change-Id: I5ec5af169b2d2a3f8d551915fa9a949e341fdc39
This commit is contained in:
Derek Higgins 2014-05-22 22:57:08 +01:00
parent 35c975a2b3
commit fbb3b726d4
1 changed files with 269 additions and 0 deletions

View File

@ -0,0 +1,269 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
======================
Triple CI improvements
======================
https://blueprints.launchpad.net/tripleo/+spec/tripleo-juno-ci-improvements
Tripleo CI is painful at the moment, we have problems with both reliability
and consistency of running job times, this spec is intended to address a
number of the problems we have been facing.
Problem Description
===================
Developers should be able to depend on CI to produce reliable test results with
a minimum number of false negatives reported in a timely fashion, this
currently isn't the case. To date the reliability of tripleo ci has been
heavily effected by network glitches, availability of network resources and
reliability of the CI clouds. This spec is intended to deal with the problems
we have been seeing.
**Problem :** Reliability of hp1 (hp1_reliability_)
Intermittent failures on jobs running on the hp1 cloud have been causing a
large number of job failures and sometimes taking this region down
altogether. Current thinking is that the root of most of these issues is
problems with a mellanox driver.
**Problem :** Unreliable access to network resources (net_reliability_)
Gaining reliable access to various network resources has been inconsistent
causing a CI outage when any one network resource is unavailable. Also
inconsistent speeds downloading these resources can make it difficult to
gauge overall speed improvements made to tripleo.
**Problem :** (system_health_) The health of the overall CI system isn't
immediately obvious, problems often persist for hours (or occasionally days)
before we react to them.
**Problem :** (ci_run_times_) The tripleo devtest story takes time to run,
this uses up CI resources and developer's time, where possible we should
reduce the time required to run devtest.
**Problem :** (inefficient_usage_) Hardware on which to run tripleo is a finite
resource, there is a spec in place to run devtest on an openstack
deployment[1], this is the best way forward in order to use the resources we
have in the most efficient way possible. We also have a number of options to
explore that would help minimise resource wastage.
**Problem :** (system_feedback_) Our CI provides no feedback about trends.
A good CI system should be more than a system that reports pass or fail, we
should be getting feedback on metrics allowing us to observe degradations,
where possible we should make use of services already provided by infra.
This will allow us to proactively intervene as CI begins to degrade?
**Problem :** (bug_frequency_) We currently have no indication of which CI
bugs are occurring most often. This frustrates efforts to make CI more
reliable.
**Problem :** (test_coverage_) Currently CI only tests a subset of what it
should.
Proposed Change
===============
There are a number of changes required in order to address the problems we have
been seeing, each listed here (in order of priority).
.. _hp1_reliability:
**Solution :**
* Temporarily scale back on CI by removing one of the overcloud jobs (so rh1 has
the capacity to run CI Solo).
* Remove hp1 from the configuration.
* Run burn-in tests on each hp1 host, removing(or repairing) failing hosts.
Burn-in tests should consist of running CI on a newly deployed cloud matching
the load expected to run on the region. Any failure rate should not exceed
that of currently deployed regions.
* Redeploy testing infrastructure on hp1 and test with tempest, this redeploy
should be done with our tripleo scripts so it can be repeated and we
are sure of parity between ci-overcloud deployments.
* Place hp1 back into CI and monitor situation.
* Add back any removed CI jobs.
* Ensure burn-in / tempest tests are followed on future regions being deployed.
* Attempts should be made to deal with problems that develop on already
deployed clouds, if it becomes obvious they can't be quickly dealt with after
48 hours they should be temporarily removed from the CI infrastructure and will
need to pass the burn-in tests before being added back into production.
.. _net_reliability:
**Solution :**
* Deploy a mirror of pypi.openstack.org on each Region.
* Deploy a mirror of the Fedora and Ubuntu package repositories on each region.
* Deploy squid in each region and cache http traffic through it, mirroring
where possible should be considered our preference but having squid in place
should cache any resources not mirrored.
* Mirror other resources (e.g. github.com, percona tarballs etc..).
* Any new requirements added to devtest should be cachable with caches in
place before the requirement is added.
.. _system_health:
**Solution :**
* Monitor our CI clouds and testenvs with Icinga, monitoring should include
ping, starting (and connecting to) new instances, disk usage etc....
* Monitor CI test results and trigger an alert if "X" number of jobs of the
same type fail in succession. An example of using logstash to monitor CI
results can be found here[5].
Once consistency is no longer a problem we will investigate speed improvements
we can make on the speed of CI jobs.
.. _ci_run_times:
**Solution :**
* Investigate if unsafe disk caching strategies will speed up disk image
creation, if an improvement is found implement it in production CI by one of
* run "unsafe" disk caching strategy on ci cloud VM's (would involve exposing
this libvirt option via the nova api).
* use "eatmydata" to noop disk sync system calls, not currently
packaged for F20 but we could try and restart that process[2].
.. _inefficient_usage:
**Solution :**
* Abandon on failure : adding a feature to zuul (or turning it on if it already
exists) to abandon all jobs in a queue for a particular commit as soon as a
voting commit fails. This would minimize usage of resources running long
running jobs that we already know will have to be rechecked.
* Adding the collectl element to compute nodes and testenv hosts will allow us
to find bottle necks and also identify places where it is safe to overcommit
(e.g. we may find that overcommitting CPU a lot on testenv hosts is viable).
.. _system_feedback:
**Solution :**
* Using a combination of logstash and graphite
* Output graphs of occurrences of false negative test results.
* Output graphs of CI run times over time in order to identify trends.
* Output graphs of CI job peak memory usage over time.
* Output graphs of CI image sizes over time.
.. _bug_frequency:
**Solution :**
* In order to be able to track false negatives that are hurting us most we
should agree not to use "recheck no bug", instead recheck with the
relevant bug number. Adding signatures to Elastic recheck for known CI
issues should help uptake of this.
.. _test_coverage:
**Solution :**
* Run tempest against the deployed overcloud.
* Test our upgrade story by upgrading to a new images. Initially to avoid
having to build new images we can edit something on the overcloud qcow images
in place in order to get a set of images to upgrade too[3].
Alternatives
------------
* As an alternative to deploying our own distro mirrors we could simply point
directly at a mirror known to be reliable. This is undesirable as a long
term solution as we still can't control outages.
Security Impact
---------------
None
Other End User Impact
---------------------
* No longer using recheck no bug places a burden on developers to
investigate why a job failed.
* Adding coverage to our tests will increase the overall time to run a job.
Performance Impact
------------------
Performance of CI should improve overall.
Other Deployer Impact
---------------------
None
Developer Impact
----------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
derekh
Other contributors:
looking for volunteers...
Work Items
----------
* hp1 upgrade to trusty.
* Potential pypi mirror.
* Fedora Mirrors.
* Ubuntu Mirrors.
* Mirroring other non distro resources.
* Per region caching proxy.
* Document CI.
* Running an unsafe disk caching strategy in the overcloud nodes.
* ZUUL abandon on failure.
* Include collectl on compute and testenv Hosts and analyse output.
* Mechanism to monitor CI run times.
* Mechanism to monitor nodepool connection failures to instances.
* Remove ability to recheck no bug or at the very least discourage its use.
* Monitoring cloud/testenv health.
* Expand ci to include tempest.
* Expand ci to include upgrades.
Dependencies
============
None
Testing
=======
CI failure rate and timings will be tracked to confirm improvements.
Documentation Impact
====================
The tripleo-ci repository needs additional documentation in order to describe
the current layout and should then be updated as changes are made.
References
==========
* [1] spec to run devtest on openstack https://review.openstack.org/#/c/92642/
* [2] eatmydata for Fedora https://bugzilla.redhat.com/show_bug.cgi?id=1007619
* [3] CI upgrades https://review.openstack.org/#/c/87758/
* [4] summit session https://etherpad.openstack.org/p/juno-summit-tripleo-ci
* [5] http://jogo.github.io/gate/tripleo.html