Add Eris whitepaper from Gautam
Converted from .docx to reStructuredText and published with permission from the original author, Gautam Divgi. This is still WIP but nevertheless is hoped to be of use to the community. Story: 2002129 Task: 22148 Change-Id: I5b9862d36ab8d4f11a764b506f955154e4e8303b
This commit is contained in:
parent
3bca09913f
commit
e28144217a
722
doc/source/eris/index.rst
Normal file
722
doc/source/eris/index.rst
Normal file
@ -0,0 +1,722 @@
|
||||
===============================================
|
||||
OpenStack Eris - an extreme testing framework
|
||||
===============================================
|
||||
|
||||
.. contents::
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
OpenStack has been expanding at a breakneck pace. Its adoption has
|
||||
been phenomenal and it is currently the go to choice for on premise
|
||||
cloud IaaS software. From a software development perspective OpenStack
|
||||
today has approximately *nLines* of code contributed by thousands
|
||||
developers, reviewers and PTLs. There are new *mNewProjects* each year
|
||||
and *kBlueprints* under review. Taking a look at its adoption
|
||||
perspective, OpenStack clouds today power *nCPUs* cores of processors
|
||||
in *nCompanys* companies. The installations handle a variety of
|
||||
traffic anywhere from simple web hosting to extremely resource and SLA
|
||||
intensive workloads like telecom virtual network functions (VNFs) and
|
||||
scientific computing.
|
||||
|
||||
A commonly heard theme with regards to this rapid expansion in both,
|
||||
installed footprint and the OpenStack software project, is resiliency
|
||||
and performance. More specifically the questions asked are:
|
||||
|
||||
- What are the resiliency and performance characteristics of OpenStack
|
||||
from a control and data plane perspective?
|
||||
|
||||
- What sort of performance metrics can be achieved with a specific
|
||||
architecture?
|
||||
|
||||
- How resilient is the architecture to failures?
|
||||
|
||||
- How much resource scale can be achieved?
|
||||
|
||||
- What level of concurrency can resource operations handle?
|
||||
|
||||
- How operationally ready is a particular OpenStack installation?
|
||||
|
||||
- How do new releases compare to the older ones with regards to the
|
||||
above questions?
|
||||
|
||||
OpenStack Eris is an extreme testing framework and test suite that
|
||||
proposes to stress OpenStack in various different ways to address
|
||||
performance and resiliency questions about OpenStack. Eris comes out
|
||||
of `the LCOO working group <https://wiki.openstack.org/wiki/LCOO>`_'s
|
||||
efforts to derive holistic performance, reliability and availability
|
||||
characteristics for OpenStack installations at the release/QA
|
||||
gates. In addition, Eris also aims to provide capabilities for third
|
||||
party CI’s and other open source communities like OpenContrail,
|
||||
etc. to execute and publish similar characteristics.
|
||||
|
||||
|
||||
Goals and Benefits
|
||||
==================
|
||||
|
||||
The major objective of the project has been outlined in the previous
|
||||
section. To reiterate here: derive holistic performance, reliability and
|
||||
availability characteristics for OpenStack. Figure 1 below translates
|
||||
the breakup of this objective into specific goals to achieve that
|
||||
objective. The aim of this section is to discuss in fairly abstract
|
||||
terms these goals without diving into actual implementation details.
|
||||
|
||||
+-----------------------------+
|
||||
| |image0| |
|
||||
+=============================+
|
||||
| **Figure 1: Goals of Eris** |
|
||||
+-----------------------------+
|
||||
|
||||
Eris has three major goals that derive from its primary goal of deriving
|
||||
holistic performance, reliability and availability characteristics of
|
||||
OpenStack. Each of the major goals and their sub-goals are discussed in
|
||||
detail below.
|
||||
|
||||
Goal 1: Requirements
|
||||
--------------------
|
||||
|
||||
Define infrastructure architecture, realistic
|
||||
workloads for that architecture and reference KPI/SLO valid for that
|
||||
architecture.
|
||||
|
||||
- **Reference architecture(s):** Performance and resiliency
|
||||
characteristics of a system are valid for specific architectures
|
||||
they are configured for. Hence, one of our first goals is to define
|
||||
reference architectures on which tests will be run.
|
||||
|
||||
- **Reference workload(s):** When pursuing the assessment of
|
||||
performance and resiliency we should ensure that it is done under
|
||||
well-defined workloads. These workloads should be modeled on either
|
||||
normal or stressful situations that happen in real data
|
||||
centers. Unrealistic workloads skew results and provide data that is
|
||||
not useful.
|
||||
|
||||
- **Reference KPI/SLO(s):** The type of testing that Eris proposes is
|
||||
non-deterministic, i.e. performance or resiliency cannot be
|
||||
determined by the success or failure of a single transaction.
|
||||
Performance and resiliency are generally determined by using
|
||||
aggregates of certain metrics (e.g. percent success rate, mean
|
||||
transaction response times, mean time to recover, etc.) for a set of
|
||||
transactions run over an extended time period. These aggregate
|
||||
metrics are the Key Performance Indicators (KPIs) or Service Level
|
||||
Objectives (SLOs) of the test. These metrics need to be defined
|
||||
since they are will determine the pass/fail criteria for the
|
||||
testing.
|
||||
|
||||
Goal 2: Frameworks
|
||||
------------------
|
||||
|
||||
Define the elements of an extreme testing framework that encompasses
|
||||
the ability to create repeatable experiments, test creation, test
|
||||
orchestration, extensibility, automation and capabilities for
|
||||
simulation and emulation. The Eris framework is not tightly coupled to
|
||||
the test suite or the requirements. This leaves it flexible for other
|
||||
general purpose use like VNF testing as well.
|
||||
|
||||
- **Repeatable experiments:** For non-deterministic testing, the
|
||||
ability to create repeatable experiments is paramount. Such a
|
||||
capability allows parameters to be consistently verified within the
|
||||
KPI/SLO limits.
|
||||
|
||||
- **Test Creation:** Ease of test creation is a basic facility that
|
||||
should be provided by the framework. A test should be specified using
|
||||
an open specification and require minimal development (programming).
|
||||
It should maximize the capability for re-use between already
|
||||
developed components and test cases.
|
||||
|
||||
- **Test Orchestration:** Facilities for test orchestration should be
|
||||
provided by the framework. Test orchestration can span various layers
|
||||
of the reference architecture. The test orchestration mechanism
|
||||
should be able to orchestrate for the reference workloads and
|
||||
failures on the reference architecture and measure the reference
|
||||
KPI/SLO.
|
||||
|
||||
- **Extensibility:** The framework should be extensible at all layers.
|
||||
This means the plugin should be designed using a plugin/driver model
|
||||
with a significantly flexible specification to accomplish this goal.
|
||||
|
||||
- **Automation:** The entire test suite should be automated. This
|
||||
includes orchestrating various steps of the test along with computing
|
||||
a success/failure of the test based on the KPI/SLO supplied. This
|
||||
also explicitly means that good mathematics will be needed. There
|
||||
shouldn’t be eyeballing graphs to see if KPI are met or not met.
|
||||
|
||||
- **Simulation and Emulation:** Any framework that does performance and
|
||||
resiliency needs to have efficient and effective simulation and
|
||||
emulation mechanisms. These are especially useful to run experiments
|
||||
on constrained environments. Examples include – how would we know if
|
||||
OpenStack control plane components are ready for 5000 compute node
|
||||
scale? It is not possible to acquire that kind of hardware. So,
|
||||
testing will eventually need robust simulation and emulation
|
||||
components.
|
||||
|
||||
Goal 3: Test Suite
|
||||
------------------
|
||||
|
||||
The test suite is the actual set of tests that are run by the
|
||||
framework on the reference architecture with the reference workload
|
||||
and faults specified. The end result is to derive the metrics related
|
||||
to performance, reliability and availability.
|
||||
|
||||
- **Control Plane Performance:** This test suite will be responsible to
|
||||
run the reference API workload on various OpenStack components.
|
||||
|
||||
- **Data Plane Performance:** This test suite will be responsible to
|
||||
run the reference data plane workload. The expectation is that data
|
||||
and control plane performance workloads are run together to get a
|
||||
feel for realistic traffic in an install OpenStack environment.
|
||||
|
||||
- **Resiliency to Failure:** The test suites at either random or
|
||||
imperative points will inject failures into the system at various
|
||||
levels (hardware, network, etc.). The failure types could be simple
|
||||
or compounded failures. The KPI’s published will also include details
|
||||
on how OpenStack reacts and recovers from these failures.
|
||||
|
||||
- **Resource scale limits:** This test suite will seek to identify
|
||||
limits of resource scale. Examples are – how many VMs can be created,
|
||||
how many networks, how many cinder volumes, how many volumes per VM,
|
||||
etc.? The test suite will also track performance of various
|
||||
components as and how the resources are scaled. There isn’t an
|
||||
expectation of high concurrency for these tests. The primary goal
|
||||
being to flush out various “limits” as defined but not explicitly
|
||||
specified either by OpenStack or components it uses.
|
||||
|
||||
- **Resource concurrency limits:** This test suite will seek to
|
||||
identify limits of resource concurrency. Examples are – how many
|
||||
concurrent modifications can be made on a network, a subnet, a port,
|
||||
etc. As with resource scale limits, resources will need to be
|
||||
identified and concurrent transactions will need to be run against
|
||||
single resources. The test suite will track performance of various
|
||||
components during the test.
|
||||
|
||||
- **Operational readiness:** It is often times not feasible to run an
|
||||
entire gamut of long running tests as identified above. What is
|
||||
needed either for production readiness testing or for QA gates is a
|
||||
smoke test that signifies operational readiness. It is the minimal
|
||||
criteria needed to declare a code change good or a site healthy. The
|
||||
test suite will contain a “smoke test” for performance, reliability
|
||||
and availability labelled as its operational readiness test.
|
||||
|
||||
Review of Existing Projects
|
||||
===========================
|
||||
|
||||
There has been a lot of work put in disparate projects, some successful
|
||||
and some that aren’t that well known into building tools and creating
|
||||
test suites for measuring OpenStack performance, reliability and
|
||||
availability. This section will review these projects with our goals in
|
||||
perspective and provide an analysis of the tools we intend to use.
|
||||
|
||||
Summary of Projects
|
||||
-------------------
|
||||
|
||||
OpenStack/Rally
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
`Rally <https://docs.openstack.org/developer/rally/>`_ is currently
|
||||
the choice for control plane performance testing. It has a flexible
|
||||
architecture with a plugin mechanism that can be extended. It has a
|
||||
wide base of existing plugins for OpenStack scenarios and this base
|
||||
keeps on expanding. Most performance testing of OpenStack today uses
|
||||
Rally. The benchmarks it provides today are mostly related to success
|
||||
rate of the transactions and response times as it is only aware of
|
||||
what is happening on the client side of the transaction. There is
|
||||
scope for failure injection scenarios using an os-faults hook with
|
||||
triggers.
|
||||
|
||||
OpenStack/Shaker
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
`Shaker <https://opendev.org/performa/shaker>`_ is
|
||||
currently the popular choice for data plane network performance
|
||||
testing. It has a custom built image with agents and iperf/iperf3
|
||||
toolsets along with a wide array of heat templates to instantiate a
|
||||
topology. Shaker also provides various methods to measure metrics and
|
||||
enforce SLA of the tests.
|
||||
|
||||
OpenStack/os-faults
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The failure injection mechanism used within Rally and one that can
|
||||
also be used independently is `os-faults
|
||||
<https://opendev.org/performa/os-faults>`_. It consists of a CLI and
|
||||
library. It currently contains failure injections that can be run at
|
||||
either a hardware or a software level. Software failure injections are
|
||||
network and process failures while hardware faults are via IPMI to
|
||||
servers. Information about a site can be discovered via pre-defined
|
||||
drivers (fuel, tcpcloud, etc.) or provided directly via a JSON
|
||||
configuration file. The set of drivers can be extended by developers
|
||||
for more automated discovery mechanisms.
|
||||
|
||||
Cisco/cloud99
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
`Cloud99 <https://github.com/cisco-oss-eng/Cloud99>`_ is Cisco open
|
||||
source to probe high availability deployments of OpenStack. It
|
||||
consists primarily of software the runs load on the control and data
|
||||
plane, injects service disruptions and measures metrics. The load
|
||||
runner for the control plane is a wrapper around OpenStack
|
||||
Rally. There doesn’t seem to be a data plane load runner implemented
|
||||
at this point in time. The metrics gathering is via Ansible/SSH and
|
||||
the service disruptors use Paramiko/SSH to induce disruptions.
|
||||
|
||||
Other Efforts
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
There have been several other efforts that use some combination of the
|
||||
tools mentioned above with custom frameworks to achieve in part some
|
||||
of the objectives that have been set for Eris. Notable work includes:
|
||||
|
||||
- an Intel destructive scenario report using Rally and os-faults,
|
||||
|
||||
- `the Mirantis Stepler framework
|
||||
<https://github.com/Mirantis/stepler>`_ that uses os-faults for
|
||||
failure injection, and
|
||||
|
||||
- `the OSIC's ops-workload-framework
|
||||
<https://github.com/osic/ops-workload-framework>`_.
|
||||
|
||||
Most of this work focuses on control plane performance combined with
|
||||
failure injection.
|
||||
|
||||
`The ENoS framework <https://github.com/BeyondTheClouds/enos>`_
|
||||
combines Rally with a deployment of containerized OpenStack to
|
||||
generate repeatable performance experiments.
|
||||
|
||||
Gap Analysis
|
||||
------------
|
||||
|
||||
This section provides a gap analysis of the above tools with regards to
|
||||
the goals of Eris. The purpose here is not to rule out or exclude the
|
||||
tools from use in Eris. To the contrary, it is to identify the strengths
|
||||
of the existing toolset and investigate where Eris needs to focus its
|
||||
efforts.
|
||||
|
||||
Requirements Gaps
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
One of the major gaps identified above is the focus on frameworks at the
|
||||
cost of a reference requirements. For any non-deterministic testing
|
||||
mechanism that focuses on performance, reliability and availability the
|
||||
underlying architecture, workloads and SLOs are extremely important.
|
||||
Those are the references that give the numbers meaning. It is not that
|
||||
the frameworks are secondary, but in the absence of the reference
|
||||
requirements, numbers from frameworks and test suites are hard to
|
||||
interpret and use. There are also specific gaps with framework and test
|
||||
suites that are outlined below.
|
||||
|
||||
Framework Gaps
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
**Repeatable Experiments:** *ENOS* is the only tool that is geared
|
||||
towards generating repeatable performance experiments. However, it is
|
||||
only valid for container deployments. There are various other deployment
|
||||
tools like Fuel, Ansible, etc. but none that integrate deployment with
|
||||
various test suites.
|
||||
|
||||
**Test Creation:** Rally is the de-facto in Control Plane performance
|
||||
test specification. Most tools and efforts around performance and
|
||||
failure injection of OpenStack have leveraged Rally – including Cloud99
|
||||
and ENOS. Shaker is popular for network load generation and provides a
|
||||
fairly good suite of out of the box templates for creating and
|
||||
benchmarking various types of tenant network load. Although both tools
|
||||
are extensible, there are major gaps with regards to specifying combined
|
||||
control and data plane workloads – like a real IaaS would have. The gaps
|
||||
include scenarios like I/O loads, network BGP loads, DPDK, CPU, memory
|
||||
in the data plane. They include multi-scenario and distributed workload
|
||||
generation in Rally. For failure injection specifications, Shaker
|
||||
supports no failure injections. Rally supports single failure injections
|
||||
via the os-faults library with the deterministic triggers (at specific
|
||||
iteration points or times).
|
||||
|
||||
**Test Orchestration:** There are no tools today that support
|
||||
distributed test orchestration. None of the tools analyzed above have
|
||||
the ability to deploy a test suite to multiple
|
||||
nodes/locations/containers, etc. and orchestrate and manage a test.
|
||||
Further – integrating such capability into these tools would involve
|
||||
some major re-architecture and refactoring [addRef-RallyRoadmap]. The
|
||||
test orchestration SLA specifications today are fairly disparate for
|
||||
control and data plane and they lack a uniform mechanism to add new
|
||||
counters and metrics especially from Control Plane hosts or compute
|
||||
hosts. Ansible seems to be used primarily as a crutch for SSH while
|
||||
ignoring the many capabilities of Ansible that can actually solve the
|
||||
various gaps.
|
||||
|
||||
**Extensibility:** Most tools surveyed are extensible for the simpler
|
||||
changes – i.e. more failure injection scenarios, randomized triggers,
|
||||
new API call scenarios, etc. However, the bigger changes seem to need
|
||||
some fairly extensive changes. Examples includes various items in the
|
||||
Rally roadmap that are blocked by a major refactoring effort. Shaker
|
||||
also today doesn’t seem to have a failure injection mechanism plugged in
|
||||
addition to not having other data plane load generation
|
||||
tools/capabilities. They definitely do not support plugins to interface
|
||||
with other third party (or proprietary) tools and make the integration
|
||||
of different performance collection and computation counters difficult.
|
||||
|
||||
**Automation:** While there is a fair amount of thought paid today to
|
||||
test setup and test orchestration automation, there is not a lot of work
|
||||
in automating the success and failure criteria based on certain SLO.
|
||||
Rally and Shaker both incorporate specific SLA verification mechanisms
|
||||
but both are limited. Shaker is limited by what is observed on the guest
|
||||
VMs and Rally by the API response times and success rates. The overall
|
||||
health of an IaaS installation will require many more counters with more
|
||||
complex mathematics needed to calculate metrics and verify the systems
|
||||
capability to satisfy SLO.
|
||||
|
||||
**Simulation and Emulation:** Any major extreme testing framework is
|
||||
never complete without competent simulators and emulators. There needs
|
||||
to be the capability to test scale without actually having the scale. It
|
||||
is especially important for an IaaS system. As an example take the case
|
||||
of scaling an OpenStack cloud to 5000 compute nodes. Is it possible?
|
||||
Probably not. However, to test software changes to make it possible
|
||||
requesting 5000 actual computes is unrealistic. This is a major gap
|
||||
today in OpenStack with no mechanisms to test scale or resiliency
|
||||
without having “real” data centers. The only thing that comes close is
|
||||
the RabbitMQ simulator in OpenStack/oslo.
|
||||
|
||||
Test Suite Gaps
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
**Control & Data Plane Performance:** Rally contains single scenarios
|
||||
for performance testing which sample loads. Shaker contains various heat
|
||||
templates for sample configurations. Neither can be classified as a test
|
||||
suite where OpenStack runs and publishes performance related numbers.
|
||||
Again, the limitation of not having multi-scenarios and distributed
|
||||
workloads will come into play as performance numbers need to be run for
|
||||
larger clouds. In such situations, workloads where only a single
|
||||
machine/client is running orchestration may not be viable.
|
||||
|
||||
**Resiliency to Failure:** There are currently no test suites that
|
||||
measure resiliency to failure. While an os-faults plugin exists in Rally
|
||||
the library itself if out of maintenance today. There are no scenarios
|
||||
of failures to the data plane. There has been an effort to identify
|
||||
points of failure and types of failure along with executing failure
|
||||
scenarios [AddRef-Intelos-faults]. However, these scenarios are run with
|
||||
single rally workloads and its assertion that the traffic represents
|
||||
real traffic seems unrealistic.
|
||||
|
||||
**Resource Scale & Concurrency Limits:** There are currently no test
|
||||
suites that probe these limits. They are generally uncovered when
|
||||
unsuspecting (or over enthusiastic) tenants try something complete way
|
||||
out of what is “ordinary” and the operation fails. They typically end up
|
||||
as bug reports and are investigated and fixed. What is needed is a
|
||||
proactive mechanism to probe and uncover these limits.
|
||||
|
||||
**Operational Readiness:** There is currently no step in the OpenStack
|
||||
QA workflow today that can take a reference architecture, reference
|
||||
workload, reference KPI and run a battery of smoke tests that cover the
|
||||
test suites mentioned in the points above. These smoke or “operational
|
||||
readiness” tests are needed to ensure that fixes and changes to
|
||||
components are not adversely impacting its performance, reliability and
|
||||
availability. This does go back to fixing the gaps that such a test
|
||||
would need at the QA gates, but once that gap if fixed such tests should
|
||||
be a part of the workflow.
|
||||
|
||||
Eris Architecture
|
||||
=================
|
||||
|
||||
Eris is architected to achieve the goals listed in Section 2. This
|
||||
section specifies the basic components of Eris and the Eris QA workflow.
|
||||
The idea is to get Eris down to an abstract framework that can be then
|
||||
extended and implemented using a variety of tools. The QA workflow will
|
||||
identify what points to run Eris.
|
||||
|
||||
Eris Framework
|
||||
--------------
|
||||
|
||||
+------------------------------+
|
||||
| |image1| |
|
||||
+==============================+
|
||||
| **Figure 2: Eris Framework** |
|
||||
+------------------------------+
|
||||
|
||||
As depicted in Figure 2, the proposed Eris architecture is modular. The
|
||||
dark blue boxes denote existing OpenStack systems that developers and
|
||||
the community use. The CI/CD infrastructure will be responsible for
|
||||
scheduling and invoking the testing. Tests that fail SLA/KPI criteria
|
||||
will have bugs created for them in the ticketing system and the
|
||||
community developers can create either tests targeted to their
|
||||
components or tests that are cross-component.
|
||||
|
||||
**Test Manager:** The responsibility of the test manager is to invoke
|
||||
test suite orchestration, interfacing with the bugs and ticketing
|
||||
systems, storing logs and data for future reference. The underlying
|
||||
orchestration layer and orchestration plugins all pipe data and logs
|
||||
into the test manager.
|
||||
|
||||
**Orchestration:** The responsibility of the orchestration component is
|
||||
to run a test scenario that can include deployment, discovery, load
|
||||
injection, failure injection, monitoring, metrics collection and KPI
|
||||
computation. The orchestration engine should be able to take an open
|
||||
specification and turn it into concrete steps that execute the test
|
||||
scenario. The orchestration engine itself may not be the tool that runs
|
||||
all the scenarios.
|
||||
|
||||
**Zone Deployment:** The zone deployment plugin will take a reference
|
||||
architecture specification and deploy an OpenStack installation that
|
||||
complies with that reference architecture. It will also take various
|
||||
reference workload and metrics collections specifications and deploy the
|
||||
test tools in with the distribution specified. When the orchestrator
|
||||
deploys an architecture based on a specification it will not need to
|
||||
discover the zone.
|
||||
|
||||
**Zone Discovery:** In the event that the orchestration plugin operates
|
||||
on an existing deployment it will need to discover the various
|
||||
components of the reference architecture it is installed on. This will
|
||||
be the responsibility of the zone discovery plugin. The zone discovery
|
||||
plugin should also eventually be able to recognize a reference
|
||||
architecture, although initially this capability may be complex to
|
||||
incorporate.
|
||||
|
||||
**Control Plane Load Injection:** This plugin is responsible for setting
|
||||
up and running the control plane load injection. The setup may include a
|
||||
distributed multi-scenario load injection to mimic actual load into an
|
||||
OpenStack IaaS installation depending on the reference workload. Running
|
||||
load should be flexible enough to tune the load models across various
|
||||
distributed nodes and specify ramp-up, ramp-down and sustenance models.
|
||||
This plugin will run OpenStack API into the control plane services and
|
||||
depending on the scenarios executed may need admin access to the zone.
|
||||
|
||||
**Data Plane Load Injection:** This plugin is responsible for setting up
|
||||
various data plane load injection scenarios and running them. As with
|
||||
the control plane load injection this can include a distributed
|
||||
multi-scenario setup to mimic actual traffic depending on the reference
|
||||
workload. While in the case of the control plane, the setup may include
|
||||
something like creating a Rally deployment, in the data plane load
|
||||
injection scenario it will be setting up tenant resources to run stress
|
||||
on the data plane. Again, as with the control plane load injection, load
|
||||
will need to be distributed across various nodes and be tunable to
|
||||
ramp-up, ramp-down and sustenance models. Stress types should include
|
||||
storage I/O, network, CPU and memory at a minimum.
|
||||
|
||||
**Failure Injection:** The failure injection plugin will be responsible
|
||||
to inject failure into various parts of the reference architecture. The
|
||||
failures could be simple failures or compound failures. The injection
|
||||
interval can be either deterministic, i.e. based at a certain time or
|
||||
workload iteration point, randomized or event driven, i.e. based on when
|
||||
certain events are happening in the control or data plane. The nature of
|
||||
the failure injection plugin demands that it have root access (or sudo
|
||||
root) across every component in the reference architecture and tenant
|
||||
space.
|
||||
|
||||
**Data Collection & KPI Computation:** Plugins for data collection and
|
||||
SLA computation will collect various counters from API calls, tenant
|
||||
space and the underlying reference architecture. Based on the matrix of
|
||||
counters at various resource points and formulas supplied for KPI that
|
||||
operate on this matrix, key process indicators (KPI) values are
|
||||
computed. These KPI are then compared against the reference service
|
||||
level objectives for the reference architecture and reference workload
|
||||
combination to provide a pass/fail for the test. Hence, this plugin is
|
||||
the final arbiter in whether the scenario passes or fails.
|
||||
|
||||
Eris Workflow
|
||||
-------------
|
||||
|
||||
+--------------------------------+
|
||||
| |image2| |
|
||||
+================================+
|
||||
| **Figure 3: Eris QA Workflow** |
|
||||
+--------------------------------+
|
||||
|
||||
Apart from the actual Eris framework that is expected to execute the
|
||||
tests, there is a component of Eris that needs to reside in the QA
|
||||
framework. This actually has three major components identified.
|
||||
|
||||
**CI/CD Integration:** Eris test suites need to be integrated into the
|
||||
CI/CD workflow. Test suite runs need to be tagged, the results archived
|
||||
and bugs generated. Initially, there may be the capacity for all Eris
|
||||
tests to be run. However, as and how the library of test suites and
|
||||
reference architectures becomes more complex the gate QA will need to
|
||||
rely on a smoke test/operational readiness test. Initially, the
|
||||
identification of what constitutes a reasonable smoke test will have to
|
||||
be done manually. However, there should be an evolution to automatically
|
||||
identify a set of smoke tests that can be reasonably handled at the
|
||||
CI/CD gates.
|
||||
|
||||
**Test Frequency:** The tests that Eris proposes to run are long running
|
||||
tests. It may not be practical to run them at every code check-in. The
|
||||
workflow proposal is for the smoke tests to be run one a day and an
|
||||
operational readiness suite to be run one every week. This party CI’s
|
||||
can rely on more exhaustive testing that can run into multiple days.
|
||||
|
||||
**Bug Reporting:** The reporting of bugs for Eris can be tricky. Bugs
|
||||
are generated when analyzed KPI from the tests fail to meet defined
|
||||
reference SLO’s. However, these bugs need to be reproducible. The
|
||||
question becomes how many times should a test run before a KPI miss is
|
||||
considered a bug? This is an open question that will consist of some
|
||||
fairly hard mathematics to solve. It may depend on several states in the
|
||||
system and reproducing specific conditions may not be possible every
|
||||
time. A good approach to take is to create a bug but attach a frequency
|
||||
tag to the bug. As and how KPI’s keep missing reference objectives a
|
||||
frequency tag is incremented. The frequency tag can be attached to the
|
||||
criticality of the bug and every 10 counts of a frequency tag can result
|
||||
in the criticality of the bug being bumped up.
|
||||
|
||||
Eris Design
|
||||
===========
|
||||
|
||||
This is the thinnest section by far in the document since not all the
|
||||
parts of Eris have been thought about. It is good in a sense because it
|
||||
provides a lot of opportunity for the community to fine tune the project
|
||||
to its needs. There has been a fair amount of thought put forth on the
|
||||
tools to be used and some of the enhancements that are needed. The main
|
||||
focus of the design here will be to focus on a specification and
|
||||
tools/libraries. The specification can then be broken up into specific
|
||||
roadmap items for Queens and beyond. Keep in mind that the tools and
|
||||
libraries will most certainly need changes that will extend their
|
||||
current capabilities.
|
||||
|
||||
Design Components
|
||||
-----------------
|
||||
|
||||
+--------------------------------------------------------+
|
||||
| |image3| |
|
||||
+========================================================+
|
||||
| **Figure 4: Eris Implementation Components (Partial)** |
|
||||
+--------------------------------------------------------+
|
||||
|
||||
The general idea is to use Ansible to orchestrate the various test
|
||||
scenarios. Ansible is python based and therefore will fit well into the
|
||||
OpenStack community. It also has a variety of plugins already available
|
||||
to orchestrate different scenarios. New plugins can be easily created
|
||||
for specific scenarios that are needed for OpenStack Eris.
|
||||
|
||||
The use of Ansible will result in the following major benefits for the
|
||||
project:
|
||||
|
||||
- Decoupling of the orchestration (Ansible) and execution (Rally,
|
||||
Shaker, etc.).
|
||||
|
||||
- Extensive use of existing Ansible plugins for installation and
|
||||
distributed orchestration of software.
|
||||
|
||||
- Well documented and open source tool for extending and expanding the
|
||||
use of Eris.
|
||||
|
||||
- Agentless execution since agents and tools require extra installation
|
||||
but rarely bring benefits for testing.
|
||||
|
||||
As can be seen from the proposed design above Eris does not exclude the
|
||||
use of already existing tools for performance and failure injection
|
||||
testing. In fact the use of Ansible as the orchestration mechanism
|
||||
provides an incentive for re-using them.
|
||||
|
||||
The other benefit of using Ansible is the ability to include plug-ins
|
||||
for third party proprietary tools with operators and companies
|
||||
developing their own plugins that confirm to the Eris specification. As
|
||||
an example, an operator may use HP Performance Center as a performance
|
||||
testing tool, HP SiteScope for gathering metrics and IXIA for BGP load
|
||||
generation. These could be private plugins for the operator to generate
|
||||
specific load components and gather metrics while still using large
|
||||
parts of Eris to discover, inject faults and compute KPI.
|
||||
|
||||
Deployment
|
||||
----------
|
||||
|
||||
Roadmap Item – for the community to specify.
|
||||
|
||||
Discovery
|
||||
---------
|
||||
|
||||
The discovery mechanism can use any tool to discover the environment. It
|
||||
can be read from a file, use Fuel or Kubernetes, etc. However, in the
|
||||
end the discovery mechanism should confirm to an Ansible dynamic
|
||||
inventory that provides a structure that describes the site. The
|
||||
description of the site can be expanded. However, the underlying load
|
||||
injection mechanisms and metrics gathering mechanisms will depend on
|
||||
this data. In short, the reference workload, failure injection and the
|
||||
metrics gathering cannot see what the discovery cannot provide. So, if
|
||||
initially the discovery provides only server and VM information those
|
||||
are the only resources that can be probed.
|
||||
|
||||
Ideally, a site is composed of the following components:
|
||||
|
||||
- Routers
|
||||
|
||||
- Switches
|
||||
|
||||
- Servers (Control & Compute)
|
||||
|
||||
- Racks
|
||||
|
||||
- VMs (or Containers)
|
||||
|
||||
- Orchestration services (Kubernetes, Ceph, Calico, etc.)
|
||||
|
||||
- OpenStack services and components (Rabbit, MariaDB, etc.)
|
||||
|
||||
Eris will need all details related to these components – specifically
|
||||
ssh keys, IP addresses, MAC addresses and any other variables that
|
||||
describe how to induce failure and stress. It is not possible to
|
||||
provide an entire specification considering the variety of
|
||||
installations. However, an example will be provided with the Queens
|
||||
roadmap.
|
||||
|
||||
Load Injection
|
||||
--------------
|
||||
|
||||
Control Plane
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The tool for control plane load injection is Rally. Rally is very well
|
||||
known in OpenStack and contains plenty of scenarios to stress the
|
||||
control plane. Rally does have some gaps with distributed workload
|
||||
generation and multi-scenario workloads. With respect to Eris, where the
|
||||
idea is to loosely couple components that make up a scenario, tight
|
||||
coupling with Rally is not desirable. Hence, Eris will use Rally single
|
||||
scenarios. However, Eris will use its own functions and methods for
|
||||
multi-scenario and distributed workload generation. Initially, Eris’
|
||||
focus will be on multi-scenario execution with distributed load
|
||||
generation closely following.
|
||||
|
||||
Data Plane
|
||||
~~~~~~~~~~
|
||||
|
||||
The tool for data plane load injection is Shaker. Shaker already has a
|
||||
custom image for iperf3 execution along with heat templates for
|
||||
deployment. Eris’ goals for Shaker exceed that already defined with
|
||||
Shaker and again there are some significant enhancements with Shaker
|
||||
that will need to be accomplished. A couple of primary enhancements may
|
||||
be the inclusion of various other data plane stress mechanisms and the
|
||||
use of an agentless mechanism using ssh (which Ansible has extensive use
|
||||
with) to control the load and gather metrics.
|
||||
|
||||
Fault Injection
|
||||
---------------
|
||||
|
||||
TODO
|
||||
|
||||
Metrics Gathering
|
||||
-----------------
|
||||
|
||||
TODO
|
||||
|
||||
SLA Computation
|
||||
---------------
|
||||
|
||||
TODO
|
||||
|
||||
Eris Roadmap
|
||||
============
|
||||
|
||||
TODO
|
||||
|
||||
Eris in Popular Literature
|
||||
==========================
|
||||
|
||||
TODO
|
||||
|
||||
.. |image0| image:: ./media/image1.jpg
|
||||
:width: 6.04097in
|
||||
:height: 3.13736in
|
||||
.. |image1| image:: ./media/image2.jpg
|
||||
:width: 6.36813in
|
||||
:height: 2.06361in
|
||||
.. |image2| image:: ./media/image3.png
|
||||
:width: 6.5in
|
||||
:height: 3.65625in
|
||||
.. |image3| image:: ./media/image4.jpg
|
||||
:width: 6.35165in
|
||||
:height: 2.10833in
|
BIN
doc/source/eris/media/image1.jpg
Normal file
BIN
doc/source/eris/media/image1.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 42 KiB |
BIN
doc/source/eris/media/image2.jpg
Normal file
BIN
doc/source/eris/media/image2.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 52 KiB |
BIN
doc/source/eris/media/image3.png
Normal file
BIN
doc/source/eris/media/image3.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 174 KiB |
BIN
doc/source/eris/media/image4.jpg
Normal file
BIN
doc/source/eris/media/image4.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 48 KiB |
@ -16,6 +16,7 @@ Contributions to this documentation are warmly encouraged; please see
|
||||
|
||||
use-cases
|
||||
specs
|
||||
eris/index
|
||||
|
||||
|
||||
Indices and tables
|
||||
|
Loading…
Reference in New Issue
Block a user