Improve chapter "Introduction to OpenStack High Availability"

Change-Id: Ibc10eee41eda2d69ba35b717b966d957509cd723
This commit is contained in:
Christian Berendt
2014-09-23 22:03:22 +02:00
parent 0afc42e955
commit 5f11d6d651

View File

@@ -1,74 +1,123 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="ch-intro">
<title>Introduction to OpenStack High Availability</title>
<para>High Availability systems seek to minimize two things:</para>
<itemizedlist>
<listitem>
<para><emphasis role="strong">System downtime</emphasis>occurs when a <emphasis>user-facing</emphasis> service is unavailable beyond a specified maximum amount of time.
</para>
</listitem>
<listitem>
<para><emphasis role="strong">Data loss</emphasis>accidental deletion or destruction of data.
</para>
</listitem>
</itemizedlist>
<para>Most high availability systems guarantee protection against system downtime and data loss only in the event of a single failure. However, they are also expected to protect against cascading failures, where a single failure deteriorates into a series of consequential failures.</para>
<para>A crucial aspect of high availability is the elimination of single points of failure (SPOFs). A SPOF is an individual piece of equipment or software which will cause system downtime or data loss if it fails. In order to eliminate SPOFs, check that mechanisms exist for redundancy of:</para>
<itemizedlist>
<listitem>
<para>
Network components, such as switches and routers
</para>
</listitem>
<listitem>
<para>
Applications and automatic service migration
</para>
</listitem>
<listitem>
<para>
Storage components
</para>
</listitem>
<listitem>
<para>
Facility services such as power, air conditioning, and fire protection
</para>
</listitem>
</itemizedlist>
<para>Most high availability systems will fail in the event of multiple independent (non-consequential) failures. In this case, most systems will protect data over maintaining availability.</para>
<para>High-availability systems typically achieve an uptime percentage of 99.99% or more, which roughly equates to less than an hour of cumulative downtime per year. In order to achieve this, high availability systems should keep recovery times after a failure to about one to two minutes, sometimes significantly less.</para>
<para>OpenStack currently meets such availability requirements for its own infrastructure services, meaning that an uptime of 99.99% is feasible for the OpenStack infrastructure proper. However, OpenStack <emphasis>does</emphasis> <emphasis>not</emphasis> guarantee 99.99% availability for individual guest instances.</para>
<para>Preventing single points of failure can depend on whether or not a service is stateless.</para>
<section xml:id="stateless-vs-stateful">
<title>Stateless vs. Stateful services</title>
<para>A stateless service is one that provides a response after your request, and then requires no further attention. To make a stateless service highly available, you need to provide redundant instances and load balance them. OpenStack services that are stateless include nova-api, nova-conductor, glance-api, keystone-api, neutron-api and nova-scheduler.</para>
<para>A stateful service is one where subsequent requests to the service depend on the results of the first request. Stateful services are more difficult to manage because a single action typically involves more than one request, so simply providing additional instances and load balancing will not solve the problem. For example, if the Horizon user interface reset itself every time you went to a new page, it wouldnt be very useful. OpenStack services that are stateful include the OpenStack database and message queue.</para>
<para>Making stateful services highly available can depend on whether you choose an active/passive or active/active configuration.</para>
</section>
<section xml:id="ap-intro">
<title>Active/Passive</title>
<para>In an active/passive configuration, systems are set up to bring additional resources online to replace those that have failed. For example, OpenStack would write to the main database while maintaining a disaster recovery database that can be brought online in the event that the main database fails.</para>
<para>Typically, an active/passive installation for a stateless service would maintain a redundant instance that can be brought online when required. Requests may be handled using a virtual IP address to facilitate return to service with minimal reconfiguration required.</para>
<para>A typical active/passive installation for a stateful service maintains a replacement resource that can be brought online when required. A separate application (such as Pacemaker or Corosync) monitors these services, bringing the backup online as necessary.</para>
</section>
<section xml:id="aa-intro">
<title>Active/Active</title>
<para>In an active/active configuration, systems also use a backup but will manage both the main and redundant systems concurrently. This way, if there is a failure the user is unlikely to notice. The backup system is already online, and takes on increased load while the main system is fixed and brought back online.</para>
<para>Typically, an active/active installation for a stateless service would maintain a redundant instance, and requests are load balanced using a virtual IP address and a load balancer such as HAProxy.</para>
<para>A typical active/active installation for a stateful service would include redundant services with all instances having an identical state. For example, updates to one instance of a database would also update all other instances. This way a request to one instance is the same as a request to any other. A load balancer manages the traffic to these systems, ensuring that operational systems always handle the request.</para>
<para>These are some of the more common ways to implement these high availability architectures, but they are by no means the only ways to do it. The important thing is to make sure that your services are redundant, and available; how you achieve that is up to you. This document will cover some of the more common options for highly available systems.</para>
</section>
</chapter>
<title>Introduction to OpenStack High Availability</title>
<para>High Availability systems seek to minimize two things:</para>
<variablelist>
<varlistentry>
<term>System downtime</term>
<listitem><para>Occurs when a user-facing service is unavailable
beyond a specified maximum amount of time.</para></listitem>
</varlistentry>
<varlistentry>
<term>Data loss</term>
<listitem><para>Accidental deletion or destruction of
data.</para></listitem>
</varlistentry>
</variablelist>
<para>Most high availability systems guarantee protection against system
downtime and data loss only in the event of a single failure. However,
they are also expected to protect against cascading failures, where a
single failure deteriorates into a series of consequential failures.</para>
<para>A crucial aspect of high availability is the elimination of single
points of failure (SPOFs). A SPOF is an individual piece of equipment or
software which will cause system downtime or data loss if it fails.
In order to eliminate SPOFs, check that mechanisms exist for redundancy
of:</para>
<itemizedlist>
<listitem>
<para>Network components, such as switches and routers</para>
</listitem>
<listitem>
<para>Applications and automatic service migration</para>
</listitem>
<listitem>
<para>Storage components</para>
</listitem>
<listitem>
<para>Facility services such as power, air conditioning, and fire
protection</para>
</listitem>
</itemizedlist>
<para>Most high availability systems will fail in the event of multiple
independent (non-consequential) failures. In this case, most systems will
protect data over maintaining availability.</para>
<para>High-availability systems typically achieve an uptime percentage of
99.99% or more, which roughly equates to less than an hour of cumulative
downtime per year. In order to achieve this, high availability systems
should keep recovery times after a failure to about one to two minutes,
sometimes significantly less.</para>
<para>OpenStack currently meets such availability requirements for its own
infrastructure services, meaning that an uptime of 99.99% is feasible for
the OpenStack infrastructure proper. However, OpenStack does not guarantee
99.99% availability for individual guest instances.</para>
<para>Preventing single points of failure can depend on whether or not a
service is stateless.</para>
<section xml:id="stateless-vs-stateful">
<title>Stateless vs. Stateful services</title>
<para>A stateless service is one that provides a response after your
request, and then requires no further attention. To make a stateless
service highly available, you need to provide redundant instances and
load balance them. OpenStack services that are stateless include
<systemitem class="service">nova-api</systemitem>,
<systemitem class="service">nova-conductor</systemitem>,
<systemitem class="service">glance-api</systemitem>,
<systemitem class="service">keystone-api</systemitem>,
<systemitem class="service">neutron-api</systemitem> and
<systemitem class="service">nova-scheduler</systemitem>.</para>
<para>A stateful service is one where subsequent requests to the service
depend on the results of the first request. Stateful services are more
difficult to manage because a single action typically involves more than
one request, so simply providing additional instances and load balancing
will not solve the problem. For example, if the Horizon user interface
reset itself every time you went to a new page, it wouldn't be very
useful. OpenStack services that are stateful include the OpenStack
database and message queue.</para>
<para>Making stateful services highly available can depend on whether you
choose an active/passive or active/active configuration.</para>
</section>
<section xml:id="ap-intro">
<title>Active/Passive</title>
<para>In an active/passive configuration, systems are set up to bring
additional resources online to replace those that have failed. For
example, OpenStack would write to the main database while maintaining a
disaster recovery database that can be brought online in the event that
the main database fails.</para>
<para>Typically, an active/passive installation for a stateless service
would maintain a redundant instance that can be brought online when
required. Requests may be handled using a virtual IP address to
facilitate return to service with minimal reconfiguration
required.</para>
<para>A typical active/passive installation for a stateful service
maintains a replacement resource that can be brought online when
required. A separate application (such as Pacemaker or Corosync) monitors
these services, bringing the backup online as necessary.</para>
</section>
<section xml:id="aa-intro">
<title>Active/Active</title>
<para>In an active/active configuration, systems also use a backup but will
manage both the main and redundant systems concurrently. This way, if
there is a failure the user is unlikely to notice. The backup system is
already online, and takes on increased load while the main system is
fixed and brought back online.</para>
<para>Typically, an active/active installation for a stateless service
would maintain a redundant instance, and requests are load balanced using
a virtual IP address and a load balancer such as HAProxy.</para>
<para>A typical active/active installation for a stateful service would
include redundant services with all instances having an identical state.
For example, updates to one instance of a database would also update all
other instances. This way a request to one instance is the same as a
request to any other. A load balancer manages the traffic to these
systems, ensuring that operational systems always handle the
request.</para>
<para>These are some of the more common ways to implement these high
availability architectures, but they are by no means the only ways to do
it. The important thing is to make sure that your services are redundant,
and available; how you achieve that is up to you. This document will
cover some of the more common options for highly available
systems.</para>
</section>
</chapter>