Lana Brindley 0e9f2c421a minor edits to HA guide
This patch contains edits to the high availability guide
for grammar, structure, tone and style.

Change-Id: Iedfc2fbf87abe16de99aae3c46c244651d1be8d8
2013-10-29 16:56:19 +11:00

58 lines
5.2 KiB
Plaintext

[[ch-intro]]
== Introduction to OpenStack High Availability
High Availability systems seek to minimize two things:
* *System downtime* -- occurs when a _user-facing_ service is unavailable beyond a specified maximum amount of time, and
* *Data loss* -- accidental deletion or destruction of data.
Most high availability systems guarantee protection against system downtime and data loss only in the event of a single failure. However, they are also expected to protect against cascading failures, where a single failure deteriorates into a series of consequential failures.
A crucial aspect of high availability is the elimination of single points of failure (SPOFs). A SPOF is an individual piece of equipment or software which will cause system downtime or data loss if it fails. In order to eliminate SPOFs, check that mechanisms exist for redundancy of:
* Network components, such as switches and routers
* Applications and automatic service migration
* Storage components
* Facility services such as power, air conditioning, and fire protection
Most high availability systems will fail in the event of multiple independent (non-consequential) failures. In this case, most systems will protect data over maintaining availability.
High-availability systems typically achieve uptime of 99.99% or more, which roughly equates to less than an hour of cumulative downtime per year. In order to achieve this, high availability systems should keep recovery times after a failure to about one to two minutes, sometimes significantly less.
OpenStack currently meets such availability requirements for its own infrastructure services, meaning that an uptime of 99.99% is feasible for the OpenStack infrastructure proper. However, OpenStack _does_ _not_ guarantee 99.99% availability for individual guest instances.
Preventing single points of failure can depend on whether or not a service is stateless.
[[stateless-vs-stateful]]
=== Stateless vs. Stateful services
A stateless service is one that provides a response after your request, and then requires no further attention. To make a stateless service highly available, you need to provide redundant instances and load balance them. OpenStack services that are stateless include nova-api, nova-conductor, glance-api, keystone-api, neutron-api and nova-scheduler.
A stateful service is one where subsequent requests to the service depend on the results of the first request. Stateful services are more difficult to manage because a single action typically involves more than one request, so simply providing additional instances and load balancing will not solve the problem. For example, if the Horizon user interface reset itself every time you went to a new page, it wouldn't be very useful. OpenStack services that are stateful include the OpenStack database and message queue.
Making stateful services highly available can depend on whether you choose an active/passive or active/active configuration.
[[ap-intro]]
=== Active/Passive
In an active/passive configuration, systems are set up to bring additional resources online to replace those that have failed. For example, OpenStack would write to the main database while maintaining a disaster recovery database that can be brought online in the event that the main database fails.
Typically, an active/passive installation for a stateless service would maintain a redundant instance that can be brought online when required. Requests are load balanced using a virtual IP address and a load balancer such as HAProxy.
A typical active/passive installation for a stateful service maintains a replacement resource that can be brought online when required. A separate application (such as Pacemaker or Corosync) monitors these services, bringing the backup online as necessary.
// Removed bullets and made into parallel prose. However, I'm not certain on the difference between "a redundant instance" (for stateless) and "a replacement resource" (stateful). Please FIXME.
[[aa-intro]]
=== Active/Active
In an active/active configuration, systems also use a backup but will manage both the main and redundant systems concurrently. This way, if there is a failure the user is unlikely to notice. The backup system is already online, and takes on increased load while the main system is fixed and brought back online.
Typically, an active/active installation for a stateless service would maintain a redundant instance, and requests are load balanced using a virtual IP address and a load balancer such as HAProxy.
A typical active/active installation for a stateful service would include redundant services with all instances having an identical state. For example, updates to one instance of a database would also update all other instances. This way a request to one instance is the same as a request to any other. A load balancer manages the traffic to these systems, ensuring that operational systems always handle the request.
These are some of the more common ways to implement these high availability architectures, but they are by no means the only ways to do it. The important thing is to make sure that your services are redundant, and available; how you achieve that is up to you. This document will cover some of the more common options for highly available systems.