Merge "Add content to high-availability in arch guide"
This commit is contained in:
commit
62e52363c7
@ -7,3 +7,183 @@ High availability
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Data Plane and Control Plane
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When designing an OpenStack cloud, it is important to consider the needs
|
||||
dictated by the :term:`Service Level Agreement (SLA)` in terms of the core
|
||||
services required to maintain availability of running Compute service
|
||||
instances, networks, storage and additional services running on top of those
|
||||
resources. These services are often referred to as the Data Plane services,
|
||||
and are generally expected to be available all the time.
|
||||
|
||||
The remaining services, responsible for CRUD operations, metering, monitoring,
|
||||
and so on, are often referred to as the Control Plane. The SLA is likely to
|
||||
dictate a lower uptime requirement for these services.
|
||||
|
||||
The services comprising an OpenStack cloud have a number of requirements which
|
||||
the architect needs to understand in order to be able to meet SLA terms. For
|
||||
example, in order to provide the Compute service a minimum of storage, message
|
||||
queueing, and database services are necessary as well as the networking between
|
||||
them.
|
||||
|
||||
Ongoing maintenance operations are made much simpler if there is logical and
|
||||
physical separation of Data Plane and Control Plane systems. It then becomes
|
||||
possible to, for example, reboot a controller without affecting customers.
|
||||
If one service failure affects the operation of an entire server ('noisy
|
||||
neighbour'), the separation between Control and Data Planes enables rapid
|
||||
maintenance with a limited effect on customer operations.
|
||||
|
||||
|
||||
Eliminating Single Points of Failure
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Within each site
|
||||
----------------
|
||||
|
||||
OpenStack lends itself to deployment in a highly available manner where it is
|
||||
expected that at least 2 servers be utilized. These can run all the services
|
||||
involved from the message queuing service, for example ``RabbitMQ`` or
|
||||
``QPID``, and an appropriately deployed database service such as ``MySQL`` or
|
||||
``MariaDB``. As services in the cloud are scaled out, back-end services will
|
||||
need to scale too. Monitoring and reporting on server utilization and response
|
||||
times, as well as load testing your systems, will help determine scale out
|
||||
decisions.
|
||||
|
||||
The OpenStack services themselves should be deployed across multiple servers
|
||||
that do not represent a single point of failure. Ensuring availability can
|
||||
be achieved by placing these services behind highly available load balancers
|
||||
that have multiple OpenStack servers as members.
|
||||
|
||||
There are a small number of OpenStack services which are intended to only run
|
||||
in one place at a time (e.g. the ``ceilometer-agent-central`` service). In
|
||||
order to prevent these services from becoming a single point of failure, they
|
||||
can be controlled by clustering software such as ``Pacemaker``.
|
||||
|
||||
In OpenStack, the infrastructure is integral to providing services and should
|
||||
always be available, especially when operating with SLAs. Ensuring network
|
||||
availability is accomplished by designing the network architecture so that no
|
||||
single point of failure exists. A consideration of the number of switches,
|
||||
routes and redundancies of power should be factored into core infrastructure,
|
||||
as well as the associated bonding of networks to provide diverse routes to your
|
||||
highly available switch infrastructure.
|
||||
|
||||
Care must be taken when deciding network functionality. Currently, OpenStack
|
||||
supports both the legacy networking (nova-network) system and the newer,
|
||||
extensible OpenStack Networking (neutron). OpenStack Networking and legacy
|
||||
networking both have their advantages and disadvantages. They are both valid
|
||||
and supported options that fit different network deployment models described in
|
||||
the `OpenStack Operations Guide
|
||||
<http://docs.openstack.org/openstack-ops/content/network_design.html#network_deployment_options>`_.
|
||||
|
||||
When using the Networking service, the OpenStack controller servers or separate
|
||||
Networking hosts handle routing unless the dynamic virtual routers pattern for
|
||||
routing is selected. Running routing directly on the controller servers mixes
|
||||
the Data and Control Planes and can cause complex issues with performance and
|
||||
troubleshooting. It is possible to use third party software and external
|
||||
appliances that help maintain highly available layer three routes. Doing so
|
||||
allows for common application endpoints to control network hardware, or to
|
||||
provide complex multi-tier web applications in a secure manner. It is also
|
||||
possible to completely remove routing from Networking, and instead rely on
|
||||
hardware routing capabilities. In this case, the switching infrastructure must
|
||||
support layer three routing.
|
||||
|
||||
Application design must also be factored into the capabilities of the
|
||||
underlying cloud infrastructure. If the compute hosts do not provide a seamless
|
||||
live migration capability, then it must be expected that when a compute host
|
||||
fails, that instance and any data local to that instance will be deleted.
|
||||
However, when providing an expectation to users that instances have a
|
||||
high-level of uptime guarantees, the infrastructure must be deployed in a way
|
||||
that eliminates any single point of failure when a compute host disappears.
|
||||
This may include utilizing shared file systems on enterprise storage or
|
||||
OpenStack Block storage to provide a level of guarantee to match service
|
||||
features.
|
||||
|
||||
If using a storage design that includes shared access to centralized storage,
|
||||
ensure that this is also designed without single points of failure and the SLA
|
||||
for the solution matches or exceeds the expected SLA for the Data Plane.
|
||||
|
||||
Between sites in a multi region design
|
||||
--------------------------------------
|
||||
|
||||
Some services are commonly shared between multiple regions, including the
|
||||
Identity service and the Dashboard. In this case, it is necessary to ensure
|
||||
that the databases backing the services are replicated, and that access to
|
||||
multiple workers across each site can be maintained in the event of losing a
|
||||
single region.
|
||||
|
||||
Multiple network links should be deployed between sites to provide redundancy
|
||||
for all components. This includes storage replication, which should be isolated
|
||||
to a dedicated network or VLAN with the ability to assign QoS to control the
|
||||
replication traffic or provide priority for this traffic. Note that if the data
|
||||
store is highly changeable, the network requirements could have a significant
|
||||
effect on the operational cost of maintaining the sites.
|
||||
|
||||
If the design incorporates more than one site, the ability to maintain object
|
||||
availability in both sites has significant implications on the object storage
|
||||
design and implementation. It also has a significant impact on the WAN network
|
||||
design between the sites.
|
||||
|
||||
If applications running in a cloud are not cloud-aware, there should be clear
|
||||
measures and expectations to define what the infrastructure can and cannot
|
||||
support. An example would be shared storage between sites. It is possible,
|
||||
however such a solution is not native to OpenStack and requires a third-party
|
||||
hardware vendor to fulfill such a requirement. Another example can be seen in
|
||||
applications that are able to consume resources in object storage directly.
|
||||
|
||||
Connecting more than two sites increases the challenges and adds more
|
||||
complexity to the design considerations. Multi-site implementations require
|
||||
planning to address the additional topology used for internal and external
|
||||
connectivity. Some options include full mesh topology, hub spoke, spine leaf,
|
||||
and 3D Torus.
|
||||
|
||||
For more information on high availability in OpenStack, see the `OpenStack High
|
||||
Availability Guide <http://docs.openstack.org/ha-guide/>`_.
|
||||
|
||||
Site loss and recovery
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Outages can cause partial or full loss of site functionality. Strategies
|
||||
should be implemented to understand and plan for recovery scenarios.
|
||||
|
||||
* The deployed applications need to continue to function and, more
|
||||
importantly, you must consider the impact on the performance and
|
||||
reliability of the application when a site is unavailable.
|
||||
|
||||
* It is important to understand what happens to the replication of
|
||||
objects and data between the sites when a site goes down. If this
|
||||
causes queues to start building up, consider how long these queues
|
||||
can safely exist until an error occurs.
|
||||
|
||||
* After an outage, ensure the method for resuming proper operations of
|
||||
a site is implemented when it comes back online. We recommend you
|
||||
architect the recovery to avoid race conditions.
|
||||
|
||||
|
||||
Inter-site replication data
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Traditionally, replication has been the best method of protecting object store
|
||||
implementations. A variety of replication methods exist in storage
|
||||
architectures, for example synchronous and asynchronous mirroring. Most object
|
||||
stores and back-end storage systems implement methods for replication at the
|
||||
storage subsystem layer. Object stores also tailor replication techniques to
|
||||
fit a cloud's requirements.
|
||||
|
||||
Organizations must find the right balance between data integrity and data
|
||||
availability. Replication strategy may also influence disaster recovery
|
||||
methods.
|
||||
|
||||
Replication across different racks, data centers, and geographical regions
|
||||
increases focus on determining and ensuring data locality. The ability to
|
||||
guarantee data is accessed from the nearest or fastest storage can be necessary
|
||||
for applications to perform well.
|
||||
|
||||
.. note::
|
||||
|
||||
When running embedded object store methods, ensure that you do not
|
||||
instigate extra data replication as this can cause performance issues.
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user