68e8c66e79
1. Edits to the multi-site chapter 2. Removed duplicated legal content which was added to a common section. See https://review.openstack.org/#/c/212299/ Change-Id: I10e3a04650548454c73024d87cbbb6fda63454e8 Implements: blueprint arch-guide
177 lines
9.4 KiB
XML
177 lines
9.4 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
version="5.0"
|
|
xml:id="user-requirements-multi-site">
|
|
<?dbhtml stop-chunking?>
|
|
<title>User requirements</title>
|
|
<section xml:id="workload-characteristics">
|
|
<title>Workload characteristics</title>
|
|
<para>An understanding of the expected workloads for a desired
|
|
multi-site environment and use case is an important factor in
|
|
the decision-making process. In this context, <literal>workload</literal>
|
|
refers to the way the systems are used. A workload could be a
|
|
single application or a suite of applications that work together.
|
|
It could also be a duplicate set of applications that need to
|
|
run in multiple cloud environments. Often in a multi-site deployment,
|
|
the same workload will need to work identically in more than one
|
|
physical location.</para>
|
|
<para>This multi-site scenario likely includes one or more of the
|
|
other scenarios in this book with the additional requirement
|
|
of having the workloads in two or more locations. The
|
|
following are some possible scenarios:</para>
|
|
<para>For many use cases the proximity of the user to their
|
|
workloads has a direct influence on the performance of the
|
|
application and therefore should be taken into consideration
|
|
in the design. Certain applications require zero to minimal
|
|
latency that can only be achieved by deploying the cloud in
|
|
multiple locations. These locations could be in different data
|
|
centers, cities, countries or geographical regions, depending
|
|
on the user requirement and location of the users.</para></section>
|
|
<section xml:id="consistency-images-templates-across-sites">
|
|
<title>Consistency of images and templates across different
|
|
sites</title>
|
|
<para>It is essential that the deployment of instances is
|
|
consistent across the different sites and built
|
|
into the infrastructure. If the OpenStack Object Storage is used as
|
|
a back end for the Image service, it is possible to create repositories
|
|
of consistent images across multiple sites. Having central
|
|
endpoints with multiple storage nodes allows consistent centralized
|
|
storage for every site.</para>
|
|
<para>Not using a centralized object store increases the operational
|
|
overhead of maintaining a consistent image library. This
|
|
could include development of a replication mechanism to handle
|
|
the transport of images and the changes to the images across
|
|
multiple sites.</para></section>
|
|
<section xml:id="high-availability-multi-site">
|
|
<title>High availability</title>
|
|
<para>If high availability is a requirement to provide continuous
|
|
infrastructure operations, a basic requirement of high
|
|
availability should be defined.</para>
|
|
<para>The OpenStack management components need to have a basic and
|
|
minimal level of redundancy. The simplest example is the loss
|
|
of any single site should have minimal impact on the
|
|
availability of the OpenStack services.</para>
|
|
<para>The <link
|
|
xlink:href="http://docs.openstack.org/high-availability-guide/content/"><citetitle>OpenStack
|
|
High Availability Guide</citetitle></link>
|
|
contains more information on how to provide redundancy for the
|
|
OpenStack components.</para>
|
|
<para>Multiple network links should be deployed between sites to
|
|
provide redundancy for all components. This includes storage
|
|
replication, which should be isolated to a dedicated network
|
|
or VLAN with the ability to assign QoS to control the
|
|
replication traffic or provide priority for this traffic. Note
|
|
that if the data store is highly changeable, the network
|
|
requirements could have a significant effect on the
|
|
operational cost of maintaining the sites.</para>
|
|
<para>The ability to maintain object availability in both sites
|
|
has significant implications on the object storage design and
|
|
implementation. It also has a significant impact on the
|
|
WAN network design between the sites.</para>
|
|
<para>Connecting more than two sites increases the challenges and
|
|
adds more complexity to the design considerations. Multi-site
|
|
implementations require planning to address the additional
|
|
topology used for internal and external connectivity. Some options
|
|
include full mesh topology, hub spoke, spine leaf, and 3D Torus.</para>
|
|
<para>If applications running in a cloud are not cloud-aware, there
|
|
should be clear measures and expectations to define what the
|
|
infrastructure can and cannot support. An example would be
|
|
shared storage between sites. It is possible, however such a
|
|
solution is not native to OpenStack and requires a third-party
|
|
hardware vendor to fulfill such a requirement. Another example
|
|
can be seen in applications that are able to consume resources
|
|
in object storage directly. These applications need to be
|
|
cloud aware to make good use of an OpenStack Object
|
|
Store.</para></section>
|
|
<section xml:id="application-readiness">
|
|
<title>Application readiness</title>
|
|
<para>Some applications are tolerant of the lack of synchronized
|
|
object storage, while others may need those objects to be
|
|
replicated and available across regions. Understanding how
|
|
the cloud implementation impacts new and existing applications
|
|
is important for risk mitigation, and the overall success of a
|
|
cloud project. Applications may have to be written or rewritten
|
|
for an infrastructure with little to no redundancy, or with the
|
|
cloud in mind.</para></section>
|
|
<section xml:id="cost-multi-site">
|
|
<title>Cost</title>
|
|
<para>A greater number of sites increase cost and complexity for a
|
|
multi-site deployment. Costs can be broken down into the following
|
|
categories:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Compute resources</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Networking resources</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Replication</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Storage</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Management</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Operational costs</para>
|
|
</listitem>
|
|
</itemizedlist></section>
|
|
<section xml:id="site-loss-and-recovery">
|
|
<title>Site loss and recovery</title>
|
|
<para>Outages can cause partial or full loss of site functionality.
|
|
Strategies should be implemented to understand and plan for recovery
|
|
scenarios.</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>The deployed applications need to continue to
|
|
function and, more importantly, you must consider the
|
|
impact on the performance and reliability of the application
|
|
when a site is unavailable.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>It is important to understand what happens to the
|
|
replication of objects and data between the sites when
|
|
a site goes down. If this causes queues to start
|
|
building up, consider how long these queues can
|
|
safely exist until an error occurs.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>After an outage, ensure the method for resuming proper
|
|
operations of a site is implemented when it comes back online.
|
|
We recommend you architect the recovery to avoid race conditions.</para>
|
|
</listitem>
|
|
</itemizedlist></section>
|
|
<section xml:id="compliance-and-geo-location-multi-site">
|
|
<title>Compliance and geo-location</title>
|
|
<para>An organization may have certain legal obligations and
|
|
regulatory compliance measures which could require certain
|
|
workloads or data to not be located in certain regions.</para></section>
|
|
<section xml:id="auditing-multi-site">
|
|
<title>Auditing</title>
|
|
<para>A well thought-out auditing strategy is important in order
|
|
to be able to quickly track down issues. Keeping track of
|
|
changes made to security groups and tenant changes can be
|
|
useful in rolling back the changes if they affect production.
|
|
For example, if all security group rules for a tenant
|
|
disappeared, the ability to quickly track down the issue would
|
|
be important for operational and legal reasons.</para></section>
|
|
<section xml:id="separation-of-duties">
|
|
<title>Separation of duties</title>
|
|
<para>A common requirement is to define different roles for the
|
|
different cloud administration functions. An example would be
|
|
a requirement to segregate the duties and permissions by
|
|
site.</para></section>
|
|
<section xml:id="authentication-between-sites">
|
|
<title>Authentication between sites</title>
|
|
<para>It is recommended to have a single authentication domain
|
|
rather than a separate implementation for each and every
|
|
site. This requires an authentication mechanism that is highly
|
|
available and distributed to ensure continuous operation.
|
|
Authentication server locality might be required and should be
|
|
planned for.</para></section>
|
|
</section>
|