Remove passive voice from Chap 8 Arch Guide

Change-Id: I7b6698e24c7fcc25c78853980c5be1068e7ee002
Closes-Bug: #1431137
This commit is contained in:
kallimachos 2015-03-13 17:10:44 +10:00
parent f1eb453741
commit a444c8510a
4 changed files with 161 additions and 171 deletions

View File

@ -6,33 +6,31 @@
xml:id="massively_scalable">
<title>Massively scalable</title>
<para>A massively scalable architecture is defined as a cloud
<para>A massively scalable architecture is a cloud
implementation that is either a very large deployment, such as
one that would be built by a commercial service provider, or
a commercial service provider might build, or
one that has the capability to support user requests for large
amounts of cloud resources. An example would be an
infrastructure in which requests to service 500 instances or
more at a time is not uncommon. In a massively scalable
infrastructure, such a request is fulfilled without completely
consuming all of the available cloud infrastructure resources.
While the high capital cost of implementing such a cloud
architecture makes it cost prohibitive and is only spearheaded
by few organizations, many organizations are planning for
massive scalability moving toward the future.</para>
amounts of cloud resources. An example is an
infrastructure in which requests to service 500 or more instances
at a time is common. A massively scalable infrastructure
fulfills such a request without exhausting the available
cloud infrastructure resources. While the high capital cost
of implementing such a cloud architecture means that it is
currently in limited use, many organizations are planning
for massive scalability in the future.</para>
<para>A massively scalable OpenStack cloud design presents a
unique set of challenges and considerations. For the most part
it is similar to a general purpose cloud architecture, as it
is built to address a non-specific range of potential use
cases or functions. Typically, it is rare that massively
scalable clouds are designed or specialized for particular
workloads. Like the general purpose cloud, the massively
cases or functions. Typically, it is rare that particular
workloads determine the design or configuration of massively
scalable clouds. Like the general purpose cloud, the massively
scalable cloud is most often built as a platform for a variety
of workloads. Massively scalable OpenStack clouds are
generally built as commercial public cloud offerings since
single private organizations rarely have the resources or need
for this scale.</para>
of workloads. Because private organizations rarely require
or have the resources for them, massively scalable OpenStack clouds
are generally built as commercial, public cloud offerings.</para>
<para>Services provided by a massively scalable OpenStack cloud
will include:</para>
include:</para>
<itemizedlist>
<listitem>
<para>Virtual-machine disk image library</para>
@ -64,12 +62,12 @@
</listitem>
</itemizedlist>
<para>Like a general purpose cloud, the instances deployed in a
massively scalable OpenStack cloud will not necessarily use
massively scalable OpenStack cloud do not necessarily use
any specific aspect of the cloud offering (compute, network,
or storage). As the cloud grows in scale, the scale of the
number of workloads can cause stress on all of the cloud
components. Additional stresses are introduced to supporting
infrastructure including databases and message brokers. The
or storage). As the cloud grows in scale, the number of
workloads can cause stress on all the cloud
components. This adds further stresses to supporting
infrastructure such as databases and message brokers. The
architecture design for such a cloud must account for these
performance pressures without negatively impacting user
experience.</para>

View File

@ -6,35 +6,35 @@
xml:id="operational-considerations-massive-scale">
<?dbhtml stop-chunking?>
<title>Operational considerations</title>
<para>In order to run at massive scale, it is important to plan on
the automation of as many of the operational processes as
<para>In order to run efficiently at massive scale, automate
as many of the operational processes as
possible. Automation includes the configuration of
provisioning, monitoring and alerting systems. Part of the
automation process includes the capability to determine when
human intervention is required and who should act. The
objective is to increase the ratio of operational staff to
running systems as much as possible to reduce maintenance
running systems as much as possible in order to reduce maintenance
costs. In a massively scaled environment, it is impossible for
staff to give each system individual care.</para>
<para>Configuration management tools such as Puppet or Chef allow
<para>Configuration management tools such as Puppet and Chef enable
operations staff to categorize systems into groups based on
their role and thus create configurations and system states
that are enforced through the provisioning system. Systems
their roles and thus create configurations and system states
that the provisioning system enforces. Systems
that fall out of the defined state due to errors or failures
are quickly removed from the pool of active nodes and
replaced.</para>
<para>At large scale the resource cost of diagnosing individual
systems that have failed is far greater than the cost of
replacement. It is more economical to immediately replace the
system with a new system that can be provisioned and
configured automatically and quickly brought back into the
<para>At large scale the resource cost of diagnosing failed individual
systems is far greater than the cost of
replacement. It is more economical to replace the failed
system with a new system, provisioning and configuring it
automatically then quickly adding it to the
pool of active nodes. By automating tasks that are labor-intensive,
repetitive, and critical to operations with
automation, cloud operations teams are able to be managed more
efficiently because fewer resources are needed for these
babysitting tasks. Administrators are then free to tackle
tasks that cannot be easily automated and have longer-term
impacts on the business such as capacity planning.</para>
repetitive, and critical to operations, cloud operations
teams can work more
efficiently because fewer resources are required for these
common tasks. Administrators are then free to tackle
tasks that are not easy to automate and that have longer-term
impacts on the business, for example capacity planning.</para>
<section xml:id="the-bleeding-edge">
<title>The bleeding edge</title>
<para>Running OpenStack at massive scale requires striking a
@ -42,49 +42,48 @@
be tempting to run an older stable release branch of OpenStack
to make deployments easier. However, when running at massive
scale, known issues that may be of some concern or only have
minimal impact in smaller deployments could become pain points
at massive scale. If the issue is well known, in many cases,
it may be resolved in more recent releases. The OpenStack
community can help resolve any issues reported by applying
minimal impact in smaller deployments could become pain points.
Recent releases may address well known issues. The OpenStack
community can help resolve reported issues by applying
the collective expertise of the OpenStack developers.</para>
<para>When issues crop up, the number of organizations running at
a similar scale is a relatively tiny proportion of the
OpenStack community, therefore it is important to share these
issues with the community and be a vocal advocate for
<para>The number of organizations running at
massive scales is a small proportion of the
OpenStack community, therefore it is important to share
related issues with the community and be a vocal advocate for
resolving them. Some issues only manifest when operating at
large scale and the number of organizations able to duplicate
and validate an issue is small, so it will be important to
large scale, and the number of organizations able to duplicate
and validate an issue is small, so it is important to
document and dedicate resources to their resolution.</para>
<para>In some cases, the resolution to the problem is ultimately
to deploy a more recent version of OpenStack. Alternatively,
when the issue needs to be resolved in a production
when you must resolve an issue in a production
environment where rebuilding the entire environment is not an
option, it is possible to deploy just the more recent separate
underlying components required to resolve issues or gain
significant performance improvements. At first glance, this
could be perceived as potentially exposing the deployment to
increased risk and instability. However, in many cases it
could be an issue that has not been discovered yet.</para>
<para>It is advisable to cultivate a development and operations
option, it is sometimes possible to deploy updates to specific
underlying components in order to resolve issues or gain
significant performance improvements. Although this may appear
to expose the deployment to
increased risk and instability, in many cases it
could be an undiscovered issue.</para>
<para>We recommend building a development and operations
organization that is responsible for creating desired
features, diagnose and resolve issues, and also build the
features, diagnosing and resolving issues, and building the
infrastructure for large scale continuous integration tests
and continuous deployment. This helps catch bugs early and
make deployments quicker and less painful. In addition to
development resources, the recruitment of experts in the
fields of message queues, databases, distributed systems, and
networking, cloud and storage is also advisable.</para></section>
makes deployments faster and easier. In addition to
development resources, we also recommend the recruitment
of experts in the fields of message queues, databases, distributed
systems, networking, cloud, and storage.</para></section>
<section xml:id="growth-and-capacity-planning">
<title>Growth and capacity planning</title>
<para>An important consideration in running at massive scale is
projecting growth and utilization trends to plan capital
expenditures for the near and long term. Utilization metrics
for compute, network, and storage as well as a historical
record of these metrics are required. While securing major
projecting growth and utilization trends in order to plan capital
expenditures for the short and long term. Gather utilization
metrics for compute, network, and storage, along with historical
records of these metrics. While securing major
anchor tenants can lead to rapid jumps in the utilization
rates of all resources, the steady adoption of the cloud
inside an organizations or by public consumers in a public
offering will also create a steady trend of increased
inside an organization or by consumers in a public
offering also creates a steady trend of increased
utilization.</para></section>
<section xml:id="skills-and-training">
<title>Skills and training</title>
@ -95,8 +94,8 @@
members to OpenStack conferences, meetup events, and
encouraging active participation in the mailing lists and
committees is a very important way to maintain skills and
forge relationships in the community. A list of OpenStack
training providers in the marketplace can be found here: <link
forge relationships in the community. For a list of OpenStack
training providers in the marketplace, see: <link
xlink:href="http://www.openstack.org/marketplace/training/">http://www.openstack.org/marketplace/training/</link>.
</para>
</section>

View File

@ -10,119 +10,114 @@
xml:id="technical-considerations-massive-scale">
<?dbhtml stop-chunking?>
<title>Technical considerations</title>
<para>Converting an existing OpenStack environment that was
designed for a different purpose to be massively scalable is a
formidable task. When building a massively scalable
environment from the ground up, make sure the initial
deployment is built with the same principles and choices that
apply as the environment grows. For example, a good approach
is to deploy the first site as a multi-site environment. This
allows the same deployment and segregation methods to be used
as the environment grows to separate locations across
dedicated links or wide area networks. In a hyperscale cloud,
scale trumps redundancy. Applications must be modified with
this in mind, relying on the scale and homogeneity of the
<para>Repurposing an existing OpenStack environment to be
massively scalable is a formidable task. When building
a massively scalable environment from the ground up, ensure
you build the initial deployment with the same principles
and choices that apply as the environment grows. For example,
a good approach is to deploy the first site as a multi-site
environment. This enables you to use the same deployment
and segregation methods as the environment grows to separate
locations across dedicated links or wide area networks. In
a hyperscale cloud, scale trumps redundancy. Modify applications
with this in mind, relying on the scale and homogeneity of the
environment to provide reliability rather than redundant
infrastructure provided by non-commodity hardware
solutions.</para>
<section xml:id="infrastructure-segregation-massive-scale">
<title>Infrastructure segregation</title>
<para>Fortunately, OpenStack services are designed to support
massive horizontal scale. Be aware that this is not the case
for the entire supporting infrastructure. This is particularly
a problem for the database management systems and message
queues used by the various OpenStack services for data storage
and remote procedure call communications.</para>
<para>Traditional clustering techniques are typically used to
<para>OpenStack services support massive horizontal scale.
Be aware that this is not the case for the entire supporting
infrastructure. This is particularly a problem for the database
management systems and message queues that OpenStack services
use for data storage and remote procedure call communications.</para>
<para>Traditional clustering techniques typically
provide high availability and some additional scale for these
environments. In the quest for massive scale, however,
additional steps need to be taken to relieve the performance
pressure on these components to prevent them from negatively
impacting the overall performance of the environment. It is
important to make sure that all the components are in balance
so that, if and when the massively scalable environment fails,
all the components are at, or close to, maximum
environments. In the quest for massive scale, however, you must
take additional steps to relieve the performance
pressure on these components in order to prevent them from negatively
impacting the overall performance of the environment. Ensure
that all the components are in balance so that if the massively
scalable environment fails, all the components are near maximum
capacity.</para>
<para>Regions are used to segregate completely independent
<para>Regions segregate completely independent
installations linked only by an Identity and Dashboard
(optional) installation. Services are installed with separate
API endpoints for each region, complete with separate database
(optional) installation. Services have separate
API endpoints for each region, an include separate database
and queue installations. This exposes some awareness of the
environment's fault domains to users and gives them the
ability to ensure some degree of application resiliency while
also imposing the requirement to specify which region their
actions must be applied to.</para>
also imposing the requirement to specify which region to apply
their actions to.</para>
<para>Environments operating at massive scale typically need their
regions or sites subdivided further without exposing the
requirement to specify the failure domain to the user. This
provides the ability to further divide the installation into
failure domains while also providing a logical unit for
maintenance and the addition of new hardware. At hyperscale,
instead of adding single compute nodes, administrators may add
instead of adding single compute nodes, administrators can add
entire racks or even groups of racks at a time with each new
addition of nodes exposed via one of the segregation concepts
mentioned herein.</para>
<para><glossterm baseform="cell">Cells</glossterm> provide the ability
to subdivide the compute portion
of an OpenStack installation, including regions, while still
exposing a single endpoint. In each region an API cell is
created along with a number of compute cells where the
workloads actually run. Each cell gets its own database and
exposing a single endpoint. Each region has an API cell
along with a number of compute cells where the
workloads actually run. Each cell has its own database and
message queue setup (ideally clustered), providing the ability
to subdivide the load on these subsystems, improving overall
performance.</para>
<para>Within each compute cell a complete compute installation is
provided, complete with full database and queue installations,
<para>Each compute cell provides a complete compute installation,
complete with full database and queue installations,
scheduler, conductor, and multiple compute hosts. The cells
scheduler handles placement of user requests from the single
API endpoint to a specific cell from those available. The
normal filter scheduler then handles placement within the
cell.</para>
<para>The downside of using cells is that they are not well
supported by any of the OpenStack services other than Compute.
Also, they do not adequately support some relatively standard
<para>Unfortunately, Compute is the only OpenStack service that
provides good support for cells. In addition, cells
do not adequately support some standard
OpenStack functionality such as security groups and host
aggregates. Due to their relative newness and specialized use,
they receive relatively little testing in the OpenStack gate.
Despite these issues, however, cells are used in some very
well known OpenStack installations operating at massive scale
including those at CERN and Rackspace.</para></section>
cells receive relatively little testing in the OpenStack gate.
Despite these issues, cells play an important role in
well known OpenStack installations operating at massive scale,
such as those at CERN and Rackspace.</para></section>
<section xml:id="host-aggregates">
<title>Host aggregates</title>
<para>Host aggregates enable partitioning of OpenStack Compute
deployments into logical groups for load balancing and
instance distribution. Host aggregates may also be used to
instance distribution. You can also use host aggregates to
further partition an availability zone. Consider a cloud which
might use host aggregates to partition an availability zone
into groups of hosts that either share common resources, such
as storage and network, or have a special property, such as
trusted computing hardware. Host aggregates are not explicitly
user-targetable; instead they are implicitly targeted via the
selection of instance flavors with extra specifications that
map to host aggregate metadata.</para></section>
trusted computing hardware. You cannot target host aggregates
explicitly. Instead, select instance flavors that map to host
aggregate metadata. These flavors target host aggregates
implicitly.</para></section>
<section xml:id="availability-zones">
<title>Availability zones</title>
<para>Availability zones provide another mechanism for subdividing
an installation or region. They are, in effect, host
aggregates that are exposed for (optional) explicit targeting
aggregates exposed for (optional) explicit targeting
by users.</para>
<para>Unlike cells, they do not have their own database server or
queue broker but simply represent an arbitrary grouping of
compute nodes. Typically, grouping of nodes into availability
zones is based on a shared failure domain based on a physical
characteristic such as a shared power source, physical network
connection, and so on. Availability zones are exposed to the
user because they can be targeted; however, users are not
required to target them. An alternate approach is for the
operator to set a default availability zone to schedule
instances to other than the default availability zone of
nova.</para></section>
<para>Unlike cells, availability zones do not have their own database
server or queue broker but represent an arbitrary grouping of
compute nodes. Typically, nodes are grouped into availability
zones using a shared failure domain based on a physical
characteristic such as a shared power source or physical network
connections. Users can target exposed availability zones; however,
this is not a requirement. An alternative approach is to set a default
availability zone to schedule instances to a non-default availability
zone of nova.</para></section>
<section xml:id="segregation-example">
<title>Segregation example</title>
<para>In this example the cloud is divided into two regions, one
for each site, with two availability zones in each based on
the power layout of the data centers. A number of host
aggregates have also been defined to allow targeting of
aggregates enable targeting of
virtual machine instances using flavors, that require special
capabilities shared by the target hosts such as SSDs, 10&nbsp;GbE
networks, or GPU cards.</para>

View File

@ -56,48 +56,47 @@
<listitem>
<para>The cloud user expects repeatable, dependable, and
deterministic processes for launching and deploying
cloud resources. This could be delivered through a
cloud resources. You could deliver this through a
web-based interface or publicly available API
endpoints. All appropriate options for requesting
cloud resources need to be available through some type
cloud resources must be available through some type
of user interface, a command-line interface (CLI), or
API endpoints.</para>
</listitem>
<listitem>
<para>Cloud users expect a fully self-service and
on-demand consumption model. When an OpenStack cloud
reaches the "massively scalable" size, it means it is
expected to be consumed "as a service" in each and
reaches the "massively scalable" size, expect
consumption "as a service" in each and
every way.</para>
</listitem>
<listitem>
<para>For a user of a massively scalable OpenStack public
cloud, there will be no expectations for control over
security, performance, or availability. Only SLAs
related to uptime of API services are expected, and
very basic SLAs expected of services offered. The user
understands it is his or her responsibility to address
cloud, there are no expectations for control over
security, performance, or availability. Users expect
only SLAs related to uptime of API services, and
very basic SLAs for services offered. It is the user's
responsibility to address
these issues on their own. The exception to this
expectation is the rare case of a massively scalable
cloud infrastructure built for a private or government
organization that has specific requirements.</para>
</listitem>
</itemizedlist>
<para>As might be expected, the cloud user requirements or
expectations that determine the design are all focused on the
consumption model. The user expects to be able to easily
consume cloud resources in an automated and deterministic way,
<para>The cloud user's requirements and expectations that determine
the cloud design focus on the
consumption model. The user expects to consume cloud resources
in an automated and deterministic way,
without any need for knowledge of the capacity, scalability,
or other attributes of the cloud's underlying
infrastructure.</para></section>
<section xml:id="operator-requirements-massive-scale">
<title>Operator requirements</title>
<para>Whereas the cloud user should be completely unaware of the
<para>While the cloud user can be completely unaware of the
underlying infrastructure of the cloud and its attributes, the
operator must be able to build and support the infrastructure,
as well as how it needs to operate at scale. This presents a
very demanding set of requirements for building such a cloud
from the operator's perspective:</para>
operator must build and support the infrastructure for operating
at scale. This presents a very demanding set of requirements
for building such a cloud from the operator's perspective:</para>
<itemizedlist>
<listitem>
<para>First and foremost, everything must be capable of
@ -105,7 +104,7 @@
compute hardware, storage hardware, or networking
hardware, to the installation and configuration of the
supporting software, everything must be capable of
being automated. Manual processes will not suffice in
automation. Manual processes are impractical in
a massively scalable OpenStack design
architecture.</para>
</listitem>
@ -127,13 +126,13 @@
<listitem>
<para>Companies operating a massively scalable OpenStack
cloud also require that operational expenditures
(OpEx) be minimized as much as possible. It is
recommended that cloud-optimized hardware is a good
approach when managing operational overhead. Some of
the factors that need to be considered include power,
cooling, and the physical design of the chassis. It is
possible to customize the hardware and systems so they
are optimized for this type of workload because of the
(OpEx) be minimized as much as possible. We
recommend using cloud-optimized hardware when
managing operational overhead. Some of
the factors to consider include power,
cooling, and the physical design of the chassis. Through
customization, it is possible to optimize the hardware
and systems for this type of workload because of the
scale of these implementations.</para>
</listitem>
<listitem>
@ -144,16 +143,16 @@
infrastructure. This includes full scale metering of
the hardware and software status. A corresponding
framework of logging and alerting is also required to
store and allow operations to act upon the metrics
provided by the metering and monitoring solution(s).
store and enable operations to act on the metrics
provided by the metering and monitoring solutions.
The cloud operator also needs a solution that uses the
data provided by the metering and monitoring solution
to provide capacity planning and capacity trending
analysis.</para>
</listitem>
<listitem>
<para>A massively scalable OpenStack cloud will be a
multi-site cloud. Therefore, the user-operator
<para>Invariably, massively scalable OpenStack clouds extend
over several sites. Therefore, the user-operator
requirements for a multi-site OpenStack architecture
design are also applicable here. This includes various
legal requirements for data storage, data placement,
@ -161,18 +160,17 @@
compliance requirements; image
consistency-availability; storage replication and
availability (both block and file/object storage); and
authentication, authorization, and auditing (AAA),
just to name a few. Refer to the <xref linkend="multi_site"/>
authentication, authorization, and auditing (AAA).
See <xref linkend="multi_site"/>
for more details on requirements and considerations
for multi-site OpenStack clouds.</para>
</listitem>
<listitem>
<para>Considerations around physical facilities such as
space, floor weight, rack height and type,
<para>The design architecture of a massively scalable OpenStack
cloud must address considerations around physical
facilities such as space, floor weight, rack height and type,
environmental considerations, power usage and power
usage efficiency (PUE), and physical security must
also be addressed by the design architecture of a
massively scalable OpenStack cloud.</para>
usage efficiency (PUE), and physical security.</para>
</listitem>
</itemizedlist></section>
</section>