[arch-guide-archive] Removing old arch guide from master

This still exists in the before-migration tag. Removing
from master as no longer required.

Change-Id: Ie7f050518e6faca46f923ee960414f690ed59253
This commit is contained in:
Alexandra Settle 2017-08-08 11:42:07 +01:00
parent e3afb934d6
commit 8454a103ac
82 changed files with 0 additions and 7299 deletions

View File

@ -33,6 +33,4 @@ declare -A SPECIAL_BOOKS=(
["contributor-guide"]="skip"
["releasenotes"]="skip"
["ha-guide-draft"]="skip"
# Skip old arch design, will be archived
["arch-design-to-archive"]="skip"
)

View File

@ -1,27 +0,0 @@
[metadata]
name = architecturedesignguide
summary = OpenStack Architecture Design Guide
author = OpenStack
author-email = openstack-docs@lists.openstack.org
home-page = https://docs.openstack.org/
classifier =
Environment :: OpenStack
Intended Audience :: Information Technology
Intended Audience :: Cloud Architects
License :: OSI Approved :: Apache Software License
Operating System :: POSIX :: Linux
Topic :: Documentation
[global]
setup-hooks =
pbr.hooks.setup_hook
[files]
[build_sphinx]
warning-is-error = 1
build-dir = build
source-dir = source
[wheel]
universal = 1

View File

@ -1,30 +0,0 @@
#!/usr/bin/env python
# Copyright (c) 2013 Hewlett-Packard Development Company, L.P.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# THIS FILE IS MANAGED BY THE GLOBAL REQUIREMENTS REPO - DO NOT EDIT
import setuptools
# In python < 2.7.4, a lazy loading of package `pbr` will break
# setuptools if some other modules registered functions in `atexit`.
# solution from: http://bugs.python.org/issue15881#msg170215
try:
import multiprocessing # noqa
except ImportError:
pass
setuptools.setup(
setup_requires=['pbr'],
pbr=True)

View File

@ -1 +0,0 @@
../../common

View File

@ -1,212 +0,0 @@
============
Architecture
============
The hardware selection covers three areas:
* Compute
* Network
* Storage
Compute-focused OpenStack clouds have high demands on processor and
memory resources, and requires hardware that can handle these demands.
Consider the following factors when selecting compute (server) hardware:
* Server density
* Resource capacity
* Expandability
* Cost
Weigh these considerations against each other to determine the best
design for the desired purpose. For example, increasing server density
means sacrificing resource capacity or expandability.
A compute-focused cloud should have an emphasis on server hardware that
can offer more CPU sockets, more CPU cores, and more RAM. Network
connectivity and storage capacity are less critical.
When designing a compute-focused OpenStack architecture, you must
consider whether you intend to scale up or scale out. Selecting a
smaller number of larger hosts, or a larger number of smaller hosts,
depends on a combination of factors: cost, power, cooling, physical rack
and floor space, support-warranty, and manageability.
Considerations for selecting hardware:
* Most blade servers can support dual-socket multi-core CPUs. To avoid
this CPU limit, select ``full width`` or ``full height`` blades. Be
aware, however, that this also decreases server density. For example,
high density blade servers such as HP BladeSystem or Dell PowerEdge
M1000e support up to 16 servers in only ten rack units. Using
half-height blades is twice as dense as using full-height blades,
which results in only eight servers per ten rack units.
* 1U rack-mounted servers that occupy only a single rack unit may offer
greater server density than a blade server solution. It is possible
to place forty 1U servers in a rack, providing space for the top of
rack (ToR) switches, compared to 32 full width blade servers.
* 2U rack-mounted servers provide quad-socket, multi-core CPU support,
but with a corresponding decrease in server density (half the density
that 1U rack-mounted servers offer).
* Larger rack-mounted servers, such as 4U servers, often provide even
greater CPU capacity, commonly supporting four or even eight CPU
sockets. These servers have greater expandability, but such servers
have much lower server density and are often more expensive.
* ``Sled servers`` are rack-mounted servers that support multiple
independent servers in a single 2U or 3U enclosure. These deliver
higher density as compared to typical 1U or 2U rack-mounted servers.
For example, many sled servers offer four independent dual-socket
nodes in 2U for a total of eight CPU sockets in 2U.
Consider these when choosing server hardware for a compute-focused
OpenStack design architecture:
* Instance density
* Host density
* Power and cooling density
Selecting networking hardware
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some of the key considerations for networking hardware selection
include:
* Port count
* Port density
* Port speed
* Redundancy
* Power requirements
We recommend designing the network architecture using a scalable network
model that makes it easy to add capacity and bandwidth. A good example
of such a model is the leaf-spline model. In this type of network
design, it is possible to easily add additional bandwidth as well as
scale out to additional racks of gear. It is important to select network
hardware that supports the required port count, port speed, and port
density while also allowing for future growth as workload demands
increase. It is also important to evaluate where in the network
architecture it is valuable to provide redundancy.
Operating system and hypervisor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The selection of operating system (OS) and hypervisor has a significant
impact on the end point design.
OS and hypervisor selection impact the following areas:
* Cost
* Supportability
* Management tools
* Scale and performance
* Security
* Supported features
* Interoperability
OpenStack components
~~~~~~~~~~~~~~~~~~~~
The selection of OpenStack components is important. There are certain
components that are required, for example the compute and image
services, but others, such as the Orchestration service, may not be
present.
For a compute-focused OpenStack design architecture, the following
components may be present:
* Identity (keystone)
* Dashboard (horizon)
* Compute (nova)
* Object Storage (swift)
* Image (glance)
* Networking (neutron)
* Orchestration (heat)
.. note::
A compute-focused design is less likely to include OpenStack Block
Storage. However, there may be some situations where the need for
performance requires a block storage component to improve data I-O.
The exclusion of certain OpenStack components might also limit the
functionality of other components. If a design includes the
Orchestration service but excludes the Telemetry service, then the
design cannot take advantage of Orchestration's auto scaling
functionality as this relies on information from Telemetry.
Networking software
~~~~~~~~~~~~~~~~~~~
OpenStack Networking provides a wide variety of networking services for
instances. There are many additional networking software packages that
might be useful to manage the OpenStack components themselves. The
`OpenStack High Availability Guide <https://docs.openstack.org/ha-guide/>`_
describes some of these software packages in more detail.
For a compute-focused OpenStack cloud, the OpenStack infrastructure
components must be highly available. If the design does not include
hardware load balancing, you must add networking software packages, for
example, HAProxy.
Management software
~~~~~~~~~~~~~~~~~~~
The selected supplemental software solution impacts and affects the
overall OpenStack cloud design. This includes software for providing
clustering, logging, monitoring and alerting.
The availability of design requirements is the main determiner for the
inclusion of clustering software, such as Corosync or Pacemaker.
Operational considerations determine the requirements for logging,
monitoring, and alerting. Each of these sub-categories include various
options.
Some other potential design impacts include:
OS-hypervisor combination
Ensure that the selected logging, monitoring, or alerting tools
support the proposed OS-hypervisor combination.
Network hardware
The logging, monitoring, and alerting software must support the
network hardware selection.
Database software
~~~~~~~~~~~~~~~~~
A large majority of OpenStack components require access to back-end
database services to store state and configuration information. Select
an appropriate back-end database that satisfies the availability and
fault tolerance requirements of the OpenStack services. OpenStack
services support connecting to any database that the SQLAlchemy Python
drivers support, however most common database deployments make use of
MySQL or some variation of it. We recommend that you make the database
that provides back-end services within a general-purpose cloud highly
available. Some of the more common software solutions include Galera,
MariaDB, and MySQL with multi-master replication.

View File

@ -1,68 +0,0 @@
==========================
Operational considerations
==========================
There are a number of operational considerations that affect the design
of compute-focused OpenStack clouds, including:
* Enforcing strict API availability requirements
* Understanding and dealing with failure scenarios
* Managing host maintenance schedules
Service-level agreements (SLAs) are contractual obligations that ensure
the availability of a service. When designing an OpenStack cloud,
factoring in promises of availability implies a certain level of
redundancy and resiliency.
Monitoring
~~~~~~~~~~
OpenStack clouds require appropriate monitoring platforms to catch and
manage errors.
.. note::
We recommend leveraging existing monitoring systems to see if they
are able to effectively monitor an OpenStack environment.
Specific meters that are critically important to capture include:
* Image disk utilization
* Response time to the Compute API
Capacity planning
~~~~~~~~~~~~~~~~~
Adding extra capacity to an OpenStack cloud is a horizontally scaling
process.
We recommend similar (or the same) CPUs when adding extra nodes to the
environment. This reduces the chance of breaking live-migration features
if they are present. Scaling out hypervisor hosts also has a direct
effect on network and other data center resources. We recommend you
factor in this increase when reaching rack capacity or when requiring
extra network switches.
Changing the internal components of a Compute host to account for
increases in demand is a process known as vertical scaling. Swapping a
CPU for one with more cores, or increasing the memory in a server, can
help add extra capacity for running applications.
Another option is to assess the average workloads and increase the
number of instances that can run within the compute environment by
adjusting the overcommit ratio.
.. note::
It is important to remember that changing the CPU overcommit ratio
can have a detrimental effect and cause a potential increase in a
noisy neighbor.
The added risk of increasing the overcommit ratio is that more instances
fail when a compute host fails. We do not recommend that you increase
the CPU overcommit ratio in compute-focused OpenStack design
architecture, as it can increase the potential for noisy neighbor
issues.

View File

@ -1,126 +0,0 @@
=====================
Prescriptive examples
=====================
The Conseil Européen pour la Recherche Nucléaire (CERN), also known as
the European Organization for Nuclear Research, provides particle
accelerators and other infrastructure for high-energy physics research.
As of 2011 CERN operated these two compute centers in Europe with plans
to add a third.
+-----------------------+------------------------+
| Data center | Approximate capacity |
+=======================+========================+
| Geneva, Switzerland | - 3.5 Mega Watts |
| | |
| | - 91000 cores |
| | |
| | - 120 PB HDD |
| | |
| | - 100 PB Tape |
| | |
| | - 310 TB Memory |
+-----------------------+------------------------+
| Budapest, Hungary | - 2.5 Mega Watts |
| | |
| | - 20000 cores |
| | |
| | - 6 PB HDD |
+-----------------------+------------------------+
To support a growing number of compute-heavy users of experiments
related to the Large Hadron Collider (LHC), CERN ultimately elected to
deploy an OpenStack cloud using Scientific Linux and RDO. This effort
aimed to simplify the management of the center's compute resources with
a view to doubling compute capacity through the addition of a data
center in 2013 while maintaining the same levels of compute staff.
The CERN solution uses :term:`cells <cell>` for segregation of compute
resources and for transparently scaling between different data centers.
This decision meant trading off support for security groups and live
migration. In addition, they must manually replicate some details, like
flavors, across cells. In spite of these drawbacks cells provide the
required scale while exposing a single public API endpoint to users.
CERN created a compute cell for each of the two original data centers
and created a third when it added a new data center in 2013. Each cell
contains three availability zones to further segregate compute resources
and at least three RabbitMQ message brokers configured for clustering
with mirrored queues for high availability.
The API cell, which resides behind a HAProxy load balancer, is in the
data center in Switzerland and directs API calls to compute cells using
a customized variation of the cell scheduler. The customizations allow
certain workloads to route to a specific data center or all data
centers, with cell RAM availability determining cell selection in the
latter case.
.. figure:: figures/Generic_CERN_Example.png
There is also some customization of the filter scheduler that handles
placement within the cells:
ImagePropertiesFilter
Provides special handling depending on the guest operating system in
use (Linux-based or Windows-based).
ProjectsToAggregateFilter
Provides special handling depending on which project the instance is
associated with.
default_schedule_zones
Allows the selection of multiple default availability zones, rather
than a single default.
A central database team manages the MySQL database server in each cell
in an active/passive configuration with a NetApp storage back end.
Backups run every 6 hours.
Network architecture
~~~~~~~~~~~~~~~~~~~~
To integrate with existing networking infrastructure, CERN made
customizations to legacy networking (nova-network). This was in the form
of a driver to integrate with CERN's existing database for tracking MAC
and IP address assignments.
The driver facilitates selection of a MAC address and IP for new
instances based on the compute node where the scheduler places the
instance.
The driver considers the compute node where the scheduler placed an
instance and selects a MAC address and IP from the pre-registered list
associated with that node in the database. The database updates to
reflect the address assignment to that instance.
Storage architecture
~~~~~~~~~~~~~~~~~~~~
CERN deploys the OpenStack Image service in the API cell and configures
it to expose version 1 (V1) of the API. This also requires the image
registry. The storage back end in use is a 3 PB Ceph cluster.
CERN maintains a small set of Scientific Linux 5 and 6 images onto which
orchestration tools can place applications. Puppet manages instance
configuration and customization.
Monitoring
~~~~~~~~~~
CERN does not require direct billing, but uses the Telemetry service to
perform metering for the purposes of adjusting project quotas. CERN uses
a sharded, replicated, MongoDB back-end. To spread API load, CERN
deploys instances of the nova-api service within the child cells for
Telemetry to query against. This also requires the configuration of
supporting services such as keystone, glance-api, and glance-registry in
the child cells.
.. figure:: figures/Generic_CERN_Architecture.png
Additional monitoring tools in use include
`Flume <https://flume.apache.org/>`__, `Elastic
Search <https://www.elastic.co/>`__,
`Kibana <https://www.elastic.co/products/kibana>`__, and the CERN
developed `Lemon <http://lemon.web.cern.ch/lemon/index.shtml>`__
project.

View File

@ -1,214 +0,0 @@
========================
Technical considerations
========================
In a compute-focused OpenStack cloud, the type of instance workloads you
provision heavily influences technical decision making.
Public and private clouds require deterministic capacity planning to
support elastic growth in order to meet user SLA expectations.
Deterministic capacity planning is the path to predicting the effort and
expense of making a given process perform consistently. This process is
important because, when a service becomes a critical part of a user's
infrastructure, the user's experience links directly to the SLAs of the
cloud itself.
There are two aspects of capacity planning to consider:
* Planning the initial deployment footprint
* Planning expansion of the environment to stay ahead of cloud user demands
Begin planning an initial OpenStack deployment footprint with
estimations of expected uptake, and existing infrastructure workloads.
The starting point is the core count of the cloud. By applying relevant
ratios, the user can gather information about:
* The number of expected concurrent instances: (overcommit fraction ×
cores) / virtual cores per instance
* Required storage: flavor disk size × number of instances
These ratios determine the amount of additional infrastructure needed to
support the cloud. For example, consider a situation in which you
require 1600 instances, each with 2 vCPU and 50 GB of storage. Assuming
the default overcommit rate of 16:1, working out the math provides an
equation of:
* 1600 = (16 × (number of physical cores)) / 2
* Storage required = 50 GB × 1600
On the surface, the equations reveal the need for 200 physical cores and
80 TB of storage for ``/var/lib/nova/instances/``. However, it is also
important to look at patterns of usage to estimate the load that the API
services, database servers, and queue servers are likely to encounter.
Aside from the creation and termination of instances, consider the
impact of users accessing the service, particularly on nova-api and its
associated database. Listing instances gathers a great deal of
information and given the frequency with which users run this operation,
a cloud with a large number of users can increase the load
significantly. This can even occur unintentionally. For example, the
OpenStack Dashboard instances tab refreshes the list of instances every
30 seconds, so leaving it open in a browser window can cause unexpected
load.
Consideration of these factors can help determine how many cloud
controller cores you require. A server with 8 CPU cores and 8 GB of RAM
server would be sufficient for a rack of compute nodes, given the above
caveats.
Key hardware specifications are also crucial to the performance of user
instances. Be sure to consider budget and performance needs, including
storage performance (spindles/core), memory availability (RAM/core),
network bandwidth (Gbps/core), and overall CPU performance (CPU/core).
The cloud resource calculator is a useful tool in examining the impacts
of different hardware and instance load outs. See `cloud-resource-calculator
<https://github.com/noslzzp/cloud-resource-calculator/blob/master/cloud-resource-calculator.ods>`_.
Expansion planning
~~~~~~~~~~~~~~~~~~
A key challenge for planning the expansion of cloud compute services is
the elastic nature of cloud infrastructure demands.
Planning for expansion is a balancing act. Planning too conservatively
can lead to unexpected oversubscription of the cloud and dissatisfied
users. Planning for cloud expansion too aggressively can lead to
unexpected underuse of the cloud and funds spent unnecessarily
on operating infrastructure.
The key is to carefully monitor the trends in cloud usage over time. The
intent is to measure the consistency with which you deliver services,
not the average speed or capacity of the cloud. Using this information
to model capacity performance enables users to more accurately determine
the current and future capacity of the cloud.
CPU and RAM
~~~~~~~~~~~
OpenStack enables users to overcommit CPU and RAM on compute nodes. This
allows an increase in the number of instances running on the cloud at
the cost of reducing the performance of the instances. OpenStack Compute
uses the following ratios by default:
* CPU allocation ratio: 16:1
* RAM allocation ratio: 1.5:1
The default CPU allocation ratio of 16:1 means that the scheduler
allocates up to 16 virtual cores per physical core. For example, if a
physical node has 12 cores, the scheduler sees 192 available virtual
cores. With typical flavor definitions of 4 virtual cores per instance,
this ratio would provide 48 instances on a physical node.
Similarly, the default RAM allocation ratio of 1.5:1 means that the
scheduler allocates instances to a physical node as long as the total
amount of RAM associated with the instances is less than 1.5 times the
amount of RAM available on the physical node.
You must select the appropriate CPU and RAM allocation ratio based on
particular use cases.
Additional hardware
~~~~~~~~~~~~~~~~~~~
Certain use cases may benefit from exposure to additional devices on the
compute node. Examples might include:
* High performance computing jobs that benefit from the availability of
graphics processing units (GPUs) for general-purpose computing.
* Cryptographic routines that benefit from the availability of hardware
random number generators to avoid entropy starvation.
* Database management systems that benefit from the availability of
SSDs for ephemeral storage to maximize read/write time.
Host aggregates group hosts that share similar characteristics, which
can include hardware similarities. The addition of specialized hardware
to a cloud deployment is likely to add to the cost of each node, so
consider carefully whether all compute nodes, or just a subset targeted
by flavors, need the additional customization to support the desired
workloads.
Utilization
~~~~~~~~~~~
Infrastructure-as-a-Service offerings, including OpenStack, use flavors
to provide standardized views of virtual machine resource requirements
that simplify the problem of scheduling instances while making the best
use of the available physical resources.
In order to facilitate packing of virtual machines onto physical hosts,
the default selection of flavors provides a second largest flavor that
is half the size of the largest flavor in every dimension. It has half
the vCPUs, half the vRAM, and half the ephemeral disk space. The next
largest flavor is half that size again. The following figure provides a
visual representation of this concept for a general purpose computing
design:
.. figure:: figures/Compute_Tech_Bin_Packing_General1.png
The following figure displays a CPU-optimized, packed server:
.. figure:: figures/Compute_Tech_Bin_Packing_CPU_optimized1.png
These default flavors are well suited to typical configurations of
commodity server hardware. To maximize utilization, however, it may be
necessary to customize the flavors or create new ones in order to better
align instance sizes to the available hardware.
Workload characteristics may also influence hardware choices and flavor
configuration, particularly where they present different ratios of CPU
versus RAM versus HDD requirements.
For more information on Flavors see `OpenStack Operations Guide:
Flavors <https://docs.openstack.org/ops-guide/ops-user-facing-operations.html#flavors>`_.
OpenStack components
~~~~~~~~~~~~~~~~~~~~
Due to the nature of the workloads in this scenario, a number of
components are highly beneficial for a Compute-focused cloud. This
includes the typical OpenStack components:
* :term:`Compute service (nova)`
* :term:`Image service (glance)`
* :term:`Identity service (keystone)`
Also consider several specialized components:
* :term:`Orchestration service (heat)`
Given the nature of the applications involved in this scenario, these
are heavily automated deployments. Making use of Orchestration is
highly beneficial in this case. You can script the deployment of a
batch of instances and the running of tests, but it makes sense to
use the Orchestration service to handle all these actions.
* :term:`Telemetry service (telemetry)`
Telemetry and the alarms it generates support autoscaling of
instances using Orchestration. Users that are not using the
Orchestration service do not need to deploy the Telemetry service and
may choose to use external solutions to fulfill their metering and
monitoring requirements.
* :term:`Block Storage service (cinder)`
Due to the burst-able nature of the workloads and the applications
and instances that perform batch processing, this cloud mainly uses
memory or CPU, so the need for add-on storage to each instance is not
a likely requirement. This does not mean that you do not use
OpenStack Block Storage (cinder) in the infrastructure, but typically
it is not a central component.
* :term:`Networking service (neutron)`
When choosing a networking platform, ensure that it either works with
all desired hypervisor and container technologies and their OpenStack
drivers, or that it includes an implementation of an ML2 mechanism
driver. You can mix networking platforms that provide ML2 mechanisms
drivers.

View File

@ -1,34 +0,0 @@
===============
Compute focused
===============
.. toctree::
:maxdepth: 2
compute-focus-technical-considerations.rst
compute-focus-operational-considerations.rst
compute-focus-architecture.rst
compute-focus-prescriptive-examples.rst
Compute-focused clouds are a specialized subset of the general
purpose OpenStack cloud architecture. A compute-focused cloud
specifically supports compute intensive workloads.
.. note::
Compute intensive workloads may be CPU intensive, RAM intensive,
or both; they are not typically storage or network intensive.
Compute-focused workloads may include the following use cases:
* High performance computing (HPC)
* Big data analytics using Hadoop or other distributed data stores
* Continuous integration/continuous deployment (CI/CD)
* Platform-as-a-Service (PaaS)
* Signal processing for network function virtualization (NFV)
.. note::
A compute-focused OpenStack cloud does not typically use raw
block storage services as it does not host applications that
require persistent block storage.

View File

@ -1,291 +0,0 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import os
# import sys
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
# sys.path.insert(0, os.path.abspath('.'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['openstackdocstheme']
# Add any paths that contain templates here, relative to this directory.
# templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
# source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
repository_name = "openstack/openstack-manuals"
bug_project = 'openstack-manuals'
project = u'Architecture Design Guide'
bug_tag = u'arch-design-to-archive'
copyright = u'2015-2017, OpenStack contributors'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.9'
# The full version, including alpha/beta/rc tags.
release = '0.9'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
# language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
# today = ''
# Else, today_fmt is used as the format for a strftime call.
# today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['common/cli*', 'common/nova*', 'common/get-started-*']
# The reST default role (used for this markup: `text`) to use for all
# documents.
# default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
# add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
# add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
# show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
# modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
# keep_warnings = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = 'openstackdocs'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
# html_theme_options = {}
# Add any paths that contain custom themes here, relative to this directory.
# html_theme_path = [openstackdocstheme.get_html_theme_path()]
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
# html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
# html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
# html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
# html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = []
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
# html_extra_path = []
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
# So that we can enable "log-a-bug" links from each output HTML page, this
# variable must be set to a format that includes year, month, day, hours and
# minutes.
html_last_updated_fmt = '%Y-%m-%d %H:%M'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
# html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
# html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
# html_additional_pages = {}
# If false, no module index is generated.
# html_domain_indices = True
# If false, no index is generated.
html_use_index = False
# If true, the index is split into individual pages for each letter.
# html_split_index = False
# If true, links to the reST sources are added to the pages.
html_show_sourcelink = False
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
# html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
# html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
# html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
# html_file_suffix = None
# Output file base name for HTML help builder.
htmlhelp_basename = 'arch-design-to-archive'
# If true, publish source files
html_copy_source = False
# -- Options for LaTeX output ---------------------------------------------
latex_engine = 'xelatex'
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
# 'papersize': 'letterpaper',
# set font (TODO: different fonts for translated PDF document builds)
'fontenc': '\\usepackage{fontspec}',
'fontpkg': '''\
\defaultfontfeatures{Scale=MatchLowercase}
\setmainfont{Liberation Serif}
\setsansfont{Liberation Sans}
\setmonofont[SmallCapsFont={Liberation Mono}]{Liberation Mono}
''',
# The font size ('10pt', '11pt' or '12pt').
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
# 'preamble': '',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'ArchGuideRst.tex', u'Architecture Design Guide',
u'OpenStack contributors', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
# latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
# latex_use_parts = False
# If true, show page references after internal links.
# latex_show_pagerefs = False
# If true, show URL addresses after external links.
# latex_show_urls = False
# Documents to append as an appendix to all manuals.
# latex_appendices = []
# If false, no module index is generated.
# latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'ArchDesignRst', u'Architecture Design Guide',
[u'OpenStack contributors'], 1)
]
# If true, show URL addresses after external links.
# man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'ArchDesignRst', u'Architecture Design Guide',
u'OpenStack contributors', 'ArchDesignRst',
'To reap the benefits of OpenStack, you should plan, design,'
'and architect your cloud properly, taking user needs into'
'account and understanding the use cases.'
'commands.', 'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
# texinfo_appendices = []
# If false, no module index is generated.
# texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
# texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
# texinfo_no_detailmenu = False
# -- Options for Internationalization output ------------------------------
locale_dirs = ['locale/']

Binary file not shown.

Before

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 59 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

View File

@ -1,483 +0,0 @@
============
Architecture
============
Hardware selection involves three key areas:
* Compute
* Network
* Storage
Hardware for a general purpose OpenStack cloud should reflect a cloud
with no pre-defined usage model, designed to run a wide variety of
applications with varying resource usage requirements. These
applications include any of the following:
* RAM-intensive
* CPU-intensive
* Storage-intensive
Certain hardware form factors may better suit a general purpose
OpenStack cloud due to the requirement for equal (or nearly equal)
balance of resources. Server hardware must provide the following:
* Equal (or nearly equal) balance of compute capacity (RAM and CPU)
* Network capacity (number and speed of links)
* Storage capacity (gigabytes or terabytes as well as :term:`Input/Output
Operations Per Second (IOPS)`
Evaluate server hardware around four conflicting dimensions:
Server density
A measure of how many servers can fit into a given measure of
physical space, such as a rack unit [U].
Resource capacity
The number of CPU cores, amount of RAM, or amount of deliverable
storage.
Expandability
Limit of additional resources you can add to a server.
Cost
The relative purchase price of the hardware weighted against the
level of design effort needed to build the system.
Increasing server density means sacrificing resource capacity or
expandability, however, increasing resource capacity and expandability
increases cost and decreases server density. As a result, determining
the best server hardware for a general purpose OpenStack architecture
means understanding how choice of form factor will impact the rest of
the design. The following list outlines the form factors to choose from:
* Blade servers typically support dual-socket multi-core CPUs. Blades
also offer outstanding density.
* 1U rack-mounted servers occupy only a single rack unit. Their
benefits include high density, support for dual-socket multi-core
CPUs, and support for reasonable RAM amounts. This form factor offers
limited storage capacity, limited network capacity, and limited
expandability.
* 2U rack-mounted servers offer the expanded storage and networking
capacity that 1U servers tend to lack, but with a corresponding
decrease in server density (half the density offered by 1U
rack-mounted servers).
* Larger rack-mounted servers, such as 4U servers, will tend to offer
even greater CPU capacity, often supporting four or even eight CPU
sockets. These servers often have much greater expandability so will
provide the best option for upgradability. This means, however, that
the servers have a much lower server density and a much greater
hardware cost.
* *Sled servers* are rack-mounted servers that support multiple
independent servers in a single 2U or 3U enclosure. This form factor
offers increased density over typical 1U-2U rack-mounted servers but
tends to suffer from limitations in the amount of storage or network
capacity each individual server supports.
The best form factor for server hardware supporting a general purpose
OpenStack cloud is driven by outside business and cost factors. No
single reference architecture applies to all implementations; the
decision must flow from user requirements, technical considerations, and
operational considerations. Here are some of the key factors that
influence the selection of server hardware:
Instance density
Sizing is an important consideration for a general purpose OpenStack
cloud. The expected or anticipated number of instances that each
hypervisor can host is a common meter used in sizing the deployment.
The selected server hardware needs to support the expected or
anticipated instance density.
Host density
Physical data centers have limited physical space, power, and
cooling. The number of hosts (or hypervisors) that can be fitted
into a given metric (rack, rack unit, or floor tile) is another
important method of sizing. Floor weight is an often overlooked
consideration. The data center floor must be able to support the
weight of the proposed number of hosts within a rack or set of
racks. These factors need to be applied as part of the host density
calculation and server hardware selection.
Power density
Data centers have a specified amount of power fed to a given rack or
set of racks. Older data centers may have a power density as power
as low as 20 AMPs per rack, while more recent data centers can be
architected to support power densities as high as 120 AMP per rack.
The selected server hardware must take power density into account.
Network connectivity
The selected server hardware must have the appropriate number of
network connections, as well as the right type of network
connections, in order to support the proposed architecture. Ensure
that, at a minimum, there are at least two diverse network
connections coming into each rack.
The selection of form factors or architectures affects the selection of
server hardware. Ensure that the selected server hardware is configured
to support enough storage capacity (or storage expandability) to match
the requirements of selected scale-out storage solution. Similarly, the
network architecture impacts the server hardware selection and vice
versa.
Selecting storage hardware
~~~~~~~~~~~~~~~~~~~~~~~~~~
Determine storage hardware architecture by selecting specific storage
architecture. Determine the selection of storage architecture by
evaluating possible solutions against the critical factors, the user
requirements, technical considerations, and operational considerations.
Incorporate the following facts into your storage architecture:
Cost
Storage can be a significant portion of the overall system cost. For
an organization that is concerned with vendor support, a commercial
storage solution is advisable, although it comes with a higher price
tag. If initial capital expenditure requires minimization, designing
a system based on commodity hardware would apply. The trade-off is
potentially higher support costs and a greater risk of
incompatibility and interoperability issues.
Scalability
Scalability, along with expandability, is a major consideration in a
general purpose OpenStack cloud. It might be difficult to predict
the final intended size of the implementation as there are no
established usage patterns for a general purpose cloud. It might
become necessary to expand the initial deployment in order to
accommodate growth and user demand.
Expandability
Expandability is a major architecture factor for storage solutions
with general purpose OpenStack cloud. A storage solution that
expands to 50 PB is considered more expandable than a solution that
only scales to 10 PB. This meter is related to scalability, which is
the measure of a solution's performance as it expands.
Using a scale-out storage solution with direct-attached storage (DAS) in
the servers is well suited for a general purpose OpenStack cloud. Cloud
services requirements determine your choice of scale-out solution. You
need to determine if a single, highly expandable and highly vertical,
scalable, centralized storage array is suitable for your design. After
determining an approach, select the storage hardware based on this
criteria.
This list expands upon the potential impacts for including a particular
storage architecture (and corresponding storage hardware) into the
design for a general purpose OpenStack cloud:
Connectivity
Ensure that, if storage protocols other than Ethernet are part of
the storage solution, the appropriate hardware has been selected. If
a centralized storage array is selected, ensure that the hypervisor
will be able to connect to that storage array for image storage.
Usage
How the particular storage architecture will be used is critical for
determining the architecture. Some of the configurations that will
influence the architecture include whether it will be used by the
hypervisors for ephemeral instance storage or if OpenStack Object
Storage will use it for object storage.
Instance and image locations
Where instances and images will be stored will influence the
architecture.
Server hardware
If the solution is a scale-out storage architecture that includes
DAS, it will affect the server hardware selection. This could ripple
into the decisions that affect host density, instance density, power
density, OS-hypervisor, management tools and others.
General purpose OpenStack cloud has multiple options. The key factors
that will have an influence on selection of storage hardware for a
general purpose OpenStack cloud are as follows:
Capacity
Hardware resources selected for the resource nodes should be capable
of supporting enough storage for the cloud services. Defining the
initial requirements and ensuring the design can support adding
capacity is important. Hardware nodes selected for object storage
should be capable of support a large number of inexpensive disks
with no reliance on RAID controller cards. Hardware nodes selected
for block storage should be capable of supporting high speed storage
solutions and RAID controller cards to provide performance and
redundancy to storage at a hardware level. Selecting hardware RAID
controllers that automatically repair damaged arrays will assist
with the replacement and repair of degraded or deleted storage
devices.
Performance
Disks selected for object storage services do not need to be fast
performing disks. We recommend that object storage nodes take
advantage of the best cost per terabyte available for storage.
Contrastingly, disks chosen for block storage services should take
advantage of performance boosting features that may entail the use
of SSDs or flash storage to provide high performance block storage
pools. Storage performance of ephemeral disks used for instances
should also be taken into consideration.
Fault tolerance
Object storage resource nodes have no requirements for hardware
fault tolerance or RAID controllers. It is not necessary to plan for
fault tolerance within the object storage hardware because the
object storage service provides replication between zones as a
feature of the service. Block storage nodes, compute nodes, and
cloud controllers should all have fault tolerance built in at the
hardware level by making use of hardware RAID controllers and
varying levels of RAID configuration. The level of RAID chosen
should be consistent with the performance and availability
requirements of the cloud.
Selecting networking hardware
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Selecting network architecture determines which network hardware will be
used. Networking software is determined by the selected networking
hardware.
There are more subtle design impacts that need to be considered. The
selection of certain networking hardware (and the networking software)
affects the management tools that can be used. There are exceptions to
this; the rise of *open* networking software that supports a range of
networking hardware means that there are instances where the
relationship between networking hardware and networking software are not
as tightly defined.
Some of the key considerations that should be included in the selection
of networking hardware include:
Port count
The design will require networking hardware that has the requisite
port count.
Port density
The network design will be affected by the physical space that is
required to provide the requisite port count. A higher port density
is preferred, as it leaves more rack space for compute or storage
components that may be required by the design. This can also lead
into concerns about fault domains and power density that should be
considered. Higher density switches are more expensive and should
also be considered, as it is important not to over design the
network if it is not required.
Port speed
The networking hardware must support the proposed network speed, for
example: 1 GbE, 10 GbE, or 40 GbE (or even 100 GbE).
Redundancy
The level of network hardware redundancy required is influenced by
the user requirements for high availability and cost considerations.
Network redundancy can be achieved by adding redundant power
supplies or paired switches. If this is a requirement, the hardware
will need to support this configuration.
Power requirements
Ensure that the physical data center provides the necessary power
for the selected network hardware.
.. note::
This may be an issue for spine switches in a leaf and spine
fabric, or end of row (EoR) switches.
There is no single best practice architecture for the networking
hardware supporting a general purpose OpenStack cloud that will apply to
all implementations. Some of the key factors that will have a strong
influence on selection of networking hardware include:
Connectivity
All nodes within an OpenStack cloud require network connectivity. In
some cases, nodes require access to more than one network segment.
The design must encompass sufficient network capacity and bandwidth
to ensure that all communications within the cloud, both north-south
and east-west traffic have sufficient resources available.
Scalability
The network design should encompass a physical and logical network
design that can be easily expanded upon. Network hardware should
offer the appropriate types of interfaces and speeds that are
required by the hardware nodes.
Availability
To ensure that access to nodes within the cloud is not interrupted,
we recommend that the network architecture identify any single
points of failure and provide some level of redundancy or fault
tolerance. With regard to the network infrastructure itself, this
often involves use of networking protocols such as LACP, VRRP or
others to achieve a highly available network connection. In
addition, it is important to consider the networking implications on
API availability. In order to ensure that the APIs, and potentially
other services in the cloud are highly available, we recommend you
design a load balancing solution within the network architecture to
accommodate for these requirements.
Software selection
~~~~~~~~~~~~~~~~~~
Software selection for a general purpose OpenStack architecture design
needs to include these three areas:
* Operating system (OS) and hypervisor
* OpenStack components
* Supplemental software
Operating system and hypervisor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The operating system (OS) and hypervisor have a significant impact on
the overall design. Selecting a particular operating system and
hypervisor can directly affect server hardware selection. Make sure the
storage hardware and topology support the selected operating system and
hypervisor combination. Also ensure the networking hardware selection
and topology will work with the chosen operating system and hypervisor
combination.
Some areas that could be impacted by the selection of OS and hypervisor
include:
Cost
Selecting a commercially supported hypervisor, such as Microsoft
Hyper-V, will result in a different cost model rather than
community-supported open source hypervisors including
:term:`KVM<kernel-based VM (KVM)>`, Kinstance or :term:`Xen`. When
comparing open source OS solutions, choosing Ubuntu over Red Hat
(or vice versa) will have an impact on cost due to support
contracts.
Supportability
Depending on the selected hypervisor, staff should have the
appropriate training and knowledge to support the selected OS and
hypervisor combination. If they do not, training will need to be
provided which could have a cost impact on the design.
Management tools
The management tools used for Ubuntu and Kinstance differ from the
management tools for VMware vSphere. Although both OS and hypervisor
combinations are supported by OpenStack, there will be very
different impacts to the rest of the design as a result of the
selection of one combination versus the other.
Scale and performance
Ensure that selected OS and hypervisor combinations meet the
appropriate scale and performance requirements. The chosen
architecture will need to meet the targeted instance-host ratios
with the selected OS-hypervisor combinations.
Security
Ensure that the design can accommodate regular periodic
installations of application security patches while maintaining
required workloads. The frequency of security patches for the
proposed OS-hypervisor combination will have an impact on
performance and the patch installation process could affect
maintenance windows.
Supported features
Determine which features of OpenStack are required. This will often
determine the selection of the OS-hypervisor combination. Some
features are only available with specific operating systems or
hypervisors.
Interoperability
You will need to consider how the OS and hypervisor combination
interactions with other operating systems and hypervisors, including
other software solutions. Operational troubleshooting tools for one
OS-hypervisor combination may differ from the tools used for another
OS-hypervisor combination and, as a result, the design will need to
address if the two sets of tools need to interoperate.
OpenStack components
~~~~~~~~~~~~~~~~~~~~
Selecting which OpenStack components are included in the overall design
is important. Some OpenStack components, like compute and Image service,
are required in every architecture. Other components, like
Orchestration, are not always required.
Excluding certain OpenStack components can limit or constrain the
functionality of other components. For example, if the architecture
includes Orchestration but excludes Telemetry, then the design will not
be able to take advantage of Orchestrations' auto scaling functionality.
It is important to research the component interdependencies in
conjunction with the technical requirements before deciding on the final
architecture.
Networking software
-------------------
OpenStack Networking (neutron) provides a wide variety of networking
services for instances. There are many additional networking software
packages that can be useful when managing OpenStack components. Some
examples include:
* Software to provide load balancing
* Network redundancy protocols
* Routing daemons
Some of these software packages are described in more detail in the
OpenStack High Availability Guide (refer to the `OpenStack network
nodes
chapter <https://docs.openstack.org/ha-guide/networking-ha.html>`__ of
the OpenStack High Availability Guide).
For a general purpose OpenStack cloud, the OpenStack infrastructure
components need to be highly available. If the design does not include
hardware load balancing, networking software packages like HAProxy will
need to be included.
Management software
-------------------
Selected supplemental software solution impacts and affects the overall
OpenStack cloud design. This includes software for providing clustering,
logging, monitoring and alerting.
Inclusion of clustering software, such as Corosync or Pacemaker, is
determined primarily by the availability requirements. The impact of
including (or not including) these software packages is primarily
determined by the availability of the cloud infrastructure and the
complexity of supporting the configuration after it is deployed. The
`OpenStack High Availability
Guide <https://docs.openstack.org/ha-guide/>`__ provides more details on
the installation and configuration of Corosync and Pacemaker, should
these packages need to be included in the design.
Requirements for logging, monitoring, and alerting are determined by
operational considerations. Each of these sub-categories includes a
number of various options.
If these software packages are required, the design must account for the
additional resource consumption (CPU, RAM, storage, and network
bandwidth). Some other potential design impacts include:
* OS-hypervisor combination: Ensure that the selected logging,
monitoring, or alerting tools support the proposed OS-hypervisor
combination.
* Network hardware: The network hardware selection needs to be
supported by the logging, monitoring, and alerting software.
Database software
-----------------
OpenStack components often require access to back-end database services
to store state and configuration information. Selecting an appropriate
back-end database that satisfies the availability and fault tolerance
requirements of the OpenStack services is required. OpenStack services
supports connecting to a database that is supported by the SQLAlchemy
python drivers, however, most common database deployments make use of
MySQL or variations of it. We recommend that the database, which
provides back-end service within a general purpose cloud, be made highly
available when using an available technology which can accomplish that
goal.

View File

@ -1,124 +0,0 @@
==========================
Operational considerations
==========================
In the planning and design phases of the build out, it is important to
include the operation's function. Operational factors affect the design
choices for a general purpose cloud, and operations staff are often
tasked with the maintenance of cloud environments for larger
installations.
Expectations set by the Service Level Agreements (SLAs) directly affect
knowing when and where you should implement redundancy and high
availability. SLAs are contractual obligations that provide assurances
for service availability. They define the levels of availability that
drive the technical design, often with penalties for not meeting
contractual obligations.
SLA terms that affect design include:
* API availability guarantees implying multiple infrastructure services
and highly available load balancers.
* Network uptime guarantees affecting switch design, which might
require redundant switching and power.
* Factor in networking security policy requirements in to your
deployments.
Support and maintainability
~~~~~~~~~~~~~~~~~~~~~~~~~~~
To be able to support and maintain an installation, OpenStack cloud
management requires operations staff to understand and comprehend design
architecture content. The operations and engineering staff skill level,
and level of separation, are dependent on size and purpose of the
installation. Large cloud service providers, or telecom providers, are
more likely to be managed by specially trained, dedicated operations
organizations. Smaller implementations are more likely to rely on
support staff that need to take on combined engineering, design and
operations functions.
Maintaining OpenStack installations requires a variety of technical
skills. You may want to consider using a third-party management company
with special expertise in managing OpenStack deployment.
Monitoring
~~~~~~~~~~
OpenStack clouds require appropriate monitoring platforms to ensure
errors are caught and managed appropriately. Specific meters that are
critically important to monitor include:
* Image disk utilization
* Response time to the :term:`Compute API <Compute API (Nova API)>`
Leveraging existing monitoring systems is an effective check to ensure
OpenStack environments can be monitored.
Downtime
~~~~~~~~
To effectively run cloud installations, initial downtime planning
includes creating processes and architectures that support the
following:
* Planned (maintenance)
* Unplanned (system faults)
Resiliency of overall system and individual components are going to be
dictated by the requirements of the SLA, meaning designing for
:term:`high availability (HA)` can have cost ramifications.
Capacity planning
~~~~~~~~~~~~~~~~~
Capacity constraints for a general purpose cloud environment include:
* Compute limits
* Storage limits
A relationship exists between the size of the compute environment and
the supporting OpenStack infrastructure controller nodes requiring
support.
Increasing the size of the supporting compute environment increases the
network traffic and messages, adding load to the controller or
networking nodes. Effective monitoring of the environment will help with
capacity decisions on scaling.
Compute nodes automatically attach to OpenStack clouds, resulting in a
horizontally scaling process when adding extra compute capacity to an
OpenStack cloud. Additional processes are required to place nodes into
appropriate availability zones and host aggregates. When adding
additional compute nodes to environments, ensure identical or functional
compatible CPUs are used, otherwise live migration features will break.
It is necessary to add rack capacity or network switches as scaling out
compute hosts directly affects network and datacenter resources.
Assessing the average workloads and increasing the number of instances
that can run within the compute environment by adjusting the overcommit
ratio is another option. It is important to remember that changing the
CPU overcommit ratio can have a detrimental effect and cause a potential
increase in a noisy neighbor. The additional risk of increasing the
overcommit ratio is more instances failing when a compute host fails.
Compute host components can also be upgraded to account for increases in
demand; this is known as vertical scaling. Upgrading CPUs with more
cores, or increasing the overall server memory, can add extra needed
capacity depending on whether the running applications are more CPU
intensive or memory intensive.
Insufficient disk capacity could also have a negative effect on overall
performance including CPU and memory usage. Depending on the back-end
architecture of the OpenStack Block Storage layer, capacity includes
adding disk shelves to enterprise storage systems or installing
additional block storage nodes. Upgrading directly attached storage
installed in compute hosts, and adding capacity to the shared storage
for additional ephemeral storage to instances, may be necessary.
For a deeper discussion on many of these topics, refer to the `OpenStack
Operations Guide <https://docs.openstack.org/ops>`_.

View File

@ -1,85 +0,0 @@
====================
Prescriptive example
====================
An online classified advertising company wants to run web applications
consisting of Tomcat, Nginx and MariaDB in a private cloud. To be able
to meet policy requirements, the cloud infrastructure will run in their
own data center. The company has predictable load requirements, but
requires scaling to cope with nightly increases in demand. Their current
environment does not have the flexibility to align with their goal of
running an open source API environment. The current environment consists
of the following:
* Between 120 and 140 installations of Nginx and Tomcat, each with 2
vCPUs and 4 GB of RAM
* A three-node MariaDB and Galera cluster, each with 4 vCPUs and 8 GB
RAM
The company runs hardware load balancers and multiple web applications
serving their websites, and orchestrates environments using combinations
of scripts and Puppet. The website generates large amounts of log data
daily that requires archiving.
The solution would consist of the following OpenStack components:
* A firewall, switches and load balancers on the public facing network
connections.
* OpenStack Controller service running Image, Identity, Networking,
combined with support services such as MariaDB and RabbitMQ,
configured for high availability on at least three controller nodes.
* OpenStack compute nodes running the KVM hypervisor.
* OpenStack Block Storage for use by compute instances, requiring
persistent storage (such as databases for dynamic sites).
* OpenStack Object Storage for serving static objects (such as images).
.. figure:: figures/General_Architecture3.png
Running up to 140 web instances and the small number of MariaDB
instances requires 292 vCPUs available, as well as 584 GB RAM. On a
typical 1U server using dual-socket hex-core Intel CPUs with
Hyperthreading, and assuming 2:1 CPU overcommit ratio, this would
require 8 OpenStack compute nodes.
The web application instances run from local storage on each of the
OpenStack compute nodes. The web application instances are stateless,
meaning that any of the instances can fail and the application will
continue to function.
MariaDB server instances store their data on shared enterprise storage,
such as NetApp or Solidfire devices. If a MariaDB instance fails,
storage would be expected to be re-attached to another instance and
rejoined to the Galera cluster.
Logs from the web application servers are shipped to OpenStack Object
Storage for processing and archiving.
Additional capabilities can be realized by moving static web content to
be served from OpenStack Object Storage containers, and backing the
OpenStack Image service with OpenStack Object Storage.
.. note::
Increasing OpenStack Object Storage means network bandwidth needs to
be taken into consideration. Running OpenStack Object Storage with
network connections offering 10 GbE or better connectivity is
advised.
Leveraging Orchestration and Telemetry services is also a potential
issue when providing auto-scaling, orchestrated web application
environments. Defining the web applications in a
:term:`Heat Orchestration Template (HOT)`
negates the reliance on the current scripted Puppet
solution.
OpenStack Networking can be used to control hardware load balancers
through the use of plug-ins and the Networking API. This allows users to
control hardware load balance pools and instances as members in these
pools, but their use in production environments must be carefully
weighed against current stability.

View File

@ -1,618 +0,0 @@
========================
Technical considerations
========================
General purpose clouds are expected to include these base services:
* Compute
* Network
* Storage
Each of these services have different resource requirements. As a
result, you must make design decisions relating directly to the service,
as well as provide a balanced infrastructure for all services.
Take into consideration the unique aspects of each service, as
individual characteristics and service mass can impact the hardware
selection process. Hardware designs should be generated for each of the
services.
Hardware decisions are also made in relation to network architecture and
facilities planning. These factors play heavily into the overall
architecture of an OpenStack cloud.
Compute resource design
~~~~~~~~~~~~~~~~~~~~~~~
When designing compute resource pools, a number of factors can impact
your design decisions. Factors such as number of processors, amount of
memory, and the quantity of storage required for each hypervisor must be
taken into account.
You will also need to decide whether to provide compute resources in a
single pool or in multiple pools. In most cases, multiple pools of
resources can be allocated and addressed on demand. A compute design
that allocates multiple pools of resources makes best use of application
resources, and is commonly referred to as bin packing.
In a bin packing design, each independent resource pool provides service
for specific flavors. This helps to ensure that, as instances are
scheduled onto compute hypervisors, each independent node's resources
will be allocated in a way that makes the most efficient use of the
available hardware. Bin packing also requires a common hardware design,
with all hardware nodes within a compute resource pool sharing a common
processor, memory, and storage layout. This makes it easier to deploy,
support, and maintain nodes throughout their lifecycle.
An overcommit ratio is the ratio of available virtual resources to
available physical resources. This ratio is configurable for CPU and
memory. The default CPU overcommit ratio is 16:1, and the default memory
overcommit ratio is 1.5:1. Determining the tuning of the overcommit
ratios during the design phase is important as it has a direct impact on
the hardware layout of your compute nodes.
When selecting a processor, compare features and performance
characteristics. Some processors include features specific to
virtualized compute hosts, such as hardware-assisted virtualization, and
technology related to memory paging (also known as EPT shadowing). These
types of features can have a significant impact on the performance of
your virtual machine.
You will also need to consider the compute requirements of
non-hypervisor nodes (sometimes referred to as resource nodes). This
includes controller, object storage, and block storage nodes, and
networking services.
The number of processor cores and threads impacts the number of worker
threads which can be run on a resource node. Design decisions must
relate directly to the service being run on it, as well as provide a
balanced infrastructure for all services.
Workload can be unpredictable in a general purpose cloud, so consider
including the ability to add additional compute resource pools on
demand. In some cases, however, the demand for certain instance types or
flavors may not justify individual hardware design. In either case,
start by allocating hardware designs that are capable of servicing the
most common instance requests. If you want to add additional hardware to
the overall architecture, this can be done later.
Designing network resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenStack clouds generally have multiple network segments, with each
segment providing access to particular resources. The network services
themselves also require network communication paths which should be
separated from the other networks. When designing network services for a
general purpose cloud, plan for either a physical or logical separation
of network segments used by operators and projects. You can also create
an additional network segment for access to internal services such as
the message bus and database used by various services. Segregating these
services onto separate networks helps to protect sensitive data and
protects against unauthorized access to services.
Choose a networking service based on the requirements of your instances.
The architecture and design of your cloud will impact whether you choose
OpenStack Networking (neutron), or legacy networking (nova-network).
Legacy networking (nova-network)
The legacy networking (nova-network) service is primarily a layer-2
networking service that functions in two modes, which use VLANs in
different ways. In a flat network mode, all network hardware nodes
and devices throughout the cloud are connected to a single layer-2
network segment that provides access to application data.
When the network devices in the cloud support segmentation using
VLANs, legacy networking can operate in the second mode. In this
design model, each project within the cloud is assigned a network
subnet which is mapped to a VLAN on the physical network. It is
especially important to remember the maximum number of 4096 VLANs
which can be used within a spanning tree domain. This places a hard
limit on the amount of growth possible within the data center. When
designing a general purpose cloud intended to support multiple
projects, we recommend the use of legacy networking with VLANs, and
not in flat network mode.
Another consideration regarding network is the fact that legacy
networking is entirely managed by the cloud operator; projects do not
have control over network resources. If projects require the ability to
manage and create network resources such as network segments and
subnets, it will be necessary to install the OpenStack Networking
service to provide network access to instances.
Networking (neutron)
OpenStack Networking (neutron) is a first class networking service
that gives full control over creation of virtual network resources
to projects. This is often accomplished in the form of tunneling
protocols which will establish encapsulated communication paths over
existing network infrastructure in order to segment project traffic.
These methods vary depending on the specific implementation, but
some of the more common methods include tunneling over GRE,
encapsulating with VXLAN, and VLAN tags.
We recommend you design at least three network segments:
* The first segment is a public network, used for access to REST APIs
by projects and operators. The controller nodes and swift proxies are
the only devices connecting to this network segment. In some cases,
this network might also be serviced by hardware load balancers and
other network devices.
* The second segment is used by administrators to manage hardware
resources. Configuration management tools also use this for deploying
software and services onto new hardware. In some cases, this network
segment might also be used for internal services, including the
message bus and database services. This network needs to communicate
with every hardware node. Due to the highly sensitive nature of this
network segment, you also need to secure this network from
unauthorized access.
* The third network segment is used by applications and consumers to
access the physical network, and for users to access applications.
This network is segregated from the one used to access the cloud APIs
and is not capable of communicating directly with the hardware
resources in the cloud. Compute resource nodes and network gateway
services which allow application data to access the physical network
from outside of the cloud need to communicate on this network
segment.
Designing Object Storage
~~~~~~~~~~~~~~~~~~~~~~~~
When designing hardware resources for OpenStack Object Storage, the
primary goal is to maximize the amount of storage in each resource node
while also ensuring that the cost per terabyte is kept to a minimum.
This often involves utilizing servers which can hold a large number of
spinning disks. Whether choosing to use 2U server form factors with
directly attached storage or an external chassis that holds a larger
number of drives, the main goal is to maximize the storage available in
each node.
.. note::
We do not recommended investing in enterprise class drives for an
OpenStack Object Storage cluster. The consistency and partition
tolerance characteristics of OpenStack Object Storage ensures that
data stays up to date and survives hardware faults without the use
of any specialized data replication devices.
One of the benefits of OpenStack Object Storage is the ability to mix
and match drives by making use of weighting within the swift ring. When
designing your swift storage cluster, we recommend making use of the
most cost effective storage solution available at the time.
To achieve durability and availability of data stored as objects it is
important to design object storage resource pools to ensure they can
provide the suggested availability. Considering rack-level and
zone-level designs to accommodate the number of replicas configured to
be stored in the Object Storage service (the default number of replicas
is three) is important when designing beyond the hardware node level.
Each replica of data should exist in its own availability zone with its
own power, cooling, and network resources available to service that
specific zone.
Object storage nodes should be designed so that the number of requests
does not hinder the performance of the cluster. The object storage
service is a chatty protocol, therefore making use of multiple
processors that have higher core counts will ensure the IO requests do
not inundate the server.
Designing Block Storage
~~~~~~~~~~~~~~~~~~~~~~~
When designing OpenStack Block Storage resource nodes, it is helpful to
understand the workloads and requirements that will drive the use of
block storage in the cloud. We recommend designing block storage pools
so that projects can choose appropriate storage solutions for their
applications. By creating multiple storage pools of different types, in
conjunction with configuring an advanced storage scheduler for the block
storage service, it is possible to provide projects with a large catalog
of storage services with a variety of performance levels and redundancy
options.
Block storage also takes advantage of a number of enterprise storage
solutions. These are addressed via a plug-in driver developed by the
hardware vendor. A large number of enterprise storage plug-in drivers
ship out-of-the-box with OpenStack Block Storage (and many more
available via third party channels). General purpose clouds are more
likely to use directly attached storage in the majority of block storage
nodes, deeming it necessary to provide additional levels of service to
projects which can only be provided by enterprise class storage
solutions.
Redundancy and availability requirements impact the decision to use a
RAID controller card in block storage nodes. The input-output per second
(IOPS) demand of your application will influence whether or not you
should use a RAID controller, and which level of RAID is required.
Making use of higher performing RAID volumes is suggested when
considering performance. However, where redundancy of block storage
volumes is more important we recommend making use of a redundant RAID
configuration such as RAID 5 or RAID 6. Some specialized features, such
as automated replication of block storage volumes, may require the use
of third-party plug-ins and enterprise block storage solutions in order
to provide the high demand on storage. Furthermore, where extreme
performance is a requirement it may also be necessary to make use of
high speed SSD disk drives' high performing flash storage solutions.
Software selection
~~~~~~~~~~~~~~~~~~
The software selection process plays a large role in the architecture of
a general purpose cloud. The following have a large impact on the design
of the cloud:
* Choice of operating system
* Selection of OpenStack software components
* Choice of hypervisor
* Selection of supplemental software
Operating system (OS) selection plays a large role in the design and
architecture of a cloud. There are a number of OSes which have native
support for OpenStack including:
* Ubuntu
* Red Hat Enterprise Linux (RHEL)
* CentOS
* SUSE Linux Enterprise Server (SLES)
.. note::
Native support is not a constraint on the choice of OS; users are
free to choose just about any Linux distribution (or even Microsoft
Windows) and install OpenStack directly from source (or compile
their own packages). However, many organizations will prefer to
install OpenStack from distribution-supplied packages or
repositories (although using the distribution vendor's OpenStack
packages might be a requirement for support).
OS selection also directly influences hypervisor selection. A cloud
architect who selects Ubuntu, RHEL, or SLES has some flexibility in
hypervisor; KVM, Xen, and LXC are supported virtualization methods
available under OpenStack Compute (nova) on these Linux distributions.
However, a cloud architect who selects Windows Server is limited to Hyper-V.
Similarly, a cloud architect who selects XenServer is limited to the
CentOS-based dom0 operating system provided with XenServer.
The primary factors that play into OS-hypervisor selection include:
User requirements
The selection of OS-hypervisor combination first and foremost needs
to support the user requirements.
Support
The selected OS-hypervisor combination needs to be supported by
OpenStack.
Interoperability
The OS-hypervisor needs to be interoperable with other features and
services in the OpenStack design in order to meet the user
requirements.
Hypervisor
~~~~~~~~~~
OpenStack supports a wide variety of hypervisors, one or more of which
can be used in a single cloud. These hypervisors include:
* KVM (and QEMU)
* XCP/XenServer
* vSphere (vCenter and ESXi)
* Hyper-V
* LXC
* Docker
* Bare-metal
A complete list of supported hypervisors and their capabilities can be
found at `OpenStack Hypervisor Support
Matrix <https://wiki.openstack.org/wiki/HypervisorSupportMatrix>`_.
We recommend general purpose clouds use hypervisors that support the
most general purpose use cases, such as KVM and Xen. More specific
hypervisors should be chosen to account for specific functionality or a
supported feature requirement. In some cases, there may also be a
mandated requirement to run software on a certified hypervisor including
solutions from VMware, Microsoft, and Citrix.
The features offered through the OpenStack cloud platform determine the
best choice of a hypervisor. Each hypervisor has their own hardware
requirements which may affect the decisions around designing a general
purpose cloud.
In a mixed hypervisor environment, specific aggregates of compute
resources, each with defined capabilities, enable workloads to utilize
software and hardware specific to their particular requirements. This
functionality can be exposed explicitly to the end user, or accessed
through defined metadata within a particular flavor of an instance.
OpenStack components
~~~~~~~~~~~~~~~~~~~~
A general purpose OpenStack cloud design should incorporate the core
OpenStack services to provide a wide range of services to end-users. The
OpenStack core services recommended in a general purpose cloud are:
* :term:`Compute service (nova)`
* :term:`Networking service (neutron)`
* :term:`Image service (glance)`
* :term:`Identity service (keystone)`
* :term:`Dashboard (horizon)`
* :term:`Telemetry service (telemetry)`
A general purpose cloud may also include :term:`Object Storage service
(swift)`. :term:`Block Storage service (cinder)`.
These may be selected to provide storage to applications and instances.
Supplemental software
~~~~~~~~~~~~~~~~~~~~~
A general purpose OpenStack deployment consists of more than just
OpenStack-specific components. A typical deployment involves services
that provide supporting functionality, including databases and message
queues, and may also involve software to provide high availability of
the OpenStack environment. Design decisions around the underlying
message queue might affect the required number of controller services,
as well as the technology to provide highly resilient database
functionality, such as MariaDB with Galera. In such a scenario,
replication of services relies on quorum.
Where many general purpose deployments use hardware load balancers to
provide highly available API access and SSL termination, software
solutions, for example HAProxy, can also be considered. It is vital to
ensure that such software implementations are also made highly
available. High availability can be achieved by using software such as
Keepalived or Pacemaker with Corosync. Pacemaker and Corosync can
provide active-active or active-passive highly available configuration
depending on the specific service in the OpenStack environment. Using
this software can affect the design as it assumes at least a 2-node
controller infrastructure where one of those nodes may be running
certain services in standby mode.
Memcached is a distributed memory object caching system, and Redis is a
key-value store. Both are deployed on general purpose clouds to assist
in alleviating load to the Identity service. The memcached service
caches tokens, and due to its distributed nature it can help alleviate
some bottlenecks to the underlying authentication system. Using
memcached or Redis does not affect the overall design of your
architecture as they tend to be deployed onto the infrastructure nodes
providing the OpenStack services.
Controller infrastructure
~~~~~~~~~~~~~~~~~~~~~~~~~
The Controller infrastructure nodes provide management services to the
end-user as well as providing services internally for the operating of
the cloud. The Controllers run message queuing services that carry
system messages between each service. Performance issues related to the
message bus would lead to delays in sending that message to where it
needs to go. The result of this condition would be delays in operation
functions such as spinning up and deleting instances, provisioning new
storage volumes and managing network resources. Such delays could
adversely affect an applications ability to react to certain
conditions, especially when using auto-scaling features. It is important
to properly design the hardware used to run the controller
infrastructure as outlined above in the Hardware Selection section.
Performance of the controller services is not limited to processing
power, but restrictions may emerge in serving concurrent users. Ensure
that the APIs and Horizon services are load tested to ensure that you
are able to serve your customers. Particular attention should be made to
the OpenStack Identity Service (Keystone), which provides the
authentication and authorization for all services, both internally to
OpenStack itself and to end-users. This service can lead to a
degradation of overall performance if this is not sized appropriately.
Network performance
~~~~~~~~~~~~~~~~~~~
In a general purpose OpenStack cloud, the requirements of the network
help determine performance capabilities. It is possible to design
OpenStack environments that run a mix of networking capabilities. By
utilizing the different interface speeds, the users of the OpenStack
environment can choose networks that are fit for their purpose.
Network performance can be boosted considerably by implementing hardware
load balancers to provide front-end service to the cloud APIs. The
hardware load balancers also perform SSL termination if that is a
requirement of your environment. When implementing SSL offloading, it is
important to understand the SSL offloading capabilities of the devices
selected.
Compute host
~~~~~~~~~~~~
The choice of hardware specifications used in compute nodes including
CPU, memory and disk type directly affects the performance of the
instances. Other factors which can directly affect performance include
tunable parameters within the OpenStack services, for example the
overcommit ratio applied to resources. The defaults in OpenStack Compute
set a 16:1 over-commit of the CPU and 1.5 over-commit of the memory.
Running at such high ratios leads to an increase in "noisy-neighbor"
activity. Care must be taken when sizing your Compute environment to
avoid this scenario. For running general purpose OpenStack environments
it is possible to keep to the defaults, but make sure to monitor your
environment as usage increases.
Storage performance
~~~~~~~~~~~~~~~~~~~
When considering performance of Block Storage, hardware and
architecture choice is important. Block Storage can use enterprise
back-end systems such as NetApp or EMC, scale out storage such as
GlusterFS and Ceph, or simply use the capabilities of directly attached
storage in the nodes themselves. Block Storage may be deployed so that
traffic traverses the host network, which could affect, and be adversely
affected by, the front-side API traffic performance. As such, consider
using a dedicated data storage network with dedicated interfaces on the
Controller and Compute hosts.
When considering performance of Object Storage, a number of design
choices will affect performance. A users access to the Object
Storage is through the proxy services, which sit behind hardware load
balancers. By the very nature of a highly resilient storage system,
replication of the data would affect performance of the overall system.
In this case, 10 GbE (or better) networking is recommended throughout
the storage network architecture.
High Availability
~~~~~~~~~~~~~~~~~
In OpenStack, the infrastructure is integral to providing services and
should always be available, especially when operating with SLAs.
Ensuring network availability is accomplished by designing the network
architecture so that no single point of failure exists. A consideration
of the number of switches, routes and redundancies of power should be
factored into core infrastructure, as well as the associated bonding of
networks to provide diverse routes to your highly available switch
infrastructure.
The OpenStack services themselves should be deployed across multiple
servers that do not represent a single point of failure. Ensuring API
availability can be achieved by placing these services behind highly
available load balancers that have multiple OpenStack servers as
members.
OpenStack lends itself to deployment in a highly available manner where
it is expected that at least 2 servers be utilized. These can run all
the services involved from the message queuing service, for example
RabbitMQ or QPID, and an appropriately deployed database service such as
MySQL or MariaDB. As services in the cloud are scaled out, back-end
services will need to scale too. Monitoring and reporting on server
utilization and response times, as well as load testing your systems,
will help determine scale out decisions.
Care must be taken when deciding network functionality. Currently,
OpenStack supports both the legacy networking (nova-network) system and
the newer, extensible OpenStack Networking (neutron). Both have their
pros and cons when it comes to providing highly available access. Legacy
networking, which provides networking access maintained in the OpenStack
Compute code, provides a feature that removes a single point of failure
when it comes to routing, and this feature is currently missing in
OpenStack Networking. The effect of legacy networkings multi-host
functionality restricts failure domains to the host running that
instance.
When using Networking, the OpenStack controller servers or
separate Networking hosts handle routing. For a deployment that requires
features available in only Networking, it is possible to remove this
restriction by using third party software that helps maintain highly
available L3 routes. Doing so allows for common APIs to control network
hardware, or to provide complex multi-tier web applications in a secure
manner. It is also possible to completely remove routing from
Networking, and instead rely on hardware routing capabilities. In this
case, the switching infrastructure must support L3 routing.
OpenStack Networking and legacy networking both have their advantages
and disadvantages. They are both valid and supported options that fit
different network deployment models described in the
`Networking deployment options table <https://docs.openstack.org/ops-guide/arch-network-design.html#network-topology>`
of OpenStack Operations Guide.
Ensure your deployment has adequate back-up capabilities.
Application design must also be factored into the capabilities of the
underlying cloud infrastructure. If the compute hosts do not provide a
seamless live migration capability, then it must be expected that when a
compute host fails, that instance and any data local to that instance
will be deleted. However, when providing an expectation to users that
instances have a high-level of uptime guarantees, the infrastructure
must be deployed in a way that eliminates any single point of failure
when a compute host disappears. This may include utilizing shared file
systems on enterprise storage or OpenStack Block storage to provide a
level of guarantee to match service features.
For more information on high availability in OpenStack, see the
`OpenStack High Availability
Guide <https://docs.openstack.org/ha-guide/>`_.
Security
~~~~~~~~
A security domain comprises users, applications, servers or networks
that share common trust requirements and expectations within a system.
Typically they have the same authentication and authorization
requirements and users.
These security domains are:
* Public
* Guest
* Management
* Data
These security domains can be mapped to an OpenStack deployment
individually, or combined. In each case, the cloud operator should be
aware of the appropriate security concerns. Security domains should be
mapped out against your specific OpenStack deployment topology. The
domains and their trust requirements depend upon whether the cloud
instance is public, private, or hybrid.
* The public security domain is an entirely untrusted area of the cloud
infrastructure. It can refer to the internet as a whole or simply to
networks over which you have no authority. This domain should always
be considered untrusted.
* The guest security domain handles compute data generated by instances
on the cloud but not services that support the operation of the
cloud, such as API calls. Public cloud providers and private cloud
providers who do not have stringent controls on instance use or who
allow unrestricted internet access to instances should consider this
domain to be untrusted. Private cloud providers may want to consider
this network as internal and therefore trusted only if they have
controls in place to assert that they trust instances and all their
projects.
* The management security domain is where services interact. Sometimes
referred to as the control plane, the networks in this domain
transport confidential data such as configuration parameters, user
names, and passwords. In most deployments this domain is considered
trusted.
* The data security domain is concerned primarily with information
pertaining to the storage services within OpenStack. Much of the data
that crosses this network has high integrity and confidentiality
requirements and, depending on the type of deployment, may also have
strong availability requirements. The trust level of this network is
heavily dependent on other deployment decisions.
When deploying OpenStack in an enterprise as a private cloud it is
usually behind the firewall and within the trusted network alongside
existing systems. Users of the cloud are employees that are bound by the
security requirements set forth by the company. This tends to push most
of the security domains towards a more trusted model. However, when
deploying OpenStack in a public facing role, no assumptions can be made
and the attack vectors significantly increase.
Consideration must be taken when managing the users of the system for
both public and private clouds. The identity service allows for LDAP to
be part of the authentication process. Including such systems in an
OpenStack deployment may ease user management if integrating into
existing systems.
It is important to understand that user authentication requests include
sensitive information including user names, passwords, and
authentication tokens. For this reason, placing the API services behind
hardware that performs SSL termination is strongly recommended.
For more information OpenStack Security, see the `OpenStack Security
Guide <https://docs.openstack.org/security-guide/>`_.

View File

@ -1,99 +0,0 @@
=================
User requirements
=================
When building a general purpose cloud, you should follow the
:term:`Infrastructure-as-a-Service (IaaS)` model; a platform best suited
for use cases with simple requirements. General purpose cloud user
requirements are not complex. However, it is important to capture them
even if the project has minimum business and technical requirements, such
as a proof of concept (PoC), or a small lab platform.
.. note::
The following user considerations are written from the perspective
of the cloud builder, not from the perspective of the end user.
Business requirements
~~~~~~~~~~~~~~~~~~~~~
Cost
Financial factors are a primary concern for any organization. Cost
is an important criterion as general purpose clouds are considered
the baseline from which all other cloud architecture environments
derive. General purpose clouds do not always provide the most
cost-effective environment for specialized applications or
situations. Unless razor-thin margins and costs have been mandated
as a critical factor, cost should not be the sole consideration when
choosing or designing a general purpose architecture.
Time to market
The ability to deliver services or products within a flexible time
frame is a common business factor when building a general purpose
cloud. Delivering a product in six months instead of two years is a
driving force behind the decision to build general purpose clouds.
General purpose clouds allow users to self-provision and gain access
to compute, network, and storage resources on-demand thus decreasing
time to market.
Revenue opportunity
Revenue opportunities for a cloud will vary greatly based on the
intended use case of that particular cloud. Some general purpose
clouds are built for commercial customer facing products, but there
are alternatives that might make the general purpose cloud the right
choice.
Technical requirements
~~~~~~~~~~~~~~~~~~~~~~
Technical cloud architecture requirements should be weighted against the
business requirements.
Performance
As a baseline product, general purpose clouds do not provide
optimized performance for any particular function. While a general
purpose cloud should provide enough performance to satisfy average
user considerations, performance is not a general purpose cloud
customer driver.
No predefined usage model
The lack of a pre-defined usage model enables the user to run a wide
variety of applications without having to know the application
requirements in advance. This provides a degree of independence and
flexibility that no other cloud scenarios are able to provide.
On-demand and self-service application
By definition, a cloud provides end users with the ability to
self-provision computing power, storage, networks, and software in a
simple and flexible way. The user must be able to scale their
resources up to a substantial level without disrupting the
underlying host operations. One of the benefits of using a general
purpose cloud architecture is the ability to start with limited
resources and increase them over time as the user demand grows.
Public cloud
For a company interested in building a commercial public cloud
offering based on OpenStack, the general purpose architecture model
might be the best choice. Designers are not always going to know the
purposes or workloads for which the end users will use the cloud.
Internal consumption (private) cloud
Organizations need to determine if it is logical to create their own
clouds internally. Using a private cloud, organizations are able to
maintain complete control over architectural and cloud components.
.. note::
Users will want to combine using the internal cloud with access
to an external cloud. If that case is likely, it might be worth
exploring the possibility of taking a multi-cloud approach with
regard to at least some of the architectural elements.
Designs that incorporate the use of multiple clouds, such as a
private cloud and a public cloud offering, are described in the
"Multi-Cloud" scenario, see :doc:`multi-site`.
Security
Security should be implemented according to asset, threat, and
vulnerability risk assessment matrices. For cloud domains that
require increased computer security, network security, or
information security, a general purpose cloud is not considered an
appropriate choice.

View File

@ -1,57 +0,0 @@
===============
General purpose
===============
.. toctree::
:maxdepth: 2
generalpurpose-user-requirements.rst
generalpurpose-technical-considerations.rst
generalpurpose-operational-considerations.rst
generalpurpose-architecture.rst
generalpurpose-prescriptive-example.rst
An OpenStack general purpose cloud is often considered a starting
point for building a cloud deployment. They are designed to balance
the components and do not emphasize any particular aspect of the
overall computing environment. Cloud design must give equal weight
to the compute, network, and storage components. General purpose clouds
are found in private, public, and hybrid environments, lending
themselves to many different use cases.
.. note::
General purpose clouds are homogeneous deployments.
They are not suited to specialized environments or edge case situations.
Common uses of a general purpose cloud include:
* Providing a simple database
* A web application runtime environment
* A shared application development platform
* Lab test bed
Use cases that benefit from scale-out rather than scale-up approaches
are good candidates for general purpose cloud architecture.
A general purpose cloud is designed to have a range of potential
uses or functions; not specialized for specific use cases. General
purpose architecture is designed to address 80% of potential use
cases available. The infrastructure, in itself, is a specific use
case, enabling it to be used as a base model for the design process.
General purpose clouds are designed to be platforms that are suited
for general purpose applications.
General purpose clouds are limited to the most basic components,
but they can include additional resources such as:
* Virtual-machine disk image library
* Raw block storage
* File or object storage
* Firewalls
* Load balancers
* IP addresses
* Network overlays or virtual local area networks (VLANs)
* Software bundles

View File

@ -1,149 +0,0 @@
============
Architecture
============
Map out the dependencies of the expected workloads and the cloud
infrastructures required to support them to architect a solution
for the broadest compatibility between cloud platforms, minimizing
the need to create workarounds and processes to fill identified gaps.
For your chosen cloud management platform, note the relative
levels of support for both monitoring and orchestration.
.. figure:: figures/Multi-Cloud_Priv-AWS4.png
:width: 100%
Image portability
~~~~~~~~~~~~~~~~~
The majority of cloud workloads currently run on instances using
hypervisor technologies. The challenge is that each of these hypervisors
uses an image format that may not be compatible with the others.
When possible, standardize on a single hypervisor and instance image format.
This may not be possible when using externally-managed public clouds.
Conversion tools exist to address image format compatibility.
Examples include `virt-p2v/virt-v2v <http://libguestfs.org/virt-v2v>`_
and `virt-edit <http://libguestfs.org/virt-edit.1.html>`_.
These tools cannot serve beyond basic cloud instance specifications.
Alternatively, build a thin operating system image as the base for
new instances.
This facilitates rapid creation of cloud instances using cloud orchestration
or configuration management tools for more specific templating.
Remember if you intend to use portable images for disaster recovery,
application diversity, or high availability, your users could move
the images and instances between cloud platforms regularly.
Upper-layer services
~~~~~~~~~~~~~~~~~~~~
Many clouds offer complementary services beyond the
basic compute, network, and storage components.
These additional services often simplify the deployment
and management of applications on a cloud platform.
When moving workloads from the source to the destination
cloud platforms, consider that the destination cloud platform
may not have comparable services. Implement workloads
in a different way or by using a different technology.
For example, moving an application that uses a NoSQL database
service such as MongoDB could cause difficulties in maintaining
the application between the platforms.
There are a number of options that are appropriate for
the hybrid cloud use case:
* Implementing a baseline of upper-layer services across all
of the cloud platforms. For platforms that do not support
a given service, create a service on top of that platform
and apply it to the workloads as they are launched on that cloud.
* For example, through the :term:`Database service <Database service
(trove)>` for OpenStack (:term:`trove`), OpenStack supports MySQL
as a service but not NoSQL databases in production.
To move from or run alongside AWS, a NoSQL workload must use
an automation tool, such as the Orchestration service (heat),
to recreate the NoSQL database on top of OpenStack.
* Deploying a :term:`Platform-as-a-Service (PaaS)` technology that
abstracts the upper-layer services from the underlying cloud platform.
The unit of application deployment and migration is the PaaS.
It leverages the services of the PaaS and only consumes the base
infrastructure services of the cloud platform.
* Using automation tools to create the required upper-layer services
that are portable across all cloud platforms.
For example, instead of using database services that are inherent
in the cloud platforms, launch cloud instances and deploy the
databases on those instances using scripts or configuration and
application deployment tools.
Network services
~~~~~~~~~~~~~~~~
Network services functionality is a critical component of
multiple cloud architectures. It is an important factor
to assess when choosing a CMP and cloud provider.
Considerations include:
* Functionality
* Security
* Scalability
* High availability (HA)
Verify and test critical cloud endpoint features.
* After selecting the network functionality framework,
you must confirm the functionality is compatible.
This ensures testing and functionality persists
during and after upgrades.
.. note::
Diverse cloud platforms may de-synchronize over time
if you do not maintain their mutual compatibility.
This is a particular issue with APIs.
* Scalability across multiple cloud providers determines
your choice of underlying network framework.
It is important to have the network API functions presented
and to verify that the desired functionality persists across
all chosen cloud endpoint.
* High availability implementations vary in functionality and design.
Examples of some common methods are active-hot-standby,
active-passive, and active-active.
Develop your high availability implementation and a test framework to
understand the functionality and limitations of the environment.
* It is imperative to address security considerations.
For example, addressing how data is secured between client and
endpoint and any traffic that traverses the multiple clouds.
Business and regulatory requirements dictate what security
approach to take. For more information, see the
:ref:`Security requirements <security>` chapter.
Data
~~~~
Traditionally, replication has been the best method of protecting
object store implementations. A variety of replication methods exist
in storage architectures, for example synchronous and asynchronous
mirroring. Most object stores and back-end storage systems implement
methods for replication at the storage subsystem layer.
Object stores also tailor replication techniques
to fit a cloud's requirements.
Organizations must find the right balance between
data integrity and data availability. Replication strategy may
also influence disaster recovery methods.
Replication across different racks, data centers, and geographical
regions increases focus on determining and ensuring data locality.
The ability to guarantee data is accessed from the nearest or
fastest storage can be necessary for applications to perform well.
.. note::
When running embedded object store methods, ensure that you do not
instigate extra data replication as this can cause performance issues.

View File

@ -1,80 +0,0 @@
==========================
Operational considerations
==========================
Hybrid cloud deployments present complex operational challenges.
Differences between provider clouds can cause incompatibilities
with workloads or Cloud Management Platforms (CMP).
Cloud providers may also offer different levels of integration
with competing cloud offerings.
Monitoring is critical to maintaining a hybrid cloud, and it is
important to determine if a CMP supports monitoring of all the
clouds involved, or if compatible APIs are available to be queried
for necessary information.
Agility
~~~~~~~
Hybrid clouds provide application availability across different
cloud environments and technologies.
This availability enables the deployment to survive disaster
in any single cloud environment.
Each cloud should provide the means to create instances quickly in
response to capacity issues or failure elsewhere in the hybrid cloud.
Application readiness
~~~~~~~~~~~~~~~~~~~~~
Enterprise workloads that depend on the underlying infrastructure
for availability are not designed to run on OpenStack.
If the application cannot tolerate infrastructure failures,
it is likely to require significant operator intervention to recover.
Applications for hybrid clouds must be fault tolerant, with an SLA
that is not tied to the underlying infrastructure.
Ideally, cloud applications should be able to recover when entire
racks and data centers experience an outage.
Upgrades
~~~~~~~~
If a deployment includes a public cloud, predicting upgrades may
not be possible. Carefully examine provider SLAs.
.. note::
At massive scale, even when dealing with a cloud that offers
an SLA with a high percentage of uptime, workloads must be able
to recover quickly.
When upgrading private cloud deployments, minimize disruption by
making incremental changes and providing a facility to either rollback
or continue to roll forward when using a continuous delivery model.
You may need to coordinate CMP upgrades with hybrid cloud upgrades
if there are API changes.
Network Operation Center
~~~~~~~~~~~~~~~~~~~~~~~~
Consider infrastructure control when planning the Network Operation
Center (NOC) for a hybrid cloud environment.
If a significant portion of the cloud is on externally managed systems,
prepare for situations where it may not be possible to make changes.
Additionally, providers may differ on how infrastructure must be
managed and exposed. This can lead to delays in root cause analysis
where each insists the blame lies with the other provider.
Ensure that the network structure connects all clouds to form
integrated system, keeping in mind the state of handoffs.
These handoffs must both be as reliable as possible and
include as little latency as possible to ensure the best
performance of the overall system.
Maintainability
~~~~~~~~~~~~~~~
Hybrid clouds rely on third party systems and processes.
As a result, it is not possible to guarantee proper maintenance
of the overall system. Instead, be prepared to abandon workloads
and recreate them in an improved state.

View File

@ -1,155 +0,0 @@
=====================
Prescriptive examples
=====================
Hybrid cloud environments are designed for these use cases:
* Bursting workloads from private to public OpenStack clouds
* Bursting workloads from private to public non-OpenStack clouds
* High availability across clouds (for technical diversity)
This chapter provides examples of environments that address
each of these use cases.
Bursting to a public OpenStack cloud
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Company A's data center is running low on capacity.
It is not possible to expand the data center in the foreseeable future.
In order to accommodate the continuously growing need for
development resources in the organization,
Company A decides to use resources in the public cloud.
Company A has an established data center with a substantial amount
of hardware. Migrating the workloads to a public cloud is not feasible.
The company has an internal cloud management platform that directs
requests to the appropriate cloud, depending on the local capacity.
This is a custom in-house application written for this specific purpose.
This solution is depicted in the figure below:
.. figure:: figures/Multi-Cloud_Priv-Pub3.png
:width: 100%
This example shows two clouds with a Cloud Management
Platform (CMP) connecting them. This guide does not
discuss a specific CMP, but describes how the Orchestration and
Telemetry services handle, manage, and control workloads.
The private OpenStack cloud has at least one controller and at least
one compute node. It includes metering using the Telemetry service.
The Telemetry service captures the load increase and the CMP
processes the information. If there is available capacity,
the CMP uses the OpenStack API to call the Orchestration service.
This creates instances on the private cloud in response to user requests.
When capacity is not available on the private cloud, the CMP issues
a request to the Orchestration service API of the public cloud.
This creates the instance on the public cloud.
In this example, Company A does not direct the deployments to an
external public cloud due to concerns regarding resource control,
security, and increased operational expense.
Bursting to a public non-OpenStack cloud
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The second example examines bursting workloads from the private cloud
into a non-OpenStack public cloud using Amazon Web Services (AWS)
to take advantage of additional capacity and to scale applications.
The following diagram demonstrates an OpenStack-to-AWS hybrid cloud:
.. figure:: figures/Multi-Cloud_Priv-AWS4.png
:width: 100%
Company B states that its developers are already using AWS
and do not want to change to a different provider.
If the CMP is capable of connecting to an external cloud
provider with an appropriate API, the workflow process remains
the same as the previous scenario.
The actions the CMP takes, such as monitoring loads and
creating new instances, stay the same.
However, the CMP performs actions in the public cloud
using applicable API calls.
If the public cloud is AWS, the CMP would use the
EC2 API to create a new instance and assign an Elastic IP.
It can then add that IP to HAProxy in the private cloud.
The CMP can also reference AWS-specific
tools such as CloudWatch and CloudFormation.
Several open source tool kits for building CMPs are
available and can handle this kind of translation.
Examples include ManageIQ, jClouds, and JumpGate.
High availability and disaster recovery
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Company C requires their local data center to be able to
recover from failure. Some of the workloads currently in
use are running on their private OpenStack cloud.
Protecting the data involves Block Storage, Object Storage,
and a database. The architecture supports the failure of
large components of the system while ensuring that the
system continues to deliver services.
While the services remain available to users, the failed
components are restored in the background based on standard
best practice data replication policies.
To achieve these objectives, Company C replicates data to
a second cloud in a geographically distant location.
The following diagram describes this system:
.. figure:: figures/Multi-Cloud_failover2.png
:width: 100%
This example includes two private OpenStack clouds connected with a CMP.
The source cloud, OpenStack Cloud 1, includes a controller and
at least one instance running MySQL. It also includes at least
one Block Storage volume and one Object Storage volume.
This means that data is available to the users at all times.
The details of the method for protecting each of these sources
of data differs.
Object Storage relies on the replication capabilities of
the Object Storage provider.
Company C enables OpenStack Object Storage so that it creates
geographically separated replicas that take advantage of this feature.
The company configures storage so that at least one replica
exists in each cloud. In order to make this work, the company
configures a single array spanning both clouds with OpenStack Identity.
Using Federated Identity, the array talks to both clouds, communicating
with OpenStack Object Storage through the Swift proxy.
For Block Storage, the replication is a little more difficult,
and involves tools outside of OpenStack itself.
The OpenStack Block Storage volume is not set as the drive itself
but as a logical object that points to a physical back end.
Disaster recovery is configured for Block Storage for
synchronous backup for the highest level of data protection,
but asynchronous backup could have been set as an alternative
that is not as latency sensitive.
For asynchronous backup, the Block Storage API makes it possible
to export the data and also the metadata of a particular volume,
so that it can be moved and replicated elsewhere.
More information can be found here:
`Add volume metadata support to Cinder backup
<https://blueprints.launchpad.net/cinder/+spec/cinder-backup-volume-metadata-support>`_.
The synchronous backups create an identical volume in both
clouds and chooses the appropriate flavor so that each cloud
has an identical back end. This is done by creating volumes
through the CMP. After this is configured, a solution
involving DRDB synchronizes the physical drives.
The database component is backed up using synchronous backups.
MySQL does not support geographically diverse replication,
so disaster recovery is provided by replicating the file itself.
As it is not possible to use Object Storage as the back end of
a database like MySQL, Swift replication is not an option.
Company C decides not to store the data on another geo-tiered
storage system, such as Ceph, as Block Storage.
This would have given another layer of protection.
Another option would have been to store the database on an OpenStack
Block Storage volume and backing it up like any other Block Storage.

View File

@ -1,155 +0,0 @@
========================
Technical considerations
========================
A hybrid cloud environment requires inspection and
understanding of technical issues in external data centers that may
not be in your control. Ideally, select an architecture
and CMP that are adaptable to changing environments.
Using diverse cloud platforms increases the risk of compatibility
issues, but clouds using the same version and distribution
of OpenStack are less likely to experience problems.
Clouds that exclusively use the same versions of OpenStack should
have no issues, regardless of distribution. More recent distributions
are less likely to encounter incompatibility between versions.
An OpenStack community initiative defines core functions that need to
remain backward compatible between supported versions. For example, the
DefCore initiative defines basic functions that every distribution must
support in order to use the name OpenStack.
Vendors can add proprietary customization to their distributions.
If an application or architecture makes use of these features, it can be
difficult to migrate to or use other types of environments.
If an environment includes non-OpenStack clouds, it may experience
compatibility problems. CMP tools must account for the differences in
the handling of operations and the implementation of services.
**Possible cloud incompatibilities**
* Instance deployment
* Network management
* Application management
* Services implementation
Capacity planning
~~~~~~~~~~~~~~~~~
One of the primary reasons many organizations use a hybrid cloud
is to increase capacity without making large capital investments.
Capacity and the placement of workloads are key design considerations
for hybrid clouds. The long-term capacity plan for these designs must
incorporate growth over time to prevent permanent consumption of more
expensive external clouds.
To avoid this scenario, account for future applications' capacity
requirements and plan growth appropriately.
It is difficult to predict the amount of load a particular
application might incur if the number of users fluctuates, or the
application experiences an unexpected increase in use.
It is possible to define application requirements in terms of
vCPU, RAM, bandwidth, or other resources and plan appropriately.
However, other clouds might not use the same meter or even the same
oversubscription rates.
Oversubscription is a method to emulate more capacity than
may physically be present.
For example, a physical hypervisor node with 32 GB RAM may host
24 instances, each provisioned with 2 GB RAM.
As long as all 24 instances do not concurrently use 2 full
gigabytes, this arrangement works well.
However, some hosts take oversubscription to extremes and,
as a result, performance can be inconsistent.
If at all possible, determine what the oversubscription rates
of each host are and plan capacity accordingly.
Utilization
~~~~~~~~~~~
A CMP must be aware of what workloads are running, where they are
running, and their preferred utilizations.
For example, in most cases it is desirable to run as many workloads
internally as possible, utilizing other resources only when necessary.
On the other hand, situations exist in which the opposite is true,
such as when an internal cloud is only for development and stressing
it is undesirable. A cost model of various scenarios and
consideration of internal priorities helps with this decision.
To improve efficiency, automate these decisions when possible.
The Telemetry service (ceilometer) provides information on the usage
of various OpenStack components. Note the following:
* If Telemetry must retain a large amount of data, for
example when monitoring a large or active cloud, we recommend
using a NoSQL back end such as MongoDB.
* You must monitor connections to non-OpenStack clouds
and report this information to the CMP.
Performance
~~~~~~~~~~~
Performance is critical to hybrid cloud deployments, and they are
affected by many of the same issues as multi-site deployments, such
as network latency between sites. Also consider the time required to
run a workload in different clouds and methods for reducing this time.
This may require moving data closer to applications or applications
closer to the data they process, and grouping functionality so that
connections that require low latency take place over a single cloud
rather than spanning clouds.
This may also require a CMP that can determine which cloud can most
efficiently run which types of workloads.
As with utilization, native OpenStack tools help improve performance.
For example, you can use Telemetry to measure performance and the
Orchestration service (heat) to react to changes in demand.
.. note::
Orchestration requires special client configurations to integrate
with Amazon Web Services. For other types of clouds, use CMP features.
Components
~~~~~~~~~~
Using more than one cloud in any design requires consideration of
four OpenStack tools:
OpenStack Compute (nova)
Regardless of deployment location, hypervisor choice has a direct
effect on how difficult it is to integrate with additional clouds.
Networking (neutron)
Whether using OpenStack Networking (neutron) or legacy
networking (nova-network), it is necessary to understand
network integration capabilities in order to connect between clouds.
Telemetry (ceilometer)
Use of Telemetry depends, in large part, on what the other parts
of the cloud you are using.
Orchestration (heat)
Orchestration can be a valuable tool in orchestrating tasks a
CMP decides are necessary in an OpenStack-based cloud.
Special considerations
~~~~~~~~~~~~~~~~~~~~~~
Hybrid cloud deployments require consideration of two issues that
are not common in other situations:
Image portability
As of the Kilo release, there is no common image format that is
usable by all clouds. Conversion or recreation of images is necessary
if migrating between clouds. To simplify deployment, use the smallest
and simplest images feasible, install only what is necessary, and
use a deployment manager such as Chef or Puppet. Do not use golden
images to speed up the process unless you repeatedly deploy the same
images on the same cloud.
API differences
Avoid using a hybrid cloud deployment with more than just
OpenStack (or with different versions of OpenStack) as API changes
can cause compatibility issues.

View File

@ -1,178 +0,0 @@
=================
User requirements
=================
Hybrid cloud architectures are complex, especially those
that use heterogeneous cloud platforms.
Ensure that design choices match requirements so that the
benefits outweigh the inherent additional complexity and risks.
Business considerations
~~~~~~~~~~~~~~~~~~~~~~~
Business considerations when designing a hybrid cloud deployment
----------------------------------------------------------------
Cost
A hybrid cloud architecture involves multiple vendors and
technical architectures.
These architectures may be more expensive to deploy and maintain.
Operational costs can be higher because of the need for more
sophisticated orchestration and brokerage tools than in other architectures.
In contrast, overall operational costs might be lower by
virtue of using a cloud brokerage tool to deploy the
workloads to the most cost effective platform.
Revenue opportunity
Revenue opportunities vary based on the intent and use case of the cloud.
As a commercial, customer-facing product, you must consider whether building
over multiple platforms makes the design more attractive to customers.
Time-to-market
One common reason to use cloud platforms is to improve the
time-to-market of a new product or application.
For example, using multiple cloud platforms is viable because
there is an existing investment in several applications.
It is faster to tie the investments together rather than migrate
the components and refactoring them to a single platform.
Business or technical diversity
Organizations leveraging cloud-based services can embrace business
diversity and utilize a hybrid cloud design to spread their
workloads across multiple cloud providers. This ensures that
no single cloud provider is the sole host for an application.
Application momentum
Businesses with existing applications may find that it is
more cost effective to integrate applications on multiple
cloud platforms than migrating them to a single platform.
Workload considerations
~~~~~~~~~~~~~~~~~~~~~~~
A workload can be a single application or a suite of applications
that work together. It can also be a duplicate set of applications that
need to run on multiple cloud environments.
In a hybrid cloud deployment, the same workload often needs to function
equally well on radically different public and private cloud environments.
The architecture needs to address these potential conflicts,
complexity, and platform incompatibilities.
Use cases for a hybrid cloud architecture
-----------------------------------------
Dynamic resource expansion or bursting
An application that requires additional resources may suit a multiple
cloud architecture. For example, a retailer needs additional resources
during the holiday season, but does not want to add private cloud
resources to meet the peak demand.
The user can accommodate the increased load by bursting to
a public cloud for these peak load periods. These bursts could be
for long or short cycles ranging from hourly to yearly.
Disaster recovery and business continuity
Cheaper storage makes the public cloud suitable for maintaining
backup applications.
Federated hypervisor and instance management
Adding self-service, charge back, and transparent delivery of
the resources from a federated pool can be cost effective.
In a hybrid cloud environment, this is a particularly important
consideration. Look for a cloud that provides cross-platform
hypervisor support and robust instance management tools.
Application portfolio integration
An enterprise cloud delivers efficient application portfolio
management and deployments by leveraging self-service features
and rules according to use.
Integrating existing cloud environments is a common driver
when building hybrid cloud architectures.
Migration scenarios
Hybrid cloud architecture enables the migration of
applications between different clouds.
High availability
A combination of locations and platforms enables a level of
availability that is not possible with a single platform.
This approach increases design complexity.
As running a workload on multiple cloud platforms increases design
complexity, we recommend first exploring options such as transferring
workloads across clouds at the application, instance, cloud platform,
hypervisor, and network levels.
Tools considerations
~~~~~~~~~~~~~~~~~~~~
Hybrid cloud designs must incorporate tools to facilitate working
across multiple clouds.
Tool functions
--------------
Broker between clouds
Brokering software evaluates relative costs between different
cloud platforms. Cloud Management Platforms (CMP)
allow the designer to determine the right location for the
workload based on predetermined criteria.
Facilitate orchestration across the clouds
CMPs simplify the migration of application workloads between
public, private, and hybrid cloud platforms.
We recommend using cloud orchestration tools for managing a diverse
portfolio of systems and applications across multiple cloud platforms.
Network considerations
~~~~~~~~~~~~~~~~~~~~~~
It is important to consider the functionality, security, scalability,
availability, and testability of network when choosing a CMP and cloud
provider.
* Decide on a network framework and design minimum functionality tests.
This ensures testing and functionality persists during and after
upgrades.
* Scalability across multiple cloud providers may dictate which underlying
network framework you choose in different cloud providers.
It is important to present the network API functions and to verify
that functionality persists across all cloud endpoints chosen.
* High availability implementations vary in functionality and design.
Examples of some common methods are active-hot-standby, active-passive,
and active-active.
Development of high availability and test frameworks is necessary to
insure understanding of functionality and limitations.
* Consider the security of data between the client and the endpoint,
and of traffic that traverses the multiple clouds.
Risk mitigation and management considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Hybrid cloud architectures introduce additional risk because
they are more complex than a single cloud design and may involve
incompatible components or tools. However, they also reduce
risk by spreading workloads over multiple providers.
Hybrid cloud risks
------------------
Provider availability or implementation details
Business changes can affect provider availability.
Likewise, changes in a provider's service can disrupt
a hybrid cloud environment or increase costs.
Differing SLAs
Hybrid cloud designs must accommodate differences in SLAs
between providers, and consider their enforceability.
Security levels
Securing multiple cloud environments is more complex than
securing single cloud environments. We recommend addressing
concerns at the application, network, and cloud platform levels.
Be aware that each cloud platform approaches security differently,
and a hybrid cloud design must address and compensate for these differences.
Provider API changes
Consumers of external clouds rarely have control over provider
changes to APIs, and changes can break compatibility.
Using only the most common and basic APIs can minimize potential conflicts.

View File

@ -1,45 +0,0 @@
======
Hybrid
======
.. toctree::
:maxdepth: 2
hybrid-user-requirements.rst
hybrid-technical-considerations.rst
hybrid-architecture.rst
hybrid-operational-considerations.rst
hybrid-prescriptive-examples.rst
A :term:`hybrid cloud` design is one that uses more than one cloud.
For example, designs that use both an OpenStack-based private
cloud and an OpenStack-based public cloud, or that use an
OpenStack cloud and a non-OpenStack cloud, are hybrid clouds.
:term:`Bursting <bursting>` describes the practice of creating new instances
in an external cloud to alleviate capacity issues in a private cloud.
**Example scenarios suited to hybrid clouds**
* Bursting from a private cloud to a public cloud
* Disaster recovery
* Development and testing
* Federated cloud, enabling users to choose resources from multiple providers
* Supporting legacy systems as they transition to the cloud
Hybrid clouds interact with systems that are outside the
control of the private cloud administrator, and require
careful architecture to prevent conflicts with hardware,
software, and APIs under external control.
The degree to which the architecture is OpenStack-based affects your ability
to accomplish tasks with native OpenStack tools. By definition,
this is a situation in which no single cloud can provide all
of the necessary functionality. In order to manage the entire
system, we recommend using a cloud management platform (CMP).
There are several commercial and open source CMPs available,
but there is no single CMP that can address all needs in all
scenarios, and sometimes a manually-built solution is the best
option. This chapter includes discussion of using CMPs for
managing a hybrid cloud.

View File

@ -1,35 +0,0 @@
.. meta::
:description: This guide targets OpenStack Architects
for architectural design
:keywords: Architecture, OpenStack
===================================
OpenStack Architecture Design Guide
===================================
Abstract
~~~~~~~~
To reap the benefits of OpenStack, you should plan, design,
and architect your cloud properly, taking user's needs into
account and understanding the use cases.
Contents
~~~~~~~~
.. toctree::
:maxdepth: 2
common/conventions.rst
introduction.rst
legal-security-requirements.rst
generalpurpose.rst
compute-focus.rst
storage-focus.rst
network-focus.rst
multi-site.rst
hybrid.rst
massively-scalable.rst
specialized.rst
references.rst
common/appendix.rst

View File

@ -1,33 +0,0 @@
How this book is organized
~~~~~~~~~~~~~~~~~~~~~~~~~~
This book examines some of the most common uses for OpenStack clouds,
and explains the considerations for each use case. Cloud architects may
use this book as a comprehensive guide by reading all of the use cases,
but it is also possible to review only the chapters which pertain to a
specific use case. The use cases covered in this guide include:
* :doc:`General purpose<generalpurpose>`: Uses common components that
address 80% of common use cases.
* :doc:`Compute focused<compute-focus>`: For compute intensive workloads
such as high performance computing (HPC).
* :doc:`Storage focused<storage-focus>`: For storage intensive workloads
such as data analytics with parallel file systems.
* :doc:`Network focused<network-focus>`: For high performance and
reliable networking, such as a :term:`content delivery network (CDN)`.
* :doc:`Multi-site<multi-site>`: For applications that require multiple
site deployments for geographical, reliability or data locality
reasons.
* :doc:`Hybrid cloud<hybrid>`: Uses multiple disparate clouds connected
either for failover, hybrid cloud bursting, or availability.
* :doc:`Massively scalable<massively-scalable>`: For cloud service
providers or other large installations.
* :doc:`Specialized cases<specialized>`: Architectures that have not
previously been covered in the defined use cases.

View File

@ -1,55 +0,0 @@
Why and how we wrote this book
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We wrote this book to guide you through designing an OpenStack cloud
architecture. This guide identifies design considerations for common
cloud use cases and provides examples.
The Architecture Design Guide was written in a book sprint format, which
is a facilitated, rapid development production method for books. The
Book Sprint was facilitated by Faith Bosworth and Adam Hyde of Book
Sprints, for more information, see the Book Sprints website
(www.booksprints.net).
This book was written in five days during July 2014 while exhausting the
M&M, Mountain Dew and healthy options supply, complete with juggling
entertainment during lunches at VMware's headquarters in Palo Alto.
We would like to thank VMware for their generous hospitality, as well as
our employers, Cisco, Cloudscaling, Comcast, EMC, Mirantis, Rackspace,
Red Hat, Verizon, and VMware, for enabling us to contribute our time. We
would especially like to thank Anne Gentle and Kenneth Hui for all of
their shepherding and organization in making this happen.
The author team includes:
* Kenneth Hui (EMC) `@hui\_kenneth <http://twitter.com/hui_kenneth>`__
* Alexandra Settle (Rackspace)
`@dewsday <http://twitter.com/dewsday>`__
* Anthony Veiga (Comcast) `@daaelar <http://twitter.com/daaelar>`__
* Beth Cohen (Verizon) `@bfcohen <http://twitter.com/bfcohen>`__
* Kevin Jackson (Rackspace)
`@itarchitectkev <http://twitter.com/itarchitectkev>`__
* Maish Saidel-Keesing (Cisco)
`@maishsk <http://twitter.com/maishsk>`__
* Nick Chase (Mirantis) `@NickChase <http://twitter.com/NickChase>`__
* Scott Lowe (VMware) `@scott\_lowe <http://twitter.com/scott_lowe>`__
* Sean Collins (Comcast) `@sc68cal <http://twitter.com/sc68cal>`__
* Sean Winn (Cloudscaling)
`@seanmwinn <http://twitter.com/seanmwinn>`__
* Sebastian Gutierrez (Red Hat) `@gutseb <http://twitter.com/gutseb>`__
* Stephen Gordon (Red Hat) `@xsgordon <http://twitter.com/xsgordon>`__
* Vinny Valdez (Red Hat)
`@VinnyValdez <http://twitter.com/VinnyValdez>`__

View File

@ -1,11 +0,0 @@
Intended audience
~~~~~~~~~~~~~~~~~
This book has been written for architects and designers of OpenStack
clouds. For a guide on deploying and operating OpenStack, please refer
to the `OpenStack Operations Guide <https://docs.openstack.org/ops-guide/>`_.
Before reading this book, we recommend prior knowledge of cloud
architecture and principles, experience in enterprise system design,
Linux and virtualization experience, and a basic understanding of
networking principles and protocols.

View File

@ -1,146 +0,0 @@
Methodology
~~~~~~~~~~~
The best way to design your cloud architecture is through creating and
testing use cases. Planning for applications that support thousands of
sessions per second, variable workloads, and complex, changing data,
requires you to identify the key meters. Identifying these key meters,
such as number of concurrent transactions per second, and size of
database, makes it possible to build a method for testing your
assumptions.
Use a functional user scenario to develop test cases, and to measure
overall project trajectory.
.. note::
If you do not want to use an application to develop user
requirements automatically, you need to create requirements to build
test harnesses and develop usable meters.
Establishing these meters allows you to respond to changes quickly
without having to set exact requirements in advance. This creates ways
to configure the system, rather than redesigning it every time there is
a requirements change.
.. important::
It is important to limit scope creep. Ensure you address tool
limitations, but do not recreate the entire suite of tools. Work
with technical product owners to establish critical features that
are needed for a successful cloud deployment.
Application cloud readiness
---------------------------
The cloud does more than host virtual machines and their applications.
This *lift and shift* approach works in certain situations, but there is
a fundamental difference between clouds and traditional bare-metal-based
environments, or even traditional virtualized environments.
In traditional environments, with traditional enterprise applications,
the applications and the servers that run on them are *pets*. They are
lovingly crafted and cared for, the servers have names like Gandalf or
Tardis, and if they get sick someone nurses them back to health. All of
this is designed so that the application does not experience an outage.
In cloud environments, servers are more like cattle. There are thousands
of them, they get names like NY-1138-Q, and if they get sick, they get
put down and a sysadmin installs another one. Traditional applications
that are unprepared for this kind of environment may suffer outages,
loss of data, or complete failure.
There are other reasons to design applications with the cloud in mind.
Some are defensive, such as the fact that because applications cannot be
certain of exactly where or on what hardware they will be launched, they
need to be flexible, or at least adaptable. Others are proactive. For
example, one of the advantages of using the cloud is scalability.
Applications need to be designed in such a way that they can take
advantage of these and other opportunities.
Determining whether an application is cloud-ready
-------------------------------------------------
There are several factors to take into consideration when looking at
whether an application is a good fit for the cloud.
Structure
A large, monolithic, single-tiered, legacy application typically is
not a good fit for the cloud. Efficiencies are gained when load can
be spread over several instances, so that a failure in one part of
the system can be mitigated without affecting other parts of the
system, or so that scaling can take place where the app needs it.
Dependencies
Applications that depend on specific hardware, such as a particular
chip set or an external device such as a fingerprint reader, might
not be a good fit for the cloud, unless those dependencies are
specifically addressed. Similarly, if an application depends on an
operating system or set of libraries that cannot be used in the
cloud, or cannot be virtualized, that is a problem.
Connectivity
Self-contained applications, or those that depend on resources that
are not reachable by the cloud in question, will not run. In some
situations, you can work around these issues with custom network
setup, but how well this works depends on the chosen cloud
environment.
Durability and resilience
Despite the existence of SLAs, things break: servers go down,
network connections are disrupted, or too many projects on a server
make a server unusable. An application must be sturdy enough to
contend with these issues.
Designing for the cloud
-----------------------
Here are some guidelines to keep in mind when designing an application
for the cloud:
* Be a pessimist: Assume everything fails and design backwards.
* Put your eggs in multiple baskets: Leverage multiple providers,
geographic regions and availability zones to accommodate for local
availability issues. Design for portability.
* Think efficiency: Inefficient designs will not scale. Efficient
designs become cheaper as they scale. Kill off unneeded components or
capacity.
* Be paranoid: Design for defense in depth and zero tolerance by
building in security at every level and between every component.
Trust no one.
* But not too paranoid: Not every application needs the platinum
solution. Architect for different SLA's, service tiers, and security
levels.
* Manage the data: Data is usually the most inflexible and complex area
of a cloud and cloud integration architecture. Do not short change
the effort in analyzing and addressing data needs.
* Hands off: Leverage automation to increase consistency and quality
and reduce response times.
* Divide and conquer: Pursue partitioning and parallel layering
wherever possible. Make components as small and portable as possible.
Use load balancing between layers.
* Think elasticity: Increasing resources should result in a
proportional increase in performance and scalability. Decreasing
resources should have the opposite effect.
* Be dynamic: Enable dynamic configuration changes such as auto
scaling, failure recovery and resource discovery to adapt to changing
environments, faults, and workload volumes.
* Stay close: Reduce latency by moving highly interactive components
and data near each other.
* Keep it loose: Loose coupling, service interfaces, separation of
concerns, abstraction, and well defined API's deliver flexibility.
* Be cost aware: Autoscaling, data transmission, virtual software
licenses, reserved instances, and similar costs can rapidly increase
monthly usage charges. Monitor usage closely.

View File

@ -1,15 +0,0 @@
============
Introduction
============
.. toctree::
:maxdepth: 2
introduction-intended-audience.rst
introduction-how-this-book-is-organized.rst
introduction-how-this-book-was-written.rst
introduction-methodology.rst
:term:`OpenStack` is a fully-featured, self-service cloud. This book takes you
through some of the considerations you have to make when designing your
cloud.

View File

@ -1,254 +0,0 @@
===============================
Security and legal requirements
===============================
This chapter discusses the legal and security requirements you
need to consider for the different OpenStack scenarios.
Legal requirements
~~~~~~~~~~~~~~~~~~
Many jurisdictions have legislative and regulatory
requirements governing the storage and management of data in
cloud environments. Common areas of regulation include:
* Data retention policies ensuring storage of persistent data
and records management to meet data archival requirements.
* Data ownership policies governing the possession and
responsibility for data.
* Data sovereignty policies governing the storage of data in
foreign countries or otherwise separate jurisdictions.
* Data compliance policies governing certain types of
information needing to reside in certain locations due to
regulatory issues - and more importantly, cannot reside in
other locations for the same reason.
Examples of such legal frameworks include the
`data protection framework <http://ec.europa.eu/justice/data-protection/>`_
of the European Union and the requirements of the
`Financial Industry Regulatory Authority
<http://www.finra.org/Industry/Regulation/FINRARules/>`_
in the United States.
Consult a local regulatory body for more information.
.. _security:
Security
~~~~~~~~
When deploying OpenStack in an enterprise as a private cloud,
despite activating a firewall and binding employees with security
agreements, cloud architecture should not make assumptions about
safety and protection.
In addition to considering the users, operators, or administrators
who will use the environment, consider also negative or hostile users who
would attack or compromise the security of your deployment regardless
of firewalls or security agreements.
Attack vectors increase further in a public facing OpenStack deployment.
For example, the API endpoints and the software behind it become
vulnerable to hostile entities attempting to gain unauthorized access
or prevent access to services.
This can result in loss of reputation and you must protect against
it through auditing and appropriate filtering.
It is important to understand that user authentication requests
encase sensitive information such as user names, passwords, and
authentication tokens. For this reason, place the API services
behind hardware that performs SSL termination.
.. warning::
Be mindful of consistency when utilizing third party
clouds to explore authentication options.
Security domains
~~~~~~~~~~~~~~~~
A security domain comprises users, applications, servers or networks
that share common trust requirements and expectations within a system.
Typically, security domains have the same authentication and
authorization requirements and users.
You can map security domains individually to the installation,
or combine them. For example, some deployment topologies combine both
guest and data domains onto one physical network.
In other cases these networks are physically separate.
Map out the security domains against specific OpenStack topologies needs.
The domains and their trust requirements depend on whether the cloud
instance is public, private, or hybrid.
Public security domains
-----------------------
The public security domain is an untrusted area of the cloud
infrastructure. It can refer to the internet as a whole or simply
to networks over which the user has no authority.
Always consider this domain untrusted. For example,
in a hybrid cloud deployment, any information traversing between and
beyond the clouds is in the public domain and untrustworthy.
Guest security domains
----------------------
Typically used for compute instance-to-instance traffic, the
guest security domain handles compute data generated by
instances on the cloud but not services that support the
operation of the cloud, such as API calls. Public cloud
providers and private cloud providers who do not have
stringent controls on instance use or who allow unrestricted
internet access to instances should consider this domain to be
untrusted. Private cloud providers may want to consider this
network as internal and therefore trusted only if they have
controls in place to assert that they trust instances and all
their projects.
Management security domains
---------------------------
The management security domain is where services interact.
The networks in this domain transport confidential data such as
configuration parameters, user names, and passwords. Trust this
domain when it is behind an organization's firewall in deployments.
Data security domains
---------------------
The data security domain is concerned primarily with
information pertaining to the storage services within OpenStack.
The data that crosses this network has integrity and
confidentiality requirements. Depending on the type of deployment there
may also be availability requirements. The trust level of this network
is heavily dependent on deployment decisions and does not have a default
level of trust.
Hypervisor-security
~~~~~~~~~~~~~~~~~~~
The hypervisor also requires a security assessment. In a
public cloud, organizations typically do not have control
over the choice of hypervisor. Properly securing your
hypervisor is important. Attacks made upon the
unsecured hypervisor are called a **hypervisor breakout**.
Hypervisor breakout describes the event of a
compromised or malicious instance breaking out of the resource
controls of the hypervisor and gaining access to the bare
metal operating system and hardware resources.
There is not an issue if the security of instances is not important.
However, enterprises need to avoid vulnerability. The only way to
do this is to avoid the situation where the instances are running
on a public cloud. That does not mean that there is a
need to own all of the infrastructure on which an OpenStack
installation operates; it suggests avoiding situations in which
sharing hardware with others occurs.
Baremetal security
~~~~~~~~~~~~~~~~~~
There are other services worth considering that provide a
bare metal instance instead of a cloud. In other cases, it is
possible to replicate a second private cloud by integrating
with a private Cloud-as-a-Service deployment. The
organization does not buy the hardware, but also does not share
with other projects. It is also possible to use a provider that
hosts a bare-metal public cloud instance for which the
hardware is dedicated only to one customer, or a provider that
offers private Cloud-as-a-Service.
.. important::
Each cloud implements services differently.
What keeps data secure in one cloud may not do the same in another.
Be sure to know the security requirements of every cloud that
handles the organization's data or workloads.
More information on OpenStack Security can be found in the
`OpenStack Security Guide <https://docs.openstack.org/security-guide>`_.
Networking security
~~~~~~~~~~~~~~~~~~~
Consider security implications and requirements before designing the
physical and logical network topologies. Make sure that the networks are
properly segregated and traffic flows are going to the correct
destinations without crossing through locations that are undesirable.
Consider the following example factors:
* Firewalls
* Overlay interconnects for joining separated project networks
* Routing through or avoiding specific networks
How networks attach to hypervisors can expose security
vulnerabilities. To mitigate against exploiting hypervisor breakouts,
separate networks from other systems and schedule instances for the
network onto dedicated compute nodes. This prevents attackers
from having access to the networks from a compromised instance.
Multi-site security
~~~~~~~~~~~~~~~~~~~
Securing a multi-site OpenStack installation brings
extra challenges. Projects may expect a project-created network
to be secure. In a multi-site installation the use of a
non-private connection between sites may be required. This may
mean that traffic would be visible to third parties and, in
cases where an application requires security, this issue
requires mitigation. In these instances, install a VPN or
encrypted connection between sites to conceal sensitive traffic.
Another security consideration with regard to multi-site
deployments is Identity. Centralize authentication within a
multi-site deployment. Centralization provides a
single authentication point for users across the deployment,
as well as a single point of administration for traditional
create, read, update, and delete operations. Centralized
authentication is also useful for auditing purposes because
all authentication tokens originate from the same source.
Just as projects in a single-site deployment need isolation
from each other, so do projects in multi-site installations.
The extra challenges in multi-site designs revolve around
ensuring that project networks function across regions.
OpenStack Networking (neutron) does not presently support
a mechanism to provide this functionality, therefore an
external system may be necessary to manage these mappings.
Project networks may contain sensitive information requiring
that this mapping be accurate and consistent to ensure that a
project in one site does not connect to a different project in
another site.
OpenStack components
~~~~~~~~~~~~~~~~~~~~
Most OpenStack installations require a bare minimum set of
pieces to function. These include OpenStack Identity
(keystone) for authentication, OpenStack Compute
(nova) for compute, OpenStack Image service (glance) for image
storage, OpenStack Networking (neutron) for networking, and
potentially an object store in the form of OpenStack Object
Storage (swift). Bringing multi-site into play also demands extra
components in order to coordinate between regions. Centralized
Identity service is necessary to provide the single authentication
point. Centralized dashboard is also recommended to provide a
single login point and a mapped experience to the API and CLI
options available. If needed, use a centralized Object Storage service,
installing the required swift proxy service alongside the Object
Storage service.
It may also be helpful to install a few extra options in
order to facilitate certain use cases. For instance,
installing DNS service may assist in automatically generating
DNS domains for each region with an automatically-populated
zone full of resource records for each instance. This
facilitates using DNS as a mechanism for determining which
region would be selected for certain applications.
Another useful tool for managing a multi-site installation
is Orchestration (heat). The Orchestration service allows
the use of templates to define a set of instances to be launched
together or for scaling existing sets.
It can set up matching or differentiated groupings based on regions.
For instance, if an application requires an equally balanced
number of nodes across sites, the same heat template can be used
to cover each site with small alterations to only the region name.

View File

@ -1,85 +0,0 @@
Operational considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to run efficiently at massive scale, automate as many of the
operational processes as possible. Automation includes the configuration of
provisioning, monitoring and alerting systems. Part of the automation process
includes the capability to determine when human intervention is required and
who should act. The objective is to decrease the ratio of operational staff to
running systems as much as possible in order to reduce maintenance costs. In a
massively scaled environment, it is very difficult for staff to give each
system individual care.
Configuration management tools such as Puppet and Chef enable operations staff
to categorize systems into groups based on their roles and thus create
configurations and system states that the provisioning system enforces.
Systems that fall out of the defined state due to errors or failures are
quickly removed from the pool of active nodes and replaced.
At large scale the resource cost of diagnosing failed individual systems is
far greater than the cost of replacement. It is more economical to replace the
failed system with a new system, provisioning and configuring it automatically
and adding it to the pool of active nodes. By automating tasks that are
labor-intensive, repetitive, and critical to operations, cloud operations
teams can work more efficiently because fewer resources are required for these
common tasks. Administrators are then free to tackle tasks that are not easy
to automate and that have longer-term impacts on the business, for example,
capacity planning.
The bleeding edge
-----------------
Running OpenStack at massive scale requires striking a balance between
stability and features. For example, it might be tempting to run an older
stable release branch of OpenStack to make deployments easier. However, when
running at massive scale, known issues that may be of some concern or only
have minimal impact in smaller deployments could become pain points. Recent
releases may address well known issues. The OpenStack community can help
resolve reported issues by applying the collective expertise of the OpenStack
developers.
The number of organizations running at massive scales is a small proportion of
the OpenStack community, therefore it is important to share related issues
with the community and be a vocal advocate for resolving them. Some issues
only manifest when operating at large scale, and the number of organizations
able to duplicate and validate an issue is small, so it is important to
document and dedicate resources to their resolution.
In some cases, the resolution to the problem is ultimately to deploy a more
recent version of OpenStack. Alternatively, when you must resolve an issue in
a production environment where rebuilding the entire environment is not an
option, it is sometimes possible to deploy updates to specific underlying
components in order to resolve issues or gain significant performance
improvements. Although this may appear to expose the deployment to increased
risk and instability, in many cases it could be an undiscovered issue.
We recommend building a development and operations organization that is
responsible for creating desired features, diagnosing and resolving issues,
and building the infrastructure for large scale continuous integration tests
and continuous deployment. This helps catch bugs early and makes deployments
faster and easier. In addition to development resources, we also recommend the
recruitment of experts in the fields of message queues, databases, distributed
systems, networking, cloud, and storage.
Growth and capacity planning
----------------------------
An important consideration in running at massive scale is projecting growth
and utilization trends in order to plan capital expenditures for the short and
long term. Gather utilization meters for compute, network, and storage, along
with historical records of these meters. While securing major anchor projects
can lead to rapid jumps in the utilization rates of all resources, the steady
adoption of the cloud inside an organization or by consumers in a public
offering also creates a steady trend of increased utilization.
Skills and training
-------------------
Projecting growth for storage, networking, and compute is only one aspect of a
growth plan for running OpenStack at massive scale. Growing and nurturing
development and operational staff is an additional consideration. Sending team
members to OpenStack conferences, meetup events, and encouraging active
participation in the mailing lists and committees is a very important way to
maintain skills and forge relationships in the community. For a list of
OpenStack training providers in the marketplace, see the `Openstack Marketplace
<https://www.openstack.org/marketplace/training/>`_.

View File

@ -1,110 +0,0 @@
Technical considerations
~~~~~~~~~~~~~~~~~~~~~~~~
Repurposing an existing OpenStack environment to be massively scalable is a
formidable task. When building a massively scalable environment from the
ground up, ensure you build the initial deployment with the same principles
and choices that apply as the environment grows. For example, a good approach
is to deploy the first site as a multi-site environment. This enables you to
use the same deployment and segregation methods as the environment grows to
separate locations across dedicated links or wide area networks. In a
hyperscale cloud, scale trumps redundancy. Modify applications with this in
mind, relying on the scale and homogeneity of the environment to provide
reliability rather than redundant infrastructure provided by non-commodity
hardware solutions.
Infrastructure segregation
--------------------------
OpenStack services support massive horizontal scale. Be aware that this is
not the case for the entire supporting infrastructure. This is particularly a
problem for the database management systems and message queues that OpenStack
services use for data storage and remote procedure call communications.
Traditional clustering techniques typically provide high availability and some
additional scale for these environments. In the quest for massive scale,
however, you must take additional steps to relieve the performance pressure on
these components in order to prevent them from negatively impacting the
overall performance of the environment. Ensure that all the components are in
balance so that if the massively scalable environment fails, all the
components are near maximum capacity and a single component is not causing the
failure.
Regions segregate completely independent installations linked only by an
Identity and Dashboard (optional) installation. Services have separate API
endpoints for each region, and include separate database and queue
installations. This exposes some awareness of the environment's fault domains
to users and gives them the ability to ensure some degree of application
resiliency while also imposing the requirement to specify which region to
apply their actions to.
Environments operating at massive scale typically need their regions or sites
subdivided further without exposing the requirement to specify the failure
domain to the user. This provides the ability to further divide the
installation into failure domains while also providing a logical unit for
maintenance and the addition of new hardware. At hyperscale, instead of adding
single compute nodes, administrators can add entire racks or even groups of
racks at a time with each new addition of nodes exposed via one of the
segregation concepts mentioned herein.
:term:`Cells <cell>` provide the ability to subdivide the compute portion of
an OpenStack installation, including regions, while still exposing a single
endpoint. Each region has an API cell along with a number of compute cells
where the workloads actually run. Each cell has its own database and message
queue setup (ideally clustered), providing the ability to subdivide the load
on these subsystems, improving overall performance.
Each compute cell provides a complete compute installation, complete with full
database and queue installations, scheduler, conductor, and multiple compute
hosts. The cells scheduler handles placement of user requests from the single
API endpoint to a specific cell from those available. The normal filter
scheduler then handles placement within the cell.
Unfortunately, Compute is the only OpenStack service that provides good
support for cells. In addition, cells do not adequately support some standard
OpenStack functionality such as security groups and host aggregates. Due to
their relative newness and specialized use, cells receive relatively little
testing in the OpenStack gate. Despite these issues, cells play an important
role in well known OpenStack installations operating at massive scale, such as
those at CERN and Rackspace.
Host aggregates
---------------
Host aggregates enable partitioning of OpenStack Compute deployments into
logical groups for load balancing and instance distribution. You can also use
host aggregates to further partition an availability zone. Consider a cloud
which might use host aggregates to partition an availability zone into groups
of hosts that either share common resources, such as storage and network, or
have a special property, such as trusted computing hardware. You cannot target
host aggregates explicitly. Instead, select instance flavors that map to host
aggregate metadata. These flavors target host aggregates implicitly.
Availability zones
------------------
Availability zones provide another mechanism for subdividing an installation
or region. They are, in effect, host aggregates exposed for (optional)
explicit targeting by users.
Unlike cells, availability zones do not have their own database server or
queue broker but represent an arbitrary grouping of compute nodes. Typically,
nodes are grouped into availability zones using a shared failure domain based
on a physical characteristic such as a shared power source or physical network
connections. Users can target exposed availability zones; however, this is not
a requirement. An alternative approach is to set a default availability zone
to schedule instances to a non-default availability zone of nova.
Segregation example
-------------------
In this example, the cloud is divided into two regions, an API cell and
three child cells for each region, with three availability zones in each
cell based on the power layout of the data centers.
The below figure describes the relationship between them within one region.
.. figure:: figures/Massively_Scalable_Cells_regions_azs.png
A number of host aggregates enable targeting of virtual machine instances
using flavors, that require special capabilities shared by the target hosts
such as SSDs, 10 GbE networks, or GPU cards.

View File

@ -1,91 +0,0 @@
User requirements
~~~~~~~~~~~~~~~~~
Defining user requirements for a massively scalable OpenStack design
architecture dictates approaching the design from two different, yet sometimes
opposing, perspectives: the cloud user, and the cloud operator. The
expectations and perceptions of the consumption and management of resources of
a massively scalable OpenStack cloud from these two perspectives are
distinctly different.
Massively scalable OpenStack clouds have the following user requirements:
* The cloud user expects repeatable, dependable, and deterministic processes
for launching and deploying cloud resources. You could deliver this through
a web-based interface or publicly available API endpoints. All appropriate
options for requesting cloud resources must be available through some type
of user interface, a command-line interface (CLI), or API endpoints.
* Cloud users expect a fully self-service and on-demand consumption model.
When an OpenStack cloud reaches the massively scalable size, expect
consumption as a service in each and every way.
* For a user of a massively scalable OpenStack public cloud, there are no
expectations for control over security, performance, or availability. Users
expect only SLAs related to uptime of API services, and very basic SLAs for
services offered. It is the user's responsibility to address these issues on
their own. The exception to this expectation is the rare case of a massively
scalable cloud infrastructure built for a private or government organization
that has specific requirements.
The cloud user's requirements and expectations that determine the cloud design
focus on the consumption model. The user expects to consume cloud resources in
an automated and deterministic way, without any need for knowledge of the
capacity, scalability, or other attributes of the cloud's underlying
infrastructure.
Operator requirements
---------------------
While the cloud user can be completely unaware of the underlying
infrastructure of the cloud and its attributes, the operator must build and
support the infrastructure for operating at scale. This presents a very
demanding set of requirements for building such a cloud from the operator's
perspective:
* Everything must be capable of automation. For example, everything from
compute hardware, storage hardware, networking hardware, to the installation
and configuration of the supporting software. Manual processes are
impractical in a massively scalable OpenStack design architecture.
* The cloud operator requires that capital expenditure (CapEx) is minimized at
all layers of the stack. Operators of massively scalable OpenStack clouds
require the use of dependable commodity hardware and freely available open
source software components to reduce deployment costs and operational
expenses. Initiatives like OpenCompute (more information available at
`Open Compute Project <http://www.opencompute.org>`_)
provide additional information and pointers. To
cut costs, many operators sacrifice redundancy. For example, using redundant
power supplies, network connections, and rack switches.
* Companies operating a massively scalable OpenStack cloud also require that
operational expenditures (OpEx) be minimized as much as possible. We
recommend using cloud-optimized hardware when managing operational overhead.
Some of the factors to consider include power, cooling, and the physical
design of the chassis. Through customization, it is possible to optimize the
hardware and systems for this type of workload because of the scale of these
implementations.
* Massively scalable OpenStack clouds require extensive metering and
monitoring functionality to maximize the operational efficiency by keeping
the operator informed about the status and state of the infrastructure. This
includes full scale metering of the hardware and software status. A
corresponding framework of logging and alerting is also required to store
and enable operations to act on the meters provided by the metering and
monitoring solutions. The cloud operator also needs a solution that uses the
data provided by the metering and monitoring solution to provide capacity
planning and capacity trending analysis.
* Invariably, massively scalable OpenStack clouds extend over several sites.
Therefore, the user-operator requirements for a multi-site OpenStack
architecture design are also applicable here. This includes various legal
requirements; other jurisdictional legal or compliance requirements; image
consistency-availability; storage replication and availability (both block
and file/object storage); and authentication, authorization, and auditing
(AAA). See :doc:`multi-site` for more details on requirements and
considerations for multi-site OpenStack clouds.
* The design architecture of a massively scalable OpenStack cloud must address
considerations around physical facilities such as space, floor weight, rack
height and type, environmental considerations, power usage and power usage
efficiency (PUE), and physical security.

View File

@ -1,57 +0,0 @@
==================
Massively scalable
==================
.. toctree::
:maxdepth: 2
massively-scalable-user-requirements.rst
massively-scalable-technical-considerations.rst
massively-scalable-operational-considerations.rst
A massively scalable architecture is a cloud implementation
that is either a very large deployment, such as a commercial
service provider might build, or one that has the capability
to support user requests for large amounts of cloud resources.
An example is an infrastructure in which requests to service
500 or more instances at a time is common. A massively scalable
infrastructure fulfills such a request without exhausting the
available cloud infrastructure resources. While the high capital
cost of implementing such a cloud architecture means that it
is currently in limited use, many organizations are planning for
massive scalability in the future.
A massively scalable OpenStack cloud design presents a unique
set of challenges and considerations. For the most part it is
similar to a general purpose cloud architecture, as it is built
to address a non-specific range of potential use cases or
functions. Typically, it is rare that particular workloads determine
the design or configuration of massively scalable clouds. The
massively scalable cloud is most often built as a platform for
a variety of workloads. Because private organizations rarely
require or have the resources for them, massively scalable
OpenStack clouds are generally built as commercial, public
cloud offerings.
Services provided by a massively scalable OpenStack cloud
include:
* Virtual-machine disk image library
* Raw block storage
* File or object storage
* Firewall functionality
* Load balancing functionality
* Private (non-routable) and public (floating) IP addresses
* Virtualized network topologies
* Software bundles
* Virtual compute resources
Like a general purpose cloud, the instances deployed in a
massively scalable OpenStack cloud do not necessarily use
any specific aspect of the cloud offering (compute, network, or storage).
As the cloud grows in scale, the number of workloads can cause
stress on all the cloud components. This adds further stresses
to supporting infrastructure such as databases and message brokers.
The architecture design for such a cloud must account for these
performance pressures without negatively impacting user experience.

View File

@ -1,118 +0,0 @@
============
Architecture
============
:ref:`ms-openstack-architecture` illustrates a high level multi-site
OpenStack architecture. Each site is an OpenStack cloud but it may be
necessary to architect the sites on different versions. For example,
if the second site is intended to be a replacement for the first site,
they would be different. Another common design would be a private
OpenStack cloud with a replicated site that would be used for high
availability or disaster recovery. The most important design decision
is configuring storage as a single shared pool or separate pools, depending
on user and technical requirements.
.. _ms-openstack-architecture:
.. figure:: figures/Multi-Site_shared_keystone_horizon_swift1.png
**Multi-site OpenStack architecture**
OpenStack services architecture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Identity service, which is used by all other OpenStack components
for authorization and the catalog of service endpoints, supports the
concept of regions. A region is a logical construct used to group
OpenStack services in close proximity to one another. The concept of
regions is flexible; it may contain OpenStack service endpoints located
within a distinct geographic region or regions. It may be smaller in
scope, where a region is a single rack within a data center, with
multiple regions existing in adjacent racks in the same data center.
The majority of OpenStack components are designed to run within the
context of a single region. The Compute service is designed to manage
compute resources within a region, with support for subdivisions of
compute resources by using availability zones and cells. The Networking
service can be used to manage network resources in the same broadcast
domain or collection of switches that are linked. The OpenStack Block
Storage service controls storage resources within a region with all
storage resources residing on the same storage network. Like the
OpenStack Compute service, the OpenStack Block Storage service also
supports the availability zone construct which can be used to subdivide
storage resources.
The OpenStack dashboard, OpenStack Identity, and OpenStack Object
Storage services are components that can each be deployed centrally in
order to serve multiple regions.
Storage
~~~~~~~
With multiple OpenStack regions, it is recommended to configure a single
OpenStack Object Storage service endpoint to deliver shared file storage
for all regions. The Object Storage service internally replicates files
to multiple nodes which can be used by applications or workloads in
multiple regions. This simplifies high availability failover and
disaster recovery rollback.
In order to scale the Object Storage service to meet the workload of
multiple regions, multiple proxy workers are run and load-balanced,
storage nodes are installed in each region, and the entire Object
Storage Service can be fronted by an HTTP caching layer. This is done so
client requests for objects can be served out of caches rather than
directly from the storage modules themselves, reducing the actual load
on the storage network. In addition to an HTTP caching layer, use a
caching layer like Memcache to cache objects between the proxy and
storage nodes.
If the cloud is designed with a separate Object Storage service endpoint
made available in each region, applications are required to handle
synchronization (if desired) and other management operations to ensure
consistency across the nodes. For some applications, having multiple
Object Storage Service endpoints located in the same region as the
application may be desirable due to reduced latency, cross region
bandwidth, and ease of deployment.
.. note::
For the Block Storage service, the most important decisions are the
selection of the storage technology, and whether a dedicated network
is used to carry storage traffic from the storage service to the
compute nodes.
Networking
~~~~~~~~~~
When connecting multiple regions together, there are several design
considerations. The overlay network technology choice determines how
packets are transmitted between regions and how the logical network and
addresses present to the application. If there are security or
regulatory requirements, encryption should be implemented to secure the
traffic between regions. For networking inside a region, the overlay
network technology for project networks is equally important. The overlay
technology and the network traffic that an application generates or
receives can be either complementary or serve cross purposes. For
example, using an overlay technology for an application that transmits a
large amount of small packets could add excessive latency or overhead to
each packet if not configured properly.
Dependencies
~~~~~~~~~~~~
The architecture for a multi-site OpenStack installation is dependent on
a number of factors. One major dependency to consider is storage. When
designing the storage system, the storage mechanism needs to be
determined. Once the storage type is determined, how it is accessed is
critical. For example, we recommend that storage should use a dedicated
network. Another concern is how the storage is configured to protect the
data. For example, the Recovery Point Objective (RPO) and the Recovery
Time Objective (RTO). How quickly recovery from a fault can be
completed, determines how often the replication of data is required.
Ensure that enough storage is allocated to support the data protection
strategy.
Networking decisions include the encapsulation mechanism that can be
used for the project networks, how large the broadcast domains should be,
and the contracted SLAs for the interconnects.

View File

@ -1,156 +0,0 @@
==========================
Operational considerations
==========================
Multi-site OpenStack cloud deployment using regions requires that the
service catalog contains per-region entries for each service deployed
other than the Identity service. Most off-the-shelf OpenStack deployment
tools have limited support for defining multiple regions in this
fashion.
Deployers should be aware of this and provide the appropriate
customization of the service catalog for their site either manually, or
by customizing deployment tools in use.
.. note::
As of the Kilo release, documentation for implementing this feature
is in progress. See this bug for more information:
https://bugs.launchpad.net/openstack-manuals/+bug/1340509.
Licensing
~~~~~~~~~
Multi-site OpenStack deployments present additional licensing
considerations over and above regular OpenStack clouds, particularly
where site licenses are in use to provide cost efficient access to
software licenses. The licensing for host operating systems, guest
operating systems, OpenStack distributions (if applicable),
software-defined infrastructure including network controllers and
storage systems, and even individual applications need to be evaluated.
Topics to consider include:
* The definition of what constitutes a site in the relevant licenses,
as the term does not necessarily denote a geographic or otherwise
physically isolated location.
* Differentiations between "hot" (active) and "cold" (inactive) sites,
where significant savings may be made in situations where one site is
a cold standby for disaster recovery purposes only.
* Certain locations might require local vendors to provide support and
services for each site which may vary with the licensing agreement in
place.
Logging and monitoring
~~~~~~~~~~~~~~~~~~~~~~
Logging and monitoring does not significantly differ for a multi-site
OpenStack cloud. The tools described in the `Logging and monitoring
chapter <https://docs.openstack.org/ops-guide/ops-logging-monitoring.html>`__
of the OpenStack Operations Guide remain applicable. Logging and monitoring
can be provided on a per-site basis, and in a common centralized location.
When attempting to deploy logging and monitoring facilities to a
centralized location, care must be taken with the load placed on the
inter-site networking links.
Upgrades
~~~~~~~~
In multi-site OpenStack clouds deployed using regions, sites are
independent OpenStack installations which are linked together using
shared centralized services such as OpenStack Identity. At a high level
the recommended order of operations to upgrade an individual OpenStack
environment is (see the `Upgrades
chapter <https://docs.openstack.org/ops-guide/ops-upgrades.html>`__
of the OpenStack Operations Guide for details):
#. Upgrade the OpenStack Identity service (keystone).
#. Upgrade the OpenStack Image service (glance).
#. Upgrade OpenStack Compute (nova), including networking components.
#. Upgrade OpenStack Block Storage (cinder).
#. Upgrade the OpenStack dashboard (horizon).
The process for upgrading a multi-site environment is not significantly
different:
#. Upgrade the shared OpenStack Identity service (keystone) deployment.
#. Upgrade the OpenStack Image service (glance) at each site.
#. Upgrade OpenStack Compute (nova), including networking components, at
each site.
#. Upgrade OpenStack Block Storage (cinder) at each site.
#. Upgrade the OpenStack dashboard (horizon), at each site or in the
single central location if it is shared.
Compute upgrades within each site can also be performed in a rolling
fashion. Compute controller services (API, Scheduler, and Conductor) can
be upgraded prior to upgrading of individual compute nodes. This allows
operations staff to keep a site operational for users of Compute
services while performing an upgrade.
Quota management
~~~~~~~~~~~~~~~~
Quotas are used to set operational limits to prevent system capacities
from being exhausted without notification. They are currently enforced
at the project level rather than at the user level.
Quotas are defined on a per-region basis. Operators can define identical
quotas for projects in each region of the cloud to provide a consistent
experience, or even create a process for synchronizing allocated quotas
across regions. It is important to note that only the operational limits
imposed by the quotas will be aligned consumption of quotas by users
will not be reflected between regions.
For example, given a cloud with two regions, if the operator grants a
user a quota of 25 instances in each region then that user may launch a
total of 50 instances spread across both regions. They may not, however,
launch more than 25 instances in any single region.
For more information on managing quotas refer to the `Managing projects
and users
chapter <https://docs.openstack.org/ops-guide/ops-projects-users.html>`__
of the OpenStack Operators Guide.
Policy management
~~~~~~~~~~~~~~~~~
OpenStack provides a default set of Role Based Access Control (RBAC)
policies, defined in a ``policy.json`` file, for each service. Operators
edit these files to customize the policies for their OpenStack
installation. If the application of consistent RBAC policies across
sites is a requirement, then it is necessary to ensure proper
synchronization of the ``policy.json`` files to all installations.
This must be done using system administration tools such as rsync as
functionality for synchronizing policies across regions is not currently
provided within OpenStack.
Documentation
~~~~~~~~~~~~~
Users must be able to leverage cloud infrastructure and provision new
resources in the environment. It is important that user documentation is
accessible by users to ensure they are given sufficient information to
help them leverage the cloud. As an example, by default OpenStack
schedules instances on a compute node automatically. However, when
multiple regions are available, the end user needs to decide in which
region to schedule the new instance. The dashboard presents the user
with the first region in your configuration. The API and CLI tools do
not execute commands unless a valid region is specified. It is therefore
important to provide documentation to your users describing the region
layout as well as calling out that quotas are region-specific. If a user
reaches his or her quota in one region, OpenStack does not automatically
build new instances in another. Documenting specific examples helps
users understand how to operate the cloud, thereby reducing calls and
tickets filed with the help desk.

View File

@ -1,192 +0,0 @@
=====================
Prescriptive examples
=====================
There are multiple ways to build a multi-site OpenStack installation,
based on the needs of the intended workloads. Below are example
architectures based on different requirements. These examples are meant
as a reference, and not a hard and fast rule for deployments. Use the
previous sections of this chapter to assist in selecting specific
components and implementations based on specific needs.
A large content provider needs to deliver content to customers that are
geographically dispersed. The workload is very sensitive to latency and
needs a rapid response to end-users. After reviewing the user, technical
and operational considerations, it is determined beneficial to build a
number of regions local to the customer's edge. Rather than build a few
large, centralized data centers, the intent of the architecture is to
provide a pair of small data centers in locations that are closer to the
customer. In this use case, spreading applications out allows for
different horizontal scaling than a traditional compute workload scale.
The intent is to scale by creating more copies of the application in
closer proximity to the users that need it most, in order to ensure
faster response time to user requests. This provider deploys two
datacenters at each of the four chosen regions. The implications of this
design are based around the method of placing copies of resources in
each of the remote regions. Swift objects, Glance images, and block
storage need to be manually replicated into each region. This may be
beneficial for some systems, such as the case of content service, where
only some of the content needs to exist in some but not all regions. A
centralized Keystone is recommended to ensure authentication and that
access to the API endpoints is easily manageable.
It is recommended that you install an automated DNS system such as
Designate. Application administrators need a way to manage the mapping
of which application copy exists in each region and how to reach it,
unless an external Dynamic DNS system is available. Designate assists by
making the process automatic and by populating the records in the each
region's zone.
Telemetry for each region is also deployed, as each region may grow
differently or be used at a different rate. Ceilometer collects each
region's meters from each of the controllers and report them back to a
central location. This is useful both to the end user and the
administrator of the OpenStack environment. The end user will find this
method useful, as it makes possible to determine if certain locations
are experiencing higher load than others, and take appropriate action.
Administrators also benefit by possibly being able to forecast growth
per region, rather than expanding the capacity of all regions
simultaneously, therefore maximizing the cost-effectiveness of the
multi-site design.
One of the key decisions of running this infrastructure is whether or
not to provide a redundancy model. Two types of redundancy and high
availability models in this configuration can be implemented. The first
type is the availability of central OpenStack components. Keystone can
be made highly available in three central data centers that host the
centralized OpenStack components. This prevents a loss of any one of the
regions causing an outage in service. It also has the added benefit of
being able to run a central storage repository as a primary cache for
distributing content to each of the regions.
The second redundancy type is the edge data center itself. A second data
center in each of the edge regional locations house a second region near
the first region. This ensures that the application does not suffer
degraded performance in terms of latency and availability.
:ref:`ms-customer-edge` depicts the solution designed to have both a
centralized set of core data centers for OpenStack services and paired edge
data centers:
.. _ms-customer-edge:
.. figure:: figures/Multi-Site_Customer_Edge.png
**Multi-site architecture example**
Geo-redundant load balancing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A large-scale web application has been designed with cloud principles in
mind. The application is designed provide service to application store,
on a 24/7 basis. The company has typical two tier architecture with a
web front-end servicing the customer requests, and a NoSQL database back
end storing the information.
As of late there has been several outages in number of major public
cloud providers due to applications running out of a single geographical
location. The design therefore should mitigate the chance of a single
site causing an outage for their business.
The solution would consist of the following OpenStack components:
* A firewall, switches and load balancers on the public facing network
connections.
* OpenStack Controller services running, Networking, dashboard, Block
Storage and Compute running locally in each of the three regions.
Identity service, Orchestration service, Telemetry service, Image
service and Object Storage service can be installed centrally, with
nodes in each of the region providing a redundant OpenStack
Controller plane throughout the globe.
* OpenStack compute nodes running the KVM hypervisor.
* OpenStack Object Storage for serving static objects such as images
can be used to ensure that all images are standardized across all the
regions, and replicated on a regular basis.
* A distributed DNS service available to all regions that allows for
dynamic update of DNS records of deployed instances.
* A geo-redundant load balancing service can be used to service the
requests from the customers based on their origin.
An autoscaling heat template can be used to deploy the application in
the three regions. This template includes:
* Web Servers, running Apache.
* Appropriate ``user_data`` to populate the central DNS servers upon
instance launch.
* Appropriate Telemetry alarms that maintain state of the application
and allow for handling of region or instance failure.
Another autoscaling Heat template can be used to deploy a distributed
MongoDB shard over the three locations, with the option of storing
required data on a globally available swift container. According to the
usage and load on the database server, additional shards can be
provisioned according to the thresholds defined in Telemetry.
Two data centers would have been sufficient had the requirements been
met. But three regions are selected here to avoid abnormal load on a
single region in the event of a failure.
Orchestration is used because of the built-in functionality of
autoscaling and auto healing in the event of increased load. Additional
configuration management tools, such as Puppet or Chef could also have
been used in this scenario, but were not chosen since Orchestration had
the appropriate built-in hooks into the OpenStack cloud, whereas the
other tools were external and not native to OpenStack. In addition,
external tools were not needed since this deployment scenario was
straight forward.
OpenStack Object Storage is used here to serve as a back end for the
Image service since it is the most suitable solution for a globally
distributed storage solution with its own replication mechanism. Home
grown solutions could also have been used including the handling of
replication, but were not chosen, because Object Storage is already an
intricate part of the infrastructure and a proven solution.
An external load balancing service was used and not the LBaaS in
OpenStack because the solution in OpenStack is not redundant and does
not have any awareness of geo location.
.. _ms-geo-redundant:
.. figure:: figures/Multi-site_Geo_Redundant_LB.png
**Multi-site geo-redundant architecture**
Location-local service
~~~~~~~~~~~~~~~~~~~~~~
A common use for multi-site OpenStack deployment is creating a Content
Delivery Network. An application that uses a location-local architecture
requires low network latency and proximity to the user to provide an
optimal user experience and reduce the cost of bandwidth and transit.
The content resides on sites closer to the customer, instead of a
centralized content store that requires utilizing higher cost
cross-country links.
This architecture includes a geo-location component that places user
requests to the closest possible node. In this scenario, 100% redundancy
of content across every site is a goal rather than a requirement, with
the intent to maximize the amount of content available within a minimum
number of network hops for end users. Despite these differences, the
storage replication configuration has significant overlap with that of a
geo-redundant load balancing use case.
In :ref:`ms-shared-keystone`, the application utilizing this multi-site
OpenStack install that is location-aware would launch web server or content
serving instances on the compute cluster in each site. Requests from clients
are first sent to a global services load balancer that determines the location
of the client, then routes the request to the closest OpenStack site where the
application completes the request.
.. _ms-shared-keystone:
.. figure:: figures/Multi-Site_shared_keystone1.png
**Multi-site shared keystone architecture**

View File

@ -1,164 +0,0 @@
========================
Technical considerations
========================
There are many technical considerations to take into account with regard
to designing a multi-site OpenStack implementation. An OpenStack cloud
can be designed in a variety of ways to handle individual application
needs. A multi-site deployment has additional challenges compared to
single site installations and therefore is a more complex solution.
When determining capacity options be sure to take into account not just
the technical issues, but also the economic or operational issues that
might arise from specific decisions.
Inter-site link capacity describes the capabilities of the connectivity
between the different OpenStack sites. This includes parameters such as
bandwidth, latency, whether or not a link is dedicated, and any business
policies applied to the connection. The capability and number of the
links between sites determine what kind of options are available for
deployment. For example, if two sites have a pair of high-bandwidth
links available between them, it may be wise to configure a separate
storage replication network between the two sites to support a single
Swift endpoint and a shared Object Storage capability between them. An
example of this technique, as well as a configuration walk-through, is
available at `Dedicated replication network
<https://docs.openstack.org/developer/swift/replication_network.html#dedicated-replication-network>`_.
Another option in this scenario is to build a dedicated set of project
private networks across the secondary link, using overlay networks with
a third party mapping the site overlays to each other.
The capacity requirements of the links between sites is driven by
application behavior. If the link latency is too high, certain
applications that use a large number of small packets, for example RPC
calls, may encounter issues communicating with each other or operating
properly. Additionally, OpenStack may encounter similar types of issues.
To mitigate this, Identity service call timeouts can be tuned to prevent
issues authenticating against a central Identity service.
Another network capacity consideration for a multi-site deployment is
the amount and performance of overlay networks available for project
networks. If using shared project networks across zones, it is imperative
that an external overlay manager or controller be used to map these
overlays together. It is necessary to ensure the amount of possible IDs
between the zones are identical.
.. note::
As of the Kilo release, OpenStack Networking was not capable of
managing tunnel IDs across installations. So if one site runs out of
IDs, but another does not, that project's network is unable to reach
the other site.
Capacity can take other forms as well. The ability for a region to grow
depends on scaling out the number of available compute nodes. This topic
is covered in greater detail in the section for compute-focused
deployments. However, it may be necessary to grow cells in an individual
region, depending on the size of your cluster and the ratio of virtual
machines per hypervisor.
A third form of capacity comes in the multi-region-capable components of
OpenStack. Centralized Object Storage is capable of serving objects
through a single namespace across multiple regions. Since this works by
accessing the object store through swift proxy, it is possible to
overload the proxies. There are two options available to mitigate this
issue:
* Deploy a large number of swift proxies. The drawback is that the
proxies are not load-balanced and a large file request could
continually hit the same proxy.
* Add a caching HTTP proxy and load balancer in front of the swift
proxies. Since swift objects are returned to the requester via HTTP,
this load balancer would alleviate the load required on the swift
proxies.
Utilization
~~~~~~~~~~~
While constructing a multi-site OpenStack environment is the goal of
this guide, the real test is whether an application can utilize it.
The Identity service is normally the first interface for OpenStack users
and is required for almost all major operations within OpenStack.
Therefore, it is important that you provide users with a single URL for
Identity service authentication, and document the configuration of
regions within the Identity service. Each of the sites defined in your
installation is considered to be a region in Identity nomenclature. This
is important for the users, as it is required to define the region name
when providing actions to an API endpoint or in the dashboard.
Load balancing is another common issue with multi-site installations.
While it is still possible to run HAproxy instances with
Load-Balancer-as-a-Service, these are defined to a specific region. Some
applications can manage this using internal mechanisms. Other
applications may require the implementation of an external system,
including global services load balancers or anycast-advertised DNS.
Depending on the storage model chosen during site design, storage
replication and availability are also a concern for end-users. If an
application can support regions, then it is possible to keep the object
storage system separated by region. In this case, users who want to have
an object available to more than one region need to perform cross-site
replication. However, with a centralized swift proxy, the user may need
to benchmark the replication timing of the Object Storage back end.
Benchmarking allows the operational staff to provide users with an
understanding of the amount of time required for a stored or modified
object to become available to the entire environment.
Performance
~~~~~~~~~~~
Determining the performance of a multi-site installation involves
considerations that do not come into play in a single-site deployment.
Being a distributed deployment, performance in multi-site deployments
may be affected in certain situations.
Since multi-site systems can be geographically separated, there may be
greater latency or jitter when communicating across regions. This can
especially impact systems like the OpenStack Identity service when
making authentication attempts from regions that do not contain the
centralized Identity implementation. It can also affect applications
which rely on Remote Procedure Call (RPC) for normal operation. An
example of this can be seen in high performance computing workloads.
Storage availability can also be impacted by the architecture of a
multi-site deployment. A centralized Object Storage service requires
more time for an object to be available to instances locally in regions
where the object was not created. Some applications may need to be tuned
to account for this effect. Block Storage does not currently have a
method for replicating data across multiple regions, so applications
that depend on available block storage need to manually cope with this
limitation by creating duplicate block storage entries in each region.
OpenStack components
~~~~~~~~~~~~~~~~~~~~
Most OpenStack installations require a bare minimum set of pieces to
function. These include the OpenStack Identity (keystone) for
authentication, OpenStack Compute (nova) for compute, OpenStack Image
service (glance) for image storage, OpenStack Networking (neutron) for
networking, and potentially an object store in the form of OpenStack
Object Storage (swift). Deploying a multi-site installation also demands
extra components in order to coordinate between regions. A centralized
Identity service is necessary to provide the single authentication
point. A centralized dashboard is also recommended to provide a single
login point and a mapping to the API and CLI options available. A
centralized Object Storage service may also be used, but will require
the installation of the swift proxy service.
It may also be helpful to install a few extra options in order to
facilitate certain use cases. For example, installing Designate may
assist in automatically generating DNS domains for each region with an
automatically-populated zone full of resource records for each instance.
This facilitates using DNS as a mechanism for determining which region
will be selected for certain applications.
Another useful tool for managing a multi-site installation is
Orchestration (heat). The Orchestration service allows the use of
templates to define a set of instances to be launched together or for
scaling existing sets. It can also be used to set up matching or
differentiated groupings based on regions. For instance, if an
application requires an equally balanced number of nodes across sites,
the same heat template can be used to cover each site with small
alterations to only the region name.

View File

@ -1,168 +0,0 @@
=================
User requirements
=================
Workload characteristics
~~~~~~~~~~~~~~~~~~~~~~~~
An understanding of the expected workloads for a desired multi-site
environment and use case is an important factor in the decision-making
process. In this context, ``workload`` refers to the way the systems are
used. A workload could be a single application or a suite of
applications that work together. It could also be a duplicate set of
applications that need to run in multiple cloud environments. Often in a
multi-site deployment, the same workload will need to work identically
in more than one physical location.
This multi-site scenario likely includes one or more of the other
scenarios in this book with the additional requirement of having the
workloads in two or more locations. The following are some possible
scenarios:
For many use cases the proximity of the user to their workloads has a
direct influence on the performance of the application and therefore
should be taken into consideration in the design. Certain applications
require zero to minimal latency that can only be achieved by deploying
the cloud in multiple locations. These locations could be in different
data centers, cities, countries or geographical regions, depending on
the user requirement and location of the users.
Consistency of images and templates across different sites
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is essential that the deployment of instances is consistent across
the different sites and built into the infrastructure. If the OpenStack
Object Storage is used as a back end for the Image service, it is
possible to create repositories of consistent images across multiple
sites. Having central endpoints with multiple storage nodes allows
consistent centralized storage for every site.
Not using a centralized object store increases the operational overhead
of maintaining a consistent image library. This could include
development of a replication mechanism to handle the transport of images
and the changes to the images across multiple sites.
High availability
~~~~~~~~~~~~~~~~~
If high availability is a requirement to provide continuous
infrastructure operations, a basic requirement of high availability
should be defined.
The OpenStack management components need to have a basic and minimal
level of redundancy. The simplest example is the loss of any single site
should have minimal impact on the availability of the OpenStack
services.
The `OpenStack High Availability
Guide <https://docs.openstack.org/ha-guide/>`_ contains more information
on how to provide redundancy for the OpenStack components.
Multiple network links should be deployed between sites to provide
redundancy for all components. This includes storage replication, which
should be isolated to a dedicated network or VLAN with the ability to
assign QoS to control the replication traffic or provide priority for
this traffic. Note that if the data store is highly changeable, the
network requirements could have a significant effect on the operational
cost of maintaining the sites.
The ability to maintain object availability in both sites has
significant implications on the object storage design and
implementation. It also has a significant impact on the WAN network
design between the sites.
Connecting more than two sites increases the challenges and adds more
complexity to the design considerations. Multi-site implementations
require planning to address the additional topology used for internal
and external connectivity. Some options include full mesh topology, hub
spoke, spine leaf, and 3D Torus.
If applications running in a cloud are not cloud-aware, there should be
clear measures and expectations to define what the infrastructure can
and cannot support. An example would be shared storage between sites. It
is possible, however such a solution is not native to OpenStack and
requires a third-party hardware vendor to fulfill such a requirement.
Another example can be seen in applications that are able to consume
resources in object storage directly. These applications need to be
cloud aware to make good use of an OpenStack Object Store.
Application readiness
~~~~~~~~~~~~~~~~~~~~~
Some applications are tolerant of the lack of synchronized object
storage, while others may need those objects to be replicated and
available across regions. Understanding how the cloud implementation
impacts new and existing applications is important for risk mitigation,
and the overall success of a cloud project. Applications may have to be
written or rewritten for an infrastructure with little to no redundancy,
or with the cloud in mind.
Cost
~~~~
A greater number of sites increase cost and complexity for a multi-site
deployment. Costs can be broken down into the following categories:
* Compute resources
* Networking resources
* Replication
* Storage
* Management
* Operational costs
Site loss and recovery
~~~~~~~~~~~~~~~~~~~~~~
Outages can cause partial or full loss of site functionality. Strategies
should be implemented to understand and plan for recovery scenarios.
* The deployed applications need to continue to function and, more
importantly, you must consider the impact on the performance and
reliability of the application when a site is unavailable.
* It is important to understand what happens to the replication of
objects and data between the sites when a site goes down. If this
causes queues to start building up, consider how long these queues
can safely exist until an error occurs.
* After an outage, ensure the method for resuming proper operations of
a site is implemented when it comes back online. We recommend you
architect the recovery to avoid race conditions.
Compliance and geo-location
~~~~~~~~~~~~~~~~~~~~~~~~~~~
An organization may have certain legal obligations and regulatory
compliance measures which could require certain workloads or data to not
be located in certain regions.
Auditing
~~~~~~~~
A well thought-out auditing strategy is important in order to be able to
quickly track down issues. Keeping track of changes made to security
groups and project changes can be useful in rolling back the changes if
they affect production. For example, if all security group rules for a
project disappeared, the ability to quickly track down the issue would be
important for operational and legal reasons.
Separation of duties
~~~~~~~~~~~~~~~~~~~~
A common requirement is to define different roles for the different
cloud administration functions. An example would be a requirement to
segregate the duties and permissions by site.
Authentication between sites
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is recommended to have a single authentication domain rather than a
separate implementation for each and every site. This requires an
authentication mechanism that is highly available and distributed to
ensure continuous operation. Authentication server locality might be
required and should be planned for.

View File

@ -1,26 +0,0 @@
==========
Multi-site
==========
.. toctree::
:maxdepth: 2
multi-site-user-requirements.rst
multi-site-technical-considerations.rst
multi-site-operational-considerations.rst
multi-site-architecture.rst
multi-site-prescriptive-examples.rst
OpenStack is capable of running in a multi-region configuration. This
enables some parts of OpenStack to effectively manage a group of sites
as a single cloud.
Some use cases that might indicate a need for a multi-site deployment of
OpenStack include:
* An organization with a diverse geographic footprint.
* Geo-location sensitive data.
* Data locality, in which specific data or functionality should be
close to users.

View File

@ -1,184 +0,0 @@
Architecture
~~~~~~~~~~~~
Network-focused OpenStack architectures have many similarities to other
OpenStack architecture use cases. There are several factors to consider
when designing for a network-centric or network-heavy application
environment.
Networks exist to serve as a medium of transporting data between
systems. It is inevitable that an OpenStack design has
inter-dependencies with non-network portions of OpenStack as well as on
external systems. Depending on the specific workload, there may be major
interactions with storage systems both within and external to the
OpenStack environment. For example, in the case of content delivery
network, there is twofold interaction with storage. Traffic flows to and
from the storage array for ingesting and serving content in a
north-south direction. In addition, there is replication traffic flowing
in an east-west direction.
Compute-heavy workloads may also induce interactions with the network.
Some high performance compute applications require network-based memory
mapping and data sharing and, as a result, induce a higher network load
when they transfer results and data sets. Others may be highly
transactional and issue transaction locks, perform their functions, and
revoke transaction locks at high rates. This also has an impact on the
network performance.
Some network dependencies are external to OpenStack. While OpenStack
Networking is capable of providing network ports, IP addresses, some
level of routing, and overlay networks, there are some other functions
that it cannot provide. For many of these, you may require external
systems or equipment to fill in the functional gaps. Hardware load
balancers are an example of equipment that may be necessary to
distribute workloads or offload certain functions. OpenStack Networking
provides a tunneling feature, however it is constrained to a
Networking-managed region. If the need arises to extend a tunnel beyond
the OpenStack region to either another region or an external system,
implement the tunnel itself outside OpenStack or use a tunnel management
system to map the tunnel or overlay to an external tunnel.
Depending on the selected design, Networking itself might not support
the required :term:`layer-3 network<Layer-3 network>` functionality. If
you choose to use the provider networking mode without running the layer-3
agent, you must install an external router to provide layer-3 connectivity
to outside systems.
Interaction with orchestration services is inevitable in larger-scale
deployments. The Orchestration service is capable of allocating network
resource defined in templates to map to project networks and for port
creation, as well as allocating floating IPs. If there is a requirement
to define and manage network resources when using orchestration, we
recommend that the design include the Orchestration service to meet the
demands of users.
Design impacts
--------------
A wide variety of factors can affect a network-focused OpenStack
architecture. While there are some considerations shared with a general
use case, specific workloads related to network requirements influence
network design decisions.
One decision includes whether or not to use Network Address Translation
(NAT) and where to implement it. If there is a requirement for floating
IPs instead of public fixed addresses then you must use NAT. An example
of this is a DHCP relay that must know the IP of the DHCP server. In
these cases it is easier to automate the infrastructure to apply the
target IP to a new instance rather than to reconfigure legacy or
external systems for each new instance.
NAT for floating IPs managed by Networking resides within the hypervisor
but there are also versions of NAT that may be running elsewhere. If
there is a shortage of IPv4 addresses there are two common methods to
mitigate this externally to OpenStack. The first is to run a load
balancer either within OpenStack as an instance, or use an external load
balancing solution. In the internal scenario, Networking's
Load-Balancer-as-a-Service (LBaaS) can manage load balancing software,
for example HAproxy. This is specifically to manage the Virtual IP (VIP)
while a dual-homed connection from the HAproxy instance connects the
public network with the project private network that hosts all of the
content servers. In the external scenario, a load balancer needs to
serve the VIP and also connect to the project overlay network through
external means or through private addresses.
Another kind of NAT that may be useful is protocol NAT. In some cases it
may be desirable to use only IPv6 addresses on instances and operate
either an instance or an external service to provide a NAT-based
transition technology such as NAT64 and DNS64. This provides the ability
to have a globally routable IPv6 address while only consuming IPv4
addresses as necessary or in a shared manner.
Application workloads affect the design of the underlying network
architecture. If a workload requires network-level redundancy, the
routing and switching architecture have to accommodate this. There are
differing methods for providing this that are dependent on the selected
network hardware, the performance of the hardware, and which networking
model you deploy. Examples include Link aggregation (LAG) and Hot
Standby Router Protocol (HSRP). Also consider whether to deploy
OpenStack Networking or legacy networking (nova-network), and which
plug-in to select for OpenStack Networking. If using an external system,
configure Networking to run :term:`layer-2<Layer-2 network>` with a provider
network configuration. For example, implement HSRP to terminate layer-3
connectivity.
Depending on the workload, overlay networks may not be the best
solution. Where application network connections are small, short lived,
or bursty, running a dynamic overlay can generate as much bandwidth as
the packets it carries. It also can induce enough latency to cause
issues with certain applications. There is an impact to the device
generating the overlay which, in most installations, is the hypervisor.
This causes performance degradation on packet per second and connection
per second rates.
Overlays also come with a secondary option that may not be appropriate
to a specific workload. While all of them operate in full mesh by
default, there might be good reasons to disable this function because it
may cause excessive overhead for some workloads. Conversely, other
workloads operate without issue. For example, most web services
applications do not have major issues with a full mesh overlay network,
while some network monitoring tools or storage replication workloads
have performance issues with throughput or excessive broadcast traffic.
Many people overlook an important design decision: The choice of layer-3
protocols. While OpenStack was initially built with only IPv4 support,
Networking now supports IPv6 and dual-stacked networks. Some workloads
are possible through the use of IPv6 and IPv6 to IPv4 reverse transition
mechanisms such as NAT64 and DNS64 or :term:`6to4`. This alters the
requirements for any address plan as single-stacked and transitional IPv6
deployments can alleviate the need for IPv4 addresses.
OpenStack has limited support for dynamic routing, however there are a
number of options available by incorporating third party solutions to
implement routing within the cloud including network equipment, hardware
nodes, and instances. Some workloads perform well with nothing more than
static routes and default gateways configured at the layer-3 termination
point. In most cases this is sufficient, however some cases require the
addition of at least one type of dynamic routing protocol if not
multiple protocols. Having a form of interior gateway protocol (IGP)
available to the instances inside an OpenStack installation opens up the
possibility of use cases for anycast route injection for services that
need to use it as a geographic location or failover mechanism. Other
applications may wish to directly participate in a routing protocol,
either as a passive observer, as in the case of a looking glass, or as
an active participant in the form of a route reflector. Since an
instance might have a large amount of compute and memory resources, it
is trivial to hold an entire unpartitioned routing table and use it to
provide services such as network path visibility to other applications
or as a monitoring tool.
Path maximum transmission unit (MTU) failures are lesser known but
harder to diagnose. The MTU must be large enough to handle normal
traffic, overhead from an overlay network, and the desired layer-3
protocol. Adding externally built tunnels reduces the MTU packet size.
In this case, you must pay attention to the fully calculated MTU size
because some systems ignore or drop path MTU discovery packets.
Tunable networking components
-----------------------------
Consider configurable networking components related to an OpenStack
architecture design when designing for network intensive workloads that
include MTU and QoS. Some workloads require a larger MTU than normal due
to the transfer of large blocks of data. When providing network service
for applications such as video streaming or storage replication, we
recommend that you configure both OpenStack hardware nodes and the
supporting network equipment for jumbo frames where possible. This
allows for better use of available bandwidth. Configure jumbo frames
across the complete path the packets traverse. If one network component
is not capable of handling jumbo frames then the entire path reverts to
the default MTU.
:term:`Quality of Service (QoS)` also has a great impact on network intensive
workloads as it provides instant service to packets which have a higher
priority due to the impact of poor network performance. In applications
such as Voice over IP (VoIP), differentiated services code points are a
near requirement for proper operation. You can also use QoS in the
opposite direction for mixed workloads to prevent low priority but high
bandwidth applications, for example backup services, video conferencing,
or file sharing, from blocking bandwidth that is needed for the proper
operation of other workloads. It is possible to tag file storage traffic
as a lower class, such as best effort or scavenger, to allow the higher
priority traffic through. In cases where regions within a cloud might be
geographically distributed it may also be necessary to plan accordingly
to implement WAN optimization to combat latency or packet loss.

View File

@ -1,64 +0,0 @@
Operational considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~
Network-focused OpenStack clouds have a number of operational
considerations that influence the selected design, including:
* Dynamic routing of static routes
* Service level agreements (SLAs)
* Ownership of user management
An initial network consideration is the selection of a telecom company
or transit provider.
Make additional design decisions about monitoring and alarming. This can
be an internal responsibility or the responsibility of the external
provider. In the case of using an external provider, service level
agreements (SLAs) likely apply. In addition, other operational
considerations such as bandwidth, latency, and jitter can be part of an
SLA.
Consider the ability to upgrade the infrastructure. As demand for
network resources increase, operators add additional IP address blocks
and add additional bandwidth capacity. In addition, consider managing
hardware and software lifecycle events, for example upgrades,
decommissioning, and outages, while avoiding service interruptions for
projects.
Factor maintainability into the overall network design. This includes
the ability to manage and maintain IP addresses as well as the use of
overlay identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS
tags. As an example, if you may need to change all of the IP addresses
on a network, a process known as renumbering, then the design must
support this function.
Address network-focused applications when considering certain
operational realities. For example, consider the impending exhaustion of
IPv4 addresses, the migration to IPv6, and the use of private networks
to segregate different types of traffic that an application receives or
generates. In the case of IPv4 to IPv6 migrations, applications should
follow best practices for storing IP addresses. We recommend you avoid
relying on IPv4 features that did not carry over to the IPv6 protocol or
have differences in implementation.
To segregate traffic, allow applications to create a private project
network for database and storage network traffic. Use a public network
for services that require direct client access from the internet. Upon
segregating the traffic, consider :term:`quality of service (QoS)` and
security to ensure each network has the required level of service.
Finally, consider the routing of network traffic. For some applications,
develop a complex policy framework for routing. To create a routing
policy that satisfies business requirements, consider the economic cost
of transmitting traffic over expensive links versus cheaper links, in
addition to bandwidth, latency, and jitter requirements.
Additionally, consider how to respond to network events. As an example,
how load transfers from one link to another during a failure scenario
could be a factor in the design. If you do not plan network capacity
correctly, failover traffic could overwhelm other ports or network links
and create a cascading failure scenario. In this case, traffic that
fails over to one link overwhelms that link and then moves to the
subsequent links until all network traffic stops.

View File

@ -1,165 +0,0 @@
Prescriptive examples
~~~~~~~~~~~~~~~~~~~~~
An organization designs a large-scale web application with cloud
principles in mind. The application scales horizontally in a bursting
fashion and generates a high instance count. The application requires an
SSL connection to secure data and must not lose connection state to
individual servers.
The figure below depicts an example design for this workload. In this
example, a hardware load balancer provides SSL offload functionality and
connects to project networks in order to reduce address consumption. This
load balancer links to the routing architecture as it services the VIP
for the application. The router and load balancer use the GRE tunnel ID
of the application's project network and an IP address within the project
subnet but outside of the address pool. This is to ensure that the load
balancer can communicate with the application's HTTP servers without
requiring the consumption of a public IP address.
Because sessions persist until closed, the routing and switching
architecture provides high availability. Switches mesh to each
hypervisor and each other, and also provide an MLAG implementation to
ensure that layer-2 connectivity does not fail. Routers use VRRP and
fully mesh with switches to ensure layer-3 connectivity. Since GRE is
provides an overlay network, Networking is present and uses the Open
vSwitch agent in GRE tunnel mode. This ensures all devices can reach all
other devices and that you can create project networks for private
addressing links to the load balancer.
.. figure:: figures/Network_Web_Services1.png
A web service architecture has many options and optional components. Due
to this, it can fit into a large number of other OpenStack designs. A
few key components, however, need to be in place to handle the nature of
most web-scale workloads. You require the following components:
* OpenStack Controller services (Image, Identity, Networking and
supporting services such as MariaDB and RabbitMQ)
* OpenStack Compute running KVM hypervisor
* OpenStack Object Storage
* Orchestration service
* Telemetry service
Beyond the normal Identity, Compute, Image service, and Object Storage
components, we recommend the Orchestration service component to handle
the proper scaling of workloads to adjust to demand. Due to the
requirement for auto-scaling, the design includes the Telemetry service.
Web services tend to be bursty in load, have very defined peak and
valley usage patterns and, as a result, benefit from automatic scaling
of instances based upon traffic. At a network level, a split network
configuration works well with databases residing on private project
networks since these do not emit a large quantity of broadcast traffic
and may need to interconnect to some databases for content.
Load balancing
--------------
Load balancing spreads requests across multiple instances. This workload
scales well horizontally across large numbers of instances. This enables
instances to run without publicly routed IP addresses and instead to
rely on the load balancer to provide a globally reachable service. Many
of these services do not require direct server return. This aids in
address planning and utilization at scale since only the virtual IP
(VIP) must be public.
Overlay networks
----------------
The overlay functionality design includes OpenStack Networking in Open
vSwitch GRE tunnel mode. In this case, the layer-3 external routers pair
with VRRP, and switches pair with an implementation of MLAG to ensure
that you do not lose connectivity with the upstream routing
infrastructure.
Performance tuning
------------------
Network level tuning for this workload is minimal. :term:`Quality of Service
(QoS)` applies to these workloads for a middle ground Class Selector
depending on existing policies. It is higher than a best effort queue
but lower than an Expedited Forwarding or Assured Forwarding queue.
Since this type of application generates larger packets with
longer-lived connections, you can optimize bandwidth utilization for
long duration TCP. Normal bandwidth planning applies here with regards
to benchmarking a session's usage multiplied by the expected number of
concurrent sessions with overhead.
Network functions
-----------------
Network functions is a broad category but encompasses workloads that
support the rest of a system's network. These workloads tend to consist
of large amounts of small packets that are very short lived, such as DNS
queries or SNMP traps. These messages need to arrive quickly and do not
deal with packet loss as there can be a very large volume of them. There
are a few extra considerations to take into account for this type of
workload and this can change a configuration all the way to the
hypervisor level. For an application that generates 10 TCP sessions per
user with an average bandwidth of 512 kilobytes per second per flow and
expected user count of ten thousand concurrent users, the expected
bandwidth plan is approximately 4.88 gigabits per second.
The supporting network for this type of configuration needs to have a
low latency and evenly distributed availability. This workload benefits
from having services local to the consumers of the service. Use a
multi-site approach as well as deploying many copies of the application
to handle load as close as possible to consumers. Since these
applications function independently, they do not warrant running
overlays to interconnect project networks. Overlays also have the
drawback of performing poorly with rapid flow setup and may incur too
much overhead with large quantities of small packets and therefore we do
not recommend them.
QoS is desirable for some workloads to ensure delivery. DNS has a major
impact on the load times of other services and needs to be reliable and
provide rapid responses. Configure rules in upstream devices to apply a
higher Class Selector to DNS to ensure faster delivery or a better spot
in queuing algorithms.
Cloud storage
-------------
Another common use case for OpenStack environments is providing a
cloud-based file storage and sharing service. You might consider this a
storage-focused use case, but its network-side requirements make it a
network-focused use case.
For example, consider a cloud backup application. This workload has two
specific behaviors that impact the network. Because this workload is an
externally-facing service and an internally-replicating application, it
has both :term:`north-south<north-south traffic>` and
:term:`east-west<east-west traffic>` traffic considerations:
north-south traffic
When a user uploads and stores content, that content moves into the
OpenStack installation. When users download this content, the
content moves out from the OpenStack installation. Because this
service operates primarily as a backup, most of the traffic moves
southbound into the environment. In this situation, it benefits you
to configure a network to be asymmetrically downstream because the
traffic that enters the OpenStack installation is greater than the
traffic that leaves the installation.
east-west traffic
Likely to be fully symmetric. Because replication originates from
any node and might target multiple other nodes algorithmically, it
is less likely for this traffic to have a larger volume in any
specific direction. However this traffic might interfere with
north-south traffic.
.. figure:: figures/Network_Cloud_Storage2.png
This application prioritizes the north-south traffic over east-west
traffic: the north-south traffic involves customer-facing data.
The network design in this case is less dependent on availability and
more dependent on being able to handle high bandwidth. As a direct
result, it is beneficial to forgo redundant links in favor of bonding
those connections. This increases available bandwidth. It is also
beneficial to configure all devices in the path, including OpenStack, to
generate and pass jumbo frames.

View File

@ -1,367 +0,0 @@
Technical considerations
~~~~~~~~~~~~~~~~~~~~~~~~
When you design an OpenStack network architecture, you must consider
layer-2 and layer-3 issues. Layer-2 decisions involve those made at the
data-link layer, such as the decision to use Ethernet versus Token Ring.
Layer-3 decisions involve those made about the protocol layer and the
point when IP comes into the picture. As an example, a completely
internal OpenStack network can exist at layer 2 and ignore layer 3. In
order for any traffic to go outside of that cloud, to another network,
or to the Internet, however, you must use a layer-3 router or switch.
The past few years have seen two competing trends in networking. One
trend leans towards building data center network architectures based on
layer-2 networking. Another trend treats the cloud environment
essentially as a miniature version of the Internet. This approach is
radically different from the network architecture approach in the
staging environment: the Internet only uses layer-3 routing rather than
layer-2 switching.
A network designed on layer-2 protocols has advantages over one designed
on layer-3 protocols. In spite of the difficulties of using a bridge to
perform the network role of a router, many vendors, customers, and
service providers choose to use Ethernet in as many parts of their
networks as possible. The benefits of selecting a layer-2 design are:
* Ethernet frames contain all the essentials for networking. These
include, but are not limited to, globally unique source addresses,
globally unique destination addresses, and error control.
* Ethernet frames can carry any kind of packet. Networking at layer-2
is independent of the layer-3 protocol.
* Adding more layers to the Ethernet frame only slows the networking
process down. This is known as 'nodal processing delay'.
* You can add adjunct networking features, for example class of service
(CoS) or multicasting, to Ethernet as readily as IP networks.
* VLANs are an easy mechanism for isolating networks.
Most information starts and ends inside Ethernet frames. Today this
applies to data, voice (for example, VoIP), and video (for example, web
cameras). The concept is that if you can perform more of the end-to-end
transfer of information from a source to a destination in the form of
Ethernet frames, the network benefits more from the advantages of
Ethernet. Although it is not a substitute for IP networking, networking
at layer-2 can be a powerful adjunct to IP networking.
Layer-2 Ethernet usage has these advantages over layer-3 IP network
usage:
* Speed
* Reduced overhead of the IP hierarchy.
* No need to keep track of address configuration as systems move
around. Whereas the simplicity of layer-2 protocols might work well
in a data center with hundreds of physical machines, cloud data
centers have the additional burden of needing to keep track of all
virtual machine addresses and networks. In these data centers, it is
not uncommon for one physical node to support 30-40 instances.
.. important::
Networking at the frame level says nothing about the presence or
absence of IP addresses at the packet level. Almost all ports,
links, and devices on a network of LAN switches still have IP
addresses, as do all the source and destination hosts. There are
many reasons for the continued need for IP addressing. The largest
one is the need to manage the network. A device or link without an
IP address is usually invisible to most management applications.
Utilities including remote access for diagnostics, file transfer of
configurations and software, and similar applications cannot run
without IP addresses as well as MAC addresses.
Layer-2 architecture limitations
--------------------------------
Outside of the traditional data center the limitations of layer-2
network architectures become more obvious.
* Number of VLANs is limited to 4096.
* The number of MACs stored in switch tables is limited.
* You must accommodate the need to maintain a set of layer-4 devices to
handle traffic control.
* MLAG, often used for switch redundancy, is a proprietary solution
that does not scale beyond two devices and forces vendor lock-in.
* It can be difficult to troubleshoot a network without IP addresses
and ICMP.
* Configuring :term:`ARP<Address Resolution Protocol (ARP)>` can be
complicated on large layer-2 networks.
* All network devices need to be aware of all MACs, even instance MACs,
so there is constant churn in MAC tables and network state changes as
instances start and stop.
* Migrating MACs (instance migration) to different physical locations
are a potential problem if you do not set ARP table timeouts
properly.
It is important to know that layer-2 has a very limited set of network
management tools. It is very difficult to control traffic, as it does
not have mechanisms to manage the network or shape the traffic, and
network troubleshooting is very difficult. One reason for this
difficulty is network devices have no IP addresses. As a result, there
is no reasonable way to check network delay in a layer-2 network.
On large layer-2 networks, configuring ARP learning can also be
complicated. The setting for the MAC address timer on switches is
critical and, if set incorrectly, can cause significant performance
problems. As an example, the Cisco default MAC address timer is
extremely long. Migrating MACs to different physical locations to
support instance migration can be a significant problem. In this case,
the network information maintained in the switches could be out of sync
with the new location of the instance.
In a layer-2 network, all devices are aware of all MACs, even those that
belong to instances. The network state information in the backbone
changes whenever an instance starts or stops. As a result there is far
too much churn in the MAC tables on the backbone switches.
Layer-3 architecture advantages
-------------------------------
In the layer-3 case, there is no churn in the routing tables due to
instances starting and stopping. The only time there would be a routing
state change is in the case of a Top of Rack (ToR) switch failure or a
link failure in the backbone itself. Other advantages of using a layer-3
architecture include:
* Layer-3 networks provide the same level of resiliency and scalability
as the Internet.
* Controlling traffic with routing metrics is straightforward.
* You can configure layer 3 to use :term:`BGP<Border Gateway Protocol (BGP)>`
confederation for scalability so core routers have state proportional to the
number of racks, not to the number of servers or instances.
* Routing takes instance MAC and IP addresses out of the network core,
reducing state churn. Routing state changes only occur in the case of
a ToR switch failure or backbone link failure.
* There are a variety of well tested tools, for example ICMP, to
monitor and manage traffic.
* Layer-3 architectures enable the use of :term:`quality of service (QoS)` to
manage network performance.
Layer-3 architecture limitations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The main limitation of layer 3 is that there is no built-in isolation
mechanism comparable to the VLANs in layer-2 networks. Furthermore, the
hierarchical nature of IP addresses means that an instance is on the
same subnet as its physical host. This means that you cannot migrate it
outside of the subnet easily. For these reasons, network virtualization
needs to use IP :term:`encapsulation` and software at the end hosts for
isolation and the separation of the addressing in the virtual layer from
the addressing in the physical layer. Other potential disadvantages of
layer 3 include the need to design an IP addressing scheme rather than
relying on the switches to keep track of the MAC addresses automatically
and to configure the interior gateway routing protocol in the switches.
Network recommendations overview
--------------------------------
OpenStack has complex networking requirements for several reasons. Many
components interact at different levels of the system stack that adds
complexity. Data flows are complex. Data in an OpenStack cloud moves
both between instances across the network (also known as East-West), as
well as in and out of the system (also known as North-South). Physical
server nodes have network requirements that are independent of instance
network requirements, which you must isolate from the core network to
account for scalability. We recommend functionally separating the
networks for security purposes and tuning performance through traffic
shaping.
You must consider a number of important general technical and business
factors when planning and designing an OpenStack network. They include:
* A requirement for vendor independence. To avoid hardware or software
vendor lock-in, the design should not rely on specific features of a
vendor's router or switch.
* A requirement to massively scale the ecosystem to support millions of
end users.
* A requirement to support indeterminate platforms and applications.
* A requirement to design for cost efficient operations to take
advantage of massive scale.
* A requirement to ensure that there is no single point of failure in
the cloud ecosystem.
* A requirement for high availability architecture to meet customer SLA
requirements.
* A requirement to be tolerant of rack level failure.
* A requirement to maximize flexibility to architect future production
environments.
Bearing in mind these considerations, we recommend the following:
* Layer-3 designs are preferable to layer-2 architectures.
* Design a dense multi-path network core to support multi-directional
scaling and flexibility.
* Use hierarchical addressing because it is the only viable option to
scale network ecosystem.
* Use virtual networking to isolate instance service network traffic
from the management and internal network traffic.
* Isolate virtual networks using encapsulation technologies.
* Use traffic shaping for performance tuning.
* Use eBGP to connect to the Internet up-link.
* Use iBGP to flatten the internal traffic on the layer-3 mesh.
* Determine the most effective configuration for block storage network.
Additional considerations
-------------------------
There are several further considerations when designing a
network-focused OpenStack cloud.
OpenStack Networking versus legacy networking (nova-network) considerations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Selecting the type of networking technology to implement depends on many
factors. OpenStack Networking (neutron) and legacy networking
(nova-network) both have their advantages and disadvantages. They are
both valid and supported options that fit different use cases:
.. list-table:: **Redundant networking: ToR switch high availability risk
analysis**
:widths: 50 40
:header-rows: 1
* - Legacy networking (nova-network)
- OpenStack Networking
* - Simple, single agent
- Complex, multiple agents
* - More mature, established
- Newer, maturing
* - Flat or VLAN
- Flat, VLAN, Overlays, L2-L3, SDN
* - No plug-in support
- Plug-in support for 3rd parties
* - Scales well
- Scaling requires 3rd party plug-ins
* - No multi-tier topologies
- Multi-tier topologies
Redundant networking: ToR switch high availability risk analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A technical consideration of networking is the idea that you should
install switching gear in a data center with backup switches in case of
hardware failure.
Research indicates the mean time between failures (MTBF) on switches is
between 100,000 and 200,000 hours. This number is dependent on the
ambient temperature of the switch in the data center. When properly
cooled and maintained, this translates to between 11 and 22 years before
failure. Even in the worst case of poor ventilation and high ambient
temperatures in the data center, the MTBF is still 2-3 years. See
`Ethernet switch reliablity: Temperature vs. moving parts
<http://media.beldensolutions.com/garrettcom/techsupport/papers/ethernet_switch_reliability.pdf>`_
for further information.
In most cases, it is much more economical to use a single switch with a
small pool of spare switches to replace failed units than it is to
outfit an entire data center with redundant switches. Applications
should tolerate rack level outages without affecting normal operations,
since network and compute resources are easily provisioned and
plentiful.
Preparing for the future: IPv6 support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
One of the most important networking topics today is the impending
exhaustion of IPv4 addresses. In early 2014, ICANN announced that they
started allocating the final IPv4 address blocks to the `Regional
Internet Registries
<http://www.internetsociety.org/deploy360/blog/2014/05/goodbye-ipv4-iana-starts-allocating-final-address-blocks/>`_.
This means the IPv4 address space is close to being fully allocated. As
a result, it will soon become difficult to allocate more IPv4 addresses
to an application that has experienced growth, or that you expect to
scale out, due to the lack of unallocated IPv4 address blocks.
For network focused applications the future is the IPv6 protocol. IPv6
increases the address space significantly, fixes long standing issues in
the IPv4 protocol, and will become essential for network focused
applications in the future.
OpenStack Networking supports IPv6 when configured to take advantage of
it. To enable IPv6, create an IPv6 subnet in Networking and use IPv6
prefixes when creating security groups.
Asymmetric links
^^^^^^^^^^^^^^^^
When designing a network architecture, the traffic patterns of an
application heavily influence the allocation of total bandwidth and the
number of links that you use to send and receive traffic. Applications
that provide file storage for customers allocate bandwidth and links to
favor incoming traffic, whereas video streaming applications allocate
bandwidth and links to favor outgoing traffic.
Performance
^^^^^^^^^^^
It is important to analyze the applications' tolerance for latency and
jitter when designing an environment to support network focused
applications. Certain applications, for example VoIP, are less tolerant
of latency and jitter. Where latency and jitter are concerned, certain
applications may require tuning of QoS parameters and network device
queues to ensure that they queue for transmit immediately or guarantee
minimum bandwidth. Since OpenStack currently does not support these
functions, consider carefully your selected network plug-in.
The location of a service may also impact the application or consumer
experience. If an application serves differing content to different
users it must properly direct connections to those specific locations.
Where appropriate, use a multi-site installation for these situations.
You can implement networking in two separate ways. Legacy networking
(nova-network) provides a flat DHCP network with a single broadcast
domain. This implementation does not support project isolation networks
or advanced plug-ins, but it is currently the only way to implement a
distributed :term:`layer-3 (L3) agent` using the multi_host configuration.
OpenStack Networking (neutron) is the official networking implementation and
provides a pluggable architecture that supports a large variety of
network methods. Some of these include a layer-2 only provider network
model, external device plug-ins, or even OpenFlow controllers.
Networking at large scales becomes a set of boundary questions. The
determination of how large a layer-2 domain must be is based on the
amount of nodes within the domain and the amount of broadcast traffic
that passes between instances. Breaking layer-2 boundaries may require
the implementation of overlay networks and tunnels. This decision is a
balancing act between the need for a smaller overhead or a need for a
smaller domain.
When selecting network devices, be aware that making this decision based
on the greatest port density often comes with a drawback. Aggregation
switches and routers have not all kept pace with Top of Rack switches
and may induce bottlenecks on north-south traffic. As a result, it may
be possible for massive amounts of downstream network utilization to
impact upstream network devices, impacting service to the cloud. Since
OpenStack does not currently provide a mechanism for traffic shaping or
rate limiting, it is necessary to implement these features at the
network hardware level.

View File

@ -1,71 +0,0 @@
User requirements
~~~~~~~~~~~~~~~~~
Network-focused architectures vary from the general-purpose architecture
designs. Certain network-intensive applications influence these
architectures. Some of the business requirements that influence the
design include network latency through slow page loads, degraded video
streams, and low quality VoIP sessions impacts the user experience.
Users are often not aware of how network design and architecture affects their
experiences. Both enterprise customers and end-users rely on the network for
delivery of an application. Network performance problems can result in a
negative experience for the end-user, as well as productivity and economic
loss.
High availability issues
------------------------
Depending on the application and use case, network-intensive OpenStack
installations can have high availability requirements. Financial
transaction systems have a much higher requirement for high availability
than a development application. Use network availability technologies,
for example :term:`quality of service (QoS)`, to improve the network
performance of sensitive applications such as VoIP and video streaming.
High performance systems have SLA requirements for a minimum QoS with
regard to guaranteed uptime, latency, and bandwidth. The level of the
SLA can have a significant impact on the network architecture and
requirements for redundancy in the systems.
Risks
-----
Network misconfigurations
Configuring incorrect IP addresses, VLANs, and routers can cause
outages to areas of the network or, in the worst-case scenario, the
entire cloud infrastructure. Automate network configurations to
minimize the opportunity for operator error as it can cause
disruptive problems.
Capacity planning
Cloud networks require management for capacity and growth over time.
Capacity planning includes the purchase of network circuits and
hardware that can potentially have lead times measured in months or
years.
Network tuning
Configure cloud networks to minimize link loss, packet loss, packet
storms, broadcast storms, and loops.
Single Point Of Failure (SPOF)
Consider high availability at the physical and environmental layers.
If there is a single point of failure due to only one upstream link,
or only one power supply, an outage can become unavoidable.
Complexity
An overly complex network design can be difficult to maintain and
troubleshoot. While device-level configuration can ease maintenance
concerns and automated tools can handle overlay networks, avoid or
document non-traditional interconnects between functions and
specialized hardware to prevent outages.
Non-standard features
There are additional risks that arise from configuring the cloud
network to take advantage of vendor specific features. One example
is multi-link aggregation (MLAG) used to provide redundancy at the
aggregator switch level of the network. MLAG is not a standard and,
as a result, each vendor has their own proprietary implementation of
the feature. MLAG architectures are not interoperable across switch
vendors, which leads to vendor lock-in, and can cause delays or
inability when upgrading components.

View File

@ -1,101 +0,0 @@
===============
Network focused
===============
.. toctree::
:maxdepth: 2
network-focus-user-requirements.rst
network-focus-technical-considerations.rst
network-focus-operational-considerations.rst
network-focus-architecture.rst
network-focus-prescriptive-examples.rst
All OpenStack deployments depend on network communication in order to function
properly due to its service-based nature. In some cases, however, the network
elevates beyond simple infrastructure. This chapter discusses architectures
that are more reliant or focused on network services. These architectures
depend on the network infrastructure and require network services that
perform reliably in order to satisfy user and application requirements.
Some possible use cases include:
Content delivery network
This includes streaming video, viewing photographs, or accessing any other
cloud-based data repository distributed to a large number of end users.
Network configuration affects latency, bandwidth, and the distribution of
instances. Therefore, it impacts video streaming. Not all video streaming
is consumer-focused. For example, multicast videos (used for media, press
conferences, corporate presentations, and web conferencing services) can
also use a content delivery network. The location of the video repository
and its relationship to end users affects content delivery. Network
throughput of the back-end systems, as well as the WAN architecture and
the cache methodology, also affect performance.
Network management functions
Use this cloud to provide network service functions built to support the
delivery of back-end network services such as DNS, NTP, or SNMP.
Network service offerings
Use this cloud to run customer-facing network tools to support services.
Examples include VPNs, MPLS private networks, and GRE tunnels.
Web portals or web services
Web servers are a common application for cloud services, and we recommend
an understanding of their network requirements. The network requires scaling
out to meet user demand and deliver web pages with a minimum latency.
Depending on the details of the portal architecture, consider the internal
east-west and north-south network bandwidth.
High speed and high volume transactional systems
These types of applications are sensitive to network configurations. Examples
include financial systems, credit card transaction applications, and trading
and other extremely high volume systems. These systems are sensitive to
network jitter and latency. They must balance a high volume of East-West and
North-South network traffic to maximize efficiency of the data delivery. Many
of these systems must access large, high performance database back ends.
High availability
These types of use cases are dependent on the proper sizing of the network to
maintain replication of data between sites for high availability. If one site
becomes unavailable, the extra sites can serve the displaced load until the
original site returns to service. It is important to size network capacity to
handle the desired loads.
Big data
Clouds used for the management and collection of big data (data ingest) have
a significant demand on network resources. Big data often uses partial
replicas of the data to maintain integrity over large distributed clouds.
Other big data applications that require a large amount of network resources
are Hadoop, Cassandra, NuoDB, Riak, and other NoSQL and distributed
databases.
Virtual desktop infrastructure (VDI)
This use case is sensitive to network congestion, latency, jitter, and other
network characteristics. Like video streaming, the user experience is
important. However, unlike video streaming, caching is not an option to
offset the network issues. VDI requires both upstream and downstream traffic
and cannot rely on caching for the delivery of the application to the end
user.
Voice over IP (VoIP)
This is sensitive to network congestion, latency, jitter, and other network
characteristics. VoIP has a symmetrical traffic pattern and it requires
network :term:`quality of service (QoS)` for best performance. In addition,
you can implement active queue management to deliver voice and multimedia
content. Users are sensitive to latency and jitter fluctuations and can detect
them at very low levels.
Video Conference or web conference
This is sensitive to network congestion, latency, jitter, and other network
characteristics. Video Conferencing has a symmetrical traffic pattern, but
unless the network is on an MPLS private network, it cannot use network
:term:`quality of service (QoS)` to improve performance. Similar to VoIP,
users are sensitive to network performance issues even at low levels.
High performance computing (HPC)
This is a complex use case that requires careful consideration of the traffic
flows and usage patterns to address the needs of cloud clusters. It has high
east-west traffic patterns for distributed computing, but there can be
substantial north-south traffic depending on the specific application.

View File

@ -1,85 +0,0 @@
==========
References
==========
`Data Protection framework of the European Union
<http://ec.europa.eu/justice/data-protection/>`_
: Guidance on Data Protection laws governed by the EU.
`Depletion of IPv4 Addresses
<http://www.internetsociety.org/deploy360/blog/2014/05/
goodbye-ipv4-iana-starts-allocating-final-address-blocks/>`_
: describing how IPv4 addresses and the migration to IPv6 is inevitable.
`Ethernet Switch Reliability <http://www.garrettcom.com/
techsupport/papers/ethernet_switch_reliability.pdf>`_
: Research white paper on Ethernet Switch reliability.
`Financial Industry Regulatory Authority
<http://www.finra.org/Industry/Regulation/FINRARules/>`_
: Requirements of the Financial Industry Regulatory Authority in the USA.
`Image Service property keys <https://docs.openstack.org/
cli-reference/glance.html#image-service-property-keys>`_
: Glance API property keys allows the administrator to attach custom
characteristics to images.
`LibGuestFS Documentation <http://libguestfs.org>`_
: Official LibGuestFS documentation.
`Logging and Monitoring
<https://docs.openstack.org/ops-guide/ops-logging-monitoring.html>`_
: Official OpenStack Operations documentation.
`ManageIQ Cloud Management Platform <http://manageiq.org/>`_
: An Open Source Cloud Management Platform for managing multiple clouds.
`N-Tron Network Availability
<https://www.scribd.com/doc/298973976/Network-Availability>`_
: Research white paper on network availability.
`Nested KVM <http://davejingtian.org/2014/03/30/nested-kvm-just-for-fun>`_
: Post on how to nest KVM under KVM.
`Open Compute Project <http://www.opencompute.org/>`_
: The Open Compute Project Foundation's mission is to design
and enable the delivery of the most efficient server,
storage and data center hardware designs for scalable computing.
`OpenStack Flavors
<https://docs.openstack.org/ops-guide/ops-user-facing-operations.html#flavors>`_
: Official OpenStack documentation.
`OpenStack High Availability Guide <https://docs.openstack.org/ha-guide/>`_
: Information on how to provide redundancy for the OpenStack components.
`OpenStack Hypervisor Support Matrix
<https://wiki.openstack.org/wiki/HypervisorSupportMatrix>`_
: Matrix of supported hypervisors and capabilities when used with OpenStack.
`OpenStack Object Store (Swift) Replication Reference
<https://docs.openstack.org/developer/swift/replication_network.html>`_
: Developer documentation of Swift replication.
`OpenStack Operations Guide <https://docs.openstack.org/ops-guide/>`_
: The OpenStack Operations Guide provides information on setting up
and installing OpenStack.
`OpenStack Security Guide <https://docs.openstack.org/security-guide/>`_
: The OpenStack Security Guide provides information on securing
OpenStack deployments.
`OpenStack Training Marketplace
<https://www.openstack.org/marketplace/training>`_
: The OpenStack Market for training and Vendors providing training
on OpenStack.
`PCI passthrough <https://wiki.openstack.org/wiki/
Pci_passthrough#How_to_check_PCI_status_with_PCI_api_paches>`_
: The PCI API patches extend the servers/os-hypervisor to
show PCI information for instance and compute node,
and also provides a resource endpoint to show PCI information.
`TripleO <https://wiki.openstack.org/wiki/TripleO>`_
: TripleO is a program aimed at installing, upgrading and operating
OpenStack clouds using OpenStack's own cloud facilities as the foundation.

View File

@ -1,47 +0,0 @@
====================
Desktop-as-a-Service
====================
Virtual Desktop Infrastructure (VDI) is a service that hosts
user desktop environments on remote servers. This application
is very sensitive to network latency and requires a high
performance compute environment. Traditionally these types of
services do not use cloud environments because few clouds
support such a demanding workload for user-facing applications.
As cloud environments become more robust, vendors are starting
to provide services that provide virtual desktops in the cloud.
OpenStack may soon provide the infrastructure for these types of deployments.
Challenges
~~~~~~~~~~
Designing an infrastructure that is suitable to host virtual
desktops is a very different task to that of most virtual workloads.
For example, the design must consider:
* Boot storms, when a high volume of logins occur in a short period of time
* The performance of the applications running on virtual desktops
* Operating systems and their compatibility with the OpenStack hypervisor
Broker
~~~~~~
The connection broker determines which remote desktop host
users can access. Medium and large scale environments require a broker
since its service represents a central component of the architecture.
The broker is a complete management product, and enables automated
deployment and provisioning of remote desktop hosts.
Possible solutions
~~~~~~~~~~~~~~~~~~
There are a number of commercial products currently available that
provide a broker solution. However, no native OpenStack projects
provide broker services.
Not providing a broker is also an option, but managing this manually
would not suffice for a large scale, enterprise solution.
Diagram
~~~~~~~
.. figure:: figures/Specialized_VDI1.png

View File

@ -1,43 +0,0 @@
====================
Specialized hardware
====================
Certain workloads require specialized hardware devices that
have significant virtualization or sharing challenges.
Applications such as load balancers, highly parallel brute
force computing, and direct to wire networking may need
capabilities that basic OpenStack components do not provide.
Challenges
~~~~~~~~~~
Some applications need access to hardware devices to either
improve performance or provide capabilities that are not
virtual CPU, RAM, network, or storage. These can be a shared
resource, such as a cryptography processor, or a dedicated
resource, such as a Graphics Processing Unit (GPU). OpenStack can
provide some of these, while others may need extra work.
Solutions
~~~~~~~~~
To provide cryptography offloading to a set of instances,
you can use Image service configuration options.
For example, assign the cryptography chip to a device node in the guest.
The OpenStack Command Line Reference contains further information on
configuring this solution in the section `Image service property keys
<https://docs.openstack.org/cli-reference/glance.html#image-service-property-keys>`_.
A challenge, however, is this option allows all guests using the
configured images to access the hypervisor cryptography device.
If you require direct access to a specific device, PCI pass-through
enables you to dedicate the device to a single instance per hypervisor.
You must define a flavor that has the PCI device specifically in order
to properly schedule instances.
More information regarding PCI pass-through, including instructions for
implementing and using it, is available at
`https://wiki.openstack.org/wiki/Pci_passthrough <https://wiki.openstack.org/
wiki/Pci_passthrough#How_to_check_PCI_status_with_PCI_api_patches>`_.
.. figure:: figures/Specialized_Hardware2.png
:width: 100%

View File

@ -1,78 +0,0 @@
========================
Multi-hypervisor example
========================
A financial company requires its applications migrated
from a traditional, virtualized environment to an API driven,
orchestrated environment. The new environment needs
multiple hypervisors since many of the company's applications
have strict hypervisor requirements.
Currently, the company's vSphere environment runs 20 VMware
ESXi hypervisors. These hypervisors support 300 instances of
various sizes. Approximately 50 of these instances must run
on ESXi. The remaining 250 or so have more flexible requirements.
The financial company decides to manage the
overall system with a common OpenStack platform.
.. figure:: figures/Compute_NSX.png
:width: 100%
Architecture planning teams decided to run a host aggregate
containing KVM hypervisors for the general purpose instances.
A separate host aggregate targets instances requiring ESXi.
Images in the OpenStack Image service have particular
hypervisor metadata attached. When a user requests a
certain image, the instance spawns on the relevant aggregate.
Images for ESXi use the VMDK format. You can convert
QEMU disk images to VMDK, VMFS Flat Disks. These disk images
can also be thin, thick, zeroed-thick, and eager-zeroed-thick.
After exporting a VMFS thin disk from VMFS to the
OpenStack Image service (a non-VMFS location), it becomes a
preallocated flat disk. This impacts the transfer time from the
OpenStack Image service to the data store since transfers require
moving the full preallocated flat disk rather than the thin disk.
The VMware host aggregate compute nodes communicate with
vCenter rather than spawning directly on a hypervisor.
The vCenter then requests scheduling for the instance to run on
an ESXi hypervisor.
This functionality requires that VMware Distributed Resource
Scheduler (DRS) is enabled on a cluster and set to **Fully Automated**.
The vSphere requires shared storage because the DRS uses vMotion
which is a service that relies on shared storage.
This solution to the company's migration uses shared storage
to provide Block Storage capabilities to the KVM instances while
also providing vSphere storage. The new environment provides this
storage functionality using a dedicated data network. The
compute hosts should have dedicated NICs to support the
dedicated data network. vSphere supports OpenStack Block Storage. This
support gives storage from a VMFS datastore to an instance. For the
financial company, Block Storage in their new architecture supports
both hypervisors.
OpenStack Networking provides network connectivity in this new
architecture, with the VMware NSX plug-in driver configured. legacy
networking (nova-network) supports both hypervisors in this new
architecture example, but has limitations. Specifically, vSphere
with legacy networking does not support security groups. The new
architecture uses VMware NSX as a part of the design. When users launch an
instance within either of the host aggregates, VMware NSX ensures the
instance attaches to the appropriate network overlay-based logical networks.
The architecture planning teams also consider OpenStack Compute integration.
When running vSphere in an OpenStack environment, nova-compute
communications with vCenter appear as a single large hypervisor.
This hypervisor represents the entire ESXi cluster. Multiple nova-compute
instances can represent multiple ESXi clusters. They can connect to
multiple vCenter servers. If the process running nova-compute
crashes it cuts the connection to the vCenter server.
Any ESXi clusters will stop running, and you will not be able to
provision further instances on the vCenter, even if you enable high
availability. You must monitor the nova-compute service connected
to vSphere carefully for any disruptions as a result of this failure point.

View File

@ -1,32 +0,0 @@
==============================
Specialized networking example
==============================
Some applications that interact with a network require
specialized connectivity. Applications such as a looking glass
require the ability to connect to a BGP peer, or route participant
applications may need to join a network at a layer2 level.
Challenges
~~~~~~~~~~
Connecting specialized network applications to their required
resources alters the design of an OpenStack installation.
Installations that rely on overlay networks are unable to
support a routing participant, and may also block layer-2 listeners.
Possible solutions
~~~~~~~~~~~~~~~~~~
Deploying an OpenStack installation using OpenStack Networking with a
provider network allows direct layer-2 connectivity to an
upstream networking device.
This design provides the layer-2 connectivity required to communicate
via Intermediate System-to-Intermediate System (ISIS) protocol or
to pass packets controlled by an OpenFlow controller.
Using the multiple layer-2 plug-in with an agent such as
:term:`Open vSwitch` allows a private connection through a VLAN
directly to a specific port in a layer-3 device.
This allows a BGP point-to-point link to join the autonomous system.
Avoid using layer-3 plug-ins as they divide the broadcast
domain and prevent router adjacencies from forming.

View File

@ -1,71 +0,0 @@
======================
OpenStack on OpenStack
======================
In some cases, users may run OpenStack nested on top
of another OpenStack cloud. This scenario describes how to
manage and provision complete OpenStack environments on instances
supported by hypervisors and servers, which an underlying OpenStack
environment controls.
Public cloud providers can use this technique to manage the
upgrade and maintenance process on complete OpenStack environments.
Developers and those testing OpenStack can also use this
technique to provision their own OpenStack environments on
available OpenStack Compute resources, whether public or private.
Challenges
~~~~~~~~~~
The network aspect of deploying a nested cloud is the most
complicated aspect of this architecture.
You must expose VLANs to the physical ports on which the underlying
cloud runs because the bare metal cloud owns all the hardware.
You must also expose them to the nested levels as well.
Alternatively, you can use the network overlay technologies on the
OpenStack environment running on the host OpenStack environment to
provide the required software defined networking for the deployment.
Hypervisor
~~~~~~~~~~
In this example architecture, consider which
approach you should take to provide a nested
hypervisor in OpenStack. This decision influences which
operating systems you use for the deployment of the nested
OpenStack deployments.
Possible solutions: deployment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Deployment of a full stack can be challenging but you can mitigate
this difficulty by creating a Heat template to deploy the
entire stack, or a configuration management system. After creating
the Heat template, you can automate the deployment of additional stacks.
The OpenStack-on-OpenStack project (:term:`TripleO`)
addresses this issue. Currently, however, the project does
not completely cover nested stacks. For more information, see
https://wiki.openstack.org/wiki/TripleO.
Possible solutions: hypervisor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the case of running TripleO, the underlying OpenStack
cloud deploys the compute nodes as bare-metal. You then deploy
OpenStack on these Compute bare-metal servers with the
appropriate hypervisor, such as KVM.
In the case of running smaller OpenStack clouds for testing
purposes, where performance is not a critical factor, you can use
QEMU instead. It is also possible to run a KVM hypervisor in an instance
(see `davejingtian.org
<http://davejingtian.org/2014/03/30/nested-kvm-just-for-fun/>`_),
though this is not a supported configuration, and could be a
complex solution for such a use case.
Diagram
~~~~~~~
.. figure:: figures/Specialized_OOO.png
:width: 100%

View File

@ -1,46 +0,0 @@
===========================
Software-defined networking
===========================
Software-defined networking (SDN) is the separation of the data
plane and control plane. SDN is a popular method of
managing and controlling packet flows within networks.
SDN uses overlays or directly controlled layer-2 devices to
determine flow paths, and as such presents challenges to a
cloud environment. Some designers may wish to run their
controllers within an OpenStack installation. Others may wish
to have their installations participate in an SDN-controlled network.
Challenges
~~~~~~~~~~
SDN is a relatively new concept that is not yet standardized,
so SDN systems come in a variety of different implementations.
Because of this, a truly prescriptive architecture is not feasible.
Instead, examine the differences between an existing and a planned
OpenStack design and determine where potential conflicts and gaps exist.
Possible solutions
~~~~~~~~~~~~~~~~~~
If an SDN implementation requires layer-2 access because it
directly manipulates switches, we do not recommend running an
overlay network or a layer-3 agent.
If the controller resides within an OpenStack installation,
it may be necessary to build an ML2 plug-in and schedule the
controller instances to connect to project VLANs that they can
talk directly to the switch hardware.
Alternatively, depending on the external device support,
use a tunnel that terminates at the switch hardware itself.
Diagram
-------
OpenStack hosted SDN controller:
.. figure:: figures/Specialized_SDN_hosted.png
OpenStack participating in an SDN controller network:
.. figure:: figures/Specialized_SDN_external.png

View File

@ -1,39 +0,0 @@
=================
Specialized cases
=================
.. toctree::
:maxdepth: 2
specialized-multi-hypervisor.rst
specialized-networking.rst
specialized-software-defined-networking.rst
specialized-desktop-as-a-service.rst
specialized-openstack-on-openstack.rst
specialized-hardware.rst
Although most OpenStack architecture designs fall into one
of the seven major scenarios outlined in other sections
(compute focused, network focused, storage focused, general
purpose, multi-site, hybrid cloud, and massively scalable),
there are a few use cases that do not fit into these categories.
This section discusses these specialized cases and provide some
additional details and design considerations for each use case:
* :doc:`Specialized networking <specialized-networking>`:
describes running networking-oriented software that may involve reading
packets directly from the wire or participating in routing protocols.
* :doc:`Software-defined networking (SDN)
<specialized-software-defined-networking>`:
describes both running an SDN controller from within OpenStack
as well as participating in a software-defined network.
* :doc:`Desktop-as-a-Service <specialized-desktop-as-a-service>`:
describes running a virtualized desktop environment in a cloud
(:term:`Desktop-as-a-Service`).
This applies to private and public clouds.
* :doc:`OpenStack on OpenStack <specialized-openstack-on-openstack>`:
describes building a multi-tiered cloud by running OpenStack
on top of an OpenStack installation.
* :doc:`Specialized hardware <specialized-hardware>`:
describes the use of specialized hardware devices from within
the OpenStack environment.

View File

@ -1,440 +0,0 @@
Architecture
~~~~~~~~~~~~
Consider the following factors when selecting storage hardware:
* Cost
* Performance
* Reliability
Storage-focused OpenStack clouds must address I/O intensive workloads.
These workloads are not CPU intensive, nor are they consistently network
intensive. The network may be heavily utilized to transfer storage, but
they are not otherwise network intensive.
The selection of storage hardware determines the overall performance and
scalability of a storage-focused OpenStack design architecture. Several
factors impact the design process, including:
Cost
The cost of components affects which storage architecture and
hardware you choose.
Performance
The latency of storage I/O requests indicates performance.
Performance requirements affect which solution you choose.
Scalability
Scalability refers to how the storage solution performs as it
expands to its maximum size. Storage solutions that perform well in
small configurations but have degraded performance in large
configurations are not scalable. A solution that performs well at
maximum expansion is scalable. Large deployments require a storage
solution that performs well as it expands.
Latency is a key consideration in a storage-focused OpenStack cloud.
Using solid-state disks (SSDs) to minimize latency and, to reduce CPU
delays caused by waiting for the storage, increases performance. Use
RAID controller cards in compute hosts to improve the performance of the
underlying disk subsystem.
Depending on the storage architecture, you can adopt a scale-out
solution, or use a highly expandable and scalable centralized storage
array. If a centralized storage array is the right fit for your
requirements, then the array vendor determines the hardware selection.
It is possible to build a storage array using commodity hardware with
Open Source software, but requires people with expertise to build such a
system.
On the other hand, a scale-out storage solution that uses
direct-attached storage (DAS) in the servers may be an appropriate
choice. This requires configuration of the server hardware to support
the storage solution.
Considerations affecting storage architecture (and corresponding storage
hardware) of a Storage-focused OpenStack cloud include:
Connectivity
Based on the selected storage solution, ensure the connectivity
matches the storage solution requirements. We recommended confirming
that the network characteristics minimize latency to boost the
overall performance of the design.
Latency
Determine if the use case has consistent or highly variable latency.
Throughput
Ensure that the storage solution throughput is optimized for your
application requirements.
Server hardware
Use of DAS impacts the server hardware choice and affects host
density, instance density, power density, OS-hypervisor, and
management tools.
Compute (server) hardware selection
-----------------------------------
Four opposing factors determine the compute (server) hardware selection:
Server density
A measure of how many servers can fit into a given measure of
physical space, such as a rack unit [U].
Resource capacity
The number of CPU cores, how much RAM, or how much storage a given
server delivers.
Expandability
The number of additional resources you can add to a server before it
reaches capacity.
Cost
The relative cost of the hardware weighed against the level of
design effort needed to build the system.
You must weigh the dimensions against each other to determine the best
design for the desired purpose. For example, increasing server density
can mean sacrificing resource capacity or expandability. Increasing
resource capacity and expandability can increase cost but decrease
server density. Decreasing cost often means decreasing supportability,
server density, resource capacity, and expandability.
Compute capacity (CPU cores and RAM capacity) is a secondary
consideration for selecting server hardware. As a result, the required
server hardware must supply adequate CPU sockets, additional CPU cores,
and more RAM; network connectivity and storage capacity are not as
critical. The hardware needs to provide enough network connectivity and
storage capacity to meet the user requirements, however they are not the
primary consideration.
Some server hardware form factors are better suited to storage-focused
designs than others. The following is a list of these form factors:
* Most blade servers support dual-socket multi-core CPUs. Choose either
full width or full height blades to avoid the limit. High density
blade servers support up to 16 servers in only 10 rack units using
half height or half width blades.
.. warning::
This decreases density by 50% (only 8 servers in 10 U) if a full
width or full height option is used.
* 1U rack-mounted servers have the ability to offer greater server
density than a blade server solution, but are often limited to
dual-socket, multi-core CPU configurations.
.. note::
Due to cooling requirements, it is rare to see 1U rack-mounted
servers with more than 2 CPU sockets.
To obtain greater than dual-socket support in a 1U rack-mount form
factor, customers need to buy their systems from Original Design
Manufacturers (ODMs) or second-tier manufacturers.
.. warning::
This may cause issues for organizations that have preferred
vendor policies or concerns with support and hardware warranties
of non-tier 1 vendors.
* 2U rack-mounted servers provide quad-socket, multi-core CPU support
but with a corresponding decrease in server density (half the density
offered by 1U rack-mounted servers).
* Larger rack-mounted servers, such as 4U servers, often provide even
greater CPU capacity. Commonly supporting four or even eight CPU
sockets. These servers have greater expandability but such servers
have much lower server density and usually greater hardware cost.
* Rack-mounted servers that support multiple independent servers in a
single 2U or 3U enclosure, "sled servers", deliver increased density
as compared to a typical 1U-2U rack-mounted servers.
Other factors that influence server hardware selection for a
storage-focused OpenStack design architecture include:
Instance density
In this architecture, instance density and CPU-RAM oversubscription
are lower. You require more hosts to support the anticipated scale,
especially if the design uses dual-socket hardware designs.
Host density
Another option to address the higher host count is to use a
quad-socket platform. Taking this approach decreases host density
which also increases rack count. This configuration affects the
number of power connections and also impacts network and cooling
requirements.
Power and cooling density
The power and cooling density requirements might be lower than with
blade, sled, or 1U server designs due to lower host density (by
using 2U, 3U or even 4U server designs). For data centers with older
infrastructure, this might be a desirable feature.
Storage-focused OpenStack design architecture server hardware selection
should focus on a "scale-up" versus "scale-out" solution. The
determination of which is the best solution (a smaller number of larger
hosts or a larger number of smaller hosts), depends on a combination of
factors including cost, power, cooling, physical rack and floor space,
support-warranty, and manageability.
Networking hardware selection
-----------------------------
Key considerations for the selection of networking hardware include:
Port count
The user requires networking hardware that has the requisite port
count.
Port density
The physical space required to provide the requisite port count
affects the network design. A switch that provides 48 10 GbE ports
in 1U has a much higher port density than a switch that provides 24
10 GbE ports in 2U. On a general scale, a higher port density leaves
more rack space for compute or storage components which is
preferred. It is also important to consider fault domains and power
density. Finally, higher density switches are more expensive,
therefore it is important not to over design the network.
Port speed
The networking hardware must support the proposed network speed, for
example: 1 GbE, 10 GbE, or 40 GbE (or even 100 GbE).
Redundancy
User requirements for high availability and cost considerations
influence the required level of network hardware redundancy. Achieve
network redundancy by adding redundant power supplies or paired
switches.
.. note::
If this is a requirement, the hardware must support this
configuration. User requirements determine if a completely
redundant network infrastructure is required.
Power requirements
Ensure that the physical data center provides the necessary power
for the selected network hardware. This is not an issue for top of
rack (ToR) switches, but may be an issue for spine switches in a
leaf and spine fabric, or end of row (EoR) switches.
Protocol support
It is possible to gain more performance out of a single storage
system by using specialized network technologies such as RDMA, SRP,
iSER and SCST. The specifics for using these technologies is beyond
the scope of this book.
Software selection
------------------
Factors that influence the software selection for a storage-focused
OpenStack architecture design include:
* Operating system (OS) and hypervisor
* OpenStack components
* Supplemental software
Design decisions made in each of these areas impacts the rest of the
OpenStack architecture design.
Operating system and hypervisor
-------------------------------
Operating system (OS) and hypervisor have a significant impact on the
overall design and also affect server hardware selection. Ensure the
selected operating system and hypervisor combination support the storage
hardware and work with the networking hardware selection and topology.
Operating system and hypervisor selection affect the following areas:
Cost
Selecting a commercially supported hypervisor, such as Microsoft
Hyper-V, results in a different cost model than a
community-supported open source hypervisor like Kinstance or Xen.
Similarly, choosing Ubuntu over Red Hat (or vice versa) impacts cost
due to support contracts. However, business or application
requirements might dictate a specific or commercially supported
hypervisor.
Supportability
Staff must have training with the chosen hypervisor. Consider the
cost of training when choosing a solution. The support of a
commercial product such as Red Hat, SUSE, or Windows, is the
responsibility of the OS vendor. If an open source platform is
chosen, the support comes from in-house resources.
Management tools
Ubuntu and Kinstance use different management tools than VMware
vSphere. Although both OS and hypervisor combinations are supported
by OpenStack, there are varying impacts to the rest of the design as
a result of the selection of one combination versus the other.
Scale and performance
Ensure the selected OS and hypervisor combination meet the
appropriate scale and performance requirements needed for this
storage focused OpenStack cloud. The chosen architecture must meet
the targeted instance-host ratios with the selected OS-hypervisor
combination.
Security
Ensure the design can accommodate the regular periodic installation
of application security patches while maintaining the required
workloads. The frequency of security patches for the proposed
OS-hypervisor combination impacts performance and the patch
installation process could affect maintenance windows.
Supported features
Selecting the OS-hypervisor combination often determines the
required features of OpenStack. Certain features are only available
with specific OSes or hypervisors. For example, if certain features
are not available, you might need to modify the design to meet user
requirements.
Interoperability
The OS-hypervisor combination should be chosen based on the
interoperability with one another, and other OS-hyervisor
combinations. Operational and troubleshooting tools for one
OS-hypervisor combination may differ from the tools used for another
OS-hypervisor combination. As a result, the design must address if
the two sets of tools need to interoperate.
OpenStack components
--------------------
The OpenStack components you choose can have a significant impact on the
overall design. While there are certain components that are always
present (Compute and Image service, for example), there are other
services that may not be required. As an example, a certain design may
not require the Orchestration service. Omitting Orchestration would not
typically have a significant impact on the overall design, however, if
the architecture uses a replacement for OpenStack Object Storage for its
storage component, this could potentially have significant impacts on
the rest of the design.
A storage-focused design might require the ability to use Orchestration
to launch instances with Block Storage volumes to perform
storage-intensive processing.
A storage-focused OpenStack design architecture uses the following
components:
* OpenStack Identity (keystone)
* OpenStack dashboard (horizon)
* OpenStack Compute (nova) (including the use of multiple hypervisor
drivers)
* OpenStack Object Storage (swift) (or another object storage solution)
* OpenStack Block Storage (cinder)
* OpenStack Image service (glance)
* OpenStack Networking (neutron) or legacy networking (nova-network)
Excluding certain OpenStack components may limit or constrain the
functionality of other components. If a design opts to include
Orchestration but exclude Telemetry, then the design cannot take
advantage of Orchestration's auto scaling functionality (which relies on
information from Telemetry). Due to the fact that you can use
Orchestration to spin up a large number of instances to perform the
compute-intensive processing, we strongly recommend including
Orchestration in a compute-focused architecture design.
Networking software
-------------------
OpenStack Networking (neutron) provides a wide variety of networking
services for instances. There are many additional networking software
packages that may be useful to manage the OpenStack components
themselves. Some examples include HAProxy, Keepalived, and various
routing daemons (like Quagga). The OpenStack High Availability Guide
describes some of these software packages, HAProxy in particular. See
the `Network controller cluster stack
chapter <https://docs.openstack.org/ha-guide/networking-ha.html>`_ of
the OpenStack High Availability Guide.
Management software
-------------------
Management software includes software for providing:
* Clustering
* Logging
* Monitoring
* Alerting
.. important::
The factors for determining which software packages in this category
to select is outside the scope of this design guide.
The availability design requirements determine the selection of
Clustering Software, such as Corosync or Pacemaker. The availability of
the cloud infrastructure and the complexity of supporting the
configuration after deployment determines the impact of including these
software packages. The OpenStack High Availability Guide provides more
details on the installation and configuration of Corosync and Pacemaker.
Operational considerations determine the requirements for logging,
monitoring, and alerting. Each of these sub-categories includes options.
For example, in the logging sub-category you could select Logstash,
Splunk, Log Insight, or another log aggregation-consolidation tool.
Store logs in a centralized location to facilitate performing analytics
against the data. Log data analytics engines can also provide automation
and issue notification, by providing a mechanism to both alert and
automatically attempt to remediate some of the more commonly known
issues.
If you require any of these software packages, the design must account
for the additional resource consumption. Some other potential design
impacts include:
* OS-Hypervisor combination: Ensure that the selected logging,
monitoring, or alerting tools support the proposed OS-hypervisor
combination.
* Network hardware: The network hardware selection needs to be
supported by the logging, monitoring, and alerting software.
Database software
-----------------
Most OpenStack components require access to back-end database services
to store state and configuration information. Choose an appropriate
back-end database which satisfies the availability and fault tolerance
requirements of the OpenStack services.
MySQL is the default database for OpenStack, but other compatible
databases are available.
.. note::
Telemetry uses MongoDB.
The chosen high availability database solution changes according to the
selected database. MySQL, for example, provides several options. Use a
replication technology such as Galera for active-active clustering. For
active-passive use some form of shared storage. Each of these potential
solutions has an impact on the design:
* Solutions that employ Galera/MariaDB require at least three MySQL
nodes.
* MongoDB has its own design considerations for high availability.
* OpenStack design, generally, does not include shared storage.
However, for some high availability designs, certain components might
require it depending on the specific implementation.

View File

@ -1,252 +0,0 @@
Operational Considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~
Several operational factors affect the design choices for a general
purpose cloud. Operations staff receive tasks regarding the maintenance
of cloud environments for larger installations, including:
Maintenance tasks
The storage solution should take into account storage maintenance
and the impact on underlying workloads.
Reliability and availability
Reliability and availability depend on wide area network
availability and on the level of precautions taken by the service
provider.
Flexibility
Organizations need to have the flexibility to choose between
off-premise and on-premise cloud storage options. This relies on
relevant decision criteria with potential cost savings. For example,
continuity of operations, disaster recovery, security, records
retention laws, regulations, and policies.
Monitoring and alerting services are vital in cloud environments with
high demands on storage resources. These services provide a real-time
view into the health and performance of the storage systems. An
integrated management console, or other dashboards capable of
visualizing SNMP data, is helpful when discovering and resolving issues
that arise within the storage cluster.
A storage-focused cloud design should include:
* Monitoring of physical hardware resources.
* Monitoring of environmental resources such as temperature and
humidity.
* Monitoring of storage resources such as available storage, memory,
and CPU.
* Monitoring of advanced storage performance data to ensure that
storage systems are performing as expected.
* Monitoring of network resources for service disruptions which would
affect access to storage.
* Centralized log collection.
* Log analytics capabilities.
* Ticketing system (or integration with a ticketing system) to track
issues.
* Alerting and notification of responsible teams or automated systems
which remediate problems with storage as they arise.
* Network Operations Center (NOC) staffed and always available to
resolve issues.
Application awareness
---------------------
Well-designed applications should be aware of underlying storage
subsystems in order to use cloud storage solutions effectively.
If natively available replication is not available, operations personnel
must be able to modify the application so that they can provide their
own replication service. In the event that replication is unavailable,
operations personnel can design applications to react such that they can
provide their own replication services. An application designed to
detect underlying storage systems can function in a wide variety of
infrastructures, and still have the same basic behavior regardless of
the differences in the underlying infrastructure.
Fault tolerance and availability
--------------------------------
Designing for fault tolerance and availability of storage systems in an
OpenStack cloud is vastly different when comparing the Block Storage and
Object Storage services.
Block Storage fault tolerance and availability
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Configure Block Storage resource nodes with advanced RAID controllers
and high performance disks to provide fault tolerance at the hardware
level.
Deploy high performing storage solutions such as SSD disk drives or
flash storage systems for applications requiring extreme performance out
of Block Storage devices.
In environments that place extreme demands on Block Storage, we
recommend using multiple storage pools. In this case, each pool of
devices should have a similar hardware design and disk configuration
across all hardware nodes in that pool. This allows for a design that
provides applications with access to a wide variety of Block Storage
pools, each with their own redundancy, availability, and performance
characteristics. When deploying multiple pools of storage it is also
important to consider the impact on the Block Storage scheduler which is
responsible for provisioning storage across resource nodes. Ensuring
that applications can schedule volumes in multiple regions, each with
their own network, power, and cooling infrastructure, can give projects
the ability to build fault tolerant applications that are distributed
across multiple availability zones.
In addition to the Block Storage resource nodes, it is important to
design for high availability and redundancy of the APIs, and related
services that are responsible for provisioning and providing access to
storage. We recommend designing a layer of hardware or software load
balancers in order to achieve high availability of the appropriate REST
API services to provide uninterrupted service. In some cases, it may
also be necessary to deploy an additional layer of load balancing to
provide access to back-end database services responsible for servicing
and storing the state of Block Storage volumes. We also recommend
designing a highly available database solution to store the Block
Storage databases. Leverage highly available database solutions such as
Galera and MariaDB to help keep database services online for
uninterrupted access, so that projects can manage Block Storage volumes.
In a cloud with extreme demands on Block Storage, the network
architecture should take into account the amount of East-West bandwidth
required for instances to make use of the available storage resources.
The selected network devices should support jumbo frames for
transferring large blocks of data. In some cases, it may be necessary to
create an additional back-end storage network dedicated to providing
connectivity between instances and Block Storage resources so that there
is no contention of network resources.
Object Storage fault tolerance and availability
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While consistency and partition tolerance are both inherent features of
the Object Storage service, it is important to design the overall
storage architecture to ensure that the implemented system meets those
goals. The OpenStack Object Storage service places a specific number of
data replicas as objects on resource nodes. These replicas are
distributed throughout the cluster based on a consistent hash ring which
exists on all nodes in the cluster.
Design the Object Storage system with a sufficient number of zones to
provide quorum for the number of replicas defined. For example, with
three replicas configured in the Swift cluster, the recommended number
of zones to configure within the Object Storage cluster in order to
achieve quorum is five. While it is possible to deploy a solution with
fewer zones, the implied risk of doing so is that some data may not be
available and API requests to certain objects stored in the cluster
might fail. For this reason, ensure you properly account for the number
of zones in the Object Storage cluster.
Each Object Storage zone should be self-contained within its own
availability zone. Each availability zone should have independent access
to network, power and cooling infrastructure to ensure uninterrupted
access to data. In addition, a pool of Object Storage proxy servers
providing access to data stored on the object nodes should service each
availability zone. Object proxies in each region should leverage local
read and write affinity so that local storage resources facilitate
access to objects wherever possible. We recommend deploying upstream
load balancing to ensure that proxy services are distributed across the
multiple zones and, in some cases, it may be necessary to make use of
third-party solutions to aid with geographical distribution of services.
A zone within an Object Storage cluster is a logical division. Any of
the following may represent a zone:
* A disk within a single node
* One zone per node
* Zone per collection of nodes
* Multiple racks
* Multiple DCs
Selecting the proper zone design is crucial for allowing the Object
Storage cluster to scale while providing an available and redundant
storage system. It may be necessary to configure storage policies that
have different requirements with regards to replicas, retention and
other factors that could heavily affect the design of storage in a
specific zone.
Scaling storage services
------------------------
Adding storage capacity and bandwidth is a very different process when
comparing the Block and Object Storage services. While adding Block
Storage capacity is a relatively simple process, adding capacity and
bandwidth to the Object Storage systems is a complex task that requires
careful planning and consideration during the design phase.
Scaling Block Storage
^^^^^^^^^^^^^^^^^^^^^
You can upgrade Block Storage pools to add storage capacity without
interrupting the overall Block Storage service. Add nodes to the pool by
installing and configuring the appropriate hardware and software and
then allowing that node to report in to the proper storage pool via the
message bus. This is because Block Storage nodes report into the
scheduler service advertising their availability. After the node is
online and available, projects can make use of those storage resources
instantly.
In some cases, the demand on Block Storage from instances may exhaust
the available network bandwidth. As a result, design network
infrastructure that services Block Storage resources in such a way that
you can add capacity and bandwidth easily. This often involves the use
of dynamic routing protocols or advanced networking solutions to add
capacity to downstream devices easily. Both the front-end and back-end
storage network designs should encompass the ability to quickly and
easily add capacity and bandwidth.
Scaling Object Storage
^^^^^^^^^^^^^^^^^^^^^^
Adding back-end storage capacity to an Object Storage cluster requires
careful planning and consideration. In the design phase, it is important
to determine the maximum partition power required by the Object Storage
service, which determines the maximum number of partitions which can
exist. Object Storage distributes data among all available storage, but
a partition cannot span more than one disk, although a disk can have
multiple partitions.
For example, a system that starts with a single disk and a partition
power of 3 can have 8 (2^3) partitions. Adding a second disk means that
each has 4 partitions. The one-disk-per-partition limit means that this
system can never have more than 8 partitions, limiting its scalability.
However, a system that starts with a single disk and a partition power
of 10 can have up to 1024 (2^10) partitions.
As you add back-end storage capacity to the system, the partition maps
redistribute data amongst the storage nodes. In some cases, this
replication consists of extremely large data sets. In these cases, we
recommend using back-end replication links that do not contend with
projects' access to data.
As more projects begin to access data within the cluster and their data
sets grow, it is necessary to add front-end bandwidth to service data
access requests. Adding front-end bandwidth to an Object Storage cluster
requires careful planning and design of the Object Storage proxies that
projects use to gain access to the data, along with the high availability
solutions that enable easy scaling of the proxy layer. We recommend
designing a front-end load balancing layer that projects and consumers
use to gain access to data stored within the cluster. This load
balancing layer may be distributed across zones, regions or even across
geographic boundaries, which may also require that the design encompass
geo-location solutions.
In some cases, you must add bandwidth and capacity to the network
resources servicing requests between proxy servers and storage nodes.
For this reason, the network architecture used for access to storage
nodes and proxy servers should make use of a design which is scalable.

View File

@ -1,142 +0,0 @@
Prescriptive Examples
~~~~~~~~~~~~~~~~~~~~~
Storage-focused architecture depends on specific use cases. This section
discusses three example use cases:
* An object store with a RESTful interface
* Compute analytics with parallel file systems
* High performance database
The example below shows a REST interface without a high performance
requirement.
Swift is a highly scalable object store that is part of the OpenStack
project. This diagram explains the example architecture:
.. figure:: figures/Storage_Object.png
The example REST interface, presented as a traditional Object store
running on traditional spindles, does not require a high performance
caching tier.
This example uses the following components:
Network:
* 10 GbE horizontally scalable spine leaf back-end storage and front
end network.
Storage hardware:
* 10 storage servers each with 12x4 TB disks equaling 480 TB total
space with approximately 160 TB of usable space after replicas.
Proxy:
* 3x proxies
* 2x10 GbE bonded front end
* 2x10 GbE back-end bonds
* Approximately 60 Gb of total bandwidth to the back-end storage
cluster
.. note::
It may be necessary to implement a 3rd-party caching layer for some
applications to achieve suitable performance.
Compute analytics with Data processing service
----------------------------------------------
Analytics of large data sets are dependent on the performance of the
storage system. Clouds using storage systems such as Hadoop Distributed
File System (HDFS) have inefficiencies which can cause performance
issues.
One potential solution to this problem is the implementation of storage
systems designed for performance. Parallel file systems have previously
filled this need in the HPC space and are suitable for large scale
performance-orientated systems.
OpenStack has integration with Hadoop to manage the Hadoop cluster
within the cloud. The following diagram shows an OpenStack store with a
high performance requirement:
.. figure:: figures/Storage_Hadoop3.png
The hardware requirements and configuration are similar to those of the
High Performance Database example below. In this case, the architecture
uses Ceph's Swift-compatible REST interface, features that allow for
connecting a caching pool to allow for acceleration of the presented
pool.
High performance database with Database service
-----------------------------------------------
Databases are a common workload that benefit from high performance
storage back ends. Although enterprise storage is not a requirement,
many environments have existing storage that OpenStack cloud can use as
back ends. You can create a storage pool to provide block devices with
OpenStack Block Storage for instances as well as object interfaces. In
this example, the database I-O requirements are high and demand storage
presented from a fast SSD pool.
A storage system presents a LUN backed by a set of SSDs using a
traditional storage array with OpenStack Block Storage integration or a
storage platform such as Ceph or Gluster.
This system can provide additional performance. For example, in the
database example below, a portion of the SSD pool can act as a block
device to the Database server. In the high performance analytics
example, the inline SSD cache layer accelerates the REST interface.
.. figure:: figures/Storage_Database_+_Object5.png
In this example, Ceph presents a Swift-compatible REST interface, as
well as a block level storage from a distributed storage cluster. It is
highly flexible and has features that enable reduced cost of operations
such as self healing and auto balancing. Using erasure coded pools are a
suitable way of maximizing the amount of usable space.
.. note::
There are special considerations around erasure coded pools. For
example, higher computational requirements and limitations on the
operations allowed on an object; erasure coded pools do not support
partial writes.
Using Ceph as an applicable example, a potential architecture would have
the following requirements:
Network:
* 10 GbE horizontally scalable spine leaf back-end storage and
front-end network
Storage hardware:
* 5 storage servers for caching layer 24x1 TB SSD
* 10 storage servers each with 12x4 TB disks which equals 480 TB total
space with about approximately 160 TB of usable space after 3
replicas
REST proxy:
* 3x proxies
* 2x10 GbE bonded front end
* 2x10 GbE back-end bonds
* Approximately 60 Gb of total bandwidth to the back-end storage
cluster
Using an SSD cache layer, you can present block devices directly to
hypervisors or instances. The REST interface can also use the SSD cache
systems as an inline cache.

View File

@ -1,62 +0,0 @@
Technical considerations
~~~~~~~~~~~~~~~~~~~~~~~~
Some of the key technical considerations that are critical to a
storage-focused OpenStack design architecture include:
Input-Output requirements
Input-Output performance requirements require researching and
modeling before deciding on a final storage framework. Running
benchmarks for Input-Output performance provides a baseline for
expected performance levels. If these tests include details, then
the resulting data can help model behavior and results during
different workloads. Running scripted smaller benchmarks during the
lifecycle of the architecture helps record the system health at
different points in time. The data from these scripted benchmarks
assist in future scoping and gaining a deeper understanding of an
organization's needs.
Scale
Scaling storage solutions in a storage-focused OpenStack
architecture design is driven by initial requirements, including
:term:`IOPS <Input/output Operations Per Second (IOPS)>`, capacity,
bandwidth, and future needs. Planning capacity based on projected
needs over the course of a budget cycle is important for a design.
The architecture should balance cost and capacity, while also allowing
flexibility to implement new technologies and methods as they become
available.
Security
Designing security around data has multiple points of focus that
vary depending on SLAs, legal requirements, industry regulations,
and certifications needed for systems or people. Consider compliance
with HIPPA, ISO9000, and SOX based on the type of data. For certain
organizations, multiple levels of access control are important.
OpenStack compatibility
Interoperability and integration with OpenStack can be paramount in
deciding on a storage hardware and storage management platform.
Interoperability and integration includes factors such as OpenStack
Block Storage interoperability, OpenStack Object Storage
compatibility, and hypervisor compatibility (which affects the
ability to use storage for ephemeral instance storage).
Storage management
You must address a range of storage management-related
considerations in the design of a storage-focused OpenStack cloud.
These considerations include, but are not limited to, backup
strategy (and restore strategy, since a backup that cannot be
restored is useless), data valuation-hierarchical storage
management, retention strategy, data placement, and workflow
automation.
Data grids
Data grids are helpful when answering questions around data
valuation. Data grids improve decision making through correlation of
access patterns, ownership, and business-unit revenue with other
metadata values to deliver actionable information about data.
When building a storage-focused OpenStack architecture, strive to build
a flexible design based on an industry standard core. One way of
accomplishing this might be through the use of different back ends
serving different use cases.

View File

@ -1,61 +0,0 @@
===============
Storage focused
===============
.. toctree::
:maxdepth: 2
storage-focus-technical-considerations.rst
storage-focus-operational-considerations.rst
storage-focus-architecture.rst
storage-focus-prescriptive-examples.rst
Cloud storage is a model of data storage that stores digital data in
logical pools and physical storage that spans across multiple servers
and locations. Cloud storage commonly refers to a hosted object storage
service, however the term also includes other types of data storage that
are available as a service, for example block storage.
Cloud storage runs on virtualized infrastructure and resembles broader
cloud computing in terms of accessible interfaces, elasticity,
scalability, multi-tenancy, and metered resources. You can use cloud
storage services from an off-premises service or deploy on-premises.
Cloud storage consists of many distributed, synonymous resources, which
are often referred to as integrated storage clouds. Cloud storage is
highly fault tolerant through redundancy and the distribution of data.
It is highly durable through the creation of versioned copies, and can
be consistent with regard to data replicas.
At large scale, management of data operations is a resource intensive
process for an organization. Hierarchical storage management (HSM)
systems and data grids help annotate and report a baseline data
valuation to make intelligent decisions and automate data decisions. HSM
enables automated tiering and movement, as well as orchestration of data
operations. A data grid is an architecture, or set of services evolving
technology, that brings together sets of services enabling users to
manage large data sets.
Example applications deployed with cloud storage characteristics:
* Active archive, backups and hierarchical storage management.
* General content storage and synchronization. An example of this is
private dropbox.
* Data analytics with parallel file systems.
* Unstructured data store for services. For example, social media
back-end storage.
* Persistent block storage.
* Operating system and application image store.
* Media streaming.
* Databases.
* Content distribution.
* Cloud storage peering.