Adds starting point for Architecture and Design Guide

The areas that still need work are:
- needs double-checking for tables
- see http://docs.openstack.org/arc/OpenStackArchitectureDesignGuide.epub
  for intended structure

Co-Authored-By: Nick Chase <nchase@mirantis.com>
Co-Authored-By: Beth Cohen <beth.cohen@verizon.com>
Co-Authored-By: Sean Collins <sean_collins2@cable.comcast.com>
Co-Authored-By: Steve Gordon <sgordon@redhat.com>
Co-Authored-By: Sebastian Gutierrez <segutier@redhat.com>
Co-Authored-By: Kevin Jackson <Kevin.Jackson@rackspace.co.uk>
Co-Authored-By: Scott Lowe <slowe@vmware.com>
Co-Authored-By: Maish Saidel-Keesing <msaidelk@cisco.com>
Co-Authored-By: Alexandra Settle <alexandra.settle@rackspace.com>
Co-Authored-By: Vinny Valdez <vvaldez@redhat.com>
Co-Authored-By: Anthony Veiga <Anthony_Veiga@cable.comcast.com>
Co-Authored-By: Sean Winn <sean.winn@cloudscaling.com>

Change-Id: Ia0ca278cd5d2d0ee67b9b7528870c1a2a80fdadf
This commit is contained in:
Anne Gentle 2014-07-17 15:50:40 -05:00 committed by Andreas Jaeger
parent 61e4a39c4f
commit 483b337b9e
111 changed files with 10567 additions and 0 deletions

View File

@ -0,0 +1,75 @@
<?xml version="1.0" encoding="UTF-8"?>
<book xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="openstack-compute-admin-manual-grizzly">
<title>OpenStack Architecture Design Guide</title>
<?rax title.font.size="28px" subtitle.font.size="28px"?>
<titleabbrev>Architecture Guide</titleabbrev>
<info>
<author>
<personname>
<firstname/>
<surname/>
</personname>
<affiliation>
<orgname>OpenStack Foundation</orgname>
</affiliation>
</author>
<copyright>
<year>2014</year>
<holder>OpenStack Foundation</holder>
</copyright>
<releaseinfo>current</releaseinfo>
<productname>OpenStack</productname>
<pubdate/>
<legalnotice role="apache2">
<annotation>
<remark>Copyright details are filled in by the
template.</remark>
</annotation>
</legalnotice>
<legalnotice role="cc-by-sa">
<annotation>
<remark>Remaining licensing details are filled in by
the template.</remark>
</annotation>
</legalnotice>
<abstract>
<para>To reap the benefits of OpenStack, you should
plan, design, and architect your cloud properly,
taking user's needs into account and understanding the
use cases.</para>
</abstract>
<revhistory>
<!-- ... continue adding more revisions here as you change this document using the markup shown below... -->
<revision>
<date>2014-07-21</date>
<revdescription>
<itemizedlist>
<listitem>
<para>Initial release.</para>
</listitem>
</itemizedlist>
</revdescription>
</revision>
</revhistory>
</info>
<!-- Chapters are referred from the book file through these
include statements. You can add additional chapters using
these types of statements. -->
<xi:include href="../common/ch_preface.xml"/>
<xi:include href="ch_introduction.xml"/>
<xi:include href="ch_generalpurpose.xml"/>
<xi:include href="ch_compute_focus.xml"/>
<xi:include href="ch_storage_focus.xml"/>
<xi:include href="ch_network_focus.xml"/>
<xi:include href="ch_multi_site.xml"/>
<xi:include href="ch_hybrid.xml"/>
<xi:include href="ch_massively_scalable.xml"/>
<xi:include href="ch_specialized.xml"/>
<xi:include href="ch_references.xml"/><!--
<xi:include href="ch_glossary.xml"/>-->
<xi:include href="../common/app_support.xml"/>
</book>

View File

@ -0,0 +1,16 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="compute_focus">
<title>Compute Focused</title>
<xi:include href="compute_focus/section_introduction_compute_focus.xml"/>
<xi:include href="compute_focus/section_user_requirements_compute_focus.xml"/>
<xi:include href="compute_focus/section_tech_considerations_compute_focus.xml"/>
<xi:include href="compute_focus/section_operational_considerations_compute_focus.xml"/>
<xi:include href="compute_focus/section_architecture_compute_focus.xml"/>
<xi:include href="compute_focus/section_prescriptive_examples_compute_focus.xml"/>
</chapter>

View File

@ -0,0 +1,16 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="generalpurpose">
<title>General Purpose</title>
<xi:include href="generalpurpose/section_introduction_generalpurpose.xml"/>
<xi:include href="generalpurpose/section_user_requirements_general_purpose.xml"/>
<xi:include href="generalpurpose/section_tech_considerations_general_purpose.xml"/>
<xi:include href="generalpurpose/section_operational_considerations_general_purpose.xml"/>
<xi:include href="generalpurpose/section_architecture_general_purpose.xml"/>
<xi:include href="generalpurpose/section_prescriptive_example_general_purpose.xml"/>
</chapter>

View File

@ -0,0 +1,580 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-design-glossary">
<title>Glossary</title>
<itemizedlist>
<listitem>
<para>6to4 - A mechanism that allows IPv6 packets to be
transmitted over an IPv4 network, providing a strategy
for migrating to IPv6.</para>
</listitem>
<listitem>
<para>AAA - authentication, authorization and
auditing.</para>
</listitem>
<listitem>
<para>Anycast - A network routing methodology that routes
traffic from a single sender to the nearest node, in a
pool of nodes.</para>
</listitem>
<listitem>
<para>ARP - Address Resolution Protocol - the protocol by
which layer 3 IP addresses are resolved into layer 2,
link local addresses.</para>
</listitem>
<listitem>
<para>BGP - Border Gateway Protocol is a dynamic routing
protocol that connects autonomous systems together.
Considered the backbone of the Internet, this protocol
connects disparate networks together to form a larger
network.</para>
</listitem>
<listitem>
<para>Boot Storm - When hundreds of users log in to and
consume resources at the same time, causing
significant performance degradation. This problem is
particularly common in Virtual Desktop Infrastructure
(VDI) environments.</para>
</listitem>
<listitem>
<para>Broadcast Domain - The layer 2 segment shared by a
group of network connected nodes.</para>
</listitem>
<listitem>
<para>Bursting - The practice of utilizing a secondary
environment to elastically build instances on-demand
when the primary environment is resource
constrained.</para>
</listitem>
<listitem>
<para>Capital Expenditure (CapEx) - A capital expense,
capital expenditure, CapEx is an initial cost for
building a product, business, or system.</para>
</listitem>
<listitem>
<para>Cascading Failure - A scenario where a single
failure in a system creates a cascading effect, where
other systems fail as load is transferred from the
failing system.</para>
</listitem>
<listitem>
<para>CDN - Content delivery network - a specialized
network that is used to distribute content to clients,
typically located close to the client for increased
performance.</para>
</listitem>
<listitem>
<para>Cells - An OpenStack Compute (Nova) feature, where a
compute deployment can be split into smaller clusters
or cells with their own queue and database for
performance and scalability, while still providing a
single API endpoint.</para>
</listitem>
<listitem>
<para>CI/CD - Continuous Integration / Continuous
Deployment, a methodology where software is
continually built and unit tests run for each change
that is merged, or proposed for merge. Continuous
Deployment is a software development methodology where
changes are deployed into production as they are
merged into source control, rather than being
collected into a release and deployed at regular
intervals</para>
</listitem>
<listitem>
<para>Cloud Broker - A cloud broker is a third-party
individual or business that acts as an intermediary
between the purchaser of a cloud computing service and
the sellers of that service. In general, a broker is
someone who acts as an intermediary between two or
more parties during negotiations.</para>
</listitem>
<listitem>
<para>Cloud Consumer - User that consumes cloud instances,
storage, or other resources in a cloud environment.
This user interacts with OpenStack or other cloud
management tools.</para>
</listitem>
<listitem>
<para>Cloud Management Platform (CMP) - Products that
provide a common interface to manage multiple cloud
environments or platforms.</para>
</listitem>
<listitem>
<para>Connection Broker - In desktop virtualization, a
connection broker is a software program that allows
the end-user to connect to an available
desktop.</para>
</listitem>
<listitem>
<para>Direct Attached Storage (DAS) - Data storage that is
directly connected to a machine.</para>
</listitem>
<listitem>
<para>DefCore - DefCore sets base requirements by defining
capabilities, code and must-pass tests for all
OpenStack products. This definition uses community
resources and involvement to drive interoperability by
creating the minimum standards for products labeled
"OpenStack." See
https://wiki.openstack.org/wiki/Governance/CoreDefinition
for more information.</para>
</listitem>
<listitem>
<para>Desktop as a Service (DaaS) - A platform that
provides a suite of desktop environments that users
may log in to to receive a desktop experience from any
location. This may provide general use, development,
or even homogenous testing environments.</para>
</listitem>
<listitem>
<para>Direct Server Return - A technique in load balancing
where an initial request is routed through a load
balancer, and the reply is sent from the responding
node directly to the requester.</para>
</listitem>
<listitem>
<para>Denial of Service (DoS) - In computing, a
denial-of-service or distributed denial-of-service
attack is an attempt to make a machine or network
resource unavailable to its intended users.</para>
</listitem>
<listitem>
<para>Distributed Replicated Block Device (DRBD) - The
Distributed Replicated Block Device (DRBD) is a
distributed replicated storage system for the Linux
platform.</para>
</listitem>
<listitem>
<para>Differentiated Service Code Point (DSCP) - Defined
in RFC 2474, this field in IPv4 and IPv6 headers is
used to define classes of network traffic, for quality
of service purposes.</para>
</listitem>
<listitem>
<para>External Border Gateway Protocol (eBGP) - External
Border Gateway Protocol describes a specific
implementation of BGP designed for inter-autonomous
system communication</para>
</listitem>
<listitem>
<para>Elastic IP - An Amazon Web Services concept, which
is an IP address that can be dynamically allocated and
re-assigned to running instances on the fly. The
OpenStack equivalent is a Floating IP.</para>
</listitem>
<listitem>
<para>Encapsulation - The practice of placing one packet
type within another for the purposes of abstracting or
securing data. Examples include GRE, MPLS, or
IPSEC.</para>
</listitem>
<listitem>
<para>External Cloud - A cloud environment that exists
outside of the control of an organization. Referred to
for hybrid cloud to indicate a public cloud or an
off-site hosted cloud.</para>
</listitem>
<listitem>
<para>Federated Cloud - A federated cloud describes a
multiple sets of cloud resources, for example compute
or storage, that are managed by a centralized
endpoint.</para>
</listitem>
<listitem>
<para>Flow - A series of packets that are stateful in
nature and represent a session. Usually represented by
a TCP stream, but can also indicate other packet types
that when combined comprise a connection between two
points.</para>
</listitem>
<listitem>
<para>Golden Image - An operating system image that
contains a set of pre-installed software packages and
configurations. This may be used to build standardized
instances that have the same base set of configuration
to improve mean time to functional application</para>
</listitem>
<listitem>
<para>Graphics Processing Unit (GPU) - A single chip
processor with integrated transform, lighting,
triangle setup/clipping, and rendering engines that is
capable of processing a minimum of 10 million polygons
per second. Traditional uses are any compute problem
that can be represented as a vector or matrix
operation.</para>
</listitem>
<listitem>
<para>Hadoop Distributed File System (HDFS) - A
distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth
across the cluster.</para>
</listitem>
<listitem>
<para>High Availability (HA) - High availability system
design approach and associated service implementation
that ensures a prearranged level of operational
performance will be met during a contractual
measurement period.</para>
</listitem>
<listitem>
<para>High Performance Computing (HPC) - Also known as
distributed computing - used for computation intensive
processes run on a large number of instances</para>
</listitem>
<listitem>
<para>Hierarchical Storage Management (HSM) - Hierarchical
storage management is a data storage technique, which
automatically moves data between high-cost and
low-cost storage media</para>
</listitem>
<listitem>
<para>Hot Standby Router Protocol (HSRP) - Hot Standby
Router Protocol is a Cisco proprietary redundancy
protocol for establishing a fault-tolerant default
gateway, and has been described in detail in RFC
2281.</para>
</listitem>
<listitem>
<para>Hybrid Cloud - Hybrid cloud is a composition of two
or more clouds (private, community or public) that
remain distinct entities but are bound together,
offering the benefits of multiple deployment models.
Hybrid cloud can also mean the ability to connect
colocation, managed and/or dedicated services with
cloud resources.</para>
</listitem>
<listitem>
<para>Interior Border Gateway Protocol (iBGP) - Interior
Border Gateway Protocol is a an interior gateway
protocol designed to exchange routing and reachability
information within autonomous systems.</para>
</listitem>
<listitem>
<para>Interior Gateway Protocol (IGP) - An Interior
Gateway Protocol is a type of protocol used for
exchanging routing information between gateways
(commonly routers) within an Autonomous System (for
example, a system of corporate local area networks).
This routing information can then be used to route
network-level protocols like IP.</para>
</listitem>
<listitem>
<para>Input/Output Operations Per Second (IOPS) - A common
performance measurement used to benchmark computer
storage devices like hard disk drives, solid state
drives, and storage area networks.</para>
</listitem>
<listitem>
<para>jClouds - An open source multi-cloud toolkit for the
Java platform that gives you the freedom to create
applications that are portable across clouds while
giving you full control to use cloud-specific
features.</para>
</listitem>
<listitem>
<para>Jitter - Is the deviation from true periodicity of a
presumed periodic signal in electronics and
telecommunications, often in relation to a reference
clock source.</para>
</listitem>
<listitem>
<para>Jumbo Frame - Ethernet frames with more than 1500
bytes of payload.</para>
</listitem>
<listitem>
<para>Kernel-based Virtual Machine (KVM) - A full
virtualization solution for Linux on x86 hardware
containing virtualization extensions (Intel VT or
AMD-V). It consists of a loadable kernel module, that
provides the core virtualization infrastructure and a
processor specific module.</para>
</listitem>
<listitem>
<para>LAG - Link aggregation group is a term to describe
various methods of combining (aggregating) multiple
network connections in parallel into a group to
increase throughput beyond what a single connection
could sustain, and to provide redundancy in case one
of the links fail.</para>
</listitem>
<listitem>
<para>Layer 2 - The data link layer provides a reliable
link between two directly connected nodes, by
detecting and possibly correcting errors that may
occur in the physical layer.</para>
</listitem>
<listitem>
<para>Layer 3 - The network layer provides the functional
and procedural means of transferring variable length
data sequences (called datagrams) from one node to
another connected to the same network.</para>
</listitem>
<listitem>
<para>Legacy System - An old method, technology, computer
system, or application program that is considered
outdated.</para>
</listitem>
<listitem>
<para>Looking Glass - A tool that provides information on
backbone routing and network efficiency.</para>
</listitem>
<listitem>
<para>Microsoft Azure - A cloud computing platform and
infrastructure, created by Microsoft, for building,
deploying and managing applications and services
through a global network of Microsoft-managed
datacenters.</para>
</listitem>
<listitem>
<para>MongoDB - A cross-platform document-oriented
database. Classified as a NoSQL database, MongoDB
eschews the traditional table-based relational
database structure in favor of JSON-like documents
with dynamic schemas.</para>
</listitem>
<listitem>
<para>Mean Time Before Failures (MTBF) - Mean time before
failures is the predicted elapsed time before inherent
failures of a system during operation. MTBF can be
calculated as the arithmetic mean (average) time
between failures of a system.</para>
</listitem>
<listitem>
<para>Maximum Transmission Unit (MTU) - The maximum
transmission unit of a communications protocol of a
layer is the size (in bytes) of the largest protocol
data unit that the layer can pass onwards.</para>
</listitem>
<listitem>
<para>NAT64 - NAT64 is a mechanism to allow IPv6 hosts to
communicate with IPv4 servers. The NAT64 server is the
endpoint for at least one IPv4 address and an IPv6
network segment of 32-bits.</para>
</listitem>
<listitem>
<para>Network Functions Virtualization (NFV) - Network
Functions Virtualization is a network architecture
concept that proposes using IT virtualization related
technologies, to virtualize entire classes of network
node functions into building blocks that may be
connected, or chained, together to create
communication services.</para>
</listitem>
<listitem>
<para>NoSQL - A NoSQL or Not Only SQL database provides a
mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used
in relational databases.</para>
</listitem>
<listitem>
<para>Open vSwitch - Open vSwitch is a production quality,
multilayer virtual switch licensed under the open
source Apache 2.0 license. It is designed to enable
massive network automation through programmatic
extension, while still supporting standard management
interfaces and protocols (e.g. NetFlow, sFlow, SPAN,
RSPAN, CLI, LACP, 802.1ag).</para>
</listitem>
<listitem>
<para>Operational Expenditure (OPEX) - An operating
expense, operating expenditure, operational expense,
operational expenditure or OPEX is an ongoing cost for
running a product, business, or system.</para>
</listitem>
<listitem>
<para>Original Design Manufacturers (ODM) - Original
Design Manufacturers, a company which designs and
manufactures a product which is specified and
eventually branded by another firm for sale.</para>
</listitem>
<listitem>
<para>Overlay Network - An overlay network is a computer
network which is built on the top of another network.
Nodes in the overlay can be thought of as being
connected by virtual or logical links, each of which
corresponds to a path, perhaps through many physical
links, in the underlying network.</para>
</listitem>
<listitem>
<para>Packet Storm - A cause of degraded service or
failure that occurs when a network system is
overwhelmed by continuous multicast or broadcast
traffic.</para>
</listitem>
<listitem>
<para>Platform as a Service (PaaS) - Platform as a Service
is a category of cloud computing services that
provides a computing platform and a solution stack as
a service.</para>
</listitem>
<listitem>
<para>Power Usage Effectiveness (PUE) - Power usage
effectiveness is a measure of how efficiently a
computer data center uses energy; specifically, how
much energy is used by the computing equipment (in
contrast to cooling and other overhead).</para>
</listitem>
<listitem>
<para>Quality of Service (QoS) - Quality of Service is the
overall performance of a telephony or computer
network, particularly the performance seen by the
users of the network.</para>
</listitem>
<listitem>
<para>Remote Desktop Host - A server that hosts Remote
Applications as session-based desktops. Users can
access a Remote Desktop Host server by using the
Remote Desktop Connection client.</para>
</listitem>
<listitem>
<para>Renumbering - Network renumbering, the exercise of
renumbering a network consists of changing the IP host
addresses, and perhaps the network mask, of each
device within the network that has an address
associated with it.</para>
</listitem>
<listitem>
<para>Rollback - In database technologies, a rollback is
an operation which returns the database to some
previous state. Rollbacks are important for database
integrity, because they mean that the database can be
restored to a clean copy even after erroneous
operations are performed.</para>
</listitem>
<listitem>
<para>Remote Procedure Call (RPC) - A powerful technique
for constructing distributed, client-server based
applications. The communicating processes may be on
the same system, or they may be on different systems
with a network connecting them.</para>
</listitem>
<listitem>
<para>Recovery Point Objective (RPO) - A recovery point
objective is defined by business continuity planning.
It is the maximum tolerable period in which data might
be lost from an IT service due to a major incident.
The RPO gives systems designers a limit to work
to.</para>
</listitem>
<listitem>
<para>Recovery Time Objective (RTO) - The recovery time
objective is the duration of time and a service level
within which a business process must be restored after
a disaster (or disruption) in order to avoid
unacceptable consequences associated with a break in
business continuity.</para>
</listitem>
<listitem>
<para>Software Development Kit (SDK) - A software
development kit is typically a set of software
development tools that allows for the creation of
applications for a certain software package, software
framework, hardware platform, computer system, video
game console, operating system, or similar development
platform.</para>
</listitem>
<listitem>
<para>Service Level Agreement (SLA) - A service-level
agreement is a part of a service
contract[disambiguation needed] where a service is
formally defined. In practice, the term SLA is
sometimes used to refer to the contracted delivery
time (of the service or performance).</para>
</listitem>
<listitem>
<para>Software Development Lifecycle (SDLC) - Software
development life cycle - A software development
process, also known as a software development
life-cycle (SDLC), is a structure imposed on the
development of a software product.</para>
</listitem>
<listitem>
<para>Top of Rack Switch (ToR Switch) - A Top of the Rack
or (TOR) switch is a small port count switch that sits
on the very top or near the top of a Telco rack you
see in Datacenters.</para>
</listitem>
<listitem>
<para>Traffic Shaping - Traffic shaping (also known as
"packet shaping") is a computer network traffic
management technique which delays some or all
datagrams to bring them into compliance with a desired
traffic profile. Traffic shaping is a form of rate
limiting.</para>
</listitem>
<listitem>
<para>Tunneling - Computer networks use a tunneling
protocol when one network protocol (the delivery
protocol) encapsulates a different payload protocol.
By using tunneling one can (for example) carry a
payload over an incompatible delivery-network, or
provide a secure path through an untrusted
network.</para>
</listitem>
<listitem>
<para>Virtual Desktop Infrastructure (VDI) - Virtual
Desktop Infrastructure is a desktop-centric service
that hosts user desktop environments on remote
servers, which are accessed over a network using a
remote display protocol. A connection brokering
service is used to connect users to their assigned
desktop sessions.</para>
</listitem>
<listitem>
<para>Virtual Local Area Networks (VLAN) - In computer
networking, a single layer-2 network may be
partitioned to create multiple distinct broadcast
domains, which are mutually isolated so that packets
can only pass between them via one or more routers;
such a domain is referred to as a virtual local area
network, virtual LAN or VLAN.</para>
</listitem>
<listitem>
<para>Voice over Internet Protocol (VoIP) -
Voice-over-Internet Protocol (VoIP) is a methodology
and group of technologies for the delivery of voice
communications and multimedia sessions over Internet
Protocol (IP) networks, such as the Internet.</para>
</listitem>
<listitem>
<para>Virtual Router Redundancy Protocol (VRRP) - The
Virtual Router Redundancy Protocol (VRRP) is a
computer networking protocol that provides for
automatic assignment of available Internet Protocol
(IP) routers to participating hosts. This increases
the availability and reliability of routing paths via
automatic default gateway selections on an IP
sub-network.</para>
</listitem>
<listitem>
<para>VXLAN Tunnel Endpoint (VTEP) - VXLAN Tunnel Endpoint
- Used for frame encapsulation. VTEP functionality can
be implemented in software such as a virtual switch or
in the form a physical switch.</para>
</listitem>
<listitem>
<para>Virtual Extensible Local Area Network (VXLAN) -
Virtual Extensible LAN is a network virtualization
technology that attempts to ameliorate the scalability
problems associated with large cloud computing
deployments. It uses a VLAN-like encapsulation
technique to encapsulate MAC-based OSI layer 2
Ethernet frames within layer 3 UDP packets.</para>
</listitem>
<listitem>
<para>Wide Area Network (WAN) - A wide area network is a
network that covers a broad area using leased or
private telecommunication lines.</para>
</listitem>
<listitem>
<para>Xen - Xen is a hypervisor using a microkernel
design, providing services that allow multiple
computer operating systems to execute on the same
computer hardware concurrently.</para>
</listitem>
</itemizedlist>
</chapter>

View File

@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="hybrid">
<title>General Purpose</title>
<xi:include href="hybrid/section_introduction_hybrid.xml"/>
<xi:include href="hybrid/section_user_requirements_hybrid.xml"/>
<xi:include href="hybrid/section_tech_considerations_hybrid.xml"/>
<xi:include href="hybrid/section_operational_considerations_hybrid.xml"/>
<xi:include href="hybrid/section_architecture_hybrid.xml"/>
<xi:include href="hybrid/section_prescriptive_examples_hybrid.xml"/>
</chapter>

View File

@ -0,0 +1,15 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="introduction">
<title>Introduction</title>
<xi:include href="introduction/section_introduction_to_openstack_architecture_design_guide.xml"/>
<xi:include href="introduction/section_intended_audience.xml"/>
<xi:include href="introduction/section_how_this_book_is_organized.xml"/>
<xi:include href="introduction/section_how_this_book_was_written.xml"/>
<xi:include href="introduction/section_methodology.xml"/>
</chapter>

View File

@ -0,0 +1,14 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="massively_scalable">
<title>Massively Scalable</title>
<xi:include href="massively_scalable/section_introduction_massively_scalable.xml"/>
<xi:include href="massively_scalable/section_user_requirements_massively_scalable.xml"/>
<xi:include href="massively_scalable/section_tech_considerations_massively_scalable.xml"/>
<xi:include href="massively_scalable/section_operational_considerations_massively_scalable.xml"/>
</chapter>

View File

@ -0,0 +1,16 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="multi_site">
<title>Hybrid</title>
<xi:include href="multi_site/section_introduction_multi_site.xml"/>
<xi:include href="multi_site/section_user_requirements_multi_site.xml"/>
<xi:include href="multi_site/section_tech_considerations_multi_site.xml"/>
<xi:include href="multi_site/section_operational_considerations_multi_site.xml"/>
<xi:include href="multi_site/section_architecture_multi_site.xml"/>
<xi:include href="multi_site/section_prescriptive_examples_multi_site.xml"/>
</chapter>

View File

@ -0,0 +1,16 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="network_focus">
<title>Network Focused</title>
<xi:include href="network_focus/section_introduction_network_focus.xml"/>
<xi:include href="network_focus/section_user_requirements_network_focus.xml"/>
<xi:include href="network_focus/section_tech_considerations_network_focus.xml"/>
<xi:include href="network_focus/section_operational_considerations_network_focus.xml"/>
<xi:include href="network_focus/section_architecture_network_focus.xml"/>
<xi:include href="network_focus/section_prescriptive_examples_network_focus.xml"/>
</chapter>

View File

@ -0,0 +1,77 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-design-references">
<?dbhtml stop-chunking?>
<title>References</title>
<para>Data Protection framework of the European Union:
http://ec.europa.eu/justice/data-protection/Guidance on Data
Protection laws governed by the EU</para>
<para>Depletion of IPv4 Addresses:
http://www.internetsociety.org/deploy360/blog/2014/05/goodbye-ipv4-iana-starts-allocating-final-address-blocks/Article
describing how IPv4 addresses and the migration to IPv6 is
inevitable</para>
<para>Ethernet Switch Reliability:
http://www.garrettcom.com/techsupport/papers/ethernet_switch_reliability.pdf
Research white paper on Ethernet Switch reliability</para>
<para>Financial Industry Regulatory Authority:
http://www.finra.org/Industry/Regulation/FINRARules/ Requirements
of the Financial Industry Regulatory Authority in the USA</para>
<para>Image Service property keys:
http://docs.openstack.org/cli-reference/content/chapter_cli-glance-property.html Glance
API property keys allows the administrator to attach custom
characteristics to images</para>
<para>LibGuestFS Documentation: http://libguestfs.orgOfficial
LibGuestFS documentation</para>
<para>Logging and Monitoring
http://docs.openstack.org/openstack-ops/content/logging_monitoring.html Official
OpenStack Operations documentation</para>
<para>ManageIQ Cloud Management Platform: http://manageiq.org/ An
Open Source Cloud Management Platform for managing multiple
clouds</para>
<para>N-Tron Network Availability:
http://www.n-tron.com/pdf/network_availability.pdfResearch
white paper on network availability</para>
<para>Nested KVM:
http://davejingtian.org/2014/03/30/nested-kvm-just-for-funBlog
Post on how to nest KVM under KVM.</para>
<para>Open Compute Project: http://www.opencompute.org/The Open
Compute Project Foundations mission is to design and enable the
delivery of the most efficient server, storage and data center
hardware designs for scalable computing.</para>
<para>OpenStack Flavors:
http://docs.openstack.org/openstack-ops/content/flavors.htmlOfficial
OpenStack documentation</para>
<para>OpenStack High Availability Guide:
http://docs.openstack.org/high-availability-guide/content/Information
on how to provide redundancy for the OpenStack components</para>
<para>OpenStack Hypervisor Support
Matrix:https://wiki.openstack.org/wiki/HypervisorSupportMatrix
Matrix of supported hypervisors and capabilities when used with
OpenStack</para>
<para>OpenStack Object Store (Swift) Replication Reference:
http://docs.openstack.org/developer/swift/replication_network.html
Developer documentation of Swift replication</para>
<para>OpenStack Operations Guide:
http://docs.openstack.org/openstack-ops/The OpenStack Operations
Guide provides information on setting up and installing
OpenStack</para>
<para>OpenStack Security
Guide:http://docs.openstack.org/security-guide/The OpenStack
Security Guide provides information on securing OpenStack
deployments</para>
<para>OpenStack Training Marketplace:
http://www.openstack.org/marketplace/trainingThe OpenStack Market
for training and Vendors providing training on OpenStack.</para>
<para>PCI passthrough:
https://wiki.openstack.org/wiki/Pci_passthrough#How_to_check_PCI_status_with_PCI_api_paches
The PCI api patches extends the servers/os-hypervisor to show PCI
information for instance and compute node, and also provides a
resource endpoint to show PCI information.</para>
<para>TripleO: https://wiki.openstack.org/wiki/TripleOTripleO is a
program aimed at installing, upgrading and operating OpenStack
clouds using OpenStack's own cloud facilities as the
foundation.</para>
</chapter>

View File

@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="specialized">
<title>Specialized Cases</title>
<xi:include href="specialized/section_introduction_specialized.xml"/>
<xi:include href="specialized/section_multi_hypervisor_specialized.xml"/>
<xi:include href="specialized/section_networking_specialized.xml"/>
<xi:include href="specialized/section_software_defined_networking_specialized.xml"/>
<xi:include href="specialized/section_desktop_as_a_service_specialized.xml"/>
<xi:include href="specialized/section_openstack_on_openstack_specialized.xml"/>
<xi:include href="specialized/section_hardware_specialized.xml"/>
</chapter>

View File

@ -0,0 +1,16 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="storage_focus">
<title>Storage Focused</title>
<xi:include href="storage_focus/section_introduction_storage_focus.xml"/>
<xi:include href="storage_focus/section_user_requirements_storage_focus.xml"/>
<xi:include href="storage_focus/section_tech_considerations_storage_focus.xml"/>
<xi:include href="storage_focus/section_operational_considerations_storage_focus.xml"/>
<xi:include href="storage_focus/section_architecture_storage_focus.xml"/>
<xi:include href="storage_focus/section_prescriptive_examples_storage_focus.xml"/>
</chapter>

View File

@ -0,0 +1,879 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-design-architecture-hardware">
<?dbhtml stop-chunking?>
<title>Architecture</title>
<para>The hardware selection covers three areas:</para>
<itemizedlist>
<listitem>
<para>Compute</para>
</listitem>
<listitem>
<para>Network</para>
</listitem>
<listitem>
<para>Storage</para>
</listitem>
</itemizedlist>
<para>In a compute-focused OpenStack cloud the hardware selection
must
reflect the workloads being compute intensive.
Compute-focused is
defined as having extreme demands on
processor and memory resources.
The hardware selection for a
compute-focused OpenStack architecture
design must reflect
this preference for compute-intensive workloads, as
these
workloads are not storage intensive, nor are they consistently
network intensive. The network and storage may be heavily
utilized
while loading a data set into the computational
cluster, but they are
not otherwise intensive.
</para>
<para>Compute (server) hardware must be evaluated against four
opposing
dimensions:
</para>
<itemizedlist>
<listitem>
<para>Server density: A measure of how many servers can
fit into a
given measure of physical space, such as a
rack unit [U].
</para>
</listitem>
<listitem>
<para>Resource capacity: The number of CPU cores, how much
RAM, or how
much storage a given server will
deliver.
</para>
</listitem>
<listitem>
<para>Expandability: The number of additional resources
that can be
added to a server before it has reached
its limit.
</para>
</listitem>
<listitem>
<para>Cost: The relative purchase price of the hardware
weighted
against the level of design effort needed to
build the system.
</para>
</listitem>
</itemizedlist>
<para>The dimensions need to be weighted against each other to
determine the best design for the desired purpose. For
example,
increasing server density means sacrificing resource
capacity or
expandability. Increasing resource capacity and
expandability can
increase cost but decreases server density.
Decreasing cost can mean
decreasing supportability, server
density, resource capacity, and
expandability.
</para>
<para>Selection of hardware for a compute-focused cloud should
have an
emphasis on server hardware that can offer more CPU
sockets, more CPU
cores, and more RAM; network connectivity
and storage capacity are less
critical. The hardware will need
to be configured to provide enough
network connectivity and
storage capacity to meet minimum user
requirements, but they
are not the primary consideration.
</para>
<para>Some server hardware form factors are better suited than
others,
as CPU and RAM capacity have the highest
priority.
</para>
<itemizedlist>
<listitem>
<para>Most blade servers can support dual-socket
multi-core CPUs. To
avoid the limit means selecting
"full width" or "full height" blades,
which
consequently loses server density. As an example,
using high
density blade servers including HP
BladeSystem and Dell PowerEdge
M1000e) which support
up to 16 servers in only 10 rack units using
half-height blades, suddenly decreases the density by
50% by
selecting full-height blades resulting in only
8 servers per 10 rack
units.
</para>
</listitem>
<listitem>
<para>1U rack-mounted servers (servers that occupy only a
single rack
unit) may be able to offer greater server
density than a blade server
solution. It is possible
to place 40 servers in a rack, providing
space for the
top of rack [ToR] switches, versus 32 "full width" or
"full height" blade servers in a rack), but often are
limited to
dual-socket, multi-core CPU configurations.
Note that, as of the
Icehouse release, neither HP,
IBM, nor Dell offered 1U rack servers
with more than 2
CPU sockets. To obtain greater than dual-socket
support in a 1U rack-mount form factor, customers need
to buy their
systems from Original Design
Manufacturers (ODMs) or second-tier
manufacturers.
This may cause issues for organizations that have
preferred vendor policies or concerns with support and
hardware
warranties of non-tier 1 vendors.
</para>
</listitem>
<listitem>
<para>2U rack-mounted servers provide quad-socket,
multi-core CPU
support, but with a corresponding
decrease in server density (half
the density offered
by 1U rack-mounted servers).
</para>
</listitem>
<listitem>
<para>Larger rack-mounted servers, such as 4U servers,
often provide
even greater CPU capacity, commonly
supporting four or even eight CPU
sockets. These
servers have greater expandability, but such servers
have much lower server density and usually greater
hardware cost.
</para>
</listitem>
<listitem>
<para>"Sled servers" (rack-mounted servers that support
multiple
independent servers in a single 2U or 3U
enclosure) deliver increased
density as compared to
typical 1U or 2U rack-mounted servers. For
example,
many sled servers offer four independent dual-socket
nodes in
2U for a total of 8 CPU sockets in 2U.
However, the dual-socket
limitation on individual
nodes may not be sufficient to offset their
additional
cost and configuration complexity.
</para>
</listitem>
</itemizedlist>
<para>The following facts will strongly influence server hardware
selection for a compute-focused OpenStack design
architecture:
</para>
<itemizedlist>
<listitem>
<para>Instance density: In this architecture instance
density is
considered lower; therefore CPU and RAM
over-subscription ratios are
also lower. More hosts
will be required to support the anticipated
scale due
to instance density being lower, especially if the
design
uses dual-socket hardware designs.
</para>
</listitem>
<listitem>
<para>Host density: Another option to address the higher host count
that might be needed with dual socket designs is to use a quad
socket platform. Taking this approach will decrease host
density,
which increases rack count. This configuration may
affect the network
requirements, the number of power
connections, and possibly impact
the cooling
requirements.
</para>
</listitem>
<listitem>
<para>Power and cooling density: The power and cooling
density
requirements might be lower than with blade,
sled, or 1U server
designs because of lower host
density (by using 2U, 3U or even 4U
server designs).
For data centers with older infrastructure, this may
be a desirable feature.
</para>
</listitem>
</itemizedlist>
<para>Compute-focused OpenStack design architecture server
hardware
selection results in a "scale up" versus "scale out"
decision.
Selecting a better solution, smaller number of
larger hosts, or a
larger number of smaller hosts depends on a
combination of factors:
cost, power, cooling, physical rack
and floor space, support-warranty,
and manageability.
</para>
<section xml:id="storage-hardware-selection">
<title>Storage Hardware Selection</title>
<para>For a compute-focused OpenStack design architecture, the
selection of
storage hardware is not critical as it is not primary
criteria, however
it is still important. There are a number of
different factors that a
cloud architect must consider:
</para>
<itemizedlist>
<listitem>
<para>Cost: The overall cost of the solution will play a major role
in what storage architecture (and resulting storage hardware) is
selected.
</para>
</listitem>
<listitem>
<para>Performance: The performance of the solution is also a big
role and can be measured by observing the latency of storage I-O
requests. In a compute-focused OpenStack cloud, storage latency
can
be a major consideration. In some compute-intensive
workloads,
minimizing the delays that the CPU experiences while
fetching data
from the storage can have a significant impact on
the overall
performance of the application.
</para>
</listitem>
<listitem>
<para>Scalability: This section will refer to the term "scalability"
to refer to how well the storage solution performs as it is
expanded up to its maximum size. A storage solution that
performs
well in small configurations but has degrading
performance as it
expands would not be considered scalable. On
the other hand, a
solution that continues to perform well at
maximum expansion would
be considered scalable.
</para>
</listitem>
<listitem>
<para>Expandability: Expandability refers to the overall ability of
the solution to grow. A storage solution that expands to 50 PB
is
considered more expandable than a solution that only scales
to 10PB.
Note that this metric is related to, but different
from,
scalability, which is a measure of the solution's
performance as it
expands.
</para>
</listitem>
</itemizedlist>
<para>For a compute-focused OpenStack cloud, latency of storage is a
major
consideration. Using solid-state disks (SSDs) to minimize
latency for
instance storage and reduce CPU delays caused by waiting
for the storage
will increase performance. Consider using RAID
controller cards in
compute hosts to improve the performance of the
underlying disk
subsystem.
</para>
<para>The selection of storage architecture, and the corresponding
storage
hardware (if there is the option), is determined by evaluating
possible
solutions against the key factors listed above. This will
determine
if a
scale-out solution (such as Ceph, GlusterFS, or similar)
should be used,
or if a single, highly expandable and scalable
centralized storage
array
would be a better choice. If a centralized
storage array is the right
fit for the requirements, the hardware will
be determined by the array
vendor. It is also possible to build a
storage array using commodity
hardware with Open Source software, but
there needs to be access to
people with expertise to build such a
system. Conversely, a scale-out
storage solution that uses
direct-attached storage (DAS) in the
servers
may be an appropriate
choice. If so, then the server hardware needs to
be configured to
support the storage solution.
</para>
<para>The following lists some of the potential impacts that may
affect a
particular storage architecture, and the corresponding
storage hardware,
of a compute-focused OpenStack cloud:
</para>
<itemizedlist>
<listitem>
<para>Connectivity: Based on the storage solution selected, ensure
the connectivity matches the storage solution requirements. If a
centralized storage array is selected, it is important to
determine
how the hypervisors will connect to the storage array.
Connectivity
could affect latency and thus performance, so check
that the network
characteristics will minimize latency to boost
the overall
performance of the design.
</para>
</listitem>
<listitem>
<para>Latency: Determine if the use case will have consistent or
highly variable latency.
</para>
</listitem>
<listitem>
<para>Throughput: To improve overall performance, make sure that the
storage solution throughout is optimized. While it is not likely
that a compute-focused cloud will have major data I-O to and
from storage, this is an important factor to consider.
</para>
</listitem>
<listitem>
<para>Server Hardware: If the solution uses DAS, this impacts, and
is not limited to, the server hardware choice that will ripple
into
host density, instance density, power density,
OS-hypervisor, and
management tools.
</para>
</listitem>
</itemizedlist>
<para>Where instances need to be made highly available, or they need
to be
capable of migration between hosts, use of a shared storage
file-system
to store instance ephemeral data should be employed to
ensure that
compute services can run uninterrupted in the event of a
node
failure.
</para>
</section>
<section xml:id="selecting-networking-hardware-arch">
<title>Selecting Networking Hardware</title>
<para>Some of the key considerations that should be included in
the
selection of networking hardware include:
</para>
<itemizedlist>
<listitem>
<para>Port count: The design will require networking
hardware that
has the requisite port count.
</para>
</listitem>
<listitem>
<para>Port density: The network design will be affected by
the
physical space that is required to provide the
requisite port count.
A switch that can provide 48 10
GbE ports in 1U has a much higher
port density than a
switch that provides 24 10 GbE ports in 2U. A
higher
port density is preferred, as it leaves more rack
space for
compute or storage components that might be
required by the design.
This also leads into concerns
about fault domains and power density
that must also
be considered. Higher density switches are more
expensive and should also be considered, as it is
important not to
over design the network if it is not
required.
</para>
</listitem>
<listitem>
<para>Port speed: The networking hardware must support the
proposed
network speed, for example: 1 GbE, 10 GbE, or
40 GbE (or even 100
GbE).
</para>
</listitem>
<listitem>
<para>Redundancy: The level of network hardware redundancy
required
is influenced by the user requirements for
high availability and
cost considerations. Network
redundancy can be achieved by adding
redundant power
supplies or paired switches. If this is a
requirement,
the hardware will need to support this configuration.
User requirements will determine if a completely
redundant network
infrastructure is required.
</para>
</listitem>
<listitem>
<para>Power requirements: Ensure that the physical data
center
provides the necessary power for the selected
network hardware. This
is not an issue for top of rack
(ToR) switches, but may be an issue
for spine switches
in a leaf and spine fabric, or end of row (EoR)
switches.
</para>
</listitem>
</itemizedlist>
<para>It is important to first understand additional factors as
well as
the use case because these additional factors heavily
influence the
cloud network architecture. Once these key
considerations have been
decided, the proper network can be
designed to best serve the
workloads being placed in the
cloud.
</para>
<para>It is recommended that the network architecture is designed
using a scalable network model that makes it easy to add
capacity and
bandwidth. A good example of such a model is the
leaf-spline model. In
this type of network design, it is
possible to easily add additional
bandwidth as well as scale
out to additional racks of gear. It is
important to select
network hardware that will support the required
port count,
port speed and port density while also allowing for future
growth as workload demands increase. It is also important to
evaluate
where in the network architecture it is valuable to
provide
redundancy. Increased network availability and
redundancy comes at a
cost, therefore it is recommended to
weigh the cost versus the benefit
gained from utilizing and
deploying redundant network switches and
using bonded
interfaces at the host level.
</para>
</section>
<section xml:id="software-selection-arch">
<title>Software Selection</title>
<para>Selecting software to be included in a compute-focused
OpenStack
architecture design must include three main
areas:
</para>
<itemizedlist>
<listitem>
<para>Operating system (OS) and hypervisor</para>
</listitem>
<listitem>
<para>OpenStack components</para>
</listitem>
<listitem>
<para>Supplemental software</para>
</listitem>
</itemizedlist>
<para>Design decisions made in each of these areas impact the rest
of
the OpenStack architecture design.
</para>
</section>
<section xml:id="os-and-hypervisor-arch">
<title>OS and Hypervisor</title>
<para>The selection of OS and hypervisor has a significant impact
on
the end point design. Selecting a particular operating
system and
hypervisor could affect server hardware selection.
For example, a
selected combination needs to be supported on
the selected hardware.
Ensuring the storage hardware selection
and topology supports the
selected operating system and
hypervisor combination should also be
considered.
Additionally, make sure that the networking hardware
selection
and topology will work with the chosen operating system and
hypervisor combination. For example, if the design uses Link
Aggregation Control Protocol (LACP), the hypervisor needs to
support
it.
</para>
<para>Some areas that could be impacted by the selection of OS and
hypervisor include:
</para>
<itemizedlist>
<listitem>
<para>Cost: Selecting a commercially supported hypervisor
such as
Microsoft Hyper-V will result in a different
cost model rather than
chosen a community-supported
open source hypervisor like Kinstance
or Xen. Even
within the ranks of open source solutions, choosing
Ubuntu over Red Hat (or vice versa) will have an
impact on cost due
to support contracts. On the other
hand, business or application
requirements might
dictate a specific or commercially supported
hypervisor.
</para>
</listitem>
<listitem>
<para>Supportability: Depending on the selected
hypervisor, the staff
should have the appropriate
training and knowledge to support the
selected OS and
hypervisor combination. If they do not, training
will
need to be provided which could have a cost impact on
the
design.
</para>
</listitem>
<listitem>
<para>Management tools: The management tools used for
Ubuntu and
Kinstance differ from the management tools
for VMware vSphere.
Although both OS and hypervisor
combinations are supported by
OpenStack, there will be
very different impacts to the rest of the
design as a
result of the selection of one combination versus the
other.
</para>
</listitem>
<listitem>
<para>Scale and performance: Ensure that selected OS and
hypervisor
combinations meet the appropriate scale and
performance
requirements. The chosen architecture will
need to meet the targeted
instance-host ratios with
the selected OS-hypervisor combination.
</para>
</listitem>
<listitem>
<para>Security: Ensure that the design can accommodate the
regular
periodic installation of application security
patches while
maintaining the required workloads. The
frequency of security
patches for the proposed OS -
hypervisor combination will have an
impact on
performance and the patch installation process could
affect maintenance windows.
</para>
</listitem>
<listitem>
<para>Supported features: Determine what features of
OpenStack are
required. This will often determine the
selection of the
OS-hypervisor combination. Certain
features are only available with
specific OSs or
hypervisors. For example, if certain features are
not
available, the design might need to be modified to
meet the user
requirements.
</para>
</listitem>
<listitem>
<para>Interoperability: Consideration should be given to
the ability
of the selected OS-hypervisor combination
to interoperate or
co-exist with other OS-hypervisors,
or other software solutions in
the overall design (if
required). Operational and troubleshooting
tools for
one OS-hypervisor combination may differ from the
tools
used for another OS-hypervisor combination and,
as a result, the
design will need to address if the
two sets of tools need to
interoperate.
</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="openstack-components-arch">
<title>OpenStack Components</title>
<para>The selection of which OpenStack components will actually be
included in the design and deployed has significant impact.
There are
certain components that will always be present,
(Nova and Glance, for
example) yet there are other services
that might not need to be
present. For example, a certain
design may not require OpenStack Heat.
Omitting Heat would not
typically have a significant impact on the
overall design.
However, if the architecture uses a replacement for
OpenStack
Swift for its storage component, this could potentially have
significant impacts on the rest of the design.
</para>
<para>For a compute-focused OpenStack design architecture, the
following components would be used:
</para>
<itemizedlist>
<listitem>
<para>Identity (Keystone)</para>
</listitem>
<listitem>
<para>Dashboard (Horizon)</para>
</listitem>
<listitem>
<para>Compute (Nova)</para>
</listitem>
<listitem>
<para>Object Storage (Swift, Ceph or a commercial
solution)
</para>
</listitem>
<listitem>
<para>Image (Glance)</para>
</listitem>
<listitem>
<para>Networking (Neutron)</para>
</listitem>
<listitem>
<para>Orchestration (Heat)</para>
</listitem>
</itemizedlist>
<para>OpenStack Block Storage would potentially not be
incorporated
into a compute-focused design due to persistent
block storage not
being a significant requirement for the
types of workloads that would
be deployed onto instances
running in a compute-focused cloud.
However, there may be some
situations where the need for performance
dictates that a
block storage component be used to improve data I-O.
</para>
<para>The exclusion of certain OpenStack components might also
limit or
constrain the functionality of other components. If a
design opts to
include Heat but exclude Ceilometer, then the
design will not be able
to take advantage of Heat's
auto scaling functionality (which relies
on information from
Ceilometer). Due to the fact that you can use Heat
to spin up
a large number of instances to perform the
compute-intensive
processing, including Heat in a compute-focused
architecture
design is strongly recommended.
</para>
</section>
<section xml:id="supplemental-software">
<title>Supplemental Software</title>
<para>While OpenStack is a fairly complete collection of software
projects for building a platform for cloud services, there are
invariably additional pieces of software that might need to be
added
to any given OpenStack design.
</para>
<section xml:id="networking-software-arch">
<title>Networking Software</title>
<para>OpenStack Networking provides a wide variety of networking
services for instances. There are many additional networking
software packages that might be useful to manage the OpenStack
components themselves. Some examples include software to
provide load
balancing, network redundancy protocols, and
routing daemons. Some of
these software packages are described
in more detail in the OpenStack
HA Guide (refer to Chapter 8
of the OpenStack High Availability
Guide).
</para>
<para>For a compute-focused OpenStack cloud, the OpenStack
infrastructure components will need to be highly available. If
the
design does not include hardware load balancing,
networking software
packages like HAProxy will need to be
included.
</para>
</section>
<section xml:id="management-software-arch">
<title>Management Software</title>
<para>The selected supplemental software solution impacts and
affects
the overall OpenStack cloud design. This includes
software for
providing clustering, logging, monitoring and
alerting.
</para>
<para>Inclusion of clustering Software, such as Corosync or
Pacemaker, is determined primarily by the availability design
requirements. Therefore, the impact of including (or not
including)
these software packages is primarily determined by
the availability
of the cloud infrastructure and the
complexity of supporting the
configuration after it is
deployed. The OpenStack High Availability
Guide provides more
details on the installation and configuration of
Corosync and
Pacemaker, should these packages need to be included in
the
design.
</para>
<para>Requirements for logging, monitoring, and alerting are
determined by operational considerations. Each of these
sub-categories includes a number of various options. For
example, in
the logging sub-category one might consider
Logstash, Splunk, Log
Insight, or some other log
aggregation-consolidation tool. Logs
should be stored in a
centralized location to make it easier to
perform analytics
against the data. Log data analytics engines can
also provide
automation and issue notification by providing a
mechanism to
both alert and automatically attempt to remediate some
of the
more commonly known issues.
</para>
<para>If any of these software packages are needed, then the
design
must account for the additional resource consumption
(CPU, RAM,
storage, and network bandwidth for a log
aggregation solution, for
example). Some other potential
design impacts include:
</para>
<itemizedlist>
<listitem>
<para>OS - Hypervisor combination: Ensure that the
selected logging,
monitoring, or alerting tools
support the proposed OS-hypervisor
combination.
</para>
</listitem>
<listitem>
<para>Network hardware: The network hardware selection
needs to be
supported by the logging, monitoring, and
alerting software.
</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="database-software-arch">
<title>Database Software</title>
<para>A large majority of the OpenStack components require access
to
back-end database services to store state and configuration
information. Selection of an appropriate back-end database
that will
satisfy the availability and fault tolerance
requirements of the
OpenStack services is required. OpenStack
services support connecting
to any database that is supported
by the sqlalchemy Python drivers,
however most common database
deployments make use of mySQL or some
variation of it. It is
recommended that the database which provides
back-end service
within a general purpose cloud be made highly
available using
an available technology which can accomplish that
goal. Some
of the more common software solutions used include Galera,
MariaDB and mySQL with multi-master replication.
</para>
</section>
</section>
</section>

View File

@ -0,0 +1,49 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intro-compute-focus">
<title>Introduction</title>
<para>A compute-focused cloud is a specialized subset of the
general purpose OpenStack cloud architecture. Unlike the
general purpose OpenStack architecture, which is built to host
a wide variety of workloads and applications and does not
heavily tax any particular computing aspect, a compute-focused
cloud is built and designed specifically to support compute
intensive workloads. As such, the design must be specifically
tailored to support hosting compute intensive workloads.
Compute intensive workloads may be CPU intensive, RAM
intensive, or both. However, they are not typically storage
intensive or network intensive. Compute-focused workloads may
include the following use cases:</para>
<itemizedlist>
<listitem>
<para>High performance computing (HPC)</para>
</listitem>
<listitem>
<para>Big data analytics using Hadoop or other distributed
data stores</para>
</listitem>
<listitem>
<para>Continuous integration/continuous deployment
(CI/CD)</para>
</listitem>
<listitem>
<para>Platform-as-a-Service (PaaS)</para>
</listitem>
<listitem>
<para>Signal processing for Network Function
Virtualization (NFV)</para>
</listitem>
</itemizedlist>
<para>Based on the use case requirements, such clouds might need
to provide additional services such as a virtual machine disk
library, file or object storage, firewalls, load balancers, IP
addresses, and network connectivity in the form of overlays or
virtual Local Area Networks (VLANs). A compute-focused
OpenStack cloud will not typically use raw block storage
services since the applications hosted on a compute-focused
OpenStack cloud generally do not need persistent block
storage.</para>
</section>

View File

@ -0,0 +1,117 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="operational-considerations-compute-focus">
<?dbhtml stop-chunking?>
<title>Operational Considerations</title>
<para>Operationally, there are a number of considerations that
affect the design of compute-focused OpenStack clouds. Some
examples might include enforcing strict API availability
requirements, understanding and dealing with failure
scenarios, or managing host maintenance schedules.</para>
<para>Service-level agreements (SLAs) are a contractual obligation
that gives assurances around availability of a provided
service. As such, factoring in promises of availability
implies a certain level of redundancy and resiliency when
designing an OpenStack cloud.</para>
<itemizedlist>
<listitem>
<para>Guarantees for API availability imply multiple
infrastructure services combined with appropriately
high available load balancers.</para>
</listitem>
<listitem>
<para>Network uptime guarantees will affect the switch
design and might require redundant switching and
power.</para>
</listitem>
<listitem>
<para>Network security policy requirements need to be
factored in to deployments.</para>
</listitem>
</itemizedlist>
<para>Knowing when and where to implement redundancy and high
availability (HA) is directly affected by terms contained in
any associated SLA, if one is present.</para>
<section xml:id="support-and-maintainability-compute-focus">
<title>Support and Maintainability</title>
<para>OpenStack cloud management requires operations staff to be
able to understand and comprehend design architecture content
on some level. The level of skills and the level of separation
of the operations and engineering staff is dependent on the
size and purpose of the installation. A large cloud service
provider or a telecom provider is more inclined to be managed
by specially trained dedicated operations organization. A
smaller implementation is more inclined to rely on a smaller
support staff that might need to take on the combined
engineering, design and operations functions.</para>
<para>Maintaining OpenStack installations require a variety of
technical skills. Some of these skills may include the ability
to debug Python log output to a basic level as well as an
understanding of networking concepts.</para>
<para>Consider incorporating features into the architecture and
design that reduce the operational burden. Some examples
include automating some of the operations functions, or
alternatively exploring the possibility of using a third party
management company with special expertise in managing
OpenStack deployments.</para></section>
<section xml:id="montioring-compute-focus"><title>Monitoring</title>
<para>Like any other infrastructure deployment, OpenStack clouds
need an appropriate monitoring platform to ensure errors are
caught and managed appropriately. Consider leveraging any
existing monitoring system to see if it will be able to
effectively monitor an OpenStack environment. While there are
many aspects that need to be monitored, specific metrics that
are critically important to capture include image disk
utilization, or response time to the Compute API.</para></section>
<section xml:id="expected-unexpected-server-downtime"><title>Expected and unexpected server downtime</title>
<para>At some point, servers will fail. The SLAs in place affect
how the design has to address recovery time. Recovery of a
failed host may mean restoring instances from a snapshot, or
respawning that instance on another available host, which then
has consequences on the overall application design running on
the OpenStack cloud.</para>
<para>It might be acceptable to design a compute-focused cloud
without the ability to migrate instances from one host to
another, because the expectation is that the application
developer must handle failure within the application itself.
Conversely, a compute-focused cloud might be provisioned to
provide extra resilience as a requirement of that business. In
this scenario, it is expected that extra supporting services
are also deployed, such as shared storage attached to hosts to
aid in recovery and resiliency of services in order to meet
strict SLAs.</para></section>
<section xml:id="capacity-planning-operational"><title>Capacity Planning</title>
<para>Adding extra capacity to an OpenStack cloud is an easy
horizontally scaling process, as consistently configured nodes
automatically attach to an OpenStack cloud. Be mindful,
however, of any additional work to place the nodes into
appropriate Availability Zones and Host Aggregates if
necessary. The same (or very similar) CPUs are recommended
when adding extra nodes to the environment because it reduces
the chance to break any live-migration features if they are
present. Scaling out hypervisor hosts also has a direct effect
on network and other data center resources, so factor in this
increase when reaching rack capacity or when extra network
switches are required.</para>
<para>Compute hosts can also have internal components changed to
account for increases in demand, a process also known as
vertical scaling. Swapping a CPU for one with more cores, or
increasing the memory in a server, can help add extra needed
capacity depending on whether the running applications are
more CPU intensive or memory based (as would be expected in a
compute-focused OpenStack cloud).</para>
<para>Another option is to assess the average workloads and
increase the number of instances that can run within the
compute environment by adjusting the overcommit ratio. While
only appropriate in some environments, it's important to
remember that changing the CPU overcommit ratio can have a
detrimental effect and cause a potential increase in noisy
neighbor. The added risk of increasing the overcommit ratio is
more instances will fail when a compute host fails. In a
compute-focused OpenStack design architecture, increasing the
CPU overcommit ratio increases the potential for noisy
neighbor issues and is not recommended.</para></section>
</section>

View File

@ -0,0 +1,128 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="prescriptive-example-compute-focus">
<?dbhtml stop-chunking?>
<title>Prescriptive Examples</title>
<para>The Conseil Européen pour la Recherche Nucléaire (CERN),
also known as the European Organization for, Nuclear Research
provides particle accelerators and other infrastructure for
high-energy physics research.</para>
<para>As of 2011 CERN operated these two compute centers in Europe
with plans to add a third:</para>
<para>To support a growing number of compute heavy users of
experiments related to the Large Hadron Collider (LHC) CERN
ultimately elected to deploy an OpenStack cloud using
Scientific Linux and RDO. This effort aimed to simplify the
management of the center's compute resources with a view to
doubling compute capacity through the addition of an
additional data center in 2013 while maintaining the same
levels of compute staff.</para>
<para>The CERN solution uses Cells for segregation of compute
resources and to transparently scale between different data
centers. This decision meant trading off support for security
groups and live migration. In addition some details like
flavors needed to be manually replicated across cells. In
spite of these drawbacks cells were determined to provide the
required scale while exposing a single public API endpoint to
users.</para>
<para>A compute cell was created for each of the two original data
centers and a third was created when a new data center was
added in 2013. Each cell contains three availability zones to
further segregate compute resources and at least three
RabbitMQ message brokers configured to be clustered with
mirrored queues for high availability.</para>
<para>The API cell, which resides behind a HAProxy load balancer,
is located in the data center in Switzerland and directs API
calls to compute cells using a customized variation of the
cell scheduler. The customizations allow certain workloads to
be directed to a specific data center or "all" data centers
with cell selection determined by cell RAM availability in the
latter case.</para>
<mediaobject>
<imageobject>
<imagedata fileref="../images/Generic_CERN_Example.png"/>
</imageobject>
</mediaobject>
<para>There is also some customization of the filter scheduler
that handles placement within the cells:</para>
<itemizedlist>
<listitem>
<para>ImagePropertiesFilter - To provide special handling
depending on the guest operating system in use
(Linux-based or Windows-based).</para>
</listitem>
<listitem>
<para>ProjectsToAggregateFilter - To provide special
handling depending on the project the instance is
associated with.</para>
</listitem>
<listitem>
<para>default_schedule_zones - Allows the selection of
multiple default availability zones, rather than a
single default.</para>
</listitem>
</itemizedlist>
<para>The MySQL database server in each cell is managed by a
central database team and configured in an active/passive
configuration with a NetApp storage back end. Backups are
performed ever 6 hours.</para>
<section xml:id="network-architecture"><title>Network Architecture</title>
<para>To integrate with existing CERN networking infrastructure
customizations were made to Nova Networking. This was in the
form of a driver to integrate with CERN's existing database
for tracking MAC and IP address assignments.</para>
<para>The driver facilitates selection of a MAC address and IP for
new instances based on the compute node the scheduler places
the instance on</para>
<para>The driver considers the compute node that the scheduler
placed an instance on and then selects a MAC address and IP
from the pre-registered list associated with that node in the
database. The database is then updated to reflect the instance
the addresses were assigned to.</para></section>
<section xml:id="storage-architecture"><title>Storage Architecture</title>
<para>The OpenStack image service is deployed in the API cell and
configured to expose version 1 (V1) of the API. As a result
the image registry is also required. The storage back end in
use is a 3 PB Ceph cluster.</para>
<para>A small set of "golden" Scientific Linux 5 and 6 images are
maintained which applications can in turn be placed on using
orchestration tools. Puppet is used for instance configuration
management and customization but Heat deployment is
expected.</para></section>
<section xml:id="monitoring"><title>Monitoring</title>
<para>Although direct billing is not required, OpenStack Telemetry
is used to perform metering for the purposes of adjusting
project quotas. A sharded, replicated, MongoDB back end is
used. To spread API load, instances of the nova-api service
were deployed within the child cells for Telemetry to query
against. This also meant that some supporting services
including keystone, glance-api and glance-registry needed to
also be configured in the child cells.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Generic_CERN_Architecture.png"/>
</imageobject>
</mediaobject>
<para>Additional monitoring tools in use include Flume
(http://flume.apache.org/), Elastic Search, Kibana
(http://www.elasticsearch.org/overview/kibana/), and the CERN
developed Lemon (http://lemon.web.cern.ch/lemon/index.shtml)
project.</para></section>
<section xml:id="references-cern-resources"><title>References</title>
<para>The authors of the Architecture Design Guide would like to
thank CERN for publicly documenting their OpenStack deployment
in these resources, which formed the basis for this
chapter:</para>
<itemizedlist>
<listitem>
<para>http://openstack-in-production.blogspot.fr/</para>
</listitem>
<listitem>
<para>http://www.openstack.org/assets/presentation-media/Deep-Dive-into-the-CERN-Cloud-Infrastructure.pdf</para>
</listitem>
</itemizedlist></section>
</section>

View File

@ -0,0 +1,421 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="technical-considerations-compute-focus">
<?dbhtml stop-chunking?>
<title>Technical Considerations</title>
<para>In a compute-focused OpenStack cloud, the type of instance
workloads being provisioned heavily influences technical
decision making. For example, specific use cases that demand
multiple short running jobs present different requirements
than those that specify long-running jobs, even though both
situations are considered "compute focused."</para>
<para>Public and private clouds require deterministic capacity
planning to support elastic growth in order to meet user SLA
expectations. Deterministic capacity planning is the path to
predicting the effort and expense of making a given process
consistently performant. This process is important because,
when a service becomes a critical part of a user's
infrastructure, the user's fate becomes wedded to the SLAs of
the cloud itself. In cloud computing, a services performance
will not be measured by its average speed but rather by the
consistency of its speed.</para>
<para>There are two aspects of capacity planning to consider:
planning the initial deployment footprint, and planning
expansion of it to stay ahead of the demands of cloud
users.</para>
<para>Planning the initial footprint for an OpenStack deployment
is typically done based on existing infrastructure workloads
and estimates based on expected uptake.</para>
<para>The starting point is the core count of the cloud. By
applying relevant ratios, the user can gather information
about:</para>
<itemizedlist>
<listitem>
<para>The number of instances expected to be available
concurrently: (overcommit fraction × cores) / virtual
cores per instance</para>
</listitem>
<listitem>
<para>How much storage is required: flavor disk size ×
number of instances</para>
</listitem>
</itemizedlist>
<para>These ratios can be used to determine the amount of
additional infrastructure needed to support the cloud. For
example, consider a situation in which you require 1600
instances, each with 2 vCPU and 50 GB of storage. Assuming the
default overcommit rate of 16:1, working out the math provides
an equation of:</para>
<itemizedlist>
<listitem>
<para>1600 = (16 x (number of physical cores)) / 2</para>
</listitem>
<listitem>
<para>storage required = 50 GB x 1600</para>
</listitem>
</itemizedlist>
<para>On the surface, the equations reveal the need for 200
physical cores and 80 TB of storage for
/var/lib/nova/instances/. However, it is also important to
look at patterns of usage to estimate the load that the API
services, database servers, and queue servers are likely to
encounter.</para>
<para>Consider, for example, the differences between a cloud that
supports a managed web-hosting platform with one running
integration tests for a development project that creates one
instance per code commit. In the former, the heavy work of
creating an instance happens only every few months, whereas
the latter puts constant heavy load on the cloud controller.
The average instance lifetime must be considered, as a larger
number generally means less load on the cloud
controller.</para>
<para>Aside from the creation and termination of instances, the
impact of users must be considered when accessing the service,
particularly on nova-api and its associated database. Listing
instances garners a great deal of information and, given the
frequency with which users run this operation, a cloud with a
large number of users can increase the load significantly.
This can even occur unintentionally. For example, the
OpenStack Dashboard instances tab refreshes the list of
instances every 30 seconds, so leaving it open in a browser
window can cause unexpected load.</para>
<para>Consideration of these factors can help determine how many
cloud controller cores are required. A server with 8 CPU cores
and 8 GB of RAM server would be sufficient for up to a rack of
compute nodes, given the above caveats.</para>
<para>Key hardware specifications are also crucial to the
performance of user instances. Be sure to consider budget and
performance needs, including storage performance
(spindles/core), memory availability (RAM/core), network
bandwidth (Gbps/core), and overall CPU performance
(CPU/core).</para>
<para>The cloud resource calculator is a useful tool in examining
the impacts of different hardware and instance load outs. It
is available at:</para>
<itemizedlist>
<listitem>
<para>https://github.com/noslzzp/cloud-resource-calculator/blob/master/cloud-resource-calculator.ods</para>
</listitem>
</itemizedlist>
<section xml:id="expansion-planning-compute-focus">
<title>Expansion Planning</title>
<para>A key challenge faced when planning the expansion of cloud
compute services is the elastic nature of cloud infrastructure
demands. Previously, new users or customers would be forced to
plan for and request the infrastructure they required ahead of
time, allowing time for reactive procurement processes. Cloud
computing users have come to expect the agility provided by
having instant access to new resources as they are required.
Consequently, this means planning should be delivered for
typical usage, but also more importantly, for sudden bursts in
usage.</para>
<para>Planning for expansion can be a delicate balancing act.
Planning too conservatively can lead to unexpected
oversubscription of the cloud and dissatisfied users. Planning
for cloud expansion too aggressively can lead to unexpected
underutilization of the cloud and funds spent on operating
infrastructure that is not being used efficiently.</para>
<para>The key is to carefully monitor the spikes and valleys in
cloud usage over time. The intent is to measure the
consistency with which services can be delivered, not the
average speed or capacity of the cloud. Using this information
to model performance results in capacity enables users to more
accurately determine the current and future capacity of the
cloud.</para></section>
<section xml:id="cpu-and-ram-compute-focus"><title>CPU and RAM</title>
<para>(Adapted from:
http://docs.openstack.org/openstack-ops/content/compute_nodes.html#cpu_choice)</para>
<para>In current generations, CPUs have up to 12 cores. If an
Intel CPU supports Hyper-Threading, those 12 cores are doubled
to 24 cores. If a server is purchased that supports multiple
CPUs, the number of cores is further multiplied.
Hyper-Threading is Intel's proprietary simultaneous
multi-threading implementation, used to improve
parallelization on their CPUs. Consider enabling
Hyper-Threading to improve the performance of multithreaded
applications.</para>
<para>Whether the user should enable Hyper-Threading on a CPU
depends upon the use case. For example, disabling
Hyper-Threading can be beneficial in intense computing
environments. Performance testing conducted by running local
workloads with both Hyper-Threading on and off can help
determine what is more appropriate in any particular
case.</para>
<para>If the Libvirt/KVM Hypervisor driver are the intended use
cases, then the CPUs used in the compute nodes must support
virtualization by way of the VT-x extensions for Intel chips
and AMD-v extensions for AMD chips to provide full
performance.</para>
<para>OpenStack enables the user to overcommit CPU and RAM on
compute nodes. This allows an increase in the number of
instances running on the cloud at the cost of reducing the
performance of the instances. OpenStack Compute uses the
following ratios by default:</para>
<itemizedlist>
<listitem>
<para>CPU allocation ratio: 16:1</para>
</listitem>
<listitem>
<para>RAM allocation ratio: 1.5:1</para>
</listitem>
</itemizedlist>
<para>The default CPU allocation ratio of 16:1 means that the
scheduler allocates up to 16 virtual cores per physical core.
For example, if a physical node has 12 cores, the scheduler
sees 192 available virtual cores. With typical flavor
definitions of 4 virtual cores per instance, this ratio would
provide 48 instances on a physical node.</para>
<para>Similarly, the default RAM allocation ratio of 1.5:1 means
that the scheduler allocates instances to a physical node as
long as the total amount of RAM associated with the instances
is less than 1.5 times the amount of RAM available on the
physical node.</para>
<para>For example, if a physical node has 48 GB of RAM, the
scheduler allocates instances to that node until the sum of
the RAM associated with the instances reaches 72 GB (such as
nine instances, in the case where each instance has 8 GB of
RAM).</para>
<para>The appropriate CPU and RAM allocation ratio must be
selected based on particular use cases.</para></section>
<section xml:id="additional-hardware-compute-focus"><title>Additional Hardware</title>
<para>Certain use cases may benefit from exposure to additional
devices on the compute node. Examples might include:</para>
<itemizedlist>
<listitem>
<para>High performance computing jobs that benefit from
the availability of graphics processing units (GPUs)
for general-purpose computing.</para>
</listitem>
</itemizedlist>
<itemizedlist>
<listitem>
<para>Cryptographic routines that benefit from the
availability of hardware random number generators to
avoid entropy starvation.</para>
</listitem>
<listitem>
<para>Database management systems that benefit from the
availability of SSDs for ephemeral storage to maximize
read/write time when it is required.</para>
</listitem>
</itemizedlist>
<para>Host aggregates are used to group hosts that share similar
characteristics, which can include hardware similarities. The
addition of specialized hardware to a cloud deployment is
likely to add to the cost of each node, so careful
consideration must be given to whether all compute nodes, or
just a subset which is targetable using flavors, need the
additional customization to support the desired
workloads.</para></section>
<section xml:id="utilization"><title>Utilization</title>
<para>Infrastructure-as-a-Service offerings, including OpenStack,
use flavors to provide standardized views of virtual machine
resource requirements that simplify the problem of scheduling
instances while making the best use of the available physical
resources.</para>
<para>In order to facilitate packing of virtual machines onto
physical hosts, the default selection of flavors are
constructed so that the second largest flavor is half the size
of the largest flavor in every dimension. It has half the
vCPUs, half the vRAM, and half the ephemeral disk space. The
next largest flavor is half that size again. As a result,
packing a server for general purpose computing might look
conceptually something like this figure:</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Compute_Tech_Bin_Packing_General1.png"
/>
</imageobject>
</mediaobject>
<para>On the other hand, a CPU optimized packed server might look
like the following figure:</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Compute_Tech_Bin_Packing_CPU_optimized1.png"
/>
</imageobject>
</mediaobject>
<para>These default flavors are well suited to typical load outs
for commodity server hardware. To maximize utilization,
however, it may be necessary to customize the flavors or
create new ones, to better align instance sizes to the
available hardware.</para>
<para>Workload characteristics may also influence hardware choices
and flavor configuration, particularly where they present
different ratios of CPU versus RAM versus HDD
requirements.</para>
<para>For more information on Flavors refer to:
http://docs.openstack.org/openstack-ops/content/flavors.html</para>
</section>
<section xml:id="performance-compute-focus"><title>Performance</title>
<para>The infrastructure of a cloud should not be shared, so that
it is possible for the workloads to consume as many resources
as are made available, and accommodations should be made to
provide large scale workloads.</para>
<para>The duration of batch processing differs depending on
individual workloads that are launched. Time limits range from
seconds, minutes to hours, and as a result it is considered
difficult to predict when resources will be used, for how
long, and even which resources will be used.</para>
</section>
<section xml:id="security-compute-focus"><title>Security</title>
<para>The security considerations needed for this scenario are
similar to those of the other scenarios discussed in this
book.</para>
<para>A security domain comprises of users, applications, servers
or networks that share common trust requirements and
expectations within a system. Typically they have the same
authentication and authorization requirements and
users.</para>
<para>These security domains are:</para>
<orderedlist>
<listitem>
<para>Public</para>
</listitem>
<listitem>
<para>Guest</para>
</listitem>
<listitem>
<para>Management</para>
</listitem>
<listitem>
<para>Data</para>
</listitem>
</orderedlist>
<para>These security domains can be mapped individually to the
installation, or they can also be combined. For example, some
deployment topologies combine both guest and data domains onto
one physical network, whereas in other cases these networks
are physically separated. In each case, the cloud operator
should be aware of the appropriate security concerns. Security
domains should be mapped out against specific OpenStack
deployment topology. The domains and their trust requirements
depend upon whether the cloud instance is public, private, or
hybrid.</para>
<para>The public security domain is an entirely untrusted area of
the cloud infrastructure. It can refer to the Internet as a
whole or simply to networks over which the user has no
authority. This domain should always be considered
untrusted.</para>
<para>Typically used for compute instance-to-instance traffic, the
guest security domain handles compute data generated by
instances on the cloud; not services that support the
operation of the cloud, for example API calls. Public cloud
providers and private cloud providers who do not have
stringent controls on instance use or who allow unrestricted
internet access to instances should consider this domain to be
untrusted. Private cloud providers may want to consider this
network as internal and therefore trusted only if they have
controls in place to assert that they trust instances and all
their tenants.</para>
<para>The management security domain is where services interact.
Sometimes referred to as the "control plane", the networks in
this domain transport confidential data such as configuration
parameters, user names, and passwords. In most deployments this
domain is considered trusted.</para>
<para>The data security domain is concerned primarily with
information pertaining to the storage services within
OpenStack. Much of the data that crosses this network has high
integrity and confidentiality requirements and depending on
the type of deployment there may also be strong availability
requirements. The trust level of this network is heavily
dependent on deployment decisions and as such we do not assign
this any default level of trust.</para>
<para>When deploying OpenStack in an enterprise as a private cloud
it is assumed to be behind a firewall and within the trusted
network alongside existing systems. Users of the cloud are
typically employees or trusted individuals that are bound by
the security requirements set forth by the company. This tends
to push most of the security domains towards a more trusted
model. However, when deploying OpenStack in a public-facing
role, no assumptions can be made and the attack vectors
significantly increase. For example, the API endpoints and the
software behind it will be vulnerable to potentially hostile
entities wanting to gain unauthorized access or prevent access
to services. This can result in loss of reputation and must be
protected against through auditing and appropriate
filtering.</para>
<para>Consideration must be taken when managing the users of the
system, whether it is the operation of public or private
clouds. The identity service allows for LDAP to be part of the
authentication process, and includes such systems as an
OpenStack deployment that may ease user management if
integrated into existing systems.</para>
<para>It is strongly recommended that the API services are placed
behind hardware that performs SSL termination. API services
transmit user names, passwords, and generated tokens between
client machines and API endpoints and therefore must be
secured.</para>
<para>More information on OpenStack Security can be found
at http://docs.openstack.org/security-guide/</para>
</section>
<section xml:id="openstack-components-compute-focus"><title>OpenStack Components</title>
<para>Due to the nature of the workloads that will be used in this
scenario, a number of components will be highly beneficial in
a Compute-focused cloud. This includes the typical OpenStack
components:</para>
<itemizedlist>
<listitem>
<para>OpenStack Compute (Nova)</para>
</listitem>
<listitem>
<para>OpenStack Image Service (Glance)</para>
</listitem>
<listitem>
<para>OpenStack Identity Service (Keystone)</para>
</listitem>
</itemizedlist>
<para>Also consider several specialized components:</para>
<itemizedlist>
<listitem>
<para>OpenStack Orchestration Engine (Heat)</para>
</listitem>
</itemizedlist>
<para>It is safe to assume that, given the nature of the
applications involved in this scenario, these will be heavily
automated deployments. Making use of Heat will be highly
beneficial in this case. Deploying a batch of instances and
running an automated set of tests can be scripted, however it
makes sense to use the OpenStack Orchestration Engine (Heat)
to handle all these actions.</para>
<itemizedlist>
<listitem>
<para>OpenStack Telemetry (Ceilometer)</para>
</listitem>
</itemizedlist>
<para>OpenStack Telemetry and the alarms it generates are required
to support autoscaling of instances using OpenStack
Orchestration. Users that are not using OpenStack
Orchestration do not need to deploy OpenStack Telemetry and
may choose to use other external solutions to fulfill their
metering and monitoring requirements.</para>
<para>See also:
http://docs.openstack.org/openstack-ops/content/logging_monitoring.html</para>
<itemizedlist>
<listitem>
<para>OpenStack Block Storage (Cinder)</para>
</listitem>
</itemizedlist>
<para>Due to the burst-able nature of the workloads and the
applications and instances that will be used for batch
processing, this cloud will utilize mainly memory or CPU, so
the need for add-on storage to each instance is not a likely
requirement. This does not mean the OpenStack Block Storage
service (Cinder) will not be used in the infrastructure, but
typically it will not be used as a central component.</para>
<itemizedlist>
<listitem>
<para>Networking</para>
</listitem>
</itemizedlist>
<para>When choosing a networking platform, ensure that it either
works with all desired hypervisor and container technologies
and their OpenStack drivers, or includes an implementation of
an ML2 mechanism driver. Networking platforms that provide ML2
mechanisms drivers can be mixed.</para></section>
</section>

View File

@ -0,0 +1,144 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="user-requirements-compute-focus">
<?dbhtml stop-chunking?>
<title>User Requirements</title>
<para>Compute intensive workloads are defined by their high
utilization of CPU, RAM, or both. User requirements will
determine if a cloud must be built to accommodate anticipated
performance demands.</para>
<itemizedlist>
<listitem>
<para>Cost: Cost is not generally a primary concern for a
compute-focused cloud, however some organizations
might be concerned with cost avoidance. Repurposing
existing resources to tackle compute-intensive tasks
instead of needing to acquire additional resources may
offer cost reduction opportunities.</para>
</listitem>
<listitem>
<para>Time to Market: Compute-focused clouds can be used
to deliver products more quickly, for example,
speeding up a company's software development life cycle
(SDLC) for building products and applications.</para>
</listitem>
<listitem>
<para>Revenue Opportunity: Companies that are interested
in building services or products that rely on the
power of the compute resources will benefit from a
compute-focused cloud. Examples include the analysis
of large data sets (via Hadoop or Cassandra) or
completing computational intensive tasks such as
rendering, scientific computation, or
simulations.</para>
</listitem>
</itemizedlist>
<section xml:id="legal-requirements-compute-focus"><title>Legal Requirements</title>
<para>Many jurisdictions have legislative and regulatory
requirements governing the storage and management of data in
cloud environments. Common areas of regulation include:</para>
<itemizedlist>
<listitem>
<para>Data retention policies ensuring storage of
persistent data and records management to meet data
archival requirements.</para>
</listitem>
<listitem>
<para>Data ownership policies governing the possession and
responsibility for data.</para>
</listitem>
<listitem>
<para>Data sovereignty policies governing the storage of
data in foreign countries or otherwise separate
jurisdictions.</para>
</listitem>
<listitem>
<para>Data compliance - certain types of information needs
to reside in certain locations due to regular issues -
and more important cannot reside in other locations
for the same reason.</para>
</listitem>
</itemizedlist>
<para>Examples of such legal frameworks include the data
protection framework of the European Union
(http://ec.europa.eu/justice/data-protection/ ) and the
requirements of the Financial Industry Regulatory Authority
(http://www.finra.org/Industry/Regulation/FINRARules/ ) in the
United States. Consult a local regulatory body for more
information.</para></section>
<section xml:id="technical-considerations-compute-focus-user"><title>Technical Considerations</title>
<para>The following are some technical requirements that need to
be incorporated into the architecture design.</para>
<itemizedlist>
<listitem>
<para>Performance: If a primary technical concern is for
the environment to deliver high performance
capability, then a compute-focused design is an
obvious choice because it is specifically designed to
host compute-intensive workloads.</para>
</listitem>
<listitem>
<para>Workload persistence: Workloads can be either
short-lived or long running. Short-lived workloads
might include continuous integration and continuous
deployment (CI-CD) jobs, where large numbers of
compute instances are created simultaneously to
perform a set of compute-intensive tasks. The results
or artifacts are then copied from the instance into
long-term storage before the instance is destroyed.
Long-running workloads, like a Hadoop or
high-performance computing (HPC) cluster, typically
ingest large data sets, perform the computational work
on those data sets, then push the results into long
term storage. Unlike short-lived workloads, when the
computational work is completed, they will remain idle
until the next job is pushed to them. Long-running
workloads are often larger and more complex, so the
effort of building them is mitigated by keeping them
active between jobs. Another example of long running
workloads is legacy applications that typically are
persistent over time.</para>
</listitem>
<listitem>
<para>Storage: Workloads targeted for a compute-focused
OpenStack cloud generally do not require any
persistent block storage (although some usages of
Hadoop with HDFS may dictate the use of persistent
block storage). A shared filesystem or object store
will maintain the initial data set(s) and serve as the
destination for saving the computational results. By
avoiding the input-output (IO) overhead, workload
performance is significantly enhanced. Depending on
the size of the data set(s), it might be necessary to
scale the object store or shared file system to match
the storage demand.</para>
</listitem>
<listitem>
<para>User Interface: Like any other cloud architecture, a
compute-focused OpenStack cloud requires an on-demand
and self-service user interface. End users must be
able to provision computing power, storage, networks
and software simply and flexibly. This includes
scaling the infrastructure up to a substantial level
without disrupting host operations.</para>
</listitem>
<listitem>
<para>Security: Security is going to be highly dependent
on the business requirements. For example, a
computationally intense drug discovery application
will obviously have much higher security requirements
than a cloud that is designed for processing market
data for a retailer. As a general start, the security
recommendations and guidelines provided in the
OpenStack Security Guide are applicable.</para>
</listitem>
</itemizedlist></section>
<section xml:id="operational-considerations-compute-focus-user"><title>Operational Considerations</title>
<para>The compute intensive cloud from the operational perspective
is similar to the requirements for the general-purpose cloud.
More details on operational requirements can be found in the
general-purpose design section.</para></section>
</section>

View File

@ -0,0 +1,744 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-architecture-overview">
<?dbhtml stop-chunking?>
<title>Architecture</title>
<para>Hardware selection involves three key areas:</para>
<itemizedlist>
<listitem>
<para>Compute</para>
</listitem>
<listitem>
<para>Network</para>
</listitem>
<listitem>
<para>Storage</para>
</listitem>
</itemizedlist>
<para>For each of these areas, the selection of hardware for a
general purpose OpenStack cloud must reflect the fact that a
the cloud has no pre-defined usage model. This means that
there will be a wide variety of applications running on this
cloud that will have varying resource usage requirements. Some
applications will be RAM-intensive, some applications will be
CPU-intensive, while others will be storage-intensive.
Therefore, choosing hardware for a general purpose OpenStack
cloud must provide balanced access to all major
resources.</para>
<para>Certain hardware form factors may be better suited for use
in a general purpose OpenStack cloud because of the need for
an equal or nearly equal balance of resources. Server hardware
for a general purpose OpenStack architecture design must
provide an equal or nearly equal balance of compute capacity
(RAM and CPU), network capacity (number and speed of links),
and storage capacity (gigabytes or terabytes as well as I-O
Operations Per Second (IOPS).</para>
<para>Server hardware is evaluated around four conflicting
dimensions:</para>
<itemizedlist>
<listitem>
<para>Server density: A measure of how many servers can
fit into a given measure of physical space, such as a
rack unit [U].</para>
</listitem>
<listitem>
<para>Resource capacity: The number of CPU cores, how much
RAM, or how much storage a given server will
deliver.</para>
</listitem>
<listitem>
<para>Expandability: The number of additional resources
that can be added to a server before it has reached
its limit.</para>
</listitem>
<listitem>
<para>Cost: The relative purchase price of the hardware
weighted against the level of design effort needed to
build the system.</para>
</listitem>
</itemizedlist>
<para>Increasing server density means sacrificing resource
capacity or expandability, however, increasing resource
capacity and expandability increases cost and decreases server
density. As a result, determining the best server hardware for
a general purpose OpenStack architecture means understanding
how choice of form factor will impact the rest of the
design.</para>
<itemizedlist>
<listitem>
<para>Blade servers typically support dual-socket
multi-core CPUs, which is the configuration generally
considered to be the "sweet spot" for a general
purpose cloud deployment. Blades also offer
outstanding density. As an example, both HP
BladeSystem and Dell PowerEdge M1000e support up to 16
servers in only 10 rack units. However, the blade
servers themselves often have limited storage and
networking capacity. Additionally, the expandability
of many blade servers can be limited.</para>
</listitem>
<listitem>
<para>1U rack-mounted servers occupy only a single rack
unit. Their benefits include high density, support for
dual-socket multi-core CPUs, and support for
reasonable RAM amounts. This form factor offers
limited storage capacity, limited network capacity,
and limited expandability.</para>
</listitem>
<listitem>
<para>2U rack-mounted servers offer the expanded storage
and networking capacity that 1U servers tend to lack,
but with a corresponding decrease in server density
(half the density offered by 1U rack-mounted
servers).</para>
</listitem>
<listitem>
<para>Larger rack-mounted servers, such as 4U servers,
will tend to offer even greater CPU capacity, often
supporting four or even eight CPU sockets. These
servers often have much greater expandability so will
provide the best option for upgradability. This means,
however, that the servers have a much lower server
density and a much greater hardware cost.</para>
</listitem>
<listitem>
<para>"Sled servers" are rack-mounted servers that support
multiple independent servers in a single 2U or 3U
enclosure. This form factor offers increased density
over typical 1U-2U rack-mounted servers but tends to
suffer from limitations in the amount of storage or
network capacity each individual server
supports.</para>
</listitem>
</itemizedlist>
<para>Given the wide selection of hardware and general user
requirements, the best form factor for the server hardware
supporting a general purpose OpenStack cloud is driven by
outside business and cost factors. No single reference
architecture will apply to all implementations; the decision
must flow out of the user requirements, technical
considerations, and operational considerations. Here are some
of the key factors that influence the selection of server
hardware:</para>
<itemizedlist>
<listitem>
<para>Instance density: Sizing is an important
consideration for a general purpose OpenStack cloud.
The expected or anticipated number of instances that
each hypervisor can host is a common metric used in
sizing the deployment. The selected server hardware
needs to support the expected or anticipated instance
density.</para>
</listitem>
<listitem>
<para>Host density: Physical data centers have limited
physical space, power, and cooling. The number of
hosts (or hypervisors) that can be fitted into a given
metric (rack, rack unit, or floor tile) is another
important method of sizing. Floor weight is an often
overlooked consideration. The data center floor must
be able to support the weight of the proposed number
of hosts within a rack or set of racks. These factors
need to be applied as part of the host density
calculation and server hardware selection.</para>
</listitem>
<listitem>
<para>Power density: Data centers have a specified amount
of power fed to a given rack or set of racks. Older
data centers may have a power density as power as low
as 20 AMPs per rack, while more recent data centers
can be architected to support power densities as high
as 120 AMP per rack. The selected server hardware must
take power density into account.</para>
</listitem>
<listitem>
<para>Network connectivity: The selected server hardware
must have the appropriate number of network
connections, as well as the right type of network
connections, in order to support the proposed
architecture. Ensure that, at a minimum, there are at
least two diverse network connections coming into each
rack. For architectures requiring even more
redundancy, it might be necessary to confirm that the
network connections are from diverse telecom
providers. Many data centers have that capacity
available.</para>
</listitem>
</itemizedlist>
<para>The selection of certain form factors or architectures will
affect the selection of server hardware. For example, if the
design calls for a scale-out storage architecture (For
example, leveraging Ceph, Gluster, or a similar commercial
solution), then the server hardware selection will need to be
carefully considered to match the requirements set by the
commercial solution. Ensure that the selected server hardware
is configured to support enough storage capacity (or storage
expandability) to match the requirements of selected scale-out
storage solution. For example, if a centralized storage
solution is required, such as a centralized storage array from
a storage vendor that has infiniBand or FDDI connections, the
server hardware will need to have appropriate network adapters
installed to be compatible with the storage array vendor's
specifications.</para>
<para>Similarly, the network architecture will have an impact on
the server hardware selection and vice versa. For example,
make sure that the server is configured with enough additional
network ports and expansion cards to support all of the
networks required. There is variability in network expansion
cards, so it is important to be aware of potential impacts or
interoperability issues with other components in the
architecture. This is especially true if the architecture uses
InfiniBand or another less commonly used networking
protocol.</para>
<section xml:id="selecting-storage-hardware">
<title>Selecting Storage Hardware</title>
<para>The selection of storage hardware is largely determined by
the proposed storage architecture. Factors that need to be
incorporated into the storage architecture include:</para>
<itemizedlist>
<listitem>
<para>Cost: Storage can be a significant portion of the
overall system cost that should be factored into the
design decision. For an organization that is concerned
with vendor support, a commercial storage solution is
advisable, although it is comes with a higher price
tag. If initial capital expenditure requires
minimization, designing a system based on commodity
hardware would apply. The trade-off is potentially
higher support costs and a greater risk of
incompatibility and interoperability issues.</para>
</listitem>
<listitem>
<para>Performance: Storage performance, measured by
observing the latency of storage I-O requests, is not
a critical factor for a general purpose OpenStack
cloud as overall systems performance is not a design
priority.</para>
</listitem>
<listitem>
<para>Scalability: The term "scalability" refers to how
well the storage solution performs as it expands up to
its maximum designed size. A solution that continues
to perform well at maximum expansion is considered
scalable. A storage solution that performs well in
small configurations but has degrading performance as
it expands was not designed to be not scalable.
Scalability, along with expandability, is a major
consideration in a general purpose OpenStack cloud. It
might be difficult to predict the final intended size
of the implementation because there are no established
usage patterns for a general purpose cloud. Therefore,
it may become necessary to expand the initial
deployment in order to accommodate growth and user
demand. The ability of the storage solution to
continue to perform well as it expands is
important.</para>
</listitem>
<listitem>
<para>Expandability: This refers to the overall ability of
the solution to grow. A storage solution that expands
to 50 PB is considered more expandable than a solution
that only scales to 10 PB. This metric is related to,
but different, from scalability, which is a measure of
the solution's performance as it expands.
Expandability is a major architecture factor for
storage solutions with general purpose OpenStack
cloud. For example, the storage architecture for a
cloud that is intended for a development platform may
not have the same expandability and scalability
requirements as a cloud that is intended for a
commercial product.</para>
</listitem>
</itemizedlist>
<para>Storage hardware architecture is largely determined by the
selected storage architecture. The selection of storage
architecture, as well as the corresponding storage hardware,
is determined by evaluating possible solutions against the
critical factors, the user requirements, technical
considerations, and operational considerations. A combination
of all the factors and considerations will determine which
approach will be best.</para>
<para>Using a scale-out storage solution with direct-attached
storage (DAS) in the servers is well suited for a general
purpose OpenStack cloud. In this scenario, it is possible to
populate storage in either the compute hosts similar to a grid
computing solution or into hosts dedicated to providing block
storage exclusively. When deploying storage in the compute
hosts, appropriate hardware which can support both the storage
and compute services on the same hardware will be required.
This approach is referred to as a grid computing architecture
because there is a grid of modules that have both compute and
storage in a single box.</para>
<para>Understanding the requirements of cloud services will help
determine if Ceph, Gluster, or a similar scale-out solution
should be used. It can then be further determined if a single,
highly expandable and highly vertical, scalable, centralized
storage array should be included in the design. Once the
approach has been determined, the storage hardware needs to be
chosen based on this criteria. If a centralized storage array
fits the requirements best, then the array vendor will
determine the hardware. For cost reasons it may be decided to
build an open source storage array using solutions such as
OpenFiler, Nexenta Open Source, or BackBlaze Open
Source.</para>
<para>This list expands upon the potential impacts for including a
particular storage architecture (and corresponding storage
hardware) into the design for a general purpose OpenStack
cloud:</para>
<itemizedlist>
<listitem>
<para>Connectivity: Ensure that, if storage protocols
other than Ethernet are part of the storage solution,
the appropriate hardware has been selected. Some
examples include InfiniBand, FDDI and Fibre Channel.
If a centralized storage array is selected, ensure
that the hypervisor will be able to connect to that
storage array for image storage.</para>
</listitem>
<listitem>
<para>Usage: How the particular storage architecture will
be used is critical for determining the architecture.
Some of the configurations that will influence the
architecture include whether it will be used by the
hypervisors for ephemeral instance storage or if
OpenStack Swift will use it for object storage. All of
these usage models are affected by the selection of
particular storage architecture and the corresponding
storage hardware to support that architecture.</para>
</listitem>
<listitem>
<para>Instance and image locations: Where instances and
images will be stored will influence the architecture.
For example, instances can be stored in a number of
options. OpenStack Cinder is a good location for
instances because it is persistent block storage,
however, Swift can be used if storage latency is less
of a concern. The same argument applies to the
appropriate image storage location.</para>
</listitem>
<listitem>
<para>Server Hardware: If the solution is a scale-out
storage architecture that includes DAS, naturally that
will affect the server hardware selection. This could
ripple into the decisions that affect host density,
instance density, power density, OS-hypervisor,
management tools and others.</para>
</listitem>
</itemizedlist>
<para>General purpose OpenStack cloud has multiple options. As a
result, there is no single decision that will apply to all
implementations. The key factors that will have an influence
on selection of storage hardware for a general purpose
OpenStack cloud are as follows:</para>
<itemizedlist>
<listitem>
<para>Capacity: Hardware resources selected for the
resource nodes should be capable of supporting enough
storage for the cloud services that will use them. It
is important to clearly define the initial
requirements and ensure that the design can support
adding capacity as resources are used in the cloud, as
workloads are relatively unknown. Hardware nodes
selected for object storage should be capable of
supporting a large number of inexpensive disks and
should not have any reliance on RAID controller cards.
Hardware nodes selected for block storage should be
capable of supporting higher speed storage solutions
and RAID controller cards to provide performance and
redundancy to storage at the hardware level. Selecting
hardware RAID controllers that can automatically
repair damaged arrays will further assist with
replacing and repairing degraded or destroyed storage
devices within the cloud.</para>
</listitem>
<listitem>
<para>Performance: Disks selected for the object storage
service do not need to be fast performing disks. It is
recommended that object storage nodes take advantage
of the best cost per terabyte available for storage at
the time of acquisition and avoid enterprise class
drives. In contrast, disks chosen for the block
storage service should take advantage of performance
boosting features and may entail the use of SSDs or
flash storage to provide for high performing block
storage pools. Storage performance of ephemeral disks
used for instances should also be taken into
consideration. If compute pools are expected to have a
high utilization of ephemeral storage or requires very
high performance, it would be advantageous to deploy
similar hardware solutions to block storage in order
to increase the storage performance.</para>
</listitem>
<listitem>
<para>Fault Tolerance: Object storage resource nodes have
no requirements for hardware fault tolerance or RAID
controllers. It is not necessary to plan for fault
tolerance within the object storage hardware because
the object storage service provides replication
between zones as a feature of the service. Block
storage nodes, compute nodes and cloud controllers
should all have fault tolerance built in at the
hardware level by making use of hardware RAID
controllers and varying levels of RAID configuration.
The level of RAID chosen should be consistent with the
performance and availability requirements of the
cloud.</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="selecting-networking-hardware">
<title>Selecting Networking Hardware</title>
<para>As is the case with storage architecture, selecting a
network architecture often determines which network hardware
will be used. The networking software in use is determined by
the selected networking hardware. Some design impacts are
obvious, for example, selecting networking hardware that only
supports Gigabit Ethernet (GbE) will naturally have an impact
on many different areas of the overall design. Similarly,
deciding to use 10 Gigabit Ethernet (10 GbE) has a number of
impacts on various areas of the overall design.</para>
<para>As an example, selecting Cisco networking hardware implies
that the architecture will be using Cisco networking software
(IOS, NX-OS, etc.). Conversely, selecting Arista networking
hardware means the network devices will use Arista networking
software (EOS). In addition, there are more subtle design
impacts that need to be considered. The selection of certain
networking hardware (and therefore the networking software)
could affect the management tools that can be used. There are
exceptions to this; the rise of "open" networking software
that supports a range of networking hardware means that there
are instances where the relationship between networking
hardware and networking software are not as tightly defined.
An example of this type of software is Cumulus Linux, which is
capable of running on a number of switch vendors hardware
solutions.</para>
<para>Some of the key considerations that should be included in
the selection of networking hardware include:</para>
<itemizedlist>
<listitem>
<para>Port count: The design will require networking
hardware that has the requisite port count.</para>
</listitem>
<listitem>
<para>Port density: The network design will be affected by
the physical space that is required to provide the
requisite port count. A switch that can provide 48 10
GbE ports in 1U has a much higher port density than a
switch that provides 24 10 GbE ports in 2U. A higher
port density is preferred, as it leaves more rack
space for compute or storage components that may be
required by the design. This can also lead into
concerns about fault domains and power density that
should be considered. Higher density switches are more
expensive and should also be considered, as it is
important not to over design the network if it is not
required.</para>
</listitem>
<listitem>
<para>Port speed: The networking hardware must support the
proposed network speed, for example: 1 GbE, 10 GbE, or
40 GbE (or even 100 GbE).</para>
</listitem>
<listitem>
<para>Redundancy: The level of network hardware redundancy
required is influenced by the user requirements for
high availability and cost considerations. Network
redundancy can be achieved by adding redundant power
supplies or paired switches. If this is a requirement,
the hardware will need to support this configuration.
User requirements will determine if a completely
redundant network infrastructure is required.</para>
</listitem>
<listitem>
<para>Power requirements: Make sure that the physical data
center provides the necessary power for the selected
network hardware. This is not an issue for top of rack
(ToR) switches, but may be an issue for spine switches
in a leaf and spine fabric, or end of row (EoR)
switches.</para>
</listitem>
</itemizedlist>
<para>There is no single best practice architecture for the
networking hardware supporting a general purpose OpenStack
cloud that will apply to all implementations. Some of the key
factors that will have a strong influence on selection of
networking hardware include:</para>
<itemizedlist>
<listitem>
<para>Connectivity: All nodes within an OpenStack cloud
require some form of network connectivity. In some
cases, nodes require access to more than one network
segment. The design must encompass sufficient network
capacity and bandwidth to ensure that all
communications within the cloud, both north-south and
east-west traffic have sufficient resources
available.</para>
</listitem>
<listitem>
<para>Scalability: The chosen network design should
encompass a physical and logical network design that
can be easily expanded upon. Network hardware should
offer the appropriate types of interfaces and speeds
that are required by the hardware nodes.</para>
</listitem>
<listitem>
<para>Availability: To ensure that access to nodes within
the cloud is not interrupted, it is recommended that
the network architecture identify any single points of
failure and provide some level of redundancy or fault
tolerance. With regard to the network infrastructure
itself, this often involves use of networking
protocols such as LACP, VRRP or others to achieve a
highly available network connection. In addition, it
is important to consider the networking implications
on API availability. In order to ensure that the APIs,
and potentially other services in the cloud are highly
available, it is recommended to design load balancing
solutions within the network architecture to
accommodate for these requirements.</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="software-selection">
<title>Software Selection</title>
<para>Software selection for a general purpose OpenStack
architecture design needs to include these three areas:</para>
<itemizedlist>
<listitem>
<para>Operating system (OS) and hypervisor</para>
</listitem>
<listitem>
<para>OpenStack components</para>
</listitem>
<listitem>
<para>Supplemental software</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="os-and-hypervisor"><title>OS and Hypervisor</title>
<para>The selection of OS and hypervisor has a tremendous impact
on the overall design. Selecting a particular operating system
and hypervisor can also directly affect server hardware
selection. It is recommended to make sure the storage hardware
selection and topology support the selected operating system
and hypervisor combination. Finally, it is important to ensure
that the networking hardware selection and topology will work
with the chosen operating system and hypervisor combination.
For example, if the design uses Link Aggregation Control
Protocol (LACP), the OS and hypervisor both need to support
it.</para>
<para>Some areas that could be impacted by the selection of OS and
hypervisor include:</para>
<itemizedlist>
<listitem>
<para>Cost: Selecting a commercially supported hypervisor,
such as Microsoft Hyper-V, will result in a different
cost model rather than community-supported open source
hypervisors including KVM, Kinstance or Xen. When
comparing open source OS solutions, choosing Ubuntu
over Red Hat (or vice versa) will have an impact on
cost due to support contracts. On the other hand,
business or application requirements may dictate a
specific or commercially supported hypervisor.</para>
</listitem>
<listitem>
<para>Supportability: Depending on the selected
hypervisor, the staff should have the appropriate
training and knowledge to support the selected OS and
hypervisor combination. If they do not, training will
need to be provided which could have a cost impact on
the design.</para>
</listitem>
<listitem>
<para>Management tools: The management tools used for
Ubuntu and Kinstance differ from the management tools
for VMware vSphere. Although both OS and hypervisor
combinations are supported by OpenStack, there will be
very different impacts to the rest of the design as a
result of the selection of one combination versus the
other.</para>
</listitem>
<listitem>
<para>Scale and performance: Ensure that selected OS and
hypervisor combinations meet the appropriate scale and
performance requirements. The chosen architecture will
need to meet the targeted instance-host ratios with
the selected OS-hypervisor combinations.</para>
</listitem>
<listitem>
<para>Security: Ensure that the design can accommodate the
regular periodic installation of application security
patches while maintaining the required workloads. The
frequency of security patches for the proposed OS -
hypervisor combination will have an impact on
performance and the patch installation process could
affect maintenance windows.</para>
</listitem>
<listitem>
<para>Supported features: Determine which features of
OpenStack are required. This will often determine the
selection of the OS-hypervisor combination. Certain
features are only available with specific OSs or
hypervisors. For example, if certain features are not
available, the design might need to be modified to
meet the user requirements.</para>
</listitem>
<listitem>
<para>Interoperability: Consideration should be given to
the ability of the selected OS-hypervisor combination
to interoperate or co-exist with other OS-hypervisors
as well as other software solutions in the overall
design (if required). Operational troubleshooting
tools for one OS-hypervisor combination may differ
from the tools used for another OS-hypervisor
combination and, as a result, the design will need to
address if the two sets of tools need to interoperate.
</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="openstack-components">
<title>OpenStack Components</title>
<para>The selection of which OpenStack components are included has
a significant impact on the overall design. While there are
certain components that will always be present, (Nova and
Glance, for example) there are other services that may not be
required. As an example, a certain design might not need
OpenStack Heat. Omitting Heat would not have a significant
impact on the overall design of a cloud; however, if the
architecture uses a replacement for OpenStack Swift for its
storage component, it could potentially have significant
impacts on the rest of the design.</para>
<para>The exclusion of certain OpenStack components might also
limit or constrain the functionality of other components. If
the architecture includes Heat but excludes Ceilometer, then
the design will not be able to take advantage of Heat's auto
scaling functionality (which relies on information from
Ceilometer). It is important to research the component
interdependencies in conjunction with the technical
requirements before deciding what components need to be
included and what components can be dropped from the final
architecture.</para>
</section>
<section xml:id="supplemental-components"><title>Supplemental Components</title>
<para>While OpenStack is a fairly complete collection of software
projects for building a platform for cloud services, there are
invariably additional pieces of software that need to be
considered in any given OpenStack design.</para>
</section>
<section xml:id="networking-software"><title>Networking Software</title>
<para>OpenStack Neutron provides a wide variety of networking
services for instances. There are many additional networking
software packages that might be useful to manage the OpenStack
components themselves. Some examples include software to
provide load balancing, network redundancy protocols, and
routing daemons. Some of these software packages are described
in more detail in the OpenStack HA Guide (refer to Chapter 8
of the OpenStack High Availability Guide).</para>
<para>For a general purpose OpenStack cloud, the OpenStack
infrastructure components will need to be highly available. If
the design does not include hardware load balancing,
networking software packages like HAProxy will need to be
included.</para>
</section>
<section xml:id="management-software"><title>Management Software</title>
<para>The selected supplemental software solution impacts and
affects the overall OpenStack cloud design. This includes
software for providing clustering, logging, monitoring and
alerting.</para>
<para>Inclusion of clustering Software, such as Corosync or
Pacemaker, is determined primarily by the availability
requirements. Therefore, the impact of including (or not
including) these software packages is primarily determined by
the availability of the cloud infrastructure and the
complexity of supporting the configuration after it is
deployed. The OpenStack High Availability Guide provides more
details on the installation and configuration of Corosync and
Pacemaker, should these packages need to be included in the
design.</para>
<para>Requirements for logging, monitoring, and alerting are
determined by operational considerations. Each of these
sub-categories includes a number of various options. For
example, in the logging sub-category one might consider
Logstash, Splunk, instanceware Log Insight, or some other log
aggregation-consolidation tool. Logs should be stored in a
centralized location to make it easier to perform analytics
against the data. Log data analytics engines can also provide
automation and issue notification by providing a mechanism to
both alert and automatically attempt to remediate some of the
more commonly known issues.</para>
<para>If any of these software packages are required, then the
design must account for the additional resource consumption
(CPU, RAM, storage, and network bandwidth for a log
aggregation solution, for example). Some other potential
design impacts include:</para>
<itemizedlist>
<listitem>
<para>OS - Hypervisor combination: Ensure that the
selected logging, monitoring, or alerting tools
support the proposed OS-hypervisor combination.</para>
</listitem>
<listitem>
<para>Network hardware: The network hardware selection
needs to be supported by the logging, monitoring, and
alerting software.</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="database-software"><title>Database Software</title>
<para>A large majority of the OpenStack components require access
to back-end database services to store state and configuration
information. Selection of an appropriate back-end database
that will satisfy the availability and fault tolerance
requirements of the OpenStack services is required. OpenStack
services supports connecting to any database that is supported
by the sqlalchemy python drivers, however, most common
database deployments make use of mySQL or variations of it. It
is recommended that the database which provides back-end
service within a general purpose cloud be made highly
available when using an available technology which can
accomplish that goal. Some of the more common software
solutions used include Galera, MariaDB and mySQL with
multi-master replication.</para>
</section>
<section xml:id="addressing-performance-sensitive-workloads"><title>Addressing Performance-Sensitive Workloads</title>
<para>Although one of the key defining factors for a general
purpose OpenStack cloud is that performance is not a
determining factor, there may still be some
performance-sensitive workloads deployed on the general
purpose OpenStack cloud. For design guidance on
performance-sensitive workloads, it is recommended to refer to
the focused scenarios later in this guide. The
resource-focused guides can be used as a supplement to this
guide to help with decisions regarding performance-sensitive
workloads.</para>
</section>
<section xml:id="compute-focused-workloads"><title>Compute-Focused Workloads</title>
<para>In an OpenStack cloud that is compute-focused, there are
some design choices that can help accommodate those workloads.
Compute-focused workloads are generally those that would place
a higher demand on CPU and memory resources with lower
priority given to storage and network performance, other than
what is required to support the intended compute workloads.
For guidance on designing for this type of cloud, please refer
to the section on Compute Focused clouds.</para>
</section>
<section xml:id="network-focused-workloads"><title>Network-Focused Workloads</title>
<para>In a network-focused OpenStack cloud some design choices can
improve the performance of these types of workloads.
Network-focused workloads have extreme demands on network
bandwidth and services that require specialized consideration
and planning. For guidance on designing for this type of
cloud, please refer to the section on Network-Focused clouds.</para>
</section>
<section xml:id="storage-focused-workloads"><title>Storage-Focused Workloads</title>
<para>Storage focused OpenStack clouds need to be designed to
accommodate workloads that have extreme demands on either
object or block storage services that require specialized
consideration and planning. For guidance on designing for this
type of cloud, please refer to the section on Storage-Focused
clouds.</para></section>
</section>

View File

@ -0,0 +1,64 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intro-general-purpose">
<title>Introduction</title>
<para>An OpenStack general purpose cloud is often considered a
starting point for building a cloud deployment. General
purpose clouds, by their nature, balance the components and do
not emphasize (or heavily emphasize) any particular aspect of
the overall computing environment. The expectation is that the
compute, network, and storage components will be given equal
weight in the design. General purpose clouds can be found in
private, public, and hybrid environments. They lend themselves
to many different use cases but, since they are homogeneous
deployments, they are not suited to specialized environments
or edge case situations. Common uses to consider for a general
purpose cloud could be, but are not limited to, providing a
simple database, a web application runtime environment, a
shared application development platform, or lab test bed. In
other words, any use case that would benefit from a scale-out
rather than a scale-up approach is a good candidate for a
general purpose cloud architecture.</para>
<para>A general purpose cloud, by definition, is something that is
designed to have a range of potential uses or functions; not
specialized for a specific use. General purpose architecture
is largely considered a scenario that would address 80% of the
potential use cases. The infrastructure, in itself, is a
specific use case. It is also a good place to start the design
process. As the most basic cloud service model, general
purpose clouds are designed to be platforms suited for general
purpose applications.</para>
<para>General purpose clouds are limited to the most basic
components, but they can include additional resources such
as:</para>
<itemizedlist>
<listitem>
<para>Virtual-machine disk image library</para>
</listitem>
<listitem>
<para>Raw block storage</para>
</listitem>
<listitem>
<para>File or object storage</para>
</listitem>
<listitem>
<para>Firewalls</para>
</listitem>
<listitem>
<para>Load balancers</para>
</listitem>
<listitem>
<para>IP addresses</para>
</listitem>
<listitem>
<para>Network overlays or virtual local area networks
(VLANs)</para>
</listitem>
<listitem>
<para>Software bundles</para>
</listitem>
</itemizedlist>
</section>

View File

@ -0,0 +1,143 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="operational-considerations-general-purpose">
<?dbhtml stop-chunking?>
<title>Operational Considerations</title>
<para>Many operational factors will affect general purpose cloud
design choices. In larger installations, it is not uncommon
for operations staff to be tasked with maintaining cloud
environments. This differs from the operations staff that is
responsible for building or designing the infrastructure. It
is important to include the operations function in the
planning and design phases of the build out.</para>
<para>Service Level Agreements (SLAs) are contractual obligations
that provide assurances for service availability. SLAs define
levels of availability that drive the technical design, often
with penalties for not meeting the contractual obligations.
The strictness of the SLA dictates the level of redundancy and
resiliency in the OpenStack cloud design. Knowing when and
where to implement redundancy and HA is directly affected by
expectations set by the terms of the SLA. Some of the SLA
terms that will affect the design include:</para>
<itemizedlist>
<listitem>
<para>Guarantees for API availability imply multiple
infrastructure services combined with highly available
load balancers.</para>
</listitem>
<listitem>
<para>Network uptime guarantees will affect the switch
design and might require redundant switching and
power.</para>
</listitem>
<listitem>
<para>Network security policies requirements need to be
factored in to deployments.</para>
</listitem>
</itemizedlist>
<section xml:id="support-and-maintainability-general-purpose"><title>Support and Maintainability</title>
<para>OpenStack cloud management requires operations staff to be
able to understand and comprehend design architecture content
on some level. The level of skills and the level of separation
of the operations and engineering staff are dependent on the
size and purpose of the installation. A large cloud service
provider or a telecom provider is more likely to be managed by
a specially trained, dedicated operations organization. A
smaller implementation is more likely to rely on a smaller
support staff that might need to take on the combined
engineering, design and operations functions.</para>
<para>Furthermore, maintaining OpenStack installations requires a
variety of technical skills. Some of these skills may include
the ability to debug Python log output to a basic level and an
understanding of networking concepts.</para>
<para>Consider incorporating features into the architecture and
design that reduce the operations burden. This is accomplished
by automating some of the operations functions. In some cases
it may be beneficial to use a third party management company
with special expertise in managing OpenStack
deployments.</para></section>
<section xml:id="monitoring-general-purpose"><title>Monitoring</title>
<para>Like any other infrastructure deployment, OpenStack clouds
need an appropriate monitoring platform to ensure any errors
are caught and managed appropriately. Consider leveraging any
existing monitoring system to see if it will be able to
effectively monitor an OpenStack environment. While there are
many aspects that need to be monitored, specific metrics that
are critically important to capture include image disk
utilization, or response time to the Compute API.</para></section>
<section xml:id="downtime-general-purpose"><title>Downtime</title>
<para>No matter how robust the architecture is, at some point
components will fail. Designing for high availability (HA) can
have significant cost ramifications, therefore the resiliency
of the overall system and the individual components is going
to be dictated by the requirements of the SLA. Downtime
planning includes creating processes and architectures that
support planned (maintenance) and unplanned (system faults)
downtime.</para>
<para>An example of an operational consideration is the recovery
of a failed compute host. This might mean requiring the
restoration of instances from a snapshot or respawning an
instance on another available compute host. This could have
consequences on the overall application design. A general
purpose cloud should not need to provide an ability to migrate
instances from one host to another. If the expectation is that
the application will be designed to tolerate failure,
additional considerations need to be made around supporting
instance migration. In this scenario, extra supporting
services, including shared storage attached to compute hosts,
might need to be deployed.</para></section>
<section xml:id="capacity-planning"><title>Capacity Planning</title>
<para>Capacity planning for future growth is a critically
important and often overlooked consideration. Capacity
constraints in a general purpose cloud environment include
compute and storage limits. There is a relationship between
the size of the compute environment and the supporting
OpenStack infrastructure controller nodes required to support
it. As the size of the supporting compute environment
increases, the network traffic and messages will increase
which will add load to the controller or networking nodes.
While no hard and fast rule exists, effective monitoring of
the environment will help with capacity decisions on when to
scale the back-end infrastructure as part of the scaling of
the compute resources.</para>
<para>Adding extra compute capacity to an OpenStack cloud is a
horizontally scaling process as consistently configured
compute nodes automatically attach to an OpenStack cloud. Be
mindful of any additional work that is needed to place the
nodes into appropriate availability zones and host aggregates.
Make sure to use identical or functionally compatible CPUs
when adding additional compute nodes to the environment
otherwise live migration features will break. Scaling out
compute hosts will directly affect network and other
datacenter resources so it will be necessary to add rack
capacity or network switches.</para>
<para>Another option is to assess the average workloads and
increase the number of instances that can run within the
compute environment by adjusting the overcommit ratio. While
only appropriate in some environments, it's important to
remember that changing the CPU overcommit ratio can have a
detrimental effect and cause a potential increase in noisy
neighbor. The added risk of increasing the overcommit ratio is
more instances will fail when a compute host fails.</para>
<para>Compute host components can also be upgraded to account for
increases in demand; this is known as vertical scaling.
Upgrading CPUs with more cores, or increasing the overall
server memory, can add extra needed capacity depending on
whether the running applications are more CPU intensive or
memory intensive.</para>
<para>Insufficient disk capacity could also have a negative effect
on overall performance including CPU and memory usage.
Depending on the back-end architecture of the OpenStack Block
Storage layer, capacity might include adding disk shelves to
enterprise storage systems or installing additional block
storage nodes. It may also be necessary to upgrade directly
attached storage installed in compute hosts or add capacity to
the shared storage to provide additional ephemeral storage to
instances.</para>
<para>For a deeper discussion on many of these topics, refer to
the OpenStack Operations Guide at
http://docs.openstack.org/ops.</para></section>
</section>

View File

@ -0,0 +1,100 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="prescriptive-example-online-classifieds">
<?dbhtml stop-chunking?>
<title>Prescriptive Example</title>
<para>An online classified advertising company wants to run web applications
consisting of Tomcat, Nginx and MariaDB in a private cloud. In order to
meet policy requirements, the cloud infrastructure will run in their own
data center. They have predictable load requirements but require an
element of scaling to cope with nightly increases in demand. Their
current environment is not flexible enough to align with their goal of
running an open source API driven environment. Their current environment
consists of the following:</para>
<itemizedlist>
<listitem>
<para>Between 120 and 140 installations of Nginx and
Tomcat, each with 2 vCPUs and 4 GB of RAM</para>
</listitem>
<listitem>
<para>A three-node MariaDB and Galera cluster, each with 4
vCPUs and 8 GB RAM</para>
</listitem>
</itemizedlist>
<para>The company runs hardware load balancers and multiple web
applications serving the sites. The company orchestrates their
environment using a combination of scripts and Puppet. The
websites generate a large amount of log data each day that
needs to be archived.</para>
<para>The solution would consist of the following OpenStack
components:</para>
<itemizedlist>
<listitem>
<para>A firewall, switches and load balancers on the
public facing network connections.</para>
</listitem>
<listitem>
<para>OpenStack Controller services running Image,
Identity, Networking and supporting services such as
MariaDB and RabbitMQ. The controllers will run in a
highly available configuration on at least three
controller nodes.</para>
</listitem>
<listitem>
<para>OpenStack Compute nodes running the KVM
hypervisor.</para>
</listitem>
<listitem>
<para>OpenStack Block Storage for use by compute instances
that require persistent storage such as databases for
dynamic sites.</para>
</listitem>
<listitem>
<para>OpenStack Object Storage for serving static objects
such as images.</para>
</listitem>
</itemizedlist>
<para><inlinemediaobject><imageobject><imagedata
fileref="../images/General_Architecture3.png"
/></imageobject></inlinemediaobject>Running up to 140
web instances and the small number of MariaDB instances
requires 292 vCPUs available, as well as 584 GB RAM. On a
typical 1U server using dual-socket hex-core Intel CPUs with
Hyperthreading, and assuming 2:1 CPU overcommit ratio, this
would require 8 OpenStack Compute nodes.</para>
<para>The web application instances run from local storage on each
of the OpenStack Compute nodes. The web application instances
are stateless, meaning that any of the instances can fail and
the application will continue to function.</para>
<para>MariaDB server instances store their data on shared
enterprise storage, such as NetApp or Solidfire devices. If a
MariaDB instance fails, storage would be expected to be
re-attached to another instance and rejoined to the Galera
cluster.</para>
<para>Logs from the web application servers are shipped to
OpenStack Object Storage for later processing and
archiving.</para>
<para>In this scenario, additional capabilities can be realized by
moving static web content to be served from OpenStack Object
Storage containers, and backing the OpenStack Image Service
with OpenStack Object Storage. Note that an increase in
OpenStack Object Storage means that network bandwidth needs to
be taken in to consideration. It is best to run OpenStack
Object Storage with network connections offering 10 GbE or
better connectivity.</para>
<para>There is also a potential to leverage the Orchestration and
Telemetry OpenStack modules to provide an auto-scaling,
orchestrated web application environment. Defining the web
applications in Heat Orchestration Templates (HOT) would
negate the reliance on the scripted Puppet solution currently
employed.</para>
<para>OpenStack Networking can be used to control hardware load
balancers through the use of plug-ins and the Networking API.
This would allow a user to control hardware load balance pools
and instances as members in these pools, but their use in
production environments must be carefully weighed against
current stability.</para>
</section>

View File

@ -0,0 +1,715 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="technical-considerations-general-purpose">
<?dbhtml stop-chunking?>
<title>Technical Considerations</title>
<para>When designing a general purpose cloud, there is an implied
requirement to design for all of the base services generally
associated with providing Infrastructure-as-a-Service:
compute, network and storage. Each of these services have
different resource requirements. As a result, it is important
to make design decisions relating directly to the service
currently under design, while providing a balanced
infrastructure that provides for all services.</para>
<para>When designing an OpenStack cloud as a general purpose
cloud, the hardware selection process can be lengthy and
involved due to the sheer mass of services which need to be
designed and the unique characteristics and requirements of
each service within the cloud. Hardware designs need to be
generated for each type of resource pool; specifically,
compute, network, and storage. In addition to the hardware
designs, which affect the resource nodes themselves, there are
also a number of additional hardware decisions to be made
related to network architecture and facilities planning. These
factors play heavily into the overall architecture of an
OpenStack cloud.</para>
<section xml:id="designing-compute-resources-tech-considerations">
<title>Designing Compute Resources</title>
<para>It is recommended to design compute resources as pools of
resources which will be addressed on-demand. When designing
compute resource pools, a number of factors impact your design
decisions. For example, decisions related to processors,
memory, and storage within each hypervisor are just one
element of designing compute resources. In addition, it is
necessary to decide whether compute resources will be provided
in a single pool or in multiple pools.</para>
<para>To design for the best use of available resources by
applications running in the cloud, it is recommended to design
more than one compute resource pool. Each independent resource
pool should be designed to provide service for specific
flavors of instances or groupings of flavors. For the purpose
of this book, "instance" refers to a virtual machines and the
operating system running on the virtual machine. Designing
multiple resource pools helps to ensure that, as instances are
scheduled onto compute hypervisors, each independent node's
resources will be allocated in a way that makes the most
efficient use of available hardware. This is commonly referred
to as bin packing.</para>
<para>Using a consistent hardware design among the nodes that are
placed within a resource pool also helps support bin packing.
Hardware nodes selected for being a part of a compute resource
pool should share a common processor, memory, and storage
layout. By choosing a common hardware design, it becomes
easier to deploy, support and maintain those nodes throughout
their life cycle in the cloud.</para>
<para>OpenStack provides the ability to configure overcommit
ratio--the ratio of virtual resources available for allocation
to physical resources present--for both CPU and memory. The
default CPU overcommit ratio is 16:1 and the default memory
overcommit ratio is 1.5:1. Determine the tuning of the
overcommit ratios for both of these options during the design
phase, as this has a direct impact on the hardware layout of
your compute nodes.</para>
<para>As an example, consider that a m1.small instance uses 1
vCPU, 20 GB of ephemeral storage and 2,048 MB of RAM. When
designing a hardware node as a compute resource pool to
service instances, take into consideration the number of
processor cores available on the node as well as the required
disk and memory to service instances running at capacity. For
a server with 2 CPUs of 10 cores each, with hyperthreading
turned on, the default CPU overcommit ratio of 16:1 would
allow for 640 (2 x 10 x 2 x 16) total m1.small instances. By
the same reasoning, using the default memory overcommit ratio
of 1.5:1 you can determine that the server will need at least
853GB (640 x 2,048 MB % 1.5) of RAM. When sizing nodes for
memory, it is also important to consider the additional memory
required to service operating system and service needs.</para>
<para>Processor selection is an extremely important consideration
in hardware design, especially when comparing the features and
performance characteristics of different processors. Some
newly released processors include features specific to
virtualized compute hosts including hardware assisted
virtualization and technology related to memory paging (also
known as EPT shadowing). These features have a tremendous
positive impact on the performance of virtual machines running
in the cloud.</para>
<para>In addition to the impact on actual compute services, it is
also important to consider the compute requirements of
resource nodes within the cloud. Resource nodes refers to
non-hypervisor nodes providing controller, object storage,
block storage, or networking services in the cloud. The number
of processor cores and threads has a direct correlation to the
number of worker threads which can be run on a resource node.
It is important to ensure sufficient compute capacity and
memory is planned on resource nodes.</para>
<para>Workload profiles are unpredictable in a general purpose
cloud, so it may be difficult to design for every specific use
case in mind. Additional compute resource pools can be added
to the cloud at a later time, so this unpredictability should
not be a problem. In some cases, the demand on certain
instance types or flavors may not justify an individual
hardware design. In either of these cases, start by providing
hardware designs which will be capable of servicing the most
common instance requests first, looking to add additional
hardware designs to the overall architecture in the form of
new hardware node designs and resource pools as they become
justified at a later time.</para></section>
<section xml:id="designing-network-resources-tech-considerations">
<title>Designing Network Resources</title>
<para>An OpenStack cloud traditionally has multiple network
segments, each of which provides access to resources within
the cloud to both operators and tenants. In addition, the
network services themselves also require network communication
paths which should also be separated from the other networks.
When designing network services for a general purpose cloud,
it is recommended to plan for either a physical or logical
separation of network segments which will be used by operators
and tenants. It is further suggested to create an additional
network segment for access to internal services such as the
message bus and database used by the various cloud services.
Segregating these services onto separate networks helps to
protect sensitive data and also protects against unauthorized
access to services.</para>
<para>Based on the requirements of instances being serviced in the
cloud, the next design choice which will affect your design is
the choice of network service which will be used to service
instances in the cloud. The choice between nova-network, as a
part of OpenStack Compute Service, and Neutron, the OpenStack
Networking Service, has tremendous implications and will have
a huge impact on the architecture and design of the cloud
network infrastructure.</para>
<para>The nova-network service is primarily a layer 2 networking
service which has two main modes in which it will function.
The difference between the two modes in nova-network pertain
to whether or not nova-network uses VLANs. When using
nova-network in a flat network mode, all network hardware
nodes and devices throughout the cloud are connected to a
single layer 2 network segment which provides access to
application data.</para>
<para>When the network devices in the cloud support segmentation
using VLANs, nova-network can operate in the second mode. In
this design model, each tenant within the cloud is assigned a
network subnet which is mapped to a VLAN on the physical
network. It is especially important to remember the maximum
number of 4096 VLANs which can be used within a spanning tree
domain. These limitations place hard limits on the amount of
growth possible within the data center. When designing a
general purpose cloud intended to support multiple tenants, it
is especially recommended to use nova-network with VLANs, and
not in flat network mode.</para>
<para>Another consideration regarding network is the fact that
nova-network is entirely managed by the cloud operator;
tenants do not have control over network resources. If tenants
require the ability to manage and create network resources
such as network segments and subnets, it will be necessary to
install the OpenStack Networking Service to provide network
access to instances.</para>
<para>The OpenStack Networking Service is a first class networking
service that gives full control over creation of virtual
network resources to tenants. This is often accomplished in
the form of tunneling protocols which will establish
encapsulated communication paths over existing network
infrastructure in order to segment tenant traffic. These
methods vary depending on the specific implementation, but
some of the more common methods include tunneling over GRE,
encapsulating with VXLAN, and VLAN tags.</para>
<para>Initially, it is suggested to design at least three network
segments, the first of which will be used for access to the
clouds REST APIs by tenants and operators. This is generally
referred to as a public network. In most cases, the controller
nodes and swift proxies within the cloud will be the only
devices necessary to connect to this network segment. In some
cases, this network might also be serviced by hardware load
balancers and other network devices.</para>
<para>The next segment is used by cloud administrators to manage
hardware resources and is also used by configuration
management tools when deploying software and services onto new
hardware. In some cases, this network segment might also be
used for internal services, including the message bus and
database services, to communicate with each other. Due to the
highly secure nature of this network segment, it may be
desirable to secure this network from unauthorized access.
This network will likely need to communicate with every
hardware node within the cloud.</para>
<para>The last network segment is used by applications and
consumers to provide access to the physical network and also
for users accessing applications running within the cloud.
This network is generally segregated from the one used to
access the cloud APIs and is not capable of communicating
directly with the hardware resources in the cloud. Compute
resource nodes will need to communicate on this network
segment, as will any network gateway services which allow
application data to access the physical network outside of the
cloud.</para></section>
<section xml:id="designing-storage-resources-tech-considerations"><title>Designing Storage Resources</title>
<para>OpenStack has two independent storage services to consider,
each with its own specific design requirements and goals. In
addition to services which provide storage as their primary
function, there are additional design considerations with
regard to compute and controller nodes which will affect the
overall cloud architecture.</para></section>
<section xml:id="designing-openstack-object-storage-tech-considerations">
<title>Designing OpenStack Object Storage</title>
<para>When designing hardware resources for OpenStack Object
Storage, the primary goal is to maximize the amount of storage
in each resource node while also ensuring that the cost per
terabyte is kept to a minimum. This often involves utilizing
servers which can hold a large number of spinning disks.
Whether choosing to use 2U server form factors with directly
attached storage or an external chassis that holds a larger
number of drives, the main goal is to maximize the storage
available in each node.</para>
<para>It is not recommended to invest in enterprise class drives
for an OpenStack Object Storage cluster. The consistency and
partition tolerance characteristics of OpenStack Object
Storage will ensure that data stays up to date and survives
hardware faults without the use of any specialized data
replication devices.</para>
<para>A great benefit of OpenStack Object Storage is the ability
to mix and match drives by utilizing weighting within the
swift ring. When designing your swift storage cluster, it is
recommended to make use of the most cost effective storage
solution available at the time. Many server chassis on the
market can hold 60 or more drives in 4U of rack space,
therefore it is recommended to maximize the amount of storage
per rack unit at the best cost per terabyte. Furthermore, the
use of RAID controllers is not recommended in an object
storage node.</para>
<para>In order to achieve this durability and availability of data
stored as objects, it is important to design object storage
resource pools in a way that provides the suggested
availability that the service can provide. Beyond designing at
the hardware node level, it is important to consider
rack-level and zone-level designs to accommodate the number of
replicas configured to be stored in the Object Storage service
(the default number of replicas is three). Each replica of
data should exist in its own availability zone with its own
power, cooling, and network resources available to service
that specific zone.</para>
<para>Object storage nodes should be designed so that the number
of requests does not hinder the performance of the cluster.
The object storage service is a chatty protocol, therefore
making use of multiple processors that have higher core counts
will ensure the IO requests do not inundate the server.</para></section>
<section xml:id="designing-openstack-block-storage"><title>Designing OpenStack Block Storage</title>
<para>When designing OpenStack Block Storage resource nodes, it is
helpful to understand the workloads and requirements that will
drive the use of block storage in the cloud. In a general
purpose cloud these use patterns are often unknown. It is
recommended to design block storage pools so that tenants can
choose the appropriate storage solution for their
applications. By creating multiple storage pools of different
types, in conjunction with configuring an advanced storage
scheduler for the block storage service, it is possible to
provide tenants with a large catalog of storage services with
a variety of performance levels and redundancy options.</para>
<para>In addition to directly attached storage populated in
servers, block storage can also take advantage of a number of
enterprise storage solutions. These are addressed via a plug-in
driver developed by the hardware vendor. A large number of
enterprise storage plug-in drivers ship out-of-the-box with
OpenStack Block Storage (and many more available via third
party channels). While a general purpose cloud would likely
use directly attached storage in the majority of block storage
nodes, it may also be necessary to provide additional levels
of service to tenants which can only be provided by enterprise
class storage solutions.</para>
<para>The determination to use a RAID controller card in block
storage nodes is impacted primarily by the redundancy and
availability requirements of the application. Applications
which have a higher demand on input-output per second (IOPS)
will influence both the choice to use a RAID controller and
the level of RAID configured on the volume. Where performance
is a consideration, it is suggested to make use of higher
performing RAID volumes. In contrast, where redundancy of
block storage volumes is more important it is recommended to
make use of a redundant RAID configuration such as RAID 5 or
RAID 6. Some specialized features, such as automated
replication of block storage volumes, may require the use of
third-party plug-ins and enterprise block storage solutions in
order to provide the high demand on storage. Furthermore,
where extreme performance is a requirement it may also be
necessary to make use of high speed SSD disk drives' high
performing flash storage solutions.</para></section>
<section xml:id="software-selection-tech-considerations">
<title>Software Selection</title>
<para>The software selection process can play a large role in the
architecture of a general purpose cloud. Choice of operating
system, selection of OpenStack software components, choice of
hypervisor and selection of supplemental software will have a
large impact on the design of the cloud.</para>
<para>Operating system (OS) selection plays a large role in the
design and architecture of a cloud. There are a number of OSes
which have native support for OpenStack including Ubuntu, Red
Hat Enterprise Linux (RHEL), CentOS, and SUSE Linux Enterprise
Server (SLES). "Native support" in this context means that the
distribution provides distribution-native packages by which to
install OpenStack in their repositories. Note that "native
support" is not a constraint on the choice of OS; users are
free to choose just about any Linux distribution (or even
Microsoft Windows) and install OpenStack directly from source
(or compile their own packages). However, the reality is that
many organizations will prefer to install OpenStack from
distribution-supplied packages or repositories (although using
the distribution vendor's OpenStack packages might be a
requirement for support).</para>
<para>OS selection also directly influences hypervisor selection.
A cloud architect who selects Ubuntu or RHEL has some
flexibility in hypervisor; KVM, Xen, and LXC are supported
virtualization methods available under OpenStack Compute
(Nova) on these Linux distributions. A cloud architect who
selects Hyper-V, on the other hand, is limited to Windows
Server. Similarly, a cloud architect who selects XenServer is
limited to the CentOS-based dom0 operating system provided
with XenServer.</para>
<para>The primary factors that play into OS/hypervisor selection
include:</para>
<itemizedlist>
<listitem>
<para>User requirements: The selection of OS/hypervisor
combination first and foremost needs to support the
user requirements.</para>
</listitem>
<listitem>
<para>Support: The selected OS/hypervisor combination
needs to be supported by OpenStack.</para>
</listitem>
<listitem>
<para>Interoperability: The OS/hypervisor needs to be
interoperable with other features and services in the
OpenStack design in order to meet the user
requirements.</para>
</listitem>
</itemizedlist></section>
<section xml:id="hypervisor-tech-considerations"><title>Hypervisor</title>
<para>OpenStack supports a wide variety of hypervisors, one or
more of which can be used in a single cloud. These hypervisors
include:</para>
<itemizedlist>
<listitem>
<para>KVM (and Qemu)</para>
</listitem>
<listitem>
<para>XCP/XenServer</para>
</listitem>
<listitem>
<para>vSphere (vCenter and ESXi)</para>
</listitem>
<listitem>
<para>Hyper-V</para>
</listitem>
<listitem>
<para>LXC</para>
</listitem>
<listitem>
<para>Docker</para>
</listitem>
<listitem>
<para>Bare-metal</para>
</listitem>
</itemizedlist>
<para>A complete list of supported hypervisors and their
capabilities can be found at
https://wiki.openstack.org/wiki/HypervisorSupportMatrix.</para>
<para>General purpose clouds should make use of hypervisors that
support the most general purpose use cases, such as KVM and
Xen. More specific hypervisors should then be chosen to
account for specific functionality or a supported feature
requirement. In some cases, there may also be a mandated
requirement to run software on a certified hypervisor
including solutions from VMware, Microsoft, and Citrix.</para>
<para>The features offered through the OpenStack cloud platform
determine the best choice of a hypervisor. As an example, for
a general purpose cloud that predominantly supports a
Microsoft-based migration, or is managed by staff that has a
particular skill for managing certain hypervisors and
operating systems, Hyper-V might be the best available choice.
While the decision to use Hyper-V does not limit the ability
to run alternative operating systems, be mindful of those that
are deemed supported. Each different hypervisor also has their
own hardware requirements which may affect the decisions
around designing a general purpose cloud. For example, to
utilize the live migration feature of VMware, vMotion, this
requires an installation of vCenter/vSphere and the use of the
ESXi hypervisor, which increases the infrastructure
requirements.</para>
<para>In a mixed hypervisor environment, specific aggregates of
compute resources, each with defined capabilities, enable
workloads to utilize software and hardware specific to their
particular requirements. This functionality can be exposed
explicitly to the end user, or accessed through defined
metadata within a particular flavor of an instance.</para></section>
<section xml:id="openstack-components-tech-considerations"><title>OpenStack Components</title>
<para>A general purpose OpenStack cloud design should incorporate
the core OpenStack services to provide a wide range of
services to end-users. The OpenStack core services recommended
in a general purpose cloud are:</para>
<itemizedlist>
<listitem>
<para>OpenStack Compute (Nova)</para>
</listitem>
<listitem>
<para>OpenStack Networking (Neutron)</para>
</listitem>
<listitem>
<para>OpenStack Image Service (Glance)</para>
</listitem>
<listitem>
<para>OpenStack Identity Service (Keystone)</para>
</listitem>
<listitem>
<para>OpenStack Dashboard (Horizon)</para>
</listitem>
<listitem>
<para>OpenStack Telemetry (Ceilometer)</para>
</listitem>
</itemizedlist>
<para>A general purpose cloud may also include OpenStack Object
Storage (Swift). OpenStack Block Storage (Cinder) may be
selected to provide persistent storage to applications and
instances although, depending on the use case, this could be
optional.</para></section>
<section xml:id="supplemental-software-tech-considerations"><title>Supplemental Software</title>
<para>A general purpose OpenStack deployment consists of more than
just OpenStack-specific components. A typical deployment
involves services that provide supporting functionality,
including databases and message queues, and may also involve
software to provide high availability of the OpenStack
environment. Design decisions around the underlying message
queue might affect the required number of controller services,
as well as the technology to provide highly resilient database
functionality, such as MariaDB with Galera. In such a
scenario, replication of services relies on quorum. Therefore,
the underlying database nodes, for example, should consist of
at least 3 nodes to account for the recovery of a failed
Galera node. When increasing the number of nodes to support a
feature of the software, consideration of rack space and
switch port density becomes important.</para>
<para>Where many general purpose deployments use hardware load
balancers to provide highly available API access and SSL
termination, software solutions, for example HAProxy, can also
be considered. It is vital to ensure that such software
implementations are also made highly available. This high
availability can be achieved by using software such as
Keepalived or Pacemaker with Corosync. Pacemaker and Corosync
can provide Active-Active or Active-Passive highly available
configuration depending on the specific service in the
OpenStack environment. Using this software can affect the
design as it assumes at least a 2-node controller
infrastructure where one of those nodes may be running certain
services in standby mode.</para>
<para>Memcached is a distributed memory object caching system, and
Redis is a key-value store. Both are usually deployed on
general purpose clouds to assist in alleviating load to the
Identity service. The memcached service caches tokens, and due
to its distributed nature it can help alleviate some
bottlenecks to the underlying authentication system. Using
memcached or Redis does not affect the overall design of your
architecture as they tend to be deployed onto the
infrastructure nodes providing the OpenStack services.</para></section>
<section xml:id="performance-tech-considerations"><title>Performance</title>
<para>Performance of an OpenStack deployment is dependent on a
number of factors related to the infrastructure and controller
services. The user requirements can be split into general
network performance, performance of compute resources, and
performance of storage systems.</para></section>
<section xml:id="controller-infrastructure-tech-considerations">
<title>Controller Infrastructure</title>
<para>The Controller infrastructure nodes provide management
services to the end-user as well as providing services
internally for the operating of the cloud. The Controllers
typically run message queuing services that carry system
messages between each service. Performance issues related to
the message bus would lead to delays in sending that message
to where it needs to go. The result of this condition would be
delays in operation functions such as spinning up and deleting
instances, provisioning new storage volumes and managing
network resources. Such delays could adversely affect an
applications ability to react to certain conditions,
especially when using auto-scaling features. It is important
to properly design the hardware used to run the controller
infrastructure as outlined above in the Hardware Selection
section.</para>
<para>Performance of the controller services is not just limited
to processing power, but restrictions may emerge in serving
concurrent users. Ensure that the APIs and Horizon services
are load tested to ensure that you are able to serve your
customers. Particular attention should be made to the
OpenStack Identity Service (Keystone), which provides the
authentication and authorization for all services, both
internally to OpenStack itself and to end-users. This service
can lead to a degradation of overall performance if this is
not sized appropriately.</para></section>
<section xml:id="network-performance-tech-considerations"><title>Network Performance</title>
<para>In a general purpose OpenStack cloud, the requirements of
the network help determine its performance capabilities. For
example, small deployments may employ 1 Gibabit Ethernet (GbE)
networking, whereas larger installations serving multiple
departments or many users would be better architected with 10
GbE networking. The performance of the running instances will
be limited by these speeds. It is possible to design OpenStack
environments that run a mix of networking capabilities. By
utilizing the different interface speeds, the users of the
OpenStack environment can choose networks that are fit for
their purpose. For example, web application instances may run
on a public network presented through OpenStack Networking
that has 1 GbE capability, whereas the back-end database uses
an OpenStack Networking network that has 10 GbE capability to
replicate its data or, in some cases, the design may
incorporate link aggregation for greater throughput.</para>
<para>Network performance can be boosted considerably by
implementing hardware load balancers to provide front-end
service to the cloud APIs. The hardware load balancers also
perform SSL termination if that is a requirement of your
environment. When implementing SSL offloading, it is important
to understand the SSL offloading capabilities of the devices
selected.</para></section>
<section xml:id="compute-host-tech-considerations"><title>Compute Host</title>
<para>The choice of hardware specifications used in compute nodes
including CPU, memory and disk type directly affects the
performance of the instances. Other factors which can directly
affect performance include tunable parameters within the
OpenStack services, for example the overcommit ratio applied
to resources. The defaults in OpenStack Compute set a 16:1
over-commit of the CPU and 1.5 over-commit of the memory.
Running at such high ratios leads to an increase in
"noisy-neighbor" activity. Care must be taken when sizing your
Compute environment to avoid this scenario. For running
general purpose OpenStack environments it is possible to keep
to the defaults, but make sure to monitor your environment as
usage increases.</para></section>
<section xml:id="storage-performance-tech-considerations"><title>Storage Performance</title>
<para>When considering performance of OpenStack Block Storage,
hardware and architecture choice is important. Block Storage
can use enterprise back-end systems such as NetApp or EMC, use
scale out storage such as GlusterFS and Ceph, or simply use
the capabilities of directly attached storage in the nodes
themselves. Block Storage may be deployed so that traffic
traverses the host network, which could affect, and be
adversely affected by, the front-side API traffic performance.
As such, consider using a dedicated data storage network with
dedicated interfaces on the Controller and Compute
hosts.</para>
<para>When considering performance of OpenStack Object Storage, a
number of design choices will affect performance. A users
access to the Object Storage is through the proxy services,
which typically sit behind hardware load balancers. By the
very nature of a highly resilient storage system, replication
of the data would affect performance of the overall system. In
this case, 10 GbE (or better) networking is recommended
throughout the storage network architecture.</para></section>
<section xml:id="availability-tech-considerations"><title>Availability</title>
<para>In OpenStack, the infrastructure is integral to providing
services and should always be available, especially when
operating with SLAs. Ensuring network availability is
accomplished by designing the network architecture so that no
single point of failure exists. A consideration of the number
of switches, routes and redundancies of power should be
factored into core infrastructure, as well as the associated
bonding of networks to provide diverse routes to your highly
available switch infrastructure.</para>
<para>The OpenStack services themselves should be deployed across
multiple servers that do not represent a single point of
failure. Ensuring API availability can be achieved by placing
these services behind highly available load balancers that
have multiple OpenStack servers as members.</para>
<para>OpenStack lends itself to deployment in a highly available
manner where it is expected that at least 2 servers be
utilized. These can run all the services involved from the
message queuing service, for example RabbitMQ or QPID, and an
appropriately deployed database service such as MySQL or
MariaDB. As services in the cloud are scaled out, back-end
services will need to scale too. Monitoring and reporting on
server utilization and response times, as well as load testing
your systems, will help determine scale out decisions.</para>
<para>Care must be taken when deciding network functionality.
Currently, OpenStack supports both the legacy Nova-network
system and the newer, extensible OpenStack Networking. Both
have their pros and cons when it comes to providing highly
available access. Nova-network, which provides networking
access maintained in the OpenStack Compute code, provides a
feature that removes a single point of failure when it comes
to routing, and this feature is currently missing in OpenStack
Networking. The effect of Nova networks Multi-Host
functionality restricts failure domains to the host running
that instance.</para>
<para>On the other hand, when using OpenStack Networking, the
OpenStack controller servers or separate OpenStack Networking
hosts handle routing. For a deployment that requires features
available in only OpenStack Networking, it is possible to
remove this restriction by using third party software that
helps maintain highly available L3 routes. Doing so allows for
common APIs to control network hardware, or to provide complex
multi-tier web applications in a secure manner. It is also
possible to completely remove routing from OpenStack
Networking, and instead rely on hardware routing capabilities.
In this case, the switching infrastructure must support L3
routing.</para>
<para>OpenStack Networking (Neutron) and Nova Network both have
their advantages and disadvantages. They are both valid and
supported options that fit different use cases as described in
the following table.</para>
<para>Ensure your deployment has adequate back-up capabilities. As
an example, in a deployment that has two infrastructure
controller nodes, the design should include controller
availability. In the event of the loss of a single controller,
cloud services will run from a single controller in the event
of failure. Where the design has higher availability
requirements, it is important to meet those requirements by
designing the proper redundancy and availability of controller
nodes.</para>
<para>Application design must also be factored into the
capabilities of the underlying cloud infrastructure. If the
compute hosts do not provide a seamless live migration
capability, then it must be expected that when a compute host
fails, that instance and any data local to that instance will
be deleted. Conversely, when providing an expectation to users
that instances have a high-level of uptime guarantees, the
infrastructure must be deployed in a way that eliminates any
single point of failure when a compute host disappears. This
may include utilizing shared file systems on enterprise
storage or OpenStack Block storage to provide a level of
guarantee to match service features.</para>
<para>For more information on HA in OpenStack, see the OpenStack
High Availability Guide found at
http://docs.openstack.org/high-availability-guide.</para></section>
<section xml:id="security-tech-considerations"><title>Security</title>
<para>A security domain comprises users, applications, servers or
networks that share common trust requirements and expectations
within a system. Typically they have the same authentication
and authorization requirements and users.</para>
<para>These security domains are:</para>
<itemizedlist>
<listitem>
<para>Public</para>
</listitem>
<listitem>
<para>Guest</para>
</listitem>
<listitem>
<para>Management</para>
</listitem>
<listitem>
<para>Data</para>
</listitem>
</itemizedlist>
<para>These security domains can be mapped to an OpenStack
deployment individually, or combined. For example, some
deployment topologies combine both guest and data domains onto
one physical network, whereas in other cases these networks
are physically separated. In each case, the cloud operator
should be aware of the appropriate security concerns. Security
domains should be mapped out against your specific OpenStack
deployment topology. The domains and their trust requirements
depend upon whether the cloud instance is public, private, or
hybrid.</para>
<para>The public security domain is an entirely untrusted area of
the cloud infrastructure. It can refer to the Internet as a
whole or simply to networks over which you have no authority.
This domain should always be considered untrusted.</para>
<para>Typically used for compute instance-to-instance traffic, the
guest security domain handles compute data generated by
instances on the cloud but not services that support the
operation of the cloud, such as API calls. Public cloud
providers and private cloud providers who do not have
stringent controls on instance use or who allow unrestricted
internet access to instances should consider this domain to be
untrusted. Private cloud providers may want to consider this
network as internal and therefore trusted only if they have
controls in place to assert that they trust instances and all
their tenants.</para>
<para>The management security domain is where services interact.
Sometimes referred to as the "control plane", the networks in
this domain transport confidential data such as configuration
parameters, user names, and passwords. In most deployments this
domain is considered trusted.</para>
<para>The data security domain is concerned primarily with
information pertaining to the storage services within
OpenStack. Much of the data that crosses this network has high
integrity and confidentiality requirements and, depending on
the type of deployment, may also have strong availability
requirements. The trust level of this network is heavily
dependent on other deployment decisions.</para>
<para>When deploying OpenStack in an enterprise as a private cloud
it is usually behind the firewall and within the trusted
network alongside existing systems. Users of the cloud are,
traditionally, employees that are bound by the security
requirements set forth by the company. This tends to push most
of the security domains towards a more trusted model. However,
when deploying OpenStack in a public facing role, no
assumptions can be made and the attack vectors significantly
increase. For example, the API endpoints, along with the
software behind them, become vulnerable to bad actors wanting
to gain unauthorized access or prevent access to services,
which could lead to loss of data, functionality, and
reputation. These services must be protected against through
auditing and appropriate filtering.</para>
<para>Consideration must be taken when managing the users of the
system for both public and private clouds. The identity
service allows for LDAP to be part of the authentication
process. Including such systems in an OpenStack deployment may
ease user management if integrating into existing
systems.</para>
<para>It's important to understand that user authentication
requests include sensitive information including user names,
passwords and authentication tokens. For this reason, placing
the API services behind hardware that performs SSL termination
is strongly recommended.</para>
<para>For more information OpenStack Security, see the OpenStack
Security Guide, at
http://docs.openstack.org/security-guide/.</para>
</section>
</section>

View File

@ -0,0 +1,175 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="user-requirements-general-purpose">
<?dbhtml stop-chunking?>
<title>User Requirements</title>
<para>The general purpose cloud is built following the
Infrastructure-as-a-Service (IaaS) model; as a platform best
suited for use cases with simple requirements. The general
purpose cloud user requirements themselves are typically not
complex. However, it is still important to capture them even
if the project has minimum business and technical requirements
such as a Proof of Concept (PoC) or a small lab
platform.</para>
<para>These user considerations are written from the perspective
of the organization that is building the cloud, not from the
perspective of the end-users who will consume cloud services
provided by this design.</para>
<itemizedlist>
<listitem>
<para>Cost: Financial factors are a primary concern for
any organization. Since general purpose clouds are
considered the baseline from which all other cloud
architecture environments derive, cost will commonly
be an important criteria. This type of cloud, however,
does not always provide the most cost-effective
environment for a specialized application or
situation. Unless razor-thin margins and costs have
been mandated as a critical factor, cost should not be
the sole consideration when choosing or designing a
general purpose architecture.</para>
</listitem>
<listitem>
<para>Time to market: Another common business factor in
building a general purpose cloud is the ability to
deliver a service or product more quickly and
flexibly. In the modern hyper-fast business world,
being able to deliver a product in six months instead
of two years is often a major driving force behind the
decision to build a general purpose cloud. General
purpose clouds allow users to self-provision and gain
access to compute, network, and storage resources
on-demand thus decreasing time to market. It may
potentially make more sense to build a general purpose
PoC as opposed to waiting to finalize the ultimate use
case for the system. The tradeoff with taking this
approach is the risk that the general purpose cloud is
not optimized for the actual final workloads. The
final decision on which approach to take will be
dependent on the specifics of the business objectives
and time frame for the project.</para>
</listitem>
<listitem>
<para>Revenue opportunity: The revenue opportunity for a
given cloud will vary greatly based on the intended
use case of that particular cloud. Some general
purpose clouds are built for commercial customer
facing products, but there are plenty of other reasons
that might make the general purpose cloud the right
choice. A small cloud service provider (CSP) might
want to build a general purpose cloud rather than a
massively scalable cloud because they do not have the
deep financial resources needed, or because they do
not or will not know in advance the purposes for which
their customers are going to use the cloud. For some
users, the advantages cloud itself offers mean an
enhancement of revenue opportunity. For others, the
fact that a general purpose cloud provides only
baseline functionality will be a disincentive for use,
leading to a potential stagnation of potential revenue
opportunities.</para>
</listitem>
</itemizedlist>
<section xml:id="legal-requirements-general-purpose"><title>Legal Requirements</title>
<para>Many jurisdictions have legislative and regulatory
requirements governing the storage and management of data in
cloud environments. Common areas of regulation include:</para>
<itemizedlist>
<listitem>
<para>Data retention policies ensuring storage of
persistent data and records management to meet data
archival requirements.</para>
</listitem>
<listitem>
<para>Data ownership policies governing the possession and
responsibility for data.</para>
</listitem>
<listitem>
<para>Data sovereignty policies governing the storage of
data in foreign countries or otherwise separate
jurisdictions.</para>
</listitem>
<listitem>
<para>Data compliance policies governing certain types of
information need to reside in certain locations due to
regular issues - and more important cannot reside in
other locations for the same reason.</para>
</listitem>
</itemizedlist>
<para>Examples of such legal frameworks include the data
protection framework of the European Union
(http://ec.europa.eu/justice/data-protection/ ) and the
requirements of the Financial Industry Regulatory Authority
(http://www.finra.org/Industry/Regulation/FINRARules/ ) in the
United States. Consult a local regulatory body for more
information.</para></section>
<section xml:id="technical-requirements"><title>Technical Requirements</title>
<para>Technical cloud architecture requirements should be weighted
against the business requirements.</para>
<itemizedlist>
<listitem>
<para>Performance: As a baseline product, general purpose
clouds do not provide optimized performance for any
particular function. While a general purpose cloud
should provide enough performance to satisfy average
user considerations, performance is not a general
purpose cloud customer driver.</para>
</listitem>
<listitem>
<para>No predefined usage model: The lack of a pre-defined
usage model enables the user to run a wide variety of
applications without having to know the application
requirements in advance. This provides a degree of
independence and flexibility that no other cloud
scenarios are able to provide.</para>
</listitem>
<listitem>
<para>On-demand and self-service application: By
definition, a cloud provides end users with the
ability to self-provision computing power, storage,
networks, and software in a simple and flexible way.
The user must be able to scale their resources up to a
substantial level without disrupting the underlying
host operations. One of the benefits of using a
general purpose cloud architecture is the ability to
start with limited resources and increase them over
time as the user demand grows.</para>
</listitem>
<listitem>
<para>Public cloud: For a company interested in building a
commercial public cloud offering based on OpenStack,
the general purpose architecture model might be the
best choice because the designers are not going to
know the purposes or workloads for which the end users
will use the cloud.</para>
</listitem>
<listitem>
<para>Internal consumption (private) cloud: Organizations
need to determine if it makes the most sense to create
their own clouds internally. The main advantage of a
private cloud is that it allows the organization to
maintain complete control over all the architecture
and the cloud components. One caution is to think
about the possibility that users will want to combine
using the internal cloud with access to an external
cloud. If that case is likely, it might be worth
exploring the possibility of taking a multi-cloud
approach with regard to at least some of the
architectural elements. Designs that incorporate the
use of multiple clouds, such as a private cloud and a
public cloud offering, are described in the
"Multi-Cloud" scenario.</para>
</listitem>
<listitem>
<para>Security: Security should be implemented according
to asset, threat, and vulnerability risk assessment
matrices. For cloud domains that require increased
computer security, network security, or information
security, general purpose cloud is not considered an
appropriate choice.</para>
</listitem>
</itemizedlist></section>
</section>

View File

@ -0,0 +1,186 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-architecture-hybrid">
<?dbhtml stop-chunking?>
<title>Architecture</title>
<para>Once business and application requirements have been
defined, the first step for designing a hybrid cloud solution
is to map out the dependencies between the expected workloads
and the diverse cloud infrastructures that need to support
them. By mapping the applications and the targeted cloud
environments, you can architect a solution that enables the
broadest compatibility between cloud platforms and minimizes
the need to create workarounds and processes to fill
identified gaps. Note the evaluation of the monitoring and
orchestration APIs available on each cloud platform and the
relative levels of support for them in the chosen Cloud
Management Platform.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-Cloud_Priv-AWS4.png"
/>
</imageobject>
</mediaobject>
<section xml:id="image-portability"><title>Image portability</title>
<para>The majority of cloud workloads currently run on instances
using hypervisor technologies such as KVM, Xen, or ESXi. The
challenge is that each of these hypervisors use an image
format that is mostly, or not at all, compatible with one
another. In a private or hybrid cloud solution, this can be
mitigated by standardizing on the same hypervisor and instance
image format but this is not always feasible. This is
particularly evident if one of the clouds in the architecture
is a public cloud that is outside of the control of the
designers.</para>
<para>There are conversion tools such as virt-v2v
(http://libguestfs.org/virt-v2v/) and virt-edit
(http://libguestfs.org/virt-edit.1.html) that can be used in
those scenarios but they are often not suitable beyond very
basic cloud instance specifications. An alternative is to
build a thin operating system image as the base for new
instances. This facilitates rapid creation of cloud instances
using cloud orchestration or configuration management tools,
driven by the CMP, for more specific templating. Another more
expensive option is to use a commercial image migration tool.
The issue of image portability is not just for a one time
migration. If the intention is to use the multiple cloud for
disaster recovery, application diversity or high availability,
the images and instances are likely to be moved between the
different cloud platforms regularly.</para></section>
<section xml:id="upper-layer-services"><title>Upper-Layer Services</title>
<para>Many clouds offer complementary services over and above the
basic compute, network, and storage components. These
additional services are often used to simplify the deployment
and management of applications on a cloud platform.</para>
<para>Consideration is required to be given to moving workloads
that may have upper-layer service dependencies on the source
cloud platform to a destination cloud platform that may not
have a comparable service. Conversely, the user can implement
it in a different way or by using a different technology. For
example, moving an application that uses a NoSQL database
service such as MongoDB that is delivered as a service on the
source cloud, to a destination cloud that does not offer that
service or may only use a relational database such as MySQL,
could cause difficulties in maintaining the application
between the platforms.</para>
<para>There are a number of options that might be appropriate for
the hybrid cloud use case:</para>
<itemizedlist>
<listitem>
<para>Create a baseline of upper-layer services that are
implemented across all of the cloud platforms. For
platforms that do not support a given service, create
a service on top of that platform and apply it to the
workloads as they are launched on that cloud. For
example, OpenStack, via Trove, supports MySQL as a
service but not NoSQL databases in production. To move
from or to run alongside on AWS a NoSQL workload would
require recreating the NoSQL database on top of
OpenStack and automate the process of implementing it
using a tool such as OpenStack Orchestration
(Heat).</para>
</listitem>
<listitem>
<para>Deploy a Platform as a Service (PaaS) technology
such as Cloud Foundry or OpenShift that abstracts the
upper-layer services from the underlying cloud
platform. The unit of application deployment and
migration is the PaaS and leverages the services of
the PaaS and only consumes the base infrastructure
services of the cloud platform. The downside to this
approach is that the PaaS itself then potentially
becomes a source of lock-in.</para>
</listitem>
<listitem>
<para>Use only the base infrastructure services that are
common across all cloud platforms. Use automation
tools to create the required upper-layer services
which are portable across all cloud platforms. For
example, instead of using any database services that
are inherent in the cloud platforms, launch cloud
instances and deploy the databases on to those
instances using scripts or various configuration and
application deployment tools.</para>
</listitem>
</itemizedlist></section>
<section xml:id="network-services"><title>Network Services</title>
<para>Network services functionality is a significant barrier for
multiple cloud architectures. It could be an important factor
to assess when choosing a CMP and cloud provider.
Considerations are: functionality, security, scalability and
High availability (HA). Verification and ongoing testing of
the critical features of the cloud endpoint used by the
architecture are important tasks.</para>
<itemizedlist>
<listitem>
<para>Once the network functionality framework has been
decided, a minimum functionality test should be
designed to confirm that the functionality is in fact
compatible. This will ensure testing and functionality
persists during and after upgrades. Note that over
time, the diverse cloud platforms are likely to
de-synchronize if care is not taken to maintain
compatibility. This is a particular issue with
APIs.</para>
</listitem>
<listitem>
<para>Scalability across multiple cloud providers may
dictate which underlying network framework is chosen
for the different cloud providers. It is important to
have the network API functions presented and to verify
that the desired functionality persists across all
chosen cloud endpoint.</para>
</listitem>
<listitem>
<para>High availability (HA) implementations vary in
functionality and design. Examples of some common
methods are Active-Hot-Standby, Active-Passive and
Active-Active. High availability and a test framework
need to be developed to insure that the functionality
and limitations are well understood.</para>
</listitem>
<listitem>
<para>Security considerations, such as how data is secured
between client and endpoint and any traffic that
traverses the multiple clouds, from eavesdropping to
DoS activities must be addressed. Business and
regulatory requirements dictate the security approach
that needs to be taken.</para>
</listitem>
</itemizedlist></section>
<section xml:id="data"><title>Data</title>
<para>Replication has been the traditional method for protecting
object store implementations. A variety of different
implementations have existed in storage architectures.
Examples of this are both synchronous and asynchronous
mirroring. Most object stores and back-end storage systems have
a method for replication that can be implemented at the
storage subsystem layer. Object stores also have implemented
replication techniques that can be tailored to fit a clouds
needs. An organization must find the right balance between
data integrity and data availability. Replication strategy may
also influence the disaster recovery methods
implemented.</para>
<para>Replication across different racks, data centers and
geographical regions has led to the increased focus of
determining and ensuring data locality. The ability to
guarantee data is accessed from the nearest or fastest storage
can be necessary for applications to perform well. Examples of
this are Hadoop running in a cloud. The user either runs with
a native HDFS, when applicable, or on a separate parallel file
system such as those provided by Hitachi and IBM. Special
consideration should be taken when running embedded object
store methods to not cause extra data replication, which can
create unnecessary performance issues. Another example of
ensuring data locality is by using Ceph. Ceph has a data
container abstraction called a pool. Pools can be created with
replicas or erasure code. Replica based pools can also have a
rule set defined to have data written to a “local” set of
hardware which would be the primary access and modification
point.</para>
</section>
</section>

View File

@ -0,0 +1,68 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intro-hybrid">
<title>Introduction</title>
<para>Hybrid cloud, by definition, means that the design spans
more than one cloud. An example of this kind of architecture
may include a situation in which the design involves more than
one OpenStack cloud (for example, an OpenStack-based private
cloud and an OpenStack-based public cloud), or it may be a
situation incorporating an OpenStack cloud and a non-OpenStack
cloud (for example, an OpenStack-based private cloud that
interacts with Amazon Web Services). Bursting into an external
cloud is the practice of creating new instances to alleviate
extra load where there is no available capacity in the private
cloud.</para>
<para>Some situations that could involve hybrid cloud architecture
include:</para>
<itemizedlist>
<listitem>
<para>Bursting from a private cloud to a public
cloud</para>
</listitem>
<listitem>
<para>Disaster recovery</para>
</listitem>
<listitem>
<para>Development and testing</para>
</listitem>
<listitem>
<para>Federated cloud, enabling users to choose resources
from multiple providers</para>
</listitem>
<listitem>
<para>Hybrid clouds built to support legacy systems as
they transition to cloud</para>
</listitem>
</itemizedlist>
<para>As a hybrid cloud design deals with systems that are outside
of the control of the cloud architect or organization, a
hybrid cloud architecture requires considering aspects of the
architecture that might not have otherwise been necessary. For
example, the design may need to deal with hardware, software,
and APIs under the control of a separate organization.</para>
<para>Similarly, the degree to which the architecture is
OpenStack-based will have an effect on the cloud operator or
cloud consumer's ability to accomplish tasks with native
OpenStack tools. By definition, this is a situation in which
no single cloud can provide all of the necessary
functionality. In order to manage the entire system, users,
operators and consumers will need an overarching tool known as
a cloud management platform (CMP). Any organization that is
working with multiple clouds already has a CMP, even if that
CMP is the operator who logs into an external web portal and
launches a public cloud instance.</para>
<para>There are commercially available options, such as
Rightscale, and open source options, such as ManageIQ
(http://manageiq.org/), but there is no single CMP that can
address all needs in all scenarios. Whereas most of the
sections of this book talk about the aspects of OpenStack, an
architect needs to consider when designing an OpenStack
architecture. This section will also discuss the things the
architect must address when choosing or building a CMP to run
a hybrid cloud design, even if the CMP will be a manually
built solution.</para>
</section>

View File

@ -0,0 +1,99 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-hybrid-operational-considerations">
<?dbhtml stop-chunking?>
<title>Operational Considerations</title>
<para>Hybrid cloud deployments present complex operational
challenges. There are several factors to consider that affect
the way each cloud is deployed and how users and operators
will interact with each cloud. Not every cloud provider
implements infrastructure components the same way which may
lead to incompatible interactions with workloads or a specific
Cloud Management Platform (CMP). Different cloud providers may
also offer different levels of integration with competing
cloud offerings.</para>
<para>When selecting a CMP, one of the most important aspects to
consider is monitoring. Gaining valuable insight into each
cloud is critical to gaining a holistic view of all involved
clouds. In choosing an existing CMP, determining whether it
supports monitoring of all the clouds involved or if
compatible APIs are available which can be queried for the
necessary information, is vital. Once all the information
about each cloud can be gathered and stored in a searchable
database, proper actions can be taken on that data offline so
workloads will not be impacted.</para>
<section xml:id="agility"><title>Agility</title>
<para>Implementing a hybrid cloud solution can provide application
availability across disparate cloud environments and
technologies. This availability enables the deployment to
survive a complete disaster in any single cloud environment.
Each cloud should provide the means to quickly spin up new
instances in the case of capacity issues or complete
unavailability of a single cloud installation.</para></section>
<section xml:id="application-readiness-hybrid"><title>Application Readiness</title>
<para>It is important to understand the type of application
workloads that will be deployed across the hybrid cloud
environment. Enterprise workloads that depend on the
underlying infrastructure for availability are not designed to
run on OpenStack. Although these types of applications can run
on an OpenStack cloud, if the application is not able to
tolerate infrastructure failures, it is likely to require
significant operator intervention to recover. Cloud workloads,
however, are designed with fault tolerance in mind and the SLA
of the application is not tied to the underlying
infrastructure. Ideally, cloud applications will be designed
to recover when entire racks and even data centers full of
infrastructure experience an outage.</para></section>
<section xml:id="upgrades"><title>Upgrades</title>
<para>OpenStack is a complex and constantly evolving collection of
software. Upgrades may be performed to one or more of the
cloud environments involved. If a public cloud is involved in
the deployment, predicting upgrades may not be possible. Be
sure to examine the advertised SLA for any public cloud
provider being used. Note that at massive scale, even when
dealing with a cloud that offers an SLA with a high percentage
of uptime, workloads must be able to recover at short
notice.</para>
<para>Similarly, when upgrading private cloud deployments, care
must be taken to minimize disruption by making incremental
changes and providing a facility to either rollback or
continue to roll forward when using a continuous delivery
model.</para>
<para>Another consideration is upgrades to the CMP which may need
to be completed in coordination with any of the hybrid cloud
upgrades. This may be necessary whenever API changes are made
in one of the cloud solutions in use to support the new
functionality.</para></section>
<section xml:id="network-operation-center-noc"><title>Network Operation Center (NOC)</title>
<para>When planning the Network Operation Center for a hybrid
cloud environment, it is important to recognize where control
over each piece of infrastructure resides. If a significant
portion of the cloud is on externally managed systems, be
prepared for situations in which it may not be possible to
make changes at all or at the most convenient time.
Additionally, situations of conflict may arise in which
multiple providers have differing points of view on the way
infrastructure must be managed and exposed. This can lead to
delays in root cause and analysis where each insists the blame
lies with the other provider.</para>
<para>It is important to ensure that the structure put in place
enables connection of the networking of both clouds to form an
integrated system, keeping in mind the state of handoffs.
These handoffs must both be as reliable as possible and
include as little latency as possible to ensure the best
performance of the overall system.</para></section>
<section xml:id="maintainability"><title>Maintainability</title>
<para>Operating hybrid clouds is a situation in which there is a
greater reliance on third party systems and processes. As a
result of a lack of control of various pieces of a hybrid
cloud environment, it is not necessarily possible to guarantee
proper maintenance of the overall system. Instead, the user
must be prepared to abandon workloads and spin them up again
in an improved state. Having a hybrid cloud deployment does,
however, provide agility for these situations by allowing the
migration of workloads to alternative clouds in response to
cloud-specific issues.</para></section>
</section>

View File

@ -0,0 +1,175 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="prescriptive-examples-multi-cloud">
<?dbhtml stop-chunking?>
<title>Prescriptive Examples</title>
<para>Multi-cloud environments are typically created to facilitate
these use cases:</para>
<itemizedlist>
<listitem>
<para>Bursting workloads from private to public OpenStack
clouds</para>
</listitem>
<listitem>
<para>Bursting workloads from private to public
non-OpenStack clouds</para>
</listitem>
<listitem>
<para>High Availability across clouds (for technical
diversity)</para>
</listitem>
</itemizedlist>
<para>Examples of environments that address each of these use
cases will be discussed in this chapter.</para>
<para>Company A's data center is running dangerously low on
capacity. The option of expanding the data center will not be
possible in the foreseeable future. In order to accommodate
the continuously growing need for development resources in the
organization, the decision was make use of resource in the
public cloud.</para>
<para>The company has an internal cloud management platform that
will direct requests to the appropriate cloud, depending on
the currently local capacity.</para>
<para>This is a custom in-house application that has been written
for this specific purpose.</para>
<para>An example for such a solution is described in the figure
below.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-Cloud_Priv-Pub3.png"
/>
</imageobject>
</mediaobject>
<para>This example shows two clouds, with a Cloud Management
Platform (CMP) connecting them. This guide does not attempt to
cover a specific CMP, but describes how workloads are
typically orchestrated using the Orchestration and Telemetry
services as shown in the diagram above. It is also possibly to
connect directly to the other OpenStack APIs with a
CMP.</para>
<para>The private cloud is an OpenStack cloud with one or more
controllers and one or more compute nodes. It includes
metering provided by OpenStack Telemetry. As load increases
Telemetry captures this and the information is in turn
processed by the CMP. As long as capacity is available, the
CMP uses the OpenStack API to call the Orchestration service
to create instances on the private cloud in response to user
requests. When capacity is not available on the private cloud,
the CMP issues a request to the Orchestration service API of
the public cloud to create the instance on the public
cloud.</para>
<para>In this example, the whole deployment was not directed to an
external public cloud because of the company's fear of lack of
resource control and security concerns over control and
increased operational expense.</para>
<para>In addition, CompanyA has already established a data center
with a substantial amount of hardware, and migrating all the
workloads out to a public cloud was not feasible.</para>
<section xml:id="bursting-to-public-nonopenstack-cloud"><title>Bursting to a Public non-OpenStack Cloud</title>
<para>Another common scenario is bursting workloads from the
private cloud into a non-OpenStack public cloud such as Amazon
Web Services (AWS) to take advantage of additional capacity
and scale applications as needed.</para>
<para>For an OpenStack-to-AWS hybrid cloud, the architecture looks
similar to the figure below:</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-Cloud_Priv-AWS4.png"
/>
</imageobject>
</mediaobject>
<para>In this scenario CompanyA has an additional requirement in
that the developers were already using AWS for some of their
work and did not want to change the cloud provider. Primarily
due to excessive overhead with network firewall rules that
needed to be created and corporate financial procedures that
required entering into an agreement with a new
provider.</para>
<para>As long as the CMP is capable of connecting an external
cloud provider with the appropriate API, the workflow process
will remain the same as the previous scenario. The actions the
CMP takes such as monitoring load, creating new instances, and
so forth are the same, but they would be performed in the
public cloud using the appropriate API calls. For example, if
the public cloud is Amazon Web Services, the CMP would use the
EC2 API to create a new instance and assign an Elastic IP.
That IP can then be added to HAProxy in the private cloud,
just as it was before. The CMP can also reference AWS-specific
tools such as CloudWatch and CloudFormation.</para>
<para>Several open source tool kits for building CMPs are now
available that can handle this kind of translation, including
ManageIQ, jClouds, and JumpGate.</para></section>
<section xml:id="high-availability-disaster-recovery"><title>High Availability/Disaster Recovery</title>
<para>CompanyA has a requirement to be able to recover from
catastrophic failure in their local data center. Some of the
workloads currently in use are running on their private
OpenStack cloud. Protecting the data involves block storage,
object storage, and a database. The architecture is designed
to support the failure of large components of the system, yet
ensuring that the system will continue to deliver services.
While the services remain available to users, the failed
components are restored in the background based on standard
best practice DR policies. To achieve the objectives, data is
replicated to a second cloud, in a geographically distant
location. The logical diagram of the system is described in
the figure below:</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-Cloud_failover2.png"
/>
</imageobject>
</mediaobject>
<para>This example includes two private OpenStack clouds connected
with a Cloud Management Platform (CMP). The source cloud,
OpenStack Cloud 1, includes a controller and at least one
instance running MySQL. It also includes at least one block
storage volume and one object storage volume so that the data
is available to the users at all times. The details of the
method for protecting each of these sources of data
differs.</para>
<para>The object storage relies on the replication capabilities of
the object storage provider. OpenStack Object Storage is
enabled so that it creates geographically separated replicas
that take advantage of this feature. It is configured so that
at least one replica exists in each cloud. In order to make
this work a single array spanning both clouds is configured
with OpenStack Identity that uses Federated Identity and talks
to both clouds, communicating with OpenStack Object Storage
through the Swift proxy.</para>
<para>For block storage, the replication is a little more
difficult, and involves tools outside of OpenStack itself. The
OpenStack Block Storage volume is not set as the drive itself
but as a logical object that points to a physical back end. The
disaster recovery is configured for Block Storage for
synchronous backup for the highest level of data protection,
but asynchronous backup could have been set as an alternative
that is not as latency sensitive. For asynchronous backup, the
Cinder API makes it possible to export the data and also the
metadata of a particular volume, so that it can be moved and
replicated elsewhere. More information can be found here:
https://blueprints.launchpad.net/cinder/+spec/cinder-backup-volume-metadata-support.</para>
<para>The synchronous backups create an identical volume in both
clouds and chooses the appropriate flavor so that each cloud
has an identical back end. This was done by creating volumes
through the CMP, because the CMP knows to create identical
volumes in both clouds. Once this is configured, a solution,
involving DRDB, is used to synchronize the actual physical
drives.</para>
<para>The database component is backed up using synchronous
backups. MySQL does not support geographically diverse
replication, so disaster recovery is provided by replicating
the file itself. As it is not possible to use object storage
as the back end of a database like MySQL, Swift replication
was not an option. It was decided not to store the data on
another geo-tiered storage system, such as Ceph, as block
storage. This would have given another layer of protection.
Another option would have been to store the database on an
OpenStack Block Storage volume and backing it up just as any
other block storage.</para></section>
</section>

View File

@ -0,0 +1,325 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="technical-considerations-hybrid">
<?dbhtml stop-chunking?>
<title>Technical Considerations</title>
<para>A hybrid cloud environment requires inspection and
understanding of technical issues that are not only outside of
an organization's data center, but potentially outside of an
organization's control. In many cases, it is necessary to
ensure that the architecture and CMP chosen can adapt to, not
only to different environments, but also to the possibility of
change. In this situation, applications are crossing diverse
platforms and are likely to be located in diverse locations.
All of these factors will influence and add complexity to the
design of a hybrid cloud architecture.</para>
<para>The only situation where cloud platform incompatibilities
are not going to be an issue is when working with clouds that
are based on the same version and the same distribution of
OpenStack. Otherwise incompatibilities are virtually
inevitable.</para>
<para>Incompatibility should be less of an issue for clouds that
exclusively use the same version of OpenStack, even if they
use different distributions. The newer the distribution in
question, the less likely it is that there will be
incompatibilities between version. This is due to the fact
that the OpenStack community has established an initiative to
define core functions that need to remain backward compatible
between supported versions. The DefCore initiative defines
basic functions that every distribution must support in order
to bear the name "OpenStack".</para>
<para>Some vendors, however, add proprietary customizations to
their distributions. If an application or architecture makes
use of these features, it will be difficult to migrate to or
use other types of environments. Anyone considering
incorporating older versions of OpenStack prior to Havana
should consider carefully before attempting to incorporate
functionality between versions. Internal differences in older
versions may be so great that the best approach might be to
consider the versions to be essentially diverse platforms, as
different as OpenStack and Amazon Web Services or Microsoft
Azure.</para>
<para>The situation is more predictable if using different cloud
platforms is incorporated from inception. If the other clouds
are not based on OpenStack, then all pretense of compatibility
vanishes, and CMP tools must account for the myriad of
differences in the way operations are handled and services are
implemented. Some situations in which these incompatibilities
can arise include differences between the way in which a
cloud:</para>
<itemizedlist>
<listitem>
<para>Deploys instances</para>
</listitem>
<listitem>
<para>Manages networks</para>
</listitem>
<listitem>
<para>Treats applications</para>
</listitem>
<listitem>
<para>Implements services</para>
</listitem>
</itemizedlist>
<section xml:id="capacity-planning-hybrid"><title>Capacity planning</title>
<para>One of the primary reasons many organizations turn to a
hybrid cloud system is to increase capacity without having to
make large capital investments. However, capacity planning is
still necessary when designing an OpenStack installation even
if it is augmented with external clouds.</para>
<para>Specifically, overall capacity and placement of workloads
need to be accounted for when designing for a mostly
internally-operated cloud with the occasional capacity burs.
The long-term capacity plan for such a design needs to
incorporate growth over time to prevent the need to
permanently burst into, and occupy, a potentially more
expensive external cloud. In order to avoid this scenario,
account for the future applications and capacity requirements
and plan growth appropriately.</para>
<para>One of the drawbacks of capacity planning is
unpredictability. It is difficult to predict the amount of
load a particular application might incur if the number of
users fluctuates or the application experiences an unexpected
increase in popularity. It is possible to define application
requirements in terms of vCPU, RAM, bandwidth or other
resources and plan appropriately, but other clouds may not use
the same metric or even the same oversubscription
rates.</para>
<para>Oversubscription is a method to emulate more capacity than
they may physically be present. For example, a physical
hypervisor node with 32 gigabytes of RAM may host 24
instances, each provisioned with 2 gigabytes of RAM. As long
as all 24 of them are not concurrently utilizing 2 full
gigabytes, this arrangement is a non-issue. However, some
hosts take oversubscription to extremes and, as a result,
performance can frequently be inconsistent. If at all
possible, determine what the oversubscription rates of each
host are and plan capacity accordingly.</para></section>
<section xml:id="security-hybrid"><title>Security</title>
<para>The nature of a hybrid cloud environment removes complete
control over the infrastructure. Security becomes a stronger
requirement because data or applications may exist in a cloud
that is outside of an organization's control. Security domains
become an important distinction when planning for a hybrid
cloud environment and its capabilities. A security domain
comprises users, applications, servers or networks that share
common trust requirements and expectations within a
system.</para>
<para>The security domains are:</para>
<orderedlist>
<listitem>
<para>Public</para>
</listitem>
<listitem>
<para>Guest</para>
</listitem>
<listitem>
<para>Management</para>
</listitem>
<listitem>
<para>Data</para>
</listitem>
</orderedlist>
<para>These security domains can be mapped individually to the
organization's installation or combined. For example, some
deployment topologies combine both guest and data domains onto
one physical network, whereas other topologies may physically
separate these networks. In each case, the cloud operator
should be aware of the appropriate security concerns. Security
domains should be mapped out against the specific OpenStack
deployment topology. The domains and their trust requirements
depend upon whether the cloud instance is public, private, or
hybrid.</para>
<para>The public security domain is an entirely untrusted area of
the cloud infrastructure. It can refer to the Internet as a
whole or simply to networks over which an organization has no
authority. This domain should always be considered untrusted.
When considering hybrid cloud deployments, any traffic
traversing beyond and between the multiple clouds should
always be considered to reside in this security domain and is
therefore untrusted.</para>
<para>Typically used for instance-to-instance traffic within a
single data center, the guest security domain handles compute
data generated by instances on the cloud but not services that
support the operation of the cloud such as API calls. Public
cloud providers that are used in a hybrid cloud configuration
which an organization does not control and private cloud
providers who do not have stringent controls on instance use
or who allow unrestricted internet access to instances should
consider this domain to be untrusted. Private cloud providers
may consider this network as internal and therefore trusted
only if there are controls in place to assert that instances
and tenants are trusted.</para>
<para>The management security domain is where services interact.
Sometimes referred to as the "control plane", the networks in
this domain transport confidential data such as configuration
parameters, user names, and passwords. In deployments behind an
organization's firewall, this domain is considered trusted. In
a public cloud model which could be part of an architecture,
this would have to be assessed with the Public Cloud provider
to understand the controls in place.</para>
<para>The data security domain is concerned primarily with
information pertaining to the storage services within
OpenStack. Much of the data that crosses this network has high
integrity and confidentiality requirements and depending on
the type of deployment there may also be strong availability
requirements. The trust level of this network is heavily
dependent on deployment decisions and as such this is not
assigned a default level of trust.</para>
<para>Consideration must be taken when managing the users of the
system, whether operating or utilizing public or private
clouds. The identity service allows for LDAP to be part of the
authentication process. Including such systems in your
OpenStack deployments may ease user management if integrating
into existing systems. Be mindful when utilizing 3rd party
clouds to explore authentication options applicable to the
installation to help manage and keep user authentication
consistent.</para>
<para>Due to the process of passing user names, passwords, and
generated tokens between client machines and API endpoints,
placing API services behind hardware that performs SSL
termination is strongly recommended.</para>
<para>Within cloud components themselves, another component that
needs security scrutiny is the hypervisor. In a public cloud,
organizations typically do not have control over the choice of
hypervisor. (Amazon uses its own particular version of Xen,
for example.) In some cases, hypervisors may be vulnerable to
a type of attack called "hypervisor breakout" if they are not
properly secured. Hypervisor breakout describes the event of a
compromised or malicious instance breaking out of the resource
controls of the hypervisor and gaining access to the bare
metal operating system and hardware resources.</para>
<para>If the security of instances is not considered important,
there may not be an issue. In most cases, however, enterprises
need to avoid this kind of vulnerability, and the only way to
do that is to avoid a situation in which the instances are
running on a public cloud. That does not mean that there is a
need to own all of the infrastructure on which an OpenStack
installation operates; it suggests avoiding situations in
which hardware may be shared with others.</para>
<para>There are other services worth considering that provide a
bare metal instance instead of a cloud. In other cases, it is
possible to replicate a second private cloud by integrating
with a Private Cloud as a Service deployment, in which an
organization does not buy hardware, but also does not share it
with other tenants. It is also possible use a provider that
hosts a bare-metal "public" cloud instance for which the
hardware is dedicated only to one customer, or a provider that
offers Private Cloud as a Service.</para>
<para>Finally, it is important to realize that each cloud
implements services differently. What keeps data secure in one
cloud may not do the same in another. Be sure to know the
security requirements of every cloud that handles the
organization's data or workloads.</para>
<para>More information on OpenStack Security can be found at
http://docs.openstack.org/security-guide/</para></section>
<section xml:id="utilization-hybrid"><title>Utilization</title>
<para>When it comes to utilization, it is important that the CMP
understands what workloads are running, where they are
running, and their preferred utilizations. For example, in
most cases it is desirable to run as many workloads internally
as possible, utilizing other resources only when necessary. On
the other hand, situations exist in which the opposite is
true. The internal cloud may only be for development and
stressing it is undesirable. In most cases, a cost model of
various scenarios helps with this decision, however this
analysis is heavily influenced by internal priorities. The
important thing is the ability to efficiently make those
decisions on a programmatic basis.</para>
<para>The OpenStack Telemetry (Ceilometer) project is designed to
provide information on the usage of various OpenStack
components. There are two limitations to consider: first, if
there is to be a large amount of data (for example, if
monitoring a large cloud, or a very active one) it is
desirable to use a NoSQL back end for Ceilometer, such as
MongoDB. Second, when connecting to a non-OpenStack cloud,
there will need to be a way to monitor that usage and to
provide that monitoring data back to the CMP.</para></section>
<section xml:id="performace-hybrid"><title>Performance</title>
<para>Performance is of primary importance in the design of a
cloud. When it comes to a hybrid cloud deployment, many of the
same issues for multi-site deployments apply, such as network
latency between sites. It is also important to think about the
speed at which a workload can be spun up in another cloud, and
what can be done to reduce the time necessary to accomplish
that task. That may mean moving data closer to applications,
or conversely, applications closer to the data they process.
It may mean grouping functionality so that connections that
require low latency take place over a single cloud rather than
spanning clouds. That may also mean ensuring that the CMP has
the intelligence to know which cloud can most efficiently run
which types of workloads.</para>
<para>As with utilization, native OpenStack tools are available to
assist. Ceilometer can measure performance and, if necessary,
OpenStack Orchestration via the Heat project can be used to
react to changes in demand by spinning up more resources. It
is important to note, however, that Orchestration requires
special configurations in the client to enable functioning
with solution offerings from Amazon Web Services. When dealing
with other types of clouds, it is necessary to rely on the
features of the CMP.</para></section>
<section xml:id="components"><title>Components</title>
<para>The number and types of native OpenStack components that are
available for use is dependent on whether the deployment is
exclusively an OpenStack cloud or not. If so, all of the
OpenStack components will be available for use, and in many
ways the issues that need to be considered will be similar to
those that need to be considered for a multi-site
deployment.</para>
<para>That said, in any situation in which more than one cloud is
being used, at least four OpenStack tools will be
considered:</para>
<itemizedlist>
<listitem>
<para>OpenStack Compute (Nova): Regardless of deployment
location, hypervisor choice has a direct effect on how
difficult it is to integrate with one or more
additional clouds. For example, integrating a Hyper-V
based OpenStack cloud with Azure will have less
compatibility issues than if KVM is used.</para>
</listitem>
<listitem>
<para>Networking: Whether OpenStack Networking (Neutron)
or Nova-network is used, the network is one place
where integration capabilities need to be understood
in order to connect between clouds.</para>
</listitem>
<listitem>
<para>OpenStack Telemetry (Ceilometer): Use of Ceilometer
depends, in large part, on what the other parts of the
cloud are using.</para>
</listitem>
<listitem>
<para>Orchestration module (Heat): Similarly, Heat can
be a valuable tool in orchestrating tasks a CMP
decides are necessary in an OpenStack-based
cloud.</para>
</listitem>
</itemizedlist></section>
<section xml:id="special-considerations-hybrid"><title>Special considerations</title>
<para>Hybrid cloud deployments also involve two more issues that
are not common in other situations:</para>
<para>Image portability: Note that, as of the Icehouse release,
there is no single common image format that is usable by all
clouds. This means that images will need to be converted or
recreated when porting between clouds. To make things simpler,
launch the smallest and simplest images feasible, installing
only what is necessary preferably using a deployment manager
such as Chef or Puppet. That means not to use golden images
for speeding up the process, however if the same images are
being repeatedly deployed it may make more sense to utilize
this technique instead of provisioning applications on lighter
images each time.</para>
<para>API differences: The most profound issue that cannot be
avoided when using a hybrid cloud deployment with more than
just OpenStack (or with different versions of OpenStack) is
that the APIs needed to perform certain functions are
different. The CMP needs to know how to handle all necessary
versions. To get around this issue, some implementers build
portals to achieve a hybrid cloud environment, but a heavily
developer-focused organization will get more use out of a
hybrid cloud broker SDK such as jClouds.</para></section>
</section>

View File

@ -0,0 +1,314 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="user-requirements-hybrid">
<?dbhtml stop-chunking?>
<title>User Requirements</title>
<para>Hybrid cloud architectures introduce additional
complexities, particularly those that use heterogeneous cloud
platforms. As a result, it is important to make sure that
design choices match requirements in such a way that the
benefits outweigh the inherent additional complexity and
risks.</para>
<para>Business considerations to make when designing a hybrid
cloud deployment include:</para>
<itemizedlist>
<listitem>
<para>Cost: A hybrid cloud architecture involves multiple
vendors and technical architectures. These
architectures may be more expensive to deploy and
maintain. Operational costs can be higher because of
the need for more sophisticated orchestration and
brokerage tools than in other architectures. In
contrast, overall operational costs might be lower by
virtue of using a cloud brokerage tool to deploy the
workloads to the most cost effective platform.</para>
</listitem>
<listitem>
<para>Revenue opportunity: Revenue opportunities vary
greatly based on the intent and use case of the cloud.
If it is being built as a commercial customer-facing
product, consider the drivers for building it over
multiple platforms and whether the use of multiple
platforms make the design more attractive to target
customers, thus enhancing the revenue
opportunity.</para>
</listitem>
<listitem>
<para>Time to Market: One of the most common reasons to
use cloud platforms is to speed the time to market of
a new product or application. A business requirement
to use multiple cloud platforms may be because there
is an existing investment in several applications and
it is faster to tie them together rather than
migrating components and refactoring to a single
platform.</para>
</listitem>
<listitem>
<para>Business or technical diversity: Organizations
already leveraging cloud-based services may wish to
embrace business diversity and utilize a hybrid cloud
design to spread their workloads across multiple cloud
providers so that no application is hosted in a single
cloud provider.</para>
</listitem>
<listitem>
<para>Application momentum: A business with existing
applications that are already in production on
multiple cloud environments may find that it is more
cost effective to integrate the applications on
multiple cloud platforms rather than migrate them to a
single platform.</para>
</listitem>
</itemizedlist>
<section xml:id="legal-requirements-hybrid"><title>Legal Requirements</title>
<para>Many jurisdictions have legislative and regulatory
requirements governing the storage and management of data in
cloud environments. Common areas of regulation include:</para>
<itemizedlist>
<listitem>
<para>Data retention policies ensuring storage of
persistent data and records management to meet data
archival requirements.</para>
</listitem>
<listitem>
<para>Data ownership policies governing the possession and
responsibility for data.</para>
</listitem>
<listitem>
<para>Data sovereignty policies governing the storage of
data in foreign countries or otherwise separate
jurisdictions.</para>
</listitem>
<listitem>
<para>Data compliance policies governing certain types of
information needs to reside in certain locations due
to regular issues and, more importantly, cannot reside
in other locations for the same reason.</para>
</listitem>
</itemizedlist>
<para>Examples of such legal frameworks include the data
protection framework of the European Union
(http://ec.europa.eu/justice/data-protection/) and the
requirements of the Financial Industry Regulatory Authority
(http://www.finra.org/Industry/Regulation/FINRARules/) in the
United States. Consult a local regulatory body for more
information.</para></section>
<section xml:id="workload-considerations"><title>Workload Considerations</title>
<para>Defining what the word "workload" means in the context of a
hybrid cloud environment is important. Workload can be defined
as the intended way the systems will be utilized, which is
often referred to as a “use case.” A workload can be a single
application or a suite of applications that work in concert.
It can also be a duplicate set of applications that need to
run on multiple cloud environments. In a hybrid cloud
deployment, the same workload will often need to function
equally well on radically different public and private cloud
environments. The architecture needs to address these
potential conflicts, complexity, and platform
incompatibilities.</para>
<para>Some possible use cases for a hybrid cloud architecture
include:</para>
<itemizedlist>
<listitem>
<para>Dynamic resource expansion or "bursting": Another
common reason to use a multiple cloud architecture is
a "bursty" application that needs additional resources
at times. An example of this case could be a retailer
that needs additional resources during the holiday
selling season, but does not want to build expensive
cloud resources to meet the peak demand. They might
have an OpenStack private cloud but want to burst to
AWS or some other public cloud for these peak load
periods. These bursts could be for long or short
cycles ranging from hourly, monthly or yearly.</para>
</listitem>
<listitem>
<para>Disaster recovery-business continuity: The cheaper
storage and instance management makes a good case for
using the cloud as a secondary site. The public cloud
is already heavily used for these purposes in
combination with an OpenStack public or private
cloud.</para>
</listitem>
<listitem>
<para>Federated hypervisor-instance management: Adding
self-service, charge back and transparent delivery of
the right resources from a federated pool can be cost
effective. In a hybrid cloud environment, this is a
particularly important consideration. Look for a cloud
that provides cross-platform hypervisor support and
robust instance management tools.</para>
</listitem>
<listitem>
<para>Application portfolio integration: An enterprise
cloud delivers better application portfolio management
and more efficient deployment by leveraging
self-service features and rules for deployments based
on types of use. A common driver for building hybrid
cloud architecture is to stitch together multiple
existing cloud environments that are already in
production or development.<!-- In the interest of time to
market, the requirements may be to maintain the
multiple clouds and just integrate the pieces
together, not rationalize to one cloud environment, but
instead to --></para>
</listitem>
<listitem>
<para>Migration scenarios: A common reason to create a
hybrid cloud architecture is to allow the migration of
applications between different clouds. This may be
because the application will be migrated permanently
to a new platform, or it might be because the
application needs to be supported on multiple
platforms going forward.</para>
</listitem>
<listitem>
<para>High availability: Another important reason for
wanting a multiple cloud architecture is to address
the needs for high availability. By using a
combination of multiple locations and platforms, a
design can achieve a level of availability that is not
possible with a single platform. This approach does
add a significant amount of complexity.</para>
</listitem>
</itemizedlist>
<para>In addition to thinking about how the workload will work on
a single cloud, the design must accommodate the added
complexity of needing the workload to run on multiple cloud
platforms. The complexity of transferring workloads across
clouds needs to be explored at the application, instance,
cloud platform, hypervisor, and network levels.</para></section>
<section xml:id="tools-considerations-hybrid"><title>Tools Considerations</title>
<para>When working with designs spanning multiple clouds, the
design must incorporate tools to facilitate working across
those multiple clouds. Some of the user requirements drive the
need for tools that will do the following functions:</para>
<itemizedlist>
<listitem>
<para>Broker between clouds: Since the multiple cloud
architecture assumes that there will be at least two
different and possibly incompatible platforms that are
likely to have different costs, brokering software is
designed to evaluate relative costs between different
cloud platforms. These solutions are sometimes
referred to as Cloud Management Platforms (CMPs).
Examples include Rightscale, Gravitent, Scalr,
CloudForms, and ManageIQ. These tools allow the
designer to determine the right location for the
workload based on predetermined criteria.</para>
</listitem>
<listitem>
<para>Facilitate orchestration across the clouds: CMPs are
tools are used to tie everything together. Cloud
orchestration tools are used to improve the management
of IT application portfolios as they migrate onto
public, private, and hybrid cloud platforms. These
tools are an important consideration. Cloud
orchestration tools are used for managing a diverse
portfolio of installed systems across multiple cloud
platforms. The typical enterprise IT application
portfolio is still comprised of a few thousand
applications scattered over legacy hardware,
virtualized infrastructure, and now dozens of
disjointed shadow public Infrastructure-as-a-Service
(IaaS) and Software-as-a-Service (SaaS) providers and
offerings.</para>
</listitem>
</itemizedlist></section>
<section xml:id="network-considerations-hybrid"><title>Network Considerations</title>
<para>The network services functionality is an important factor to
assess when choosing a CMP and cloud provider. Considerations
are functionality, security, scalability and HA. Verification
and ongoing testing of the critical features of the cloud
endpoint used by the architecture are important tasks.</para>
<itemizedlist>
<listitem>
<para>Once the network functionality framework has been
decided, a minimum functionality test should be
designed. This will ensure testing and functionality
persists during and after upgrades.</para>
</listitem>
<listitem>
<para>Scalability across multiple cloud providers may
dictate which underlying network framework you will
choose in different cloud providers. It is important
to have the network API functions presented and to
verify that functionality persists across all cloud
endpoints chosen.</para>
</listitem>
<listitem>
<para>High availability implementations vary in
functionality and design. Examples of some common
methods are Active-Hot-Standby, Active-Passive and
Active-Active. High availability and a test framework
needs to be developed to insure that the functionality
and limitations are well understood.</para>
</listitem>
<listitem>
<para>Security considerations include how data is secured
between client and endpoint and any traffic that
traverses the multiple clouds, from eavesdropping to
DoS activities.</para>
</listitem>
</itemizedlist></section>
<section xml:id="risk-mitigation-management-hybrid"><title>Risk Mitigation and Management
Considerations</title>
<para>Hybrid cloud architectures introduce additional risk because
they add additional complexity and potentially conflicting or
incompatible components or tools. However, they also reduce
risk by spreading workloads over multiple providers. This
means, if one was to go out of business, the organization
could remain operational.</para>
<para>Risks that will be heightened by using a hybrid cloud
architecture include:</para>
<itemizedlist>
<listitem>
<para>Provider availability or implementation details:
This can range from the company going out of business
to the company changing how it delivers its services.
Cloud architectures are inherently designed to be
flexible and changeable; paradoxically, the cloud is
both perceived to be rock solid and ever flexible at
the same time.</para>
</listitem>
<listitem>
<para>Differing SLAs: Users of hybrid cloud environments
potentially encounter some losses through differences
in service level agreements. A hybrid cloud design
needs to accommodate the different SLAs provided by
the various clouds involved in the design, and must
address the actual enforceability of the providers'
SLAs.</para>
</listitem>
<listitem>
<para>Security levels: Securing multiple cloud
environments is more complex than securing a single
cloud environment. Concerns need to be addressed at,
but not limited to, the application, network, and
cloud platform levels. One issue is that different
cloud platforms approach security differently, and a
hybrid cloud design must address and compensate for
differences in security approaches. For example, AWS
uses a relatively simple model that relies on user
privilege combined with firewalls.</para>
</listitem>
<listitem>
<para>Provider API changes: APIs are crucial in a hybrid
cloud environment. As a consumer of a provider's cloud
services, an organization will rarely have any control
over provider changes to APIs. Cloud services that
might have previously had compatible APIs may no
longer work. This is particularly a problem with AWS
and OpenStack AWS-compatible APIs. OpenStack was
originally planned to maintain compatibility with
changes in AWS APIs. However, over time, the APIs have
become more divergent in functionality. One way to
address this issue is to focus on using only the most
common and basic APIs to minimize potential
conflicts.</para>
</listitem>
</itemizedlist></section>
</section>

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 77 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

View File

@ -0,0 +1,97 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-how-this-book-is-organized">
<title>How this Book is Organized</title>
<para>This book has been organized into various chapters that help
define the use cases associated with making architectural
choices related to an OpenStack cloud installation. Each
chapter is intended to stand alone to encourage individual
chapter readability, however each chapter is designed to
contain useful information that may be applicable in
situations covered by other chapters. Cloud architects may use
this book as a comprehensive guide by reading all of the use
cases, but it is also possible to review only the chapters
which pertain to a specific use case. When choosing to read
specific use cases, note that it may be necessary to read more
than one section of the guide to formulate a complete design
for the cloud. The use cases covered in this guide
include:</para>
<itemizedlist>
<listitem>
<para>General purpose: A cloud built with common
components that should address 80% of common use
cases.</para>
</listitem>
<listitem>
<para>Compute focused: A cloud designed to address compute
intensive workloads such as high performance computing
(HPC).</para>
</listitem>
<listitem>
<para>Storage focused: A cloud focused on storage
intensive workloads such as data analytics with
parallel file systems.</para>
</listitem>
<listitem>
<para>Network focused: A cloud depending on high
performance and reliable networking, such as a content
delivery network (CDN).</para>
</listitem>
<listitem>
<para>Multi-site: A cloud built with multiple sites
available for application deployments for
geographical, reliability or data locality
reasons.</para>
</listitem>
<listitem>
<para>Hybrid cloud: An architecture where multiple
disparate clouds are connected either for failover,
hybrid cloud bursting, or availability.</para>
</listitem>
<listitem>
<para>Massively Scalable: An architecture that is intended
for cloud service providers or other extremely large
installations.</para>
</listitem>
</itemizedlist>
<para>A section titled Specialized Use Cases provides information
on architectures that have not previously been covered in the
defined use cases.</para>
<para>Each chapter in the guide is then further broken down into
the following sections:</para>
<itemizedlist>
<listitem>
<para>Introduction: Provides an overview of the
architectural use case.</para>
</listitem>
<listitem>
<para>User requirements: Defines the set of user
considerations that typically come into play for that
use case.</para>
</listitem>
<listitem>
<para>Technical considerations: Covers the technical
issues that must be accounted when dealing with this
use case.</para>
</listitem>
<listitem>
<para>Operational considerations: Covers the ongoing
operational tasks associated with this use case and
architecture.</para>
</listitem>
<listitem>
<para>Architecture: Covers the overall architecture
associated with the use case.</para>
</listitem>
<listitem>
<para>Prescriptive examples: Presents one or more
scenarios where this architecture could be
deployed.</para>
</listitem>
</itemizedlist>
<para>A Glossary covers the terms and phrases used in the
book.</para>
</section>

View File

@ -0,0 +1,88 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-why-and-who-we-wrote-this-book">
<title>Why and How We Wrote this Book</title>
<para>The velocity at which OpenStack environments are moving from
proof-of-concepts to production deployments is leading to
increasing questions and issues related to architecture design
considerations. By and large these considerations are not
addressed in the existing documentation, which typically
focuses on the specifics of deployment and configuration
options or operational considerations, rather than the bigger
picture.</para>
<para>We wrote this book to guide readers in designing an
OpenStack architecture that meets the needs of their
organization. This guide concentrates on identifying important
design considerations for common cloud use cases and provides
examples based on these design guidelines. This guide does not
aim to provide explicit instructions for installing and
configuring the cloud, but rather focuses on design principles
as they relate to user requirements as well as technical and
operational considerations. For specific guidance with
installation and configuration there are a number of resources
already available in the OpenStack documentation that help in
that area.</para>
<para>This book was written in a book sprint format, which is a
facilitated, rapid development production method for books.
For more information, see the Book Sprints website
(www.booksprints.net).</para>
<para>This book was written in five days during July 2014 while
exhausting the M&amp;M, Mountain Dew and healthy options
supply, complete with juggling entertainment during lunches at
VMware's headquarters in Palo Alto. The event was also
documented on Twitter using the #OpenStackDesign hashtag. The
Book Sprint was facilitated by Faith Bosworth and Adam
Hyde.</para>
<para>We would like to thank VMware for their generous
hospitality, as well as our employers, Cisco, Cloudscaling,
Comcast, EMC, Mirantis, Rackspace, Red Hat, Verizon, and
VMware, for enabling us to contribute our time. We would
especially like to think Anne Gentle and Kenneth Hui for all
of their shepherding and organization in making this
happen.</para>
<para>The author team includes:</para>
<itemizedlist>
<listitem>
<para>Kenneth Hui (EMC) @hui_kenneth</para>
</listitem>
<listitem>
<para>Alexandra Settle (Rackspace) @dewsday</para>
</listitem>
<listitem>
<para>Anthony Veiga (Comcast) @daaelar</para>
</listitem>
<listitem>
<para>Beth Cohen (Verizon) @bfcohen</para>
</listitem>
<listitem>
<para>Kevin Jackson (Rackspace) @itarchitectkev</para>
</listitem>
<listitem>
<para>Maish Saidel-Keesing (Cisco) @maishsk</para>
</listitem>
<listitem>
<para>Nick Chase (Mirantis) @NickChase</para>
</listitem>
<listitem>
<para>Scott Lowe (VMware) @scott_lowe</para>
</listitem>
<listitem>
<para>Sean Collins (Comcast) @sc68cal</para>
</listitem>
<listitem>
<para>Sean Winn (Cloudscaling) @seanmwinn</para>
</listitem>
<listitem>
<para>Sebastian Gutierrez (Red Hat) @gutseb</para>
</listitem>
<listitem>
<para>Stephen Gordon (Red Hat) @xsgordon</para>
</listitem>
<listitem>
<para>Vinny Valdez (Red Hat) @VinnyValdez</para>
</listitem>
</itemizedlist>
</section>

View File

@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intended-audience">
<title>Intended Audience</title>
<para>This book has been written for architects and designers of
OpenStack clouds. This book is not intended for people who are
deploying OpenStack. For a guide on deploying and operating
OpenStack, please refer to the Operations Guide
http://docs.openstack.org/openstack-ops.</para>
<para>The reader should have prior knowledge of cloud architecture
and principles, experience in enterprise system design, Linux
and virtualization experience, and a basic understanding of
networking principles and protocols.</para>
</section>

View File

@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intro-to-openstack-arch-design-guide">
<title>Introduction to the OpenStack Architecture Design
Guide</title>
<para>OpenStack is a leader in the cloud technology gold rush, as
organizations of all stripes discover the increased
flexibility and speed to market that self-service cloud and
Infrastructure as a Service (IaaS) provides. To truly reap
those benefits, however, the cloud must be designed and
architected properly.</para>
<para>A well-architected cloud provides a stable IT environment
that offers easy access to needed resources, usage-based
expenses, extra capacity on demand, disaster recovery, and a
secure environment, but a well-architected cloud does not
magically build itself. It requires careful consideration of a
multitude of factors, both technical and non-technical.</para>
<para>There is no single architecture that is "right" for an
OpenStack cloud deployment. OpenStack can be used for any
number of different purposes, and each of them has its own
particular requirements and architectural
peculiarities.</para>
<para>This book is designed to look at some of the most common
uses for OpenStack clouds (and even some that are less common,
but provide a good example) and explain what issues need to be
considered and why, along with a wealth of knowledge and
advice to help an organization to design and build a
well-architected OpenStack cloud that will fit its unique
requirements.</para>
</section>

View File

@ -0,0 +1,232 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="methodology">
<title>Methodology</title>
<para>The magic of the cloud is that it can do anything. It is both robust
and flexible, the best of both worlds. Yes, the cloud is highly flexible
and it can do almost anything, but to get the most out of a cloud
investment, it is important to define how the cloud will be used by
creating and testing use cases. This is the chapter that describes the
thought process behind how to design a cloud architecture that best
suits the intended use.</para>
<mediaobject>
<imageobject>
<imagedata fileref="../images/Methodology.png"/>
</imageobject>
</mediaobject>
<para>The diagram shows at a very abstract level the process for capturing
requirements and building use cases. Once a set of use cases has been
defined, it can then be used to design the cloud architecture.</para>
<para>Use case planning can seem counter-intuitive. After all, it takes
about five minutes to sign up for a server with Amazon. Amazon does not
know in advance what any given user is planning on doing with it, right?
Wrong. Amazons product management department spends plenty of time
figuring out exactly what would be attractive to their typical customer
and honing the service to deliver it. For the enterprise, the planning
process is no different, but instead of planning for an external paying
customer, for example, the use could be for internal application
developers or a web portal. The following is a list of the high level
objectives that need to be incorporated into the thinking about creating
a use case.</para>
<para>Overall business objectives</para>
<itemizedlist>
<listitem>
<para>Develop clear definition of business goals and requirements
</para>
</listitem>
<listitem>
<para>Increase project support and engagement with business,
customers and end users.</para>
</listitem>
</itemizedlist>
<para>Technology</para>
<itemizedlist>
<listitem>
<para>Coordinate the OpenStack architecture across the project and
leverage OpenStack community efforts more effectively.</para>
</listitem>
<listitem>
<para>Architect for automation as much as possible to speed
development and deployment.</para>
</listitem>
<listitem>
<para>Use the appropriate tools for the development effort.</para>
</listitem>
<listitem>
<para>Create better and more test metrics and test harnesses to
support continuous and integrated development, test processes
and automation.</para>
</listitem>
</itemizedlist>
<para>Organization</para>
<itemizedlist>
<listitem>
<para>Better messaging of management support of team efforts</para>
</listitem>
<listitem>
<para>Develop better cultural understanding of Open Source, cloud
architectures, Agile methodologies, continuous development, test
and integration, overall development concepts in general</para>
</listitem>
</itemizedlist>
<para>As an example of how this works, consider a business goal of using the
cloud for the companys E-commerce website. This goal means planning for
applications that will support thousands of sessions per second,
variable workloads, and lots of complex and changing data. By
identifying the key metrics, such as number of concurrent transactions
per second, size of database, and so on, it is possible to then build a
method for testing the assumptions.</para>
<para>Develop functional user scenarios: Develop functional user scenarios
that can be used to develop test cases that can be used to measure
overall project trajectory. If the organization is not ready to commit
to an application or applications that can be used to develop user
requirements, it needs to create requirements to build valid test
harnesses and develop useable metrics. Once the metrics are established,
as requirements change, it is easier to respond to the changes quickly
without having to worry overly much about setting the exact requirements
in advance. Think of this as creating ways to configure the system,
rather than redesigning it every time there is a requirements change.</para>
<para>Limit cloud feature set: Create requirements that address the pain
points, but do not recreate the entire OpenStack tool suite. The
requirement to build OpenStack, only better, is self-defeating. It is
important to limit scope creep by concentrating on developing a platform
that will address tool limitations for the requirements, but not
recreating the entire suite of tools. Work with technical product owners
to establish critical features that are needed for a successful cloud
deployment.</para>
<section xml:id="application-cloud-readiness-methods">
<title>Application Cloud Readiness</title>
<para>Although the cloud is designed to make things easier, it is
important to realize that "using cloud" is more than just firing up
an instance and dropping an application on it. The "lift and shift"
approach works in certain situations, but there is a fundamental
difference between clouds and traditional bare-metal-based
environments, or even traditional virtualized environments.</para>
<para>In traditional environments, with traditional enterprise
applications, the applications and the servers that run on them are
"pets". They're lovingly crafted and cared for, the servers have
names like Gandalf or Tardis, and if they get sick, someone nurses
them back to health. All of this is designed so that the application
does not experience an outage.</para>
<para>In cloud environments, on the other hand, servers are more like
cattle. There are thousands of them, they get names like NY-1138-Q,
and if they get sick, they get put down and a sysadmin installs
another one. Traditional applications that are unprepared for this
kind of environment, naturally will suffer outages, lost data, or
worse.</para>
<para>There are other reasons to design applications with cloud in mind.
Some are defensive, such as the fact that applications cannot be
certain of exactly where or on what hardware they will be launched,
they need to be flexible, or at least adaptable. Others are
proactive. For example, one of the advantages of using the cloud is
scalability, so applications need to be designed in such a way that
they t can take advantage of those and other opportunities.</para>
</section>
<section xml:id="determining-whether-an-application-is-cloud-ready">
<title>Determining whether an application is cloud-ready</title>
<para>There are several factors to take into consideration when looking
at whether an application is a good fit for the cloud.</para>
<para>Structure: A large, monolithic, single-tiered legacy application
typically isn't a good fit for the cloud. Efficiencies are gained
when load can be spread over several instances, so that a failure in
one part of the system can be mitigated without affecting other
parts of the system, or so that scaling can take place where the app
needs it.</para>
<para>Dependencies: Applications that depend on specific hardware --
such as a particular chip set or an external device such as a
fingerprint reader -- might not be a good fit for the cloud, unless
those dependencies are specifically addressed. Similarly, if an
application depends on an operating system or set of libraries that
cannot be used in the cloud, or cannot be virtualized, that is a
problem.</para>
<para>Connectivity: Self-contained applications or those that depend on
resources that are not reachable by the cloud in question, will not
run. In some situations, work around these issues with custom
network setup, but how well this works depends on the chosen cloud
environment.</para>
<para>Durability and Resilience: Despite the existence of SLAs, the one
reality of the cloud is that Things Break. Servers go down, network
connections are disrupted, other tenants on a server ramp up the
load to make the server unusable. Any number of things can happen,
and an application that isn't built to withstand this kind of
disruption isn't going to work properly.</para>
</section>
<section xml:id="designing-for-the-cloud">
<title>Designing for the cloud</title>
<para>Here are some guidelines to keep in mind when designing an
application for the cloud:</para>
<itemizedlist>
<listitem>
<para>Be a pessimist: Assume everything fails and design
backwards. Love your chaos monkey.</para>
</listitem>
<listitem>
<para>Put your eggs in multiple baskets: Leverage multiple
providers, geographic regions and availability zones to
accommodate for local availability issues. Design for
portability.</para>
</listitem>
<listitem>
<para>Think efficiency: Inefficient designs will not scale.
Efficient designs become cheaper as they scale. Kill off
unneeded components or capacity.</para>
</listitem>
<listitem>
<para>Be paranoid: Design for defense in depth and zero
tolerance by building in security at every level and between
every component. Trust no one.</para>
</listitem>
<listitem>
<para>But not too paranoid: Not every application needs the
platinum solution. Architect for different SLAs, service
tiers and security levels.</para>
</listitem>
<listitem>
<para>Manage the data: Data is usually the most inflexible and
complex area of a cloud and cloud integration architecture.
Dont short change the effort in analyzing and addressing
data needs.</para>
</listitem>
<listitem>
<para>Hands off: Leverage automation to increase consistency and
quality and reduce response times.</para>
</listitem>
<listitem>
<para>Divide and conquer: Pursue partitioning and
parallel layering wherever possible. Make components as small
and portable as possible. Use load balancing between layers.
</para>
</listitem>
<listitem>
<para>Think elasticity: Increasing resources should result in a
proportional increase in performance and scalability.
Decreasing resources should have the opposite effect.
</para>
</listitem>
<listitem>
<para>Be dynamic: Enable dynamic configuration changes such as
auto scaling, failure recovery and resource discovery to
adapt to changing environments, faults and workload volumes.
</para>
</listitem>
<listitem>
<para>Stay close: Reduce latency by moving highly interactive
components and data near each other.</para>
</listitem>
<listitem>
<para>Keep it loose: Loose coupling, service interfaces,
separation of concerns, abstraction and well defined APIs
deliver flexibility.</para>
</listitem>
<listitem>
<para>Be cost aware: Autoscaling, data transmission, virtual
software licenses, reserved instances, and so on can rapidly
increase monthly usage charges. Monitor usage closely.
</para>
</listitem>
</itemizedlist>
</section>
</section>

View File

@ -0,0 +1,75 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intro-massive-scale">
<title>Introduction</title>
<para>A massively scalable architecture is defined as a cloud
implementation that is either a very large deployment, such as
one that would be built by a commercial service provider, or
one that has the capability to support user requests for large
amounts of cloud resources. An example would be an
infrastructure in which requests to service 500 instances or
more at a time is not uncommon. In a massively scalable
infrastructure, such a request is fulfilled without completely
consuming all of the available cloud infrastructure resources.
While the high capital cost of implementing such a cloud
architecture makes it cost prohibitive and is only spearheaded
by few organizations, many organizations are planning for
massive scalability moving toward the future.</para>
<para>A massively scalable OpenStack cloud design presents a
unique set of challenges and considerations. For the most part
it is similar to a general purpose cloud architecture, as it
is built to address a non-specific range of potential use
cases or functions. Typically, it is rare that massively
scalable clouds are designed or specialized for particular
workloads. Like the general purpose cloud, the massively
scalable cloud is most often built as a platform for a variety
of workloads. Massively scalable OpenStack clouds are
generally built as commercial public cloud offerings since
single private organizations rarely have the resources or need
for this scale.</para>
<para>Services provided by a massively scalable OpenStack cloud
will include:</para>
<itemizedlist>
<listitem>
<para>Virtual-machine disk image library</para>
</listitem>
<listitem>
<para>Raw block storage</para>
</listitem>
<listitem>
<para>File or object storage</para>
</listitem>
<listitem>
<para>Firewall functionality</para>
</listitem>
<listitem>
<para>Load balancing functionality</para>
</listitem>
<listitem>
<para>Private (non-routable) and public (floating) IP
addresses</para>
</listitem>
<listitem>
<para>Virtualized network topologies</para>
</listitem>
<listitem>
<para>Software bundles</para>
</listitem>
<listitem>
<para>Virtual compute resources</para>
</listitem>
</itemizedlist>
<para>Like a general purpose cloud, the instances deployed in a
massively scalable OpenStack cloud will not necessarily use
any specific aspect of the cloud offering (compute, network,
or storage). As the cloud grows in scale, the scale of the
number of workloads can cause stress on all of the cloud
components. Additional stresses are introduced to supporting
infrastructure including databases and message brokers. The
architecture design for such a cloud must account for these
performance pressures without negatively impacting user
experience.</para>
</section>

View File

@ -0,0 +1,99 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="operational-considerations-massive-scale">
<?dbhtml stop-chunking?>
<title>Operational Considerations</title>
<para>In order to run at massive scale, it is important to plan on
the automation of as many of the operational processes as
possible. Automation includes the configuration of
provisioning, monitoring and alerting systems. Part of the
automation process includes the capability to determine when
human intervention is required and who should act. The
objective is to increase the ratio of operational staff to
running systems as much as possible to reduce maintenance
costs. In a massively scaled environment, it is impossible for
staff to give each system individual care.</para>
<para>Configuration management tools such as Puppet or Chef allow
operations staff to categorize systems into groups based on
their role and thus create configurations and system states
that are enforced through the provisioning system. Systems
that fall out of the defined state due to errors or failures
are quickly removed from the pool of active nodes and
replaced.</para>
<para>At large scale the resource cost of diagnosing individual
systems that have failed is far greater than the cost of
replacement. It is more economical to immediately replace the
system with a new system that can be provisioned and
configured automatically and quickly brought back into the
pool of active nodes. By automating tasks that are labor
intensive, repetitive, and critical to operations with
automation, cloud operations teams are able to be managed more
efficiently because fewer resources are needed for these
babysitting tasks. Administrators are then free to tackle
tasks that cannot be easily automated and have longer-term
impacts on the business such as capacity planning.</para>
<section xml:id="the-bleeding-edge"><title>The Bleeding Edge</title>
<para>Running OpenStack at massive scale requires striking a
balance between stability and features. For example, it might
be tempting to run an older stable release branch of OpenStack
to make deployments easier. However, when running at massive
scale, known issues that may be of some concern or only have
minimal impact in smaller deployments could become pain points
at massive scale. If the issue is well known, in many cases,
it may be resolved in more recent releases. The OpenStack
community can help resolve any issues reported by the applying
the collective expertise of the OpenStack developers.</para>
<para>When issues crop up, the number of organizations running at
a similar scale is a relatively tiny proportion of the
OpenStack community, therefore it is important to share these
issues with the community and be a vocal advocate for
resolving them. Some issues only manifest when operating at
large scale and the number of organizations able to duplicate
and validate an issue is small, so it will be important to
document and dedicate resources to their resolution.</para>
<para>In some cases, the resolution to the problem is ultimately
to deploy a more recent version of OpenStack. Alternatively,
when the issue needs to be resolved in a production
environment where rebuilding the entire environment is not an
option, it is possible to deploy just the more recent separate
underlying components required to resolve issues or gain
significant performance improvements. At first glance, this
could be perceived as potentially exposing the deployment to
increased risk to and instability. However, in many cases it
could be an issue that has not been discovered yet.</para>
<para>It is advisable to cultivate a development and operations
organization that is responsible for creating desired
features, diagnose and resolve issues, and also build the
infrastructure for large scale continuous integration tests
and continuous deployment. This helps catch bugs early and
make deployments quicker and less painful. In addition to
development resources, the recruitment of experts in the
fields of message queues, databases, distributed systems, and
networking, cloud and storage is also advisable.</para></section>
<section xml:id="growth-and-capacity-planning"><title>Growth and Capacity Planning</title>
<para>An important consideration in running at massive scale is
projecting growth and utilization trends to plan capital
expenditures for the near and long term. Utilization metrics
for compute, network, and storage as well as a historical
record of these metrics are required. While securing major
anchor tenants can lead to rapid jumps in the utilization
rates of all resources, the steady adoption of the cloud
inside an organizations or by public consumers in a public
offering will also create a steady trend of increased
utilization.</para></section>
<section xml:id="skills-and-training"><title>Skills and Training</title>
<para>Projecting growth for storage, networking, and compute is
only one aspect of a growth plan for running OpenStack at
massive scale. Growing and nurturing development and
operational staff is an additional consideration. Sending team
members to OpenStack conferences, meetup events, and
encouraging active participation in the mailing lists and
committees is a very important way to maintain skills and
forge relationships in the community. A list of OpenStack
training providers in the marketplace can be found here:
http://www.openstack.org/marketplace/training/.</para>
</section>
</section>

View File

@ -0,0 +1,127 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="technical-considerations-massive-scale">
<?dbhtml stop-chunking?>
<title>Technical Considerations</title>
<para>Converting an existing OpenStack environment that was
designed for a different purpose to be massively scalable is a
formidable task. When building a massively scalable
environment from the ground up, make sure the initial
deployment is built with the same principles and choices that
apply as the environment grows. For example, a good approach
is to deploy the first site as a multi-site environment. This
allows the same deployment and segregation methods to be used
as the environment grows to separate locations across
dedicated links or wide area networks. In a hyperscale cloud,
scale trumps redundancy. Applications must be modified with
this in mind, relying on the scale and homogeneity of the
environment to provide reliability rather than redundant
infrastructure provided by non-commodity hardware
solutions.</para>
<section xml:id="infrastructure-segregation-massive-scale"><title>Infrastructure Segregation</title>
<para>Fortunately, OpenStack services are designed to support
massive horizontal scale. Be aware that this is not the case
for the entire supporting infrastructure. This is particularly
a problem for the database management systems and message
queues used by the various OpenStack services for data storage
and remote procedure call communications.</para>
<para>Traditional clustering techniques are typically used to
provide high availability and some additional scale for these
environments. In the quest for massive scale, however,
additional steps need to be taken to relieve the performance
pressure on these components to prevent them from negatively
impacting the overall performance of the environment. It is
important to make sure that all the components are in balance
so that, if and when the massively scalable environment fails,
all the components are at, or close to, maximum
capacity.</para>
<para>Regions are used to segregate completely independent
installations linked only by an Identity and Dashboard
(optional) installation. Services are installed with separate
API endpoints for each region, complete with separate database
and queue installations. This exposes some awareness of the
environment's fault domains to users and gives them the
ability to ensure some degree of application resiliency while
also imposing the requirement to specify which region their
actions must be applied to.</para>
<para>Environments operating at massive scale typically need their
regions or sites subdivided further without exposing the
requirement to specify the failure domain to the user. This
provides the ability to further divide the installation into
failure domains while also providing a logical unit for
maintenance and the addition of new hardware. At hyperscale,
instead of adding single compute nodes, administrators may add
entire racks or even groups of racks at a time with each new
addition of nodes exposed via one of the segregation concepts
mentioned herein.</para>
<para>Cells provide the ability to subdivide the compute portion
of an OpenStack installation, including regions, while still
exposing a single endpoint. In each region an API cell is
created along with a number of compute cells where the
workloads actually run. Each cell gets its own database and
message queue setup (ideally clustered), providing the ability
to subdivide the load on these subsystems, improving overall
performance.</para>
<para>Within each compute cell a complete compute installation is
provided, complete with full database and queue installations,
scheduler, conductor, and multiple compute hosts. The cells
scheduler handles placement of user requests from the single
API endpoint to a specific cell from those available. The
normal filter scheduler then handles placement within the
cell.</para>
<para>The downside of using cells is that they are not well
supported by any of the OpenStack services other than compute.
Also, they do not adequately support some relatively standard
OpenStack functionality such as security groups and host
aggregates. Due to their relative newness and specialized use,
they receive relatively little testing in the OpenStack gate.
Despite these issues, however, cells are used in some very
well known OpenStack installations operating at massive scale
including those at CERN and Rackspace.</para></section>
<section xml:id="host-aggregates"><title>Host Aggregates</title>
<para>Host Aggregates enable partitioning of OpenStack Compute
deployments into logical groups for load balancing and
instance distribution. Host aggregates may also be used to
further partition an availability zone. Consider a cloud which
might use host aggregates to partition an availability zone
into groups of hosts that either share common resources, such
as storage and network, or have a special property, such as
trusted computing hardware. Host aggregates are not explicitly
user-targetable; instead they are implicitly targeted via the
selection of instance flavors with extra specifications that
map to host aggregate metadata.</para></section>
<section xml:id="availability-zones"><title>Availability Zones</title>
<para>Availability zones provide another mechanism for subdividing
an installation or region. They are, in effect, Host
aggregates that are exposed for (optional) explicit targeting
by users.</para>
<para>Unlike cells, they do not have their own database server or
queue broker but simply represent an arbitrary grouping of
compute nodes. Typically, grouping of nodes into availability
zones is based on a shared failure domain based on a physical
characteristic such as a shared power source, physical network
connection, and so on. Availability Zones are exposed to the
user because they can be targeted; however, users are not
required to target them. An alternate approach is for the
operator to set a default availability zone to schedule
instances to other than the default availability zone of
nova.</para></section>
<section xml:id="segregation-example"><title>Segregation Example</title>
<para>In this example the cloud is divided into two regions, one
for each site, with two availability zones in each based on
the power layout of the data centers. A number of host
aggregates have also been defined to allow targeting of
virtual machine instances using flavors, that require special
capabilities shared by the target hosts such as SSDs, 10 G
networks, or GPU cards.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Massively_Scalable_Cells_+_regions_+_azs.png"
/>
</imageobject>
</mediaobject></section>
</section>

View File

@ -0,0 +1,173 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="user-requirements-massive-scale-overview">
<?dbhtml stop-chunking?>
<title>User Requirements</title>
<para>More so than other scenarios, defining user requirements for
a massively scalable OpenStack design architecture dictates
approaching the design from two different, yet sometimes
opposing, perspectives: the cloud user, and the cloud
operator. The expectations and perceptions of the consumption
and management of resources of a massively scalable OpenStack
cloud from the user point of view is distinctly different from
that of the cloud operator.</para>
<para>Many jurisdictions have legislative and regulatory
requirements governing the storage and management of data in
cloud environments. Common areas of regulation include:</para>
<itemizedlist>
<listitem>
<para>Data retention policies ensuring storage of
persistent data and records management to meet data
archival requirements.</para>
</listitem>
<listitem>
<para>Data ownership policies governing the possession and
responsibility for data.</para>
</listitem>
<listitem>
<para>Data sovereignty policies governing the storage of
data in foreign countries or otherwise separate
jurisdictions.</para>
</listitem>
<listitem>
<para>Data compliance policies governing certain types of
information needs to reside in certain locations due
to regular issues and, more importantly, cannot reside
in other locations for the same reason.</para>
</listitem>
</itemizedlist>
<para>Examples of such legal frameworks include the data
protection framework of the European Union
(http://ec.europa.eu/justice/data-protection/ ) and the
requirements of the Financial Industry Regulatory Authority
(http://www.finra.org/Industry/Regulation/FINRARules/ ) in the
United States. Consult a local regulatory body for more
information.</para>
<section xml:id="user-requirements-massive-scale"><title>User Requirements</title>
<para>Massively scalable OpenStack clouds have the following user
requirements:</para>
<itemizedlist>
<listitem>
<para>The cloud user expects repeatable, dependable, and
deterministic processes for launching and deploying
cloud resources. This could be delivered through a
web-based interface or publicly available API
endpoints. All appropriate options for requesting
cloud resources need to be available through some type
of user interface, a command-line interface (CLI), or
API endpoints.</para>
</listitem>
<listitem>
<para>Cloud users expect a fully self-service and
on-demand consumption model. When an OpenStack cloud
reaches the "massively scalable" size, it means it is
expected to be consumed "as a service" in each and
every way.</para>
</listitem>
<listitem>
<para>For a user of a massively scalable OpenStack public
cloud, there will be no expectations for control over
security, performance, or availability. Only SLAs
related to uptime of API services are expected, and
very basic SLAs expected of services offered. The user
understands it is his or her responsibility to address
these issues on their own. The exception to this
expectation is the rare case of a massively scalable
cloud infrastructure built for a private or government
organization that has specific requirements.</para>
</listitem>
</itemizedlist>
<para>As might be expected, the cloud user requirements or
expectations that determine the design are all focused on the
consumption model. The user expects to be able to easily
consume cloud resources in an automated and deterministic way,
without any need for knowledge of the capacity, scalability,
or other attributes of the cloud's underlying
infrastructure.</para></section>
<section xml:id="operator-requirements-massive-scale"><title>Operator Requirements</title>
<para>Whereas the cloud user should be completely unaware of the
underlying infrastructure of the cloud and its attributes, the
operator must be able to build and support the infrastructure,
as well as how it needs to operate at scale. This presents a
very demanding set of requirements for building such a cloud
from the operator's perspective:</para>
<itemizedlist>
<listitem>
<para>First and foremost, everything must be capable of
automation. From the deployment of new hardware,
compute hardware, storage hardware, or networking
hardware, to the installation and configuration of the
supporting software, everything must be capable of
being automated. Manual processes will not suffice in
a massively scalable OpenStack design
architecture.</para>
</listitem>
<listitem>
<para>The cloud operator requires that capital expenditure
(CapEx) is minimized at all layers of the stack.
Operators of massively scalable OpenStack clouds
require the use of dependable commodity hardware and
freely available open source software components to
reduce deployment costs and operational expenses.
Initiatives like OpenCompute (more information
available at http://www.opencompute.org) provide
additional information and pointers. To cut costs,
many operators sacrifice redundancy. For example,
redundant power supplies, redundant network
connections, and redundant rack switches.</para>
</listitem>
<listitem>
<para>Companies operating a massively scalable OpenStack
cloud also require that operational expenditures
(OpEx) be minimized as much as possible. It is
recommended that cloud-optimized hardware is a good
approach when managing operational overhead. Some of
the factors that need to be considered include power,
cooling, and the physical design of the chassis. It is
possible to customize the hardware and systems so they
are optimized for this type of workload because of the
scale of these implementations.</para>
</listitem>
<listitem>
<para>Massively scalable OpenStack clouds require
extensive metering and monitoring functionality to
maximize the operational efficiency by keeping the
operator informed about the status and state of the
infrastructure. This includes full scale metering of
the hardware and software status. A corresponding
framework of logging and alerting is also required to
store and allow operations to act upon the metrics
provided by the metering and monitoring solution(s).
The cloud operator also needs a solution that uses the
data provided by the metering and monitoring solution
to provide capacity planning and capacity trending
analysis.</para>
</listitem>
<listitem>
<para>A massively scalable OpenStack cloud will be a
multi-site cloud. Therefore, the user-operator
requirements for a multi-site OpenStack architecture
design are also applicable here. This includes various
legal requirements for data storage, data placement,
and data retention; other jurisdictional legal or
compliance requirements; image
consistency-availability; storage replication and
availability (both block and file/object storage); and
authentication, authorization, and auditing (AAA),
just to name a few. Refer to the "Multi-Site" section
for more details on requirements and considerations
for multi-site OpenStack clouds.</para>
</listitem>
<listitem>
<para>Considerations around physical facilities such as
space, floor weight, rack height and type,
environmental considerations, power usage and power
usage efficiency (PUE), and physical security must
also be addressed by the design architecture of a
massively scalable OpenStack cloud.</para>
</listitem>
</itemizedlist></section>
</section>

View File

@ -0,0 +1,118 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
xml:id="arch-design-architecture-multiple-site">
<?dbhtml stop-chunking?>
<title>Architecture</title>
<para>This graphic is a high level diagram of a multiple site OpenStack
architecture. Each site is an OpenStack cloud but it may be necessary to
architect the sites on different versions. For example, if the second
site is intended to be a replacement for the first site, they would be
different. Another common design would be a private OpenStack cloud with
replicated site that would be used for high availability or disaster
recovery. The most important design decision is how to configure the
storage. It can be configured as a single shared pool or separate pools,
depending on the user and technical requirements.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-Site_shared_keystone_horizon_swift1.png"/>
</imageobject>
</mediaobject>
<section xml:id="openstack-services-architecture">
<title>OpenStack Services Architecture</title>
<para>The OpenStack Identity service, which is used by all other
OpenStack components for authorization and the catalog of service
endpoints, supports the concept of regions. A region is a logical
construct that can be used to group OpenStack services that are in
close proximity to one another. The concept of regions is flexible;
it may can contain OpenStack service endpoints located within a
distinct geographic region, or regions. It may be smaller in scope,
where a region is a single rack within a data center or even a
single blade chassis, with multiple regions existing in adjacent
racks in the same data center.</para>
<para>The majority of OpenStack components are designed to run within
the context of a single region. The OpenStack Compute service is
designed to manage compute resources within a region, with support
for subdivisions of compute resources by using Availability Zones
and Cells. The OpenStack Networking service can be used to manage
network resources in the same broadcast domain or collection of
switches that are linked. The OpenStack Block Storage service
controls storage resources within a region with all storage
resources residing on the same storage network. Like the OpenStack
Compute service, the OpenStack Block Storage Service also supports
the Availability Zone construct,which can be used to subdivide
storage resources.</para>
<para>The OpenStack Dashboard, OpenStack Identity Service, and OpenStack
Object Storage services are components that can each be deployed
centrally in order to serve multiple regions.</para>
</section>
<section xml:id="arch-multi-storage">
<title>Storage</title>
<para>With multiple OpenStack regions, having a single OpenStack Object
Storage Service endpoint that delivers shared file storage for all
regions is desirable. The Object Storage service internally
replicates files to multiple nodes. The advantages of this are that,
if a file placed into the Object Storage service is visible to all
regions, it can be used by applications or workloads in any or all
of the regions. This simplifies high availability failover and
disaster recovery rollback.</para>
<para>In order to scale the Object Storage service to meet the workload
of multiple regions, multiple proxy workers are run and
load-balanced, storage nodes are installed in each region, and the
entire Object Storage Service can be fronted by an HTTP caching
layer. This is done so client requests for objects can be served out
of caches rather than directly from the storage modules themselves,
reducing the actual load on the storage network. In addition to an
HTTP caching layer, use a caching layer like Memcache to cache
objects between the proxy and storage nodes.</para>
<para>If the cloud is designed without a single Object Storage Service
endpoint for multiple regions, and instead a separate Object Storage
Service endpoint is made available in each region, applications are
required to handle synchronization (if desired) and other management
operations to ensure consistency across the nodes. For some
applications, having multiple Object Storage Service endpoints
located in the same region as the application may be desirable due
to reduced latency, cross region bandwidth, and ease of
deployment.</para>
<para>For the Block Storage service, the most important decisions are
the selection of the storage technology and whether or not a
dedicated network is used to carry storage traffic from the storage
service to the compute nodes.</para>
</section>
<section xml:id="arch-networking-multiple">
<title>Networking</title>
<para>When connecting multiple regions together there are several design
considerations. The overlay network technology choice determines how
packets are transmitted between regions and how the logical network
and addresses present to the application. If there are security or
regulatory requirements, encryption should be implemented to secure
the traffic between regions. For networking inside a region, the
overlay network technology for tenant networks is equally important.
The overlay technology and the network traffic of an application
generates or receives can be either complementary or be at cross
purpose. For example, using an overlay technology for an application
that transmits a large amount of small packets could add excessive
latency or overhead to each packet if not configured
properly.</para>
</section>
<section xml:id="arch-dependencies-multiple">
<title>Dependencies</title>
<para>The architecture for a multi-site installation of OpenStack is
dependent on a number of factors. One major dependency to consider
is storage. When designing the storage system, the storage mechanism
needs to be determined. Once the storage type is determined, how it
will be accessed is critical. For example, it is recommended that
storage should utilize a dedicated network. Another concern is how
the storage is configured to protect the data. For example, the
recovery point objective (RPO) and the recovery time objective
(RTO). How quickly can the recovery from a fault be completed, will
determine how often the replication of data be required. Ensure that
enough storage is allocated to support the data protection
strategy.</para>
<para>Networking decisions include the encapsulation mechanism that will
be used for the tenant networks, how large the broadcast domains
should be, and the contracted SLAs for the interconnects.</para>
</section>
</section>

View File

@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intro-multi">
<title>Introduction</title>
<para>A multi-site OpenStack environment is one in which services
located in more than one data center are used to provide the
overall solution. Usage requirements of different multi-site
clouds may vary widely, however they share some common needs.
OpenStack is capable of running in a multi-region
configuration allowing some parts of OpenStack to effectively
manage a grouping of sites as a single cloud. With some
careful planning in the design phase, OpenStack can act as an
excellent multi-site cloud solution for a multitude of
needs.</para>
<para>Some use cases that might indicate a need for a multi-site
deployment of OpenStack include:</para>
<itemizedlist>
<listitem>
<para>An organization with a diverse geographic
footprint.</para>
</listitem>
<listitem>
<para>Geo-location sensitive data.</para>
</listitem>
<listitem>
<para>Data locality, in which specific data or
functionality should be close to users.</para>
</listitem>
</itemizedlist>
</section>

View File

@ -0,0 +1,178 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="operational-considerations-multi-site">
<?dbhtml stop-chunking?>
<title>Operational Considerations</title>
<para>Deployment of a multi-site OpenStack cloud using regions
requires that the service catalog contains per-region entries
for each service deployed other than the Identity service
itself. There is limited support amongst currently available
off-the-shelf OpenStack deployment tools for defining multiple
regions in this fashion.</para>
<para>Deployers must be aware of this and provide the appropriate
customization of the service catalog for their site either
manually or via customization of the deployment tools in
use.</para>
<para>Note that, as of the Icehouse release, documentation for
implementing this feature is in progress. See this bug for
more information:
https://bugs.launchpad.net/openstack-manuals/+bug/1340509</para>
<section xml:id="licensing"><title>Licensing</title>
<para>Multi-site OpenStack deployments present additional
licensing considerations over and above regular OpenStack
clouds, particularly where site licenses are in use to provide
cost efficient access to software licenses. The licensing for
host operating systems, guest operating systems, OpenStack
distributions (if applicable), software-defined infrastructure
including network controllers and storage systems, and even
individual applications need to be evaluated in light of the
multi-site nature of the cloud.</para>
<para>Topics to consider include:</para>
<itemizedlist>
<listitem>
<para>The specific definition of what constitutes a site
in the relevant licenses, as the term does not
necessarily denote a geographic or otherwise
physically isolated location in the traditional
sense.</para>
</listitem>
<listitem>
<para>Differentiations between "hot" (active) and "cold"
(inactive) sites where significant savings may be made
in situations where one site is a cold standby for
disaster recovery purposes only.</para>
</listitem>
<listitem>
<para>Certain locations might require local vendors to
provide support and services for each site provides
challenges, but will vary on the licensing agreement
in place.</para>
</listitem>
</itemizedlist></section>
<section xml:id="logging-and-monitoring-multi-site"><title>Logging and Monitoring</title>
<para>Logging and monitoring does not significantly differ for a
multi-site OpenStack cloud. The same well known tools
described in the Operations Guide
(http://docs.openstack.org/openstack-ops/content/logging_monitoring.html)
remain applicable. Logging and monitoring can be provided both
on a per-site basis and in a common centralized
location.</para>
<para>When attempting to deploy logging and monitoring facilities
to a centralized location, care must be taken with regards to
the load placed on the inter-site networking links.</para></section>
<section xml:id="upgrades-multi-site"><title>Upgrades</title>
<para>In multi-site OpenStack clouds deployed using regions each
site is, effectively, an independent OpenStack installation
which is linked to the others by using centralized services
such as Identity which are shared between sites. At a high
level the recommended order of operations to upgrade an
individual OpenStack environment is
(http://docs.openstack.org/openstack-ops/content/ops_upgrades-general-steps.html):</para>
<orderedlist>
<listitem>
<para>Upgrade the OpenStack Identity Service
(Keystone).</para>
</listitem>
<listitem>
<para>Upgrade the OpenStack Image Service (Glance).</para>
</listitem>
<listitem>
<para>Upgrade OpenStack Compute (Nova), including
networking components.</para>
</listitem>
<listitem>
<para>Upgrade OpenStack Block Storage (Cinder).</para>
</listitem>
<listitem>
<para>Upgrade the OpenStack dashboard.(Horizon)</para>
</listitem>
</orderedlist>
<para>The process for upgrading a multi-site environment is not
significantly different:</para>
<orderedlist>
<listitem>
<para>Upgrade the shared OpenStack Identity Service
(Keystone) deployment.</para>
</listitem>
<listitem>
<para>Upgrade the OpenStack Image Service (glance) at each
site.</para>
</listitem>
<listitem>
<para>Upgrade OpenStack Compute (Nova), including
networking components, at each site.</para>
</listitem>
<listitem>
<para>Upgrade OpenStack Block Storage (Cinder) at each
site.</para>
</listitem>
<listitem>
<para>Upgrade the OpenStack dashboard (Horizon), at each
site - or in the single central location if it is
shared.</para>
</listitem>
</orderedlist>
<para>Note that, as of the OpenStack Icehouse release, compute
upgrades within each site can also be performed in a rolling
fashion. Compute controller services (API, Scheduler, and
Conductor) can be upgraded prior to upgrading of individual
compute nodes. This maximizes the ability of operations staff
to keep a site operational for users of compute services while
performing an upgrade.</para></section>
<section xml:id="quota-management-multi-site"><title>Quota Management</title>
<para>To prevent system capacities from being exhausted without
notification, OpenStack provides operators with the ability to
define quotas. Quotas are used to set operational limits and
are currently enforced at the tenant (or project) level rather
than at the user level.</para>
<para>Quotas are defined on a per-region basis. Operators may wish
to define identical quotas for tenants in each region of the
cloud to provide a consistent experience, or even create a
process for synchronizing allocated quotas across regions. It
is important to note that only the operational limits imposed
by the quotas will be aligned consumption of quotas by users
will not be reflected between regions.</para>
<para>For example, given a cloud with two regions, if the operator
grants a user a quota of 25 instances in each region then that
user may launch a total of 50 instances spread across both
regions. They may not, however, launch more than 25 instances
in any single region.</para>
<para>For more information on managing quotas refer to Chapter 9.
Managing Projects and Users
(http://docs.openstack.org/openstack-ops/content/projects_users.html)
of the OpenStack Operators Guide.</para></section>
<section xml:id="policy-management-multi-site"><title>Policy Management</title>
<para>OpenStack provides a default set of Role Based Access
Control (RBAC) policies, defined in a <filename>policy.json</filename> file, for
each service. Operators edit these files to customize the
policies for their OpenStack installation. If the application
of consistent RBAC policies across sites is considered a
requirement, then it is necessary to ensure proper
synchronization of the <filename>policy.json</filename> files to all
installations.</para>
<para>This must be done using normal system administration tools
such as rsync as no functionality for synchronizing policies
across regions is currently provided within OpenStack.</para></section>
<section xml:id="documentation-multi-site"><title>Documentation</title>
<para>Users must be able to leverage cloud infrastructure and
provision new resources in the environment. It is important
that user documentation is accessible by users of the cloud
infrastructure to ensure they are given sufficient information
to help them leverage the cloud. As an example, by default
OpenStack will schedule instances on a compute node
automatically. However, when multiple regions are available,
it is left to the end user to decide in which region to
schedule the new instance. Horizon will present the user with
the first region in your configuration. The API and CLI tools
will not execute commands unless a valid region is specified.
It is therefore important to provide documentation to your
users describing the region layout as well as calling out that
quotas are region-specific. If a user reaches his or her quota
in one region, OpenStack will not automatically build new
instances in another. Documenting specific examples will help
users understand how to operate the cloud, thereby reducing
calls and tickets filed with the help desk.</para></section>
</section>

View File

@ -0,0 +1,218 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="prescriptive-example-multisite">
<?dbhtml stop-chunking?>
<title>Prescriptive Examples</title>
<para>Based on the needs of the intended workloads, there are
multiple ways to build a multi-site OpenStack installation.
Below are example architectures based on different
requirements. These examples are meant as a reference, and not
a hard and fast rule for deployments. Use the previous
sections of this chapter to assist in selecting specific
components and implementations based on specific needs.</para>
<para>A large content provider needs to deliver content to
customers that are geographically dispersed. The workload is
very sensitive to latency and needs a rapid response to
end-users. After reviewing the user, technical and operational
considerations, it is determined beneficial to build a number
of regions local to the customers edge. In this case rather
than build a few large, centralized data centers, the intent
of the architecture is to provide a pair of small data centers
in locations that are closer to the customer. In this use
case, spreading applications out allows for different
horizontal scaling than a traditional compute workload scale.
The intent is to scale by creating more copies of the
application in closer proximity to the users that need it
most, in order to ensure faster response time to user
requests. This provider will deploy two datacenters at each of
the four chosen regions. The implications of this design are
based around the method of placing copies of resources in each
of the remote regions. Swift objects, Glance images, and block
storage will need to be manually replicated into each region.
This may be beneficial for some systems, such as the case of
content service, where only some of the content needs to exist
in some but not all regions. A centralized Keystone is
recommended to ensure authentication and that access to the
API endpoints is easily manageable.</para>
<para>Installation of an automated DNS system such as Designate is
highly recommended. Unless an external Dynamic DNS system is
available, application administrators will need a way to
manage the mapping of which application copy exists in each
region and how to reach it. Designate will assist by making
the process automatic and by populating the records in the
each region's zone.</para>
<para>Telemetry for each region is also deployed, as each region
may grow differently or be used at a different rate.
Ceilometer will run to collect each region's metrics from each
of the controllers and report them back to a central location.
This is useful both to the end user and the administrator of
the OpenStack environment. The end user will find this method
useful, in that it is possible to determine if certain
locations are experiencing higher load than others, and take
appropriate action. Administrators will also benefit by
possibly being able to forecast growth per region, rather than
expanding the capacity of all regions simultaneously,
therefore maximizing the cost-effectiveness of the multi-site
design.</para>
<para>One of the key decisions of running this sort of
infrastructure is whether or not to provide a redundancy
model. Two types of redundancy and high availability models in
this configuration will be implemented. The first type
revolves around the availability of the central OpenStack
components. Keystone will be made highly available in three
central data centers that will host the centralized OpenStack
components. This prevents a loss of any one of the regions
causing an outage in service. It also has the added benefit of
being able to run a central storage repository as a primary
cache for distributing content to each of the regions.</para>
<para>The second redundancy topic is that of the edge data center
itself. A second data center in each of the edge regional
locations will house a second region near the first. This
ensures that the application will not suffer degraded
performance in terms of latency and availability.</para>
<para>This figure depicts the solution designed to have both a
centralized set of core data centers for OpenStack services
and paired edge data centers:</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-Site_Customer_Edge.png"/>
</imageobject>
</mediaobject>
<section xml:id="geo-redundant-load-balancing"><title>Geo-redundant load balancing</title>
<para>A large-scale web application has been designed with cloud
principles in mind. The application is designed provide
service to application store, on a 24/7 basis. The company has
typical 2-tier architecture with a web front-end servicing the
customer requests and a NoSQL database back end storing the
information.</para>
<para>As of late there has been several outages in number of major
public cloud providers - usually due to the fact these
applications were running out of a single geographical
location. The design therefore should mitigate the chance of a
single site causing an outage for their business.</para>
<para>The solution would consist of the following OpenStack
components:</para>
<itemizedlist>
<listitem>
<para>A firewall, switches and load balancers on the
public facing network connections.</para>
</listitem>
<listitem>
<para>OpenStack Controller services running, Networking,
Horizon, Cinder and Nova compute running locally in
each of the three regions. The other services,
Keystone, Heat Ceilometer, Glance and Swift will be
installed centrally - with nodes in each of the region
providing a redundant OpenStack Controller plane
throughout the globe.</para>
</listitem>
<listitem>
<para>OpenStack Compute nodes running the KVM
hypervisor.</para>
</listitem>
<listitem>
<para>OpenStack Object Storage for serving static objects
such as images will be used to ensure that all images
are standardized across all the regions, and
replicated on a regular basis.</para>
</listitem>
<listitem>
<para>A Distributed DNS service available to all regions -
that allows for dynamic update of DNS records of
deployed instances.</para>
</listitem>
<listitem>
<para>A geo-redundant load balancing service will be used
to service the requests from the customers based on
their origin.</para>
</listitem>
</itemizedlist>
<para>An autoscaling heat template will used to deploy the
application in the three regions. This template will
include:</para>
<itemizedlist>
<listitem>
<para>Web Servers, running Apache.</para>
</listitem>
<listitem>
<para>Appropriate user_data to populate the central DNS
servers upon instance launch.</para>
</listitem>
<listitem>
<para>Appropriate Ceilometer alarms that maintain state of
the application and allow for handling of region or
instance failure.</para>
</listitem>
</itemizedlist>
<para>Another autoscaling Heat template will be used to deploy a
distributed MongoDB shard over the three locations - with the
option of storing required data on a globally available Swift
container. according to the usage and load on the database
server - additional shards will be provisioned according to
the thresholds defined in Ceilometer.</para>
<para>The reason that 3 regions were selected here was because of
the fear of having abnormal load on a single region in the
event of a failure. Two data center would have been sufficient
had the requirements been met.</para>
<para>Heat is used because of the built-in functionality of
autoscaling and auto healing in the event of increased load.
Additional configuration management tools, such as Puppet or
Chef could also have been used in this scenario, but were not
chosen due to the fact that Heat had the appropriate built-in
hooks into the OpenStack cloud - whereas the other tools were
external and not native to OpenStack. In addition - since this
deployment scenario was relatively straight forward - the
external tools were not needed.</para>
<para>Swift is used here to serve as a back end for Glance and
Object storage since was the most suitable solution for a
globally distributed storage solution - with its own
replication mechanism. Home grown solutions could also have
been used including the handling of replication - but were not
chosen, because Swift is already an intricate part of the
infrastructure - and proven solution.</para>
<para>An external load balancing service was used and not the
LBaaS in OpenStack because the solution in OpenStack is not
redundant and does have any awareness of geo location.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-site_Geo_Redundant_LB.png"/>
</imageobject>
</mediaobject></section>
<section xml:id="location-local-services"><title>Location-local service</title>
<para>A common use for a multi-site deployment of OpenStack, is
for creating a Content Delivery Network. An application that
uses a location-local architecture will require low network
latency and proximity to the user, in order to provide an
optimal user experience, in addition to reducing the cost of
bandwidth and transit, since the content resides on sites
closer to the customer, instead of a centralized content store
that would require utilizing higher cost cross country
links.</para>
<para>This architecture usually includes a geo-location component
that places user requests at the closest possible node. In
this scenario, 100% redundancy of content across every site is
a goal rather than a requirement, with the intent being to
maximize the amount of content available that is within a
minimum number of network hops for any given end user. Despite
these differences, the storage replication configuration has
significant overlap with that of a geo-redundant load
balancing use case.</para>
<para>In this example, the application utilizing this multi-site
OpenStack install that is location aware would launch web
server or content serving instances on the compute cluster in
each site. Requests from clients will first be sent to a
global services load balancer that determines the location of
the client, then routes the request to the closest OpenStack
site where the application completes the request.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Multi-Site_shared_keystone1.png"/>
</imageobject>
</mediaobject></section>
</section>

View File

@ -0,0 +1,196 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="technical-considerations-multi-site">
<?dbhtml stop-chunking?>
<title>Technical Considerations</title>
<para>There are many technical considerations to take into account
with regard to designing a multi-site OpenStack
implementation. An OpenStack cloud can be designed in a
variety of ways to handle individual application needs. A
multi-site deployment will have additional challenges compared
to single site installations and will therefore be a more
complex solution.</para>
<para>When determining capacity options be sure to take into
account not just the technical issues, but also the economic
or operational issues that might arise from specific
decisions.</para>
<para>Inter-site link capacity describes the capabilities of the
connectivity between the different OpenStack sites. This
includes parameters such as bandwidth, latency, whether or not
a link is dedicated, and any business policies applied to the
connection. The capability and number of the links between
sites will determine what kind of options may be available for
deployment. For example, if two sites have a pair of
high-bandwidth links available between them, it may be wise to
configure a separate storage replication network between the
two sites to support a single Swift endpoint and a shared
object storage capability between them. (An example of this
technique, as well as a configuration walk-through, is
available at
http://docs.openstack.org/developer/swift/replication_network.html#dedicated-replication-network).
Another option in this scenario is to build a dedicated set of
tenant private networks across the secondary link using
overlay networks with a third party mapping the site overlays
to each other.</para>
<para>The capacity requirements of the links between sites will be
driven by application behavior. If the latency of the links is
too high, certain applications that use a large number of
small packets, for example RPC calls, may encounter issues
communicating with each other or operating properly.
Additionally, OpenStack may encounter similar types of issues.
To mitigate this, tuning of the Keystone call timeouts may be
necessary to prevent issues authenticating against a central
Identity Service.</para>
<para>Another capacity consideration when it comes to networking
for a multi-site deployment is the available amount and
performance of overlay networks for tenant networks. If using
shared tenant networks across zones, it is imperative that an
external overlay manager or controller be used to map these
overlays together. It is necessary to ensure the amount of
possible IDs between the zones are identical. Note that, as of
the Icehouse release, Neutron was not capable of managing
tunnel IDs across installations. This means that if one site
runs out of IDs, but other does not, that tenant's network
will be unable to reach the other site.</para>
<para>Capacity can take other forms as well. The ability for a
region to grow depends on scaling out the number of available
compute nodes. This topic is covered in greater detail in the
section for compute-focused deployments. However, it should be
noted that cells may be necessary to grow an individual region
beyond a certain point. This point depends on the size of your
cluster and the ratio of virtual machines per
hypervisor.</para>
<para>A third form of capacity comes in the multi-region-capable
components of OpenStack. Centralized Object Storage is capable
of serving objects through a single namespace across multiple
regions. Since this works by accessing the object store via
swift proxy, it is possible to overload the proxies. There are
two options available to mitigate this issue. The first is to
deploy a large number of swift proxies. The drawback to this
is that the proxies are not load-balanced and a large file
request could continually hit the same proxy. The other way to
mitigate this is to front-end the proxies with a caching HTTP
proxy and load balancer. Since swift objects are returned to
the requester via HTTP, this load balancer would alleviate the
load required on the swift proxies.</para>
<section xml:id="utilization-multi-site"><title>Utilization</title>
<para>While constructing a multi-site OpenStack environment is the
goal of this guide, the real test is whether an application
can utilize it.</para>
<para>Identity is normally the first interface for the majority of
OpenStack users. Interacting with Keystone is required for
almost all major operations within OpenStack. Therefore, it is
important to ensure that you provide users with a single URL
for Keystone authentication. Equally important is proper
documentation and configuration of regions within Keystone.
Each of the sites defined in your installation is considered
to be a region in Keystone nomenclature. This is important for
the users of the system, when reading Keystone documentation,
as it is required to define the Region name when providing
actions to an API endpoint or in Horizon.</para>
<para>Load balancing is another common issue with multi-site
installations. While it is still possible to run HAproxy
instances with load balancer as a service, these will be local
to a specific region. Some applications may be able to cope
with this via internal mechanisms. Others, however, may
require the implementation of an external system including
global services load balancers or anycast-advertised
DNS.</para>
<para>Depending on the storage model chosen during site design,
storage replication and availability will also be a concern
for end-users. If an application is capable of understanding
regions, then it is possible to keep the object storage system
separated by region. In this case, users who want to have an
object available to more than one region will need to do the
cross-site replication themselves. With a centralized swift
proxy, however, the user may need to benchmark the replication
timing of the Swift back end. Benchmarking allows the
operational staff to provide users with an understanding of
the amount of time required for a stored or modified object to
become available to the entire environment.</para></section>
<section xml:id="performance"><title>Performance</title>
<para>Determining the performance of a multi-site installation
involves considerations that do not come into play in a
single-site deployment. Being a distributed deployment,
multi-site deployments incur a few extra penalties to
performance in certain situations.</para>
<para>Since multi-site systems can be geographically separated,
they may have worse than normal latency or jitter when
communicating across regions. This can especially impact
systems like the OpenStack Identity service when making
authentication attempts from regions that do not contain the
centralized Keystone implementation. It can also affect
certain applications which rely on remote procedure call (RPC)
for normal operation. An example of this can be seen in High
Performance Computing workloads.</para>
<para>Storage availability can also be impacted by the
architecture of a multi-site deployment. A centralized Object
Storage Service requires more time for an object to be
available to instances locally in regions where the object was
not created. Some applications may need to be tuned to account
for this effect. Block storage does not currently have a
method for replicating data across multiple regions, so
applications that depend on available block storage will need
to manually cope with this limitation by creating duplicate
block storage entries in each region.</para></section>
<section xml:id="security-multi-site"><title>Security</title>
<para>Securing a multi-site OpenStack installation also brings
extra challenges. Tenants may expect a tenant-created network
to be secure. In a multi-site installation the use of a
non-private connection between sites may be required. This may
mean that traffic would be visible to third parties and, in
cases where an application requires security, this issue will
require mitigation. Installing a VPN or encrypted connection
between sites is recommended in such instances.</para>
<para>Another security consideration with regard to multi-site
deployments is Identity. Authentication in a multi-site
deployment should be centralized. Centralization provides a
single authentication point for users across the deployment,
as well as a single point of administration for traditional
create, read, update and delete operations. Centralized
authentication is also useful for auditing purposes because
all authentication tokens originate from the same
source.</para>
<para>Just as tenants in a single-site deployment need isolation
from each other, so do tenants in multi-site installations.
The extra challenges in multi-site designs revolve around
ensuring that tenant networks function across regions.
Unfortunately, OpenStack Networking does not presently support
a mechanism to provide this functionality, therefore an
external system may be necessary to manage these mappings.
Tenant networks may contain sensitive information requiring
that this mapping be accurate and consistent to ensure that a
tenant in one site does not connect to a different tenant in
another site.</para></section>
<section xml:id="openstack-components-multi-site"><title>OpenStack Components</title>
<para>Most OpenStack installations require a bare minimum set of
pieces to function. These include Keystone for authentication,
Nova for compute, Glance for image storage, Neutron for
networking, and potentially an object store in the form of
Swift. Bringing multi-site into play also demands extra
components in order to coordinate between regions. Centralized
Keystone is necessary to provide the single authentication
point. Centralized Horizon is also recommended to provide a
single login point and a mapped experience to the API and CLI
options available. If necessary, a centralized Swift may be
used and will require the installation of the Swift proxy
service.</para>
<para>It may also be helpful to install a few extra options in
order to facilitate certain use cases. For instance,
installing Designate may assist in automatically generating
DNS domains for each region with an automatically-populated
zone full of resource records for each instance. This
facilitates using DNS as a mechanism for determining which
region would be selected for certain applications.</para>
<para>Another useful tool for managing a multi-site installation
is Heat. Heat allows the use of templates to define a set of
instances to be launched together or for scaling existing
sets. It can also be used to setup matching or differentiated
groupings based on regions. For instance, if an application
requires an equally balanced number of nodes across sites, the
same heat template can be used to cover each site with small
alterations to only the region name.</para></section>
</section>

View File

@ -0,0 +1,213 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="user-requirements-multi-site">
<?dbhtml stop-chunking?>
<title>User Requirements</title>
<para>A multi-site architecture is complex and has its own risks
and considerations, therefore it is important to make sure
when contemplating the design such an architecture that it
meets the user and business requirements.</para>
<para>Many jurisdictions have legislative and regulatory
requirements governing the storage and management of data in
cloud environments. Common areas of regulation include:</para>
<itemizedlist>
<listitem>
<para>Data retention policies ensuring storage of
persistent data and records management to meet data
archival requirements.</para>
</listitem>
<listitem>
<para>Data ownership policies governing the possession and
responsibility for data.</para>
</listitem>
<listitem>
<para>Data sovereignty policies governing the storage of
data in foreign countries or otherwise separate
jurisdictions.</para>
</listitem>
<listitem>
<para>Data compliance policies governing types of
information that needs to reside in certain locations
due to regular issues and, more importantly, cannot
reside in other locations for the same reason.</para>
</listitem>
</itemizedlist>
<para>Examples of such legal frameworks include the data
protection framework of the European Union
(http://ec.europa.eu/justice/data-protection) and the
requirements of the Financial Industry Regulatory Authority
(http://www.finra.org/Industry/Regulation/FINRARules) in the
United States. Consult a local regulatory body for more
information.</para>
<section xml:id="workload-characteristics"><title>Workload Characteristics</title>
<para>The expected workload is a critical requirement that needs
to be captured to guide decision-making. An understanding of
the workloads in the context of the desired multi-site
environment and use case is important. Another way of thinking
about a workload is to think of it as the way the systems are
used. A workload could be a single application or a suite of
applications that work together. It could also be a duplicate
set of applications that need to run in multiple cloud
environments. Often in a multi-site deployment the same
workload will need to work identically in more than one
physical location.</para>
<para>This multi-site scenario likely includes one or more of the
other scenarios in this book with the additional requirement
of having the workloads in two or more locations. The
following are some possible scenarios:</para>
<para>For many use cases the proximity of the user to their
workloads has a direct influence on the performance of the
application and therefore should be taken into consideration
in the design. Certain applications require zero to minimal
latency that can only be achieved by deploying the cloud in
multiple locations. These locations could be in different data
centers, cities, countries or geographical regions, depending
on the user requirement and location of the users.</para></section>
<section xml:id="consistency-images-templates-across-sites">
<title>Consistency of images and templates across different
sites</title>
<para>It is essential that the deployment of instances is
consistent across the different sites. This needs to be built
into the infrastructure. If OpenStack Object Store is used as
a back end for Glance, it is possible to create repositories of
consistent images across multiple sites. Having a central
endpoint with multiple storage nodes will allow for a
consistent centralized storage for each and every site.</para>
<para>Not using a centralized object store will increase
operational overhead so that a consistent image library can be
maintained. This could include development of a replication
mechanism to handle the transport of images and the changes to
the images across multiple sites.</para></section>
<section xml:id="high-availability-multi-site"><title>High Availability</title>
<para>If high availability is a requirement to provide continuous
infrastructure operations, a basic requirement of High
Availability should be defined.</para>
<para>The OpenStack management components need to have a basic and
minimal level of redundancy. The simplest example is the loss
of any single site has no significant impact on the
availability of the OpenStack services of the entire
infrastructure.</para>
<para>The OpenStack High Availability Guide
(http://docs.openstack.org/high-availability-guide/content/)
contains more information on how to provide redundancy for the
OpenStack components.</para>
<para>Multiple network links should be deployed between sites to
provide redundancy for all components. This includes storage
replication, which should be isolated to a dedicated network
or VLAN with the ability to assign QoS to control the
replication traffic or provide priority for this traffic. Note
that if the data store is highly changeable, the network
requirements could have a significant effect on the
operational cost of maintaining the sites.</para>
<para>The ability to maintain object availability in both sites
has significant implications on the object storage design and
implementation. It will also have a significant impact on the
WAN network design between the sites.</para>
<para>Connecting more than two sites increases the challenges and
adds more complexity to the design considerations. Multi-site
implementations require extra planning to address the
additional topology complexity used for internal and external
connectivity. Some options include full mesh topology, hub
spoke, spine leaf, or 3d Torus.</para>
<para>Not all the applications running in a cloud are cloud-aware.
If that is the case, there should be clear measures and
expectations to define what the infrastructure can support
and, more importantly, what it cannot. An example would be
shared storage between sites. It is possible, however such a
solution is not native to OpenStack and requires a third-party
hardware vendor to fulfill such a requirement. Another example
can be seen in applications that are able to consume resources
in object storage directly. These applications need to be
cloud aware to make good use of an OpenStack Object
Store.</para></section>
<section xml:id="application-readiness"><title>Application readiness</title>
<para>Some applications are tolerant of the lack of synchronized
object storage, while others may need those objects to be
replicated and available across regions. Understanding of how
the cloud implementation impacts new and existing applications
is important for risk mitigation and the overall success of a
cloud project. Applications may have to be written to expect
an infrastructure with little to no redundancy. Existing
applications not developed with the cloud in mind may need to
be rewritten.</para></section>
<section xml:id="cost-multi-site"><title>Cost</title>
<para>The requirement of having more than one site has a cost
attached to it. The greater the number of sites, the greater
the cost and complexity. Costs can be broken down into the
following categories</para>
<itemizedlist>
<listitem>
<para>Compute Resources</para>
</listitem>
<listitem>
<para>Networking resources</para>
</listitem>
<listitem>
<para>Replication</para>
</listitem>
<listitem>
<para>Storage</para>
</listitem>
<listitem>
<para>Management</para>
</listitem>
<listitem>
<para>Operational costs</para>
</listitem>
</itemizedlist></section>
<section xml:id="site-loss-and-recovery"><title>Site Loss and Recovery</title>
<para>Outages can cause loss of partial or full functionality of a
site. Strategies should be implemented to understand and plan
for recovery scenarios.</para>
<itemizedlist>
<listitem>
<para>The deployed applications need to continue to
function and, more importantly, consideration should
be taken of the impact on the performance and
reliability of the application when a site is
unavailable.</para>
</listitem>
<listitem>
<para>It is important to understand what will happen to
replication of objects and data between the sites when
a site goes down. If this causes queues to start
building up, considering how long these queues can
safely exist until something explodes.</para>
</listitem>
<listitem>
<para>Ensure determination of the method for resuming
proper operations of a site when it comes back online
after a disaster. It is recommended to architect the
recovery to avoid race conditions.</para>
</listitem>
</itemizedlist></section>
<section xml:id="compliance-and-geo-location-multi-site"><title>Compliance and Geo-location</title>
<para>An organization could have certain legal obligations and
regulatory compliance measures which could require certain
workloads or data to not be located in certain regions.</para></section>
<section xml:id="auditing-multi-site"><title>Auditing</title>
<para>A well thought-out auditing strategy is important in order
to be able to quickly track down issues. Keeping track of
changes made to security groups and tenant changes can be
useful in rolling back the changes if they affect production.
For example, if all security group rules for a tenant
disappeared, the ability to quickly track down the issue would
be important for operational and legal reasons.</para></section>
<section xml:id="separation-of-duties"><title>Separation of duties</title>
<para>A common requirement is to define different roles for the
different cloud administration functions. An example would be
a requirement to segregate the duties and permissions by
site.</para></section>
<section xml:id="authentication-between-sites">
<title>Authentication between sites</title>
<para>Ideally it is best to have a single authentication domain
and not need a separate implementation for each and every
site. This will, of course, require an authentication
mechanism that is highly available and distributed to ensure
continuous operation. Authentication server locality is also
something that might be needed as well and should be planned
for.</para></section>
</section>

View File

@ -0,0 +1,215 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="architecture-network-focus">
<title>Architecture</title>
<para>Network focused OpenStack architectures have many
similarities to other OpenStack architecture use cases. There
a number of very specific considerations to keep in mind when
designing for a network-centric or network-heavy application
environment.</para>
<para>Networks exist to serve a as medium of transporting data
between systems. It is inevitable that an OpenStack design
have inter-dependencies with non-network portions of OpenStack
as well as on external systems. Depending on the specific
workload, there may be major interactions with storage systems
both within and external to the OpenStack environment. For
example, if the workload is a content delivery network, then
the interactions with storage will be two-fold. There will be
traffic flowing to and from the storage array for ingesting
and serving content in a north-south direction. In addition,
there is replication traffic flowing in an east-west
direction.</para>
<para>Compute-heavy workloads may also induce interactions with
the network. Some high performance compute applications
require network-based memory mapping and data sharing and, as
a result, will induce a higher network load when they transfer
results and data sets. Others may be highly transactional and
issue transaction locks, perform their functions and rescind
transaction locks at very high rates. This also has an impact
on the network performance.</para>
<para>Some network dependencies are going to be external to
OpenStack. While Neutron is capable of providing network
ports, IP addresses, some level of routing, and overlay
networks, there are some other functions that it cannot
provide. For many of these, external systems or equipment may
be required to fill in the functional gaps. Hardware load
balancers are an example of equipment that may be necessary to
distribute workloads or offload certain functions. Note that,
as of the icehouse release, dynamic routing is currently in
its infancy within OpenStack and may need to be implemented
either by an external device or a specialized service instance
within OpenStack. Tunneling is a feature provided by Neutron,
however it is constrained to a Neutron-managed region. If the
need arises to extend a tunnel beyond the OpenStack region to
either another region or an external system, it is necessary
to implement the tunnel itself outside OpenStack or by using a
tunnel management system to map the tunnel or overlay to an
external tunnel. OpenStack does not currently provide quotas
for network resources. Where network quotas are required, it
is necessary to implement quality of service management
outside of OpenStack. In many of these instances, similar
solutions for traffic shaping or other network functions will
be needed.</para>
<para>Depending on the selected design, Neutron itself may not
even support the required layer 3 network functionality. If it
is necessary or advantageous to use the provider networking
mode of Neutron without running the layer 3 agent, then an
external router will be required to provide layer 3
connectivity to outside systems.</para>
<para>Interaction with orchestration services is inevitable in
larger-scale deployments. Heat is capable of allocating
network resource defined in templates to map to tenant
networks and for port creation, as well as allocating floating
IPs. If there is a requirement to define and manage network
resources in using orchestration, it is recommended that the
design include OpenStack Orchestration to meet the demands of
users.</para>
<section xml:id="desing-impacts"><title>Design Impacts</title>
<para>A wide variety of factors can affect a network focused
OpenStack architecture. While there are some considerations
shared with a general use case, specific workloads related to
network requirements will influence network design
decisions.</para>
<para>One decision includes whether or not to use Network Address
Translation (NAT) and where to implement it. If there is a
requirement for floating IPs to be available instead of using
public fixed addresses then NAT is required. This can be seen
in network management applications that rely on an IP
endpoint. An example of this is a DHCP relay that needs to
know the IP of the actual DHCP server. In these cases it is
easier to automate the infrastructure to apply the target IP
to a new instance rather than reconfigure legacy or external
systems for each new instance.</para>
<para>NAT for floating IPs managed by Neutron will reside within
the hypervisor but there are also versions of NAT that may be
running elsewhere. If there is a shortage of IPv4 addresses
there are two common methods to mitigate this externally to
OpenStack. The first is to run a load balancer either within
OpenStack as a instance, or use an external load balancing
solution. In the internal scenario, load balancing software,
such as HAproxy, can be managed with Neutron's Load Balancer
as a Service (LBaaS). This is specifically to manage the
Virtual IP (VIPs) while a dual-homed connection from the
HAproxy instance connects the public network with the tenant
private network that hosts all of the content servers. In the
external scenario, a load balancer would need to serve the VIP
and also be joined to the tenant overlay network through
external means or routed to it via private addresses.</para>
<para>Another kind of NAT that may be useful is protocol NAT. In
some cases it may be desirable to use only IPv6 addresses on
instances and operate either an instance or an external
service to provide a NAT-based transition technology such as
NAT64 and DNS64. This provides the ability to have a globally
routable IPv6 address while only consuming IPv4 addresses as
necessary or in a shared manner.</para>
<para>Application workloads will affect the design of the
underlying network architecture. If a workload requires
network-level redundancy, the routing and switching
architecture will have to accommodate this. There are
differing methods for providing this that are dependent on the
network hardware selected, the performance of the hardware,
and which networking model is deployed. Some examples of this
are the use of Link aggregation (LAG) or Hot Standby Router
Protocol (HSRP). There are also the considerations of whether
to deploy Neutron or Nova-network and which plug-in to select
for Neutron. If using an external system, Neutron will need to
be configured to run layer 2 with a provider network
configuration. For example, it may be necessary to implement
HSRP to terminate layer 3 connectivity.</para>
<para>Depending on the workload, overlay networks may or may not
be a recommended configuration. Where application network
connections are small, short lived or bursty, running a
dynamic overlay can generate as much bandwidth as the packets
it carries. It also can induce enough latency to cause issues
with certain applications. There is an impact to the device
generating the overlay which, in most installations, will be
the hypervisor. This will cause performance degradation on
packet per second and connection per second rates.</para>
<para>Overlays also come with a secondary option that may or may
not be appropriate to a specific workload. While all of them
will operate in full mesh by default, there might be good
reasons to disable this function because it may cause
excessive overhead for some workloads. Conversely, other
workloads will operate without issue. For example, most web
services applications will not have major issues with a full
mesh overlay network, while some network monitoring tools or
storage replication workloads will have performance issues
with throughput or excessive broadcast traffic.</para>
<para>A design decision that many overlook is a choice of layer 3
protocols. While OpenStack was initially built with only IPv4
support, Neutron now supports IPv6 and dual-stacked networks.
Note that, as of the icehouse release, this only includes
stateless address autoconfiguration but the work is in
progress to support stateless and stateful dhcpv6 as well as
IPv6 floating IPs without NAT. Some workloads become possible
through the use of IPv6 and IPv6 to IPv4 reverse transition
mechanisms such as NAT64 and DNS64 or 6to4, because these
options are available. This will alter the requirements for
any address plan as single-stacked and transitional IPv6
deployments can alleviate the need for IPv4 addresses.</para>
<para>As of the icehouse release, OpenStack has limited support
for dynamic routing, however there are a number of options
available by incorporating third party solutions to implement
routing within the cloud including network equipment, hardware
nodes, and instances. Some workloads will perform well with
nothing more than static routes and default gateways
configured at the layer 3 termination point. In most cases
this will suffice, however some cases require the addition of
at least one type of dynamic routing protocol if not multiple
protocols. Having a form of interior gateway protocol (IGP)
available to the instances inside an OpenStack installation
opens up the possibility of use cases for anycast route
injection for services that need to use it as a geographic
location or failover mechanism. Other applications may wish to
directly participate in a routing protocol, either as a
passive observer as in the case of a looking glass, or as an
active participant in the form of a route reflector. Since an
instance might have a large amount of compute and memory
resources, it is trivial to hold an entire unpartitioned
routing table and use it to provide services such as network
path visibility to other applications or as a monitoring
tool.</para>
<para>A lesser known, but harder to diagnose issue, is that of
path Maximum Transmission Unit (MTU) failures. It is less of
an optional design consideration and more a design warning
that MTU must be at least large enough to handle normal
traffic, plus any overhead from an overlay network, and the
desired layer 3 protocol. Adding externally built tunnels will
further lessen the MTU packet size making it imperative to pay
attention to the fully calculated MTU as some systems may be
configured to ignore or drop path MTU discovery
packets.</para></section>
<section xml:id="tunables">
<title>Tunable networking components</title>
<para>Consider configurable networking components related to an
OpenStack architecture design when designing for network intensive
workloads include MTU and QoS. Some workloads will require a larger
MTU than normal based on a requirement to transfer large blocks of
data. When providing network service for applications such as video
streaming or storage replication, it is recommended to ensure that
both OpenStack hardware nodes and the supporting network equipment
are configured for jumbo frames where possible. This will allow for
a better utilization of available bandwidth. Configuration of jumbo
frames should be done across the complete path the packets will
traverse. If one network component is not capable of handling jumbo
frames then the entire path will revert to the default MTU.</para>
<para>Quality of Service (QoS) also has a great impact on network
intensive workloads by providing instant service to packets which
have a higher priority due to their ability to be impacted by poor
network performance. In applications such as Voice over IP (VoIP)
differentiated services code points are a near requirement for
proper operation. QoS can also be used in the opposite direction for
mixed workloads to prevent low priority but high bandwidth
applications, for example backup services, video conferencing or
file sharing, from blocking bandwidth that is needed for the proper
operation of other workloads. It is possible to tag file storage
traffic as a lower class, such as best effort or scavenger, to allow
the higher priority traffic through. In cases where regions within a
cloud might be geographically distributed it may also be necessary
to plan accordingly to implement WAN optimization to combat latency
or packet loss.</para>
</section>
</section>

View File

@ -0,0 +1,138 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-guide-intro-network-focus">
<title>Introduction</title>
<para>All OpenStack deployments are dependent, to some extent, on
network communication in order to function properly due to a
service-based nature. In some cases, however, use cases
dictate that the network is elevated beyond simple
infrastructure. This section is a discussion of architectures
that are more reliant or focused on network services. These
architectures are heavily dependent on the network
infrastructure and need to be architected so that the network
services perform and are reliable in order to satisfy user and
application requirements.</para>
<para>Some possible use cases include:</para>
<itemizedlist>
<listitem>
<para>Content Delivery Network: This could include
streaming video, photographs or any other cloud based
repository of data that is distributed to a large
number of end users. Mass market streaming video will
be very heavily affected by the network configurations
that would affect latency, bandwidth, and the
distribution of instances. Not all video streaming is
consumer focused. For example, multicast videos (used
for media, press conferences, corporate presentations,
web conferencing services, etc.) can also utilize a
content delivery network. Content delivery will be
affected by the location of the video repository and
its relationship to end users. Performance is also
affected by network throughput of the backend systems,
as well as the WAN architecture and the cache
methodology.</para>
</listitem>
<listitem>
<para>Network Management Functions: A cloud that provides
network service functions would be built to support
the delivery of back-end network services such as DNS,
NTP or SNMP and would be used by a company for
internal network management.</para>
</listitem>
<listitem>
<para>Network Service Offerings: A cloud can be used to
run customer facing network tools to support services.
For example, VPNs, MPLS private networks, GRE tunnels
and others.</para>
</listitem>
<listitem>
<para>Web portals / Web Services: Web servers are a common
application for cloud services and it is recommended
to have an understanding of the network requirements.
The network will need to be able to scale out to meet
user demand and deliver webpages with a minimum of
latency. Internal east-west and north-south network
bandwidth must be considered depending on the details
of the portal architecture.</para>
</listitem>
<listitem>
<para>High Speed and High Volume Transactional Systems:
These types of applications are very sensitive to
network configurations. Examples include many
financial systems, credit card transaction
applications, trading and other extremely high volume
systems. These systems are sensitive to network jitter
and latency. They also have a high volume of both
east-west and north-south network traffic that needs
to be balanced to maximize efficiency of the data
delivery. Many of these systems have large high
performance database back ends that need to be
accessed.</para>
</listitem>
<listitem>
<para>High Availability: These types of use cases are
highly dependent on the proper sizing of the network
to maintain replication of data between sites for high
availability. If one site becomes unavailable, the
extra sites will be able to serve the displaced load
until the original site returns to service. It is
important to size network capacity to handle the loads
that are desired.</para>
</listitem>
<listitem>
<para>Big Data: Clouds that will be used for the
management and collection of big data (data ingest)
will have a significant demand on network resources.
Big data often uses partial replicas of the data to
maintain data integrity over large distributed clouds.
Other big data applications that require a large
amount of network resources are Hadoop, Cassandra,
NuoDB, RIAK and other No-SQL and distributed
databases.</para>
</listitem>
<listitem>
<para>Virtual Desktop Infrastructure (VDI): This use case
is very sensitive to network congestion, latency,
jitter and other network characteristics. Like video
streaming, the user experience is very important
however, unlike video streaming, caching is not an
option to offset the network issues. VDI requires both
upstream and downstream traffic and cannot rely on
caching for the delivery of the application to the end
user.</para>
</listitem>
<listitem>
<para>Voice over IP (VoIP): This is extremely sensitive to
network congestion, latency, jitter and other network
characteristics. VoIP has a symmetrical traffic
pattern and it requires network quality of service
(QoS) for best performance. It may also require an
active queue management implementation to ensure
delivery. Users are very sensitive to latency and
jitter fluctuations and can detect them at very low
levels.</para>
</listitem>
<listitem>
<para>Video Conference / Web Conference: This also is
extremely sensitive to network congestion, latency,
jitter and other network flaws. Video Conferencing has
a symmetrical traffic pattern, but unless the network
is on an MPLS private network, it cannot use network
quality of service (QoS) to improve performance.
Similar to VOIP, users will be sensitive to network
performance issues even at low levels.</para>
</listitem>
<listitem>
<para>High Performance Computing (HPC): This is a complex
use case that requires careful consideration of the
traffic flows and usage patterns to address the needs
of cloud clusters. It has high East-West traffic
patterns for distributed computing, but there can be
substantial North-South traffic depending on the
specific application.</para>
</listitem>
</itemizedlist>
</section>

View File

@ -0,0 +1,72 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="operational-considerations-networking-focus">
<?dbhtml stop-chunking?>
<title>Operational Considerations</title>
<para>Network focused OpenStack clouds have a number of
operational considerations that will influence the selected
design. Topics including, but not limited to, dynamic routing
of static routes, service level agreements, and ownership of
user management all need to be considered.</para>
<para>One of the first required decisions is the selection of a
telecom company or transit provider. This is especially true
if the network requirements include external or site-to-site
network connectivity.</para>
<para>Additional design decisions need to be made about monitoring
and alarming. These can be an internal responsibility or the
responsibility of the external provider. In the case of using
an external provider, SLAs will likely apply. In addition,
other operational considerations such as bandwidth, latency,
and jitter can be part of a service level agreement.</para>
<para>The ability to upgrade the infrastructure is another subject
for consideration. As demand for network resources increase,
operators will be required to add additional IP address blocks
and add additional bandwidth capacity. Managing hardware and
software life cycle events, for example upgrades,
decommissioning, and outages while avoiding service
interruptions for tenants, will also need to be
considered.</para>
<para>Maintainability will also need to be factored into the
overall network design. This includes the ability to manage
and maintain IP addresses as well as the use of overlay
identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS
tags. As an example, if all of the IP addresses have to be
changed on a network, a process known as renumbering, then the
design needs to support the ability to do so.</para>
<para>Network focused applications themselves need to be addressed
when concerning certain operational realities. For example,
the impending exhaustion of IPv4 addresses, the migration to
IPv6 and the utilization of private networks to segregate
different types of traffic that an application receives or
generates. In the case of IPv4 to IPv6 migrations,
applications should follow best practices for storing IP
addresses. It is further recommended to avoid relying on IPv4
features that were not carried over to the IPv6 protocol or
have differences in implementation.</para>
<para>When using private networks to segregate traffic,
applications should create private tenant networks for
database and data storage network traffic, and utilize public
networks for client-facing traffic. By segregating this
traffic, quality of service and security decisions can be made
to ensure that each network has the correct level of service
that it requires.</para>
<para>Finally, decisions must be made about the routing of network
traffic. For some applications, a more complex policy
framework for routing must be developed. The economic cost of
transmitting traffic over expensive links versus cheaper
links, in addition to bandwidth, latency, and jitter
requirements, can be used to create a routing policy that will
satisfy business requirements.</para>
<para>How to respond to network events must also be taken into
consideration. As an example, how load is transferred from one
link to another during a failure scenario could be a factor in
the design. If network capacity is not planned correctly,
failover traffic could overwhelm other ports or network links
and create a cascading failure scenario. In this case, traffic
that fails over to one link overwhelms that link and then
moves to the subsequent links until the all network traffic
stops.</para>
</section>

View File

@ -0,0 +1,189 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="prescriptive-example-large-scale-web-app">
<?dbhtml stop-chunking?>
<title>Prescriptive Examples</title>
<para>A large-scale web application has been designed with cloud
principles in mind. The application is designed to scale
horizontally in a bursting fashion and will generate a high
instance count. The application requires an SSL connection to
secure data and must not lose connection state to individual
servers.</para>
<para>An example design for this workload is depicted in the
figure below. In this example, a hardware load balancer is
configured to provide SSL offload functionality and to connect
to tenant networks in order to reduce address consumption.
This load balancer is linked to the routing architecture as it
will service the VIP for the application. The router and load
balancer are configured with GRE tunnel ID of the
application's tenant network and provided an IP address within
the tenant subnet but outside of the address pool. This is to
ensure that the load balancer can communicate with the
application's HTTP servers without requiring the consumption
of a public IP address.</para>
<para>Since sessions must remain until closing, the routing and
switching architecture is designed for high availability.
Switches are meshed to each hypervisor and to each other, and
also provide an MLAG implementation to ensure layer 2
connectivity does not fail. Routers are configured with VRRP
and fully meshed with switches to ensure layer 3 connectivity.
Since GRE is used as an overlay network, Neutron is installed
and configured to use the Open vSwitch agent in GRE tunnel
mode. This ensures all devices can reach all other devices and
that tenant networks can be created for private addressing
links to the load balancer.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Network_Web_Services1.png"
/>
</imageobject>
</mediaobject>
<para>A web service architecture has many options and optional
components. Due to this, it can fit into a large number of
other OpenStack designs however a few key components will need
to be in place to handle the nature of most web-scale
workloads. The user needs the following components:</para>
<itemizedlist>
<listitem>
<para>OpenStack Controller services (Image, Identity,
Networking and supporting services such as MariaDB and
RabbitMQ)</para>
</listitem>
<listitem>
<para>OpenStack Compute running KVM hypervisor</para>
</listitem>
<listitem>
<para>OpenStack Object Storage</para>
</listitem>
<listitem>
<para>OpenStack Orchestration</para>
</listitem>
<listitem>
<para>OpenStack Telemetry</para>
</listitem>
</itemizedlist>
<para>Beyond the normal Keystone, Nova, Glance and Swift
components, Heat is a recommended component to handle properly
scaling the workloads to adjust to demand. Ceilometer will
also need to be included in the design due to the requirement
for auto-scaling. Web services tend to be bursty in load, have
very defined peak and valley usage patterns and, as a result,
benefit from automatic scaling of instances based upon
traffic. At a network level, a split network configuration
will work well with databases residing on private tenant
networks since these do not emit a large quantity of broadcast
traffic and may need to interconnect to some databases for
content.</para>
<section xml:id="load-balancing"><title>Load Balancing</title>
<para>Load balancing was included in this design to spread
requests across multiple instances. This workload scales well
horizontally across large numbers of instances. This allows
instances to run without publicly routed IP addresses and
simply rely on the load balancer for the service to be
globally reachable. Many of these services do not require
direct server return. This aids in address planning and
utilization at scale since only the virtual IP (VIP) must be
public.</para></section>
<section xml:id="overlay-networks"><title>Overlay Networks</title>
<para>OpenStack Networking using the Open vSwitch GRE tunnel mode
was included in the design to provide overlay functionality.
In this case, the layer 3 external routers will be in a pair
with VRRP and switches should be paired with an implementation
of MLAG running to ensure that there is no loss of
connectivity with the upstream routing infrastructure.</para></section>
<section xml:id="performance-tuning"><title>Performance Tuning</title>
<para>Network level tuning for this workload is minimal.
Quality-of-Service (QoS) will be applied to these workloads
for a middle ground Class Selector depending on existing
policies. It will be higher than a best effort queue but lower
than an Expedited Forwarding or Assured Forwarding queue.
Since this type of application generates larger packets with
longer-lived connections, bandwidth utilization can be
optimized for long duration TCP. Normal bandwidth planning
applies here with regards to benchmarking a session's usage
multiplied by the expected number of concurrent sessions with
overhead.</para></section>
<section xml:id="network-functions"><title>Network Functions</title>
<para>Network functions is a broad category but encompasses
workloads that support the rest of a system's network. These
workloads tend to consist of large amounts of small packets
that are very short lived, such as DNS queries or SNMP traps.
These messages need to arrive quickly and do not deal with
packet loss as there can be a very large volume of them. There
are a few extra considerations to take into account for this
type of workload and this can change a configuration all the
way to the hypervisor level. For an application that generates
10 TCP sessions per user with an average bandwidth of 512
kilobytes per second per flow and expected user count of ten
thousand concurrent users, the expected bandwidth plan is
approximately 4.88 gigabits per second.</para>
<para>The supporting network for this type of configuration needs
to have a low latency and evenly distributed availability.
This workload benefits from having services local to the
consumers of the service. A multi-site approach is used as
well as deploying many copies of the application to handle
load as close as possible to consumers. Since these
applications function independently, they do not warrant
running overlays to interconnect tenant networks. Overlays
also have the drawback of performing poorly with rapid flow
setup and may incur too much overhead with large quantities of
small packets and are therefore not recommended.</para>
<para>QoS is desired for some workloads to ensure delivery. DNS
has a major impact on the load times of other services and
needs to be reliable and provide rapid responses. It is to
configure rules in upstream devices to apply a higher Class
Selector to DNS to ensure faster delivery or a better spot in
queuing algorithms.</para></section>
<section xml:id="cloud-storage"><title>Cloud Storage</title>
<para>Another common use case for OpenStack environments is to
provide a cloud based file storage and sharing service. While
this may initially be considered to be a storage focused use
case there are also major requirements on the network side
that place it in the realm of requiring a network focused
architecture. An example for this application is cloud
backup.</para>
<para>There are two specific behaviors of this workload that have
major and different impacts on the network. Since this is both
an externally facing service and internally replicating
application there are both North-South and East-West traffic
considerations.</para>
<para>North-South traffic is primarily user facing. This means
that when a user uploads content for storage it will be coming
into the OpenStack installation. Users who download this
content will be drawing traffic from the OpenStack
installation. Since the service is intended primarily as a
backup the majority of the traffic will be southbound into the
environment. In this case it is beneficial to configure a
network to be asymmetric downstream as the traffic entering
the OpenStack installation will be greater than traffic
leaving.</para>
<para>East-West traffic is likely to be fully symmetric. Since
replication will originate from any node and may target
multiple other nodes algorithmically, it is less likely for
this traffic to have a larger volume in any specific
direction. However this traffic may interfere with north-south
traffic.</para>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Network_Cloud_Storage2.png"
/>
</imageobject>
</mediaobject>
<para>This application will prioritize the North-South traffic
over East-West traffic as it is the customer-facing data. QoS
is implemented on East-West traffic to be a lower priority
Class Selector, while North-South traffic requires a higher
level in the priority queue because of this.</para>
<para>The network design in this case is less dependant on
availability and more dependant on being able to handle high
bandwidth. As a direct result, it is beneficial to forego
redundant links in favor of bonding those connections. This
increases available bandwidth. It is also beneficial to
configure all devices in the path, including OpenStack, to
generate and pass jumbo frames.</para></section>
</section>

View File

@ -0,0 +1,402 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="technical-considerations-network-focus">
<?dbhtml stop-chunking?>
<title>Technical Considerations</title>
<para>Designing an OpenStack network architecture involves a
combination of layer 2 and layer 3 considerations. Layer 2
decisions involve those made at the data-link layer, such as
the decision to use Ethernet versus Token Ring. Layer 3
involve those made about the protocol layer and the point at
which IP comes into the picture. As an example, a completely
internal OpenStack network can exist at layer 2 and ignore
layer 3 however, in order for any traffic to go outside of
that cloud, to another network, or to the Internet, a layer 3
router or switch must be involved.</para>
<para>The past few years have seen two competing trends in
networking. There has been a trend towards building data
center network architectures based on layer 2 networking and
simultaneously another network architecture approach is to
treat the cloud environment essentially as a miniature version
of the Internet. This represents a radically different
approach to the network architecture from what is currently
installed in the staging environment because the Internet is
based entirely on layer 3 routing rather than layer 2
switching.</para>
<para>In the data center context, there are advantages of
designing the network on layer 2 protocols rather than layer
3. In spite of the difficulties of using a bridge to perform
the network role of a router, many vendors, customers, and
service providers are attracted to the idea of using Ethernet
in as many parts of their networks as possible. The benefits
of selecting a layer 2 design are:</para>
<itemizedlist>
<listitem>
<para>Ethernet frames contain all the essentials for
networking. These include, but are not limited to,
globally unique source addresses, globally unique
destination addresses, and error control.</para>
</listitem>
<listitem>
<para>Ethernet frames can carry any kind of packet.
Networking at layer 2 is independent of the layer 3
protocol.</para>
</listitem>
<listitem>
<para>More layers added to the Ethernet frame only slow
the networking process down. This is known as 'nodal
processing delay'.</para>
</listitem>
<listitem>
<para>Adjunct networking features, for example class of
service (CoS) or multicasting, can be added to
Ethernet as readily as IP networks.</para>
</listitem>
<listitem>
<para>VLANs are an easy mechanism for isolating
networks.</para>
</listitem>
</itemizedlist>
<para>Most information starts and ends inside Ethernet frames.
Today this applies to data, voice (for example, VoIP) and
video (for example, web cameras). The concept is that, if more
of the end-to-end transfer of information from a source to a
destination can be done in the form of Ethernet frames, more
of the benefits of Ethernet can be realized on the network.
Though it is not a substitute for IP networking, networking at
layer 2 can be a powerful adjunct to IP networking.</para>
<para>The basic reasoning behind using layer 2 Ethernet over layer
3 IP networks is the speed, the reduced overhead of the IP
hierarchy, and the lack of requirement to keep track of IP
address configuration as systems are moved around. Whereas the
simplicity of layer 2 protocols might work well in a data
center with hundreds of physical machines, cloud data centers
have the additional burden of needing to keep track of all
virtual machine addresses and networks. In these data centers,
it is not uncommon for one physical node to support 30-40
instances.</para>
<para>Important Note: Networking at the frame level says nothing
about the presence or absence of IP addresses at the packet
level. Almost all ports, links, and devices on a network of
LAN switches still have IP addresses, as do all the source and
destination hosts. There are many reasons for the continued
need for IP addressing. The largest one is the need to manage
the network. A device or link without an IP address is usually
invisible to most management applications. Utilities including
remote access for diagnostics, file transfer of configurations
and software, and similar applications cannot run without IP
addresses as well as MAC addresses.</para>
<section xml:id="layer-2-arch-limitations"><title>Layer 2 Architecture Limitations</title>
<para>Outside of the traditional data center the limitations of
layer 2 network architectures become more obvious.</para>
<itemizedlist>
<listitem>
<para>Number of VLANs is limited to 4096.</para>
</listitem>
<listitem>
<para>The number of MACs stored in switch tables is
limited.</para>
</listitem>
<listitem>
<para>The need to maintain a set of layer 4 devices to
handle traffic control must be accommodated.</para>
</listitem>
<listitem>
<para>MLAG, often used for switch redundancy, is a
proprietary solution that does not scale beyond two
devices and forces vendor lock-in.</para>
</listitem>
<listitem>
<para>It can be difficult to troubleshoot a network
without IP addresses and ICMP.</para>
</listitem>
<listitem>
<para>Configuring ARP is considered complicated on large
layer 2 networks.</para>
</listitem>
<listitem>
<para>All network devices need to be aware of all MACs,
even instance MACs, so there is constant churn in MAC
tables and network state changes as instances are
started or stopped.</para>
</listitem>
<listitem>
<para>Migrating MACs (instance migration) to different
physical locations are a potential problem if ARP
table timeouts are not set properly.</para>
</listitem>
</itemizedlist>
<para>It is important to know that layer 2 has a very limited set
of network management tools. It is very difficult to control
traffic, as it does not have mechanisms to manage the network
or shape the traffic, and network troubleshooting is very
difficult. One reason for this difficulty is network devices
have no IP addresses. As a result, there is no reasonable way
to check network delay in a layer 2 network.</para>
<para>On large layer 2 networks, configuring ARP learning can also
be complicated. The setting for the MAC address timer on
switches is critical and, if set incorrectly, can cause
significant performance problems. As an example, the Cisco
default MAC address timer is extremely long. Migrating MACs to
different physical locations to support instance migration can
be a significant problem. In this case, the network
information maintained in the switches could be out of sync
with the new location of the instance.</para>
<para>In a layer 2 network, all devices are aware of all MACs,
even those that belong to instances. The network state
information in the backbone changes whenever an instance is
started or stopped. As a result there is far too much churn in
the MAC tables on the backbone switches.</para></section>
<section xml:id="layer-3-arch-advantages"><title>Layer 3 Architecture Advantages</title>
<para>In the layer 3 case, there is no churn in the routing tables
due to instances starting and stopping. The only time there
would be a routing state change would be in the case of a Top
of Rack (ToR) switch failure or a link failure in the backbone
itself. Other advantages of using a layer 3 architecture
include:</para>
<itemizedlist>
<listitem>
<para>layer 3 networks provide the same level of
resiliency and scalability as the Internet.</para>
</listitem>
<listitem>
<para>Controlling traffic with routing metrics is
straightforward.</para>
</listitem>
<listitem>
<para>layer 3 can be configured to use BGP confederation
for scalability so core routers have state
proportional to number of racks, not to the number of
servers or instances.</para>
</listitem>
<listitem>
<para>Routing ensures that instance MAC and IP addresses
out of the network core reducing state churn. Routing
state changes only occur in the case of a ToR switch
failure or backbone link failure.</para>
</listitem>
<listitem>
<para>There are a variety of well tested tools, for
example ICMP, to monitor and manage traffic.</para>
</listitem>
<listitem>
<para>layer 3 architectures allow for the use of Quality
of Service (QoS) to manage network performance.</para>
</listitem>
</itemizedlist>
<section xml:id="layer-3-arch-limitations"><title>Layer 3 Architecture Limitations</title>
<para>The main limitation of layer 3 is that there is no built-in
isolation mechanism comparable to the VLANs in layer 2
networks. Furthermore, the hierarchical nature of IP addresses
means that an instance will also be on the same subnet as its
physical host. This means that it cannot be migrated outside
of the subnet easily. For these reasons, network
virtualization needs to use IP encapsulation and software at
the end hosts for both isolation, as well as for separation of
the addressing in the virtual layer from addressing in the
physical layer. Other potential disadvantages of layer 3
include the need to design an IP addressing scheme rather than
relying on the switches to automatically keep track of the MAC
addresses and to configure the interior gateway routing
protocol in the switches.</para></section></section>
<section xml:id="network-recommendations-overview">
<title>Network Recommendations Overview</title>
<para>OpenStack has complex networking requirements for several
reasons. Many components interact at different levels of the
system stack that adds complexity. Data flows are complex.
Data in an OpenStack cloud moves both between instances across
the network (also known as East-West), as well as in and out
of the system (also known as North-South). Physical server
nodes have network requirements that are independent of those
used by instances which need to be isolated from the core
network to account for scalability. It is also recommended to
functionally separate the networks for security purposes and
tune performance through traffic shaping.</para>
<para>A number of important general technical and business factors
need to be taken into consideration when planning and
designing an OpenStack network. They include:</para>
<itemizedlist>
<listitem>
<para>A requirement for vendor independence. To avoid
hardware or software vendor lock-in, the design should
not rely on specific features of a vendors router or
switch.</para>
</listitem>
<listitem>
<para>A requirement to massively scale the ecosystem to
support millions of end users.</para>
</listitem>
<listitem>
<para>A requirement to support indeterminate platforms and
applications.</para>
</listitem>
<listitem>
<para>A requirement to design for cost efficient
operations to take advantage of massive scale.</para>
</listitem>
<listitem>
<para>A requirement to ensure that there is no single
point of failure in the cloud ecosystem.</para>
</listitem>
<listitem>
<para>A requirement for high availability architecture to
meet customer SLA requirements.</para>
</listitem>
<listitem>
<para>A requirement to be tolerant of rack level
failure.</para>
</listitem>
<listitem>
<para>A requirement to maximize flexibility to architect
future production environments.</para>
</listitem>
</itemizedlist>
<para>Keeping all of these in mind, the following network design
recommendations can be made:</para>
<itemizedlist>
<listitem>
<para>Layer 3 designs are preferred over layer 2
architectures.</para>
</listitem>
<listitem>
<para>Design a dense multi-path network core to support
multi-directional scaling and flexibility.</para>
</listitem>
<listitem>
<para>Use hierarchical addressing because it is the only
viable option to scale network ecosystem.</para>
</listitem>
<listitem>
<para>Use virtual networking to isolate instance service
network traffic from the management and internal
network traffic.</para>
</listitem>
<listitem>
<para>Isolate virtual networks using encapsulation
technologies.</para>
</listitem>
<listitem>
<para>Use traffic shaping for performance tuning.</para>
</listitem>
<listitem>
<para>Use eBGP to connect to the Internet up-link.</para>
</listitem>
<listitem>
<para>Use iBGP to flatten the internal traffic on the
layer 3 mesh.</para>
</listitem>
<listitem>
<para>Determine the most effective configuration for block
storage network.</para>
</listitem>
</itemizedlist></section>
<section xml:id="additional-considerations-network-focus"><title>Additional Considerations</title>
<para>There are numerous topics to consider when designing a
network-focused OpenStack cloud.</para>
<section xml:id="openstack-networking-versus-nova-network"><title>OpenStack Networking versus Nova Network
Considerations</title>
<para>Selecting the type of networking technology to implement
depends on many factors. OpenStack Networking (Neutron) and
Nova Network both have their advantages and disadvantages.
They are both valid and supported options that fit different
use cases as described in the following table.</para></section>
<section xml:id="redundant-networking-tor-switch-ha"><title>Redundant Networking: ToR Switch High Availability
Risk Analysis</title>
<para>A technical consideration of networking is the idea that
switching gear in the data center that should be installed
with backup switches in case of hardware failure.</para>
<para>Research into the mean time between failures (MTBF) on
switches is between 100,000 and 200,000 hours. This number is
dependent on the ambient temperature of the switch in the data
center. When properly cooled and maintained, this translates
to between 11 and 22 years before failure. Even in the worst
case of poor ventilation and high ambient temperatures in the
data center, the MTBF is still 2-3 years. This is based on
published research found at
http://www.garrettcom.com/techsupport/papers/ethernet_switch_reliability.pdf
and http://www.n-tron.com/pdf/network_availability.pdf</para>
<para>In most cases, it is much more economical to only use a
single switch with a small pool of spare switches to replace
failed units than it is to outfit an entire data center with
redundant switches. Applications should also be able to
tolerate rack level outages without affecting normal
operations since network and compute resources are easily
provisioned and plentiful.</para></section>
<section xml:id="preparing-for-future-ipv6-support"><title>Preparing for the future: IPv6 Support</title>
<para>One of the most important networking topics today is the
impending exhaustion of IPv4 addresses. In early 2014, ICANN
announced that they started allocating the final IPv4 address
blocks to the Regional Internet Registries
http://www.internetsociety.org/deploy360/blog/2014/05/goodbye-ipv4-iana-starts-allocating-final-address-blocks/.
This means the IPv4 address space is close to being fully
allocated. As a result, it will soon become difficult to
allocate more IPv4 addresses to an application that has
experienced growth, or is expected to scale out, due to the
lack of unallocated IPv4 address blocks.</para>
<para>For network focused applications the future is the IPv6
protocol. IPv6 increases the address space significantly,
fixes long standing issues in the IPv4 protocol, and will
become an essential for network focused applications in the
future.</para>
<para>Neutron supports IPv6 when configured to take advantage of
the feature. To enable it, simply create an IPv6 subnet in
OpenStack Neutron and use IPv6 prefixes when creating security
groups.</para></section>
<section xml:id="asymetric-links"><title>Asymmetric Links</title>
<para>When designing a network architecture, the traffic patterns
of an application will heavily influence the allocation of
total bandwidth and the number of links that are used to send
and receive traffic. Applications that provide file storage
for customers will allocate bandwidth and links to favor
incoming traffic, whereas video streaming applications will
allocate bandwidth and links to favor outgoing traffic.</para></section>
<section xml:id="performance-network-focus"><title>Performance</title>
<para>It is important to analyze the applications' tolerance for
latency and jitter when designing an environment to support
network focused applications. Certain applications, for
example VoIP, are less tolerant of latency and jitter. Where
latency and jitter are concerned, certain applications may
require tuning of QoS parameters and network device queues to
ensure that they are queued for transmit immediately or
guaranteed minimum bandwidth. Since OpenStack currently does
not support these functions, some considerations may need to
be made for the network plug-in selected.</para>
<para>The location of a service may also impact the application or
consumer experience. If an application is designed to serve
differing content to differing users it will need to be
designed to properly direct connections to those specific
locations. Use a multi-site installation for these situations,
where appropriate.</para>
<para>OpenStack networking can be implemented in two separate
ways. The legacy nova-network provides a flat DHCP network
with a single broadcast domain. This implementation does not
support tenant isolation networks or advanced plug-ins, but it
is currently the only way to implement a distributed layer 3
agent using the multi_host configuration. Neutron is the
official current implementation of OpenStack Networking. It
provides a pluggable architecture that supports a large
variety of network methods. Some of these include a layer 2
only provider network model, external device plug-ins, or even
OpenFlow controllers.</para>
<para>Networking at large scales becomes a set of boundary
questions. The determination of how large a layer 2 domain
needs to be is based on the amount of nodes within the domain
and the amount of broadcast traffic that passes between
instances. Breaking layer 2 boundaries may require the
implementation of overlay networks and tunnels. This decision
is a balancing act between the need for a smaller overhead or
a need for a smaller domain.</para>
<para>When selecting network devices, be aware that making this
decision based on largest port density often comes with a
drawback. Aggregation switches and routers have not all kept
pace with Top of Rack switches and may induce bottlenecks on
north-south traffic. As a result, it may be possible for
massive amounts of downstream network utilization to impact
upstream network devices, impacting service to the cloud.
Since OpenStack does not currently provide a mechanism for
traffic shaping or rate limiting, it is necessary to implement
these features at the network hardware level.</para></section></section>
</section>

View File

@ -0,0 +1,170 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="user-requirements-network-focus">
<?dbhtml stop-chunking?>
<title>User Requirements</title>
<para>Network focused architectures vary from the general purpose
designs. They are heavily influenced by a specific subset of
applications that interact with the network in a more
impacting way. Some of the business requirements that will
influence the design include:</para>
<itemizedlist>
<listitem>
<para>User experience: User experience is impacted by
network latency through slow page loads, degraded
video streams, and low quality VoIP sessions. Users
are often not aware of how network design and
architecture affects their experiences. Both
enterprise customers and end-users rely on the network
for delivery of an application. Network performance
problems can provide a negative experience for the
end-user, as well as productivity and economic loss.
</para>
</listitem>
<listitem>
<para>Regulatory requirements: Networks need to take into
consideration any regulatory requirements about the
physical location of data as it traverses the network.
For example, Canadian medical records cannot pass
outside of Canadian sovereign territory. Another
network consideration is maintaining network
segregation of private data flows and ensuring that
the network between cloud locations is encrypted where
required. Network architectures are affected by
regulatory requirements for encryption and protection
of data in flight as the data moves through various
networks.</para>
</listitem>
</itemizedlist>
<para>Many jurisdictions have legislative and regulatory
requirements governing the storage and management of data in
cloud environments. Common areas of regulation include:</para>
<itemizedlist>
<listitem>
<para>Data retention policies ensuring storage of
persistent data and records management to meet data
archival requirements.</para>
</listitem>
<listitem>
<para>Data ownership policies governing the possession and
responsibility for data.</para>
</listitem>
<listitem>
<para>Data sovereignty policies governing the storage of
data in foreign countries or otherwise separate
jurisdictions.</para>
</listitem>
<listitem>
<para>Data compliance policies governing where information
needs to reside in certain locations due to regular
issues and, more importantly, where it cannot reside
in other locations for the same reason.</para>
</listitem>
</itemizedlist>
<para>Examples of such legal frameworks include the data
protection framework of the European Union
(http://ec.europa.eu/justice/data-protection/ ) and the
requirements of the Financial Industry Regulatory Authority
(http://www.finra.org/Industry/Regulation/FINRARules) in the
United States. Consult a local regulatory body for more
information.</para>
<section xml:id="high-availability-issues-network-focus"><title>High Availability Issues</title>
<para>OpenStack installations with high demand on network
resources have high availability requirements that are
determined by the application and use case. Financial
transaction systems will have a much higher requirement for
high availability than a development application. Forms of
network availability, for example quality of service (QoS),
can be used to improve the network performance of sensitive
applications, for example VoIP and video streaming.</para>
<para>Often, high performance systems will have SLA requirements
for a minimum QoS with regard to guaranteed uptime, latency
and bandwidth. The level of the SLA can have a significant
impact on the network architecture and requirements for
redundancy in the systems.</para></section>
<section xml:id="risks-network-focus"><title>Risks</title>
<itemizedlist>
<listitem>
<para>Network Misconfigurations: Configuring incorrect IP
addresses, VLANs, and routes can cause outages to
areas of the network or, in the worst case scenario,
the entire cloud infrastructure. Misconfigurations can
cause disruptive problems and should be automated to
minimize the opportunity for operator error.</para>
</listitem>
<listitem>
<para>Capacity Planning: Cloud networks need to be managed
for capacity and growth over time. There is a risk
that the network will not grow to support the
workload. Capacity planning includes the purchase of
network circuits and hardware that can potentially
have lead times measured in months or more.</para>
</listitem>
<listitem>
<para>Network Tuning: Cloud networks need to be configured
to minimize link loss, packet loss, packet storms,
broadcast storms, and loops.</para>
</listitem>
<listitem>
<para>Single Point Of Failure (SPOF): High availability
must be taken into account even at the physical and
environmental layers. If there is a single point of
failure due to only one upstream link, or only one
power supply, an outage becomes unavoidable.</para>
</listitem>
<listitem>
<para>Complexity: An overly complex network design becomes
difficult to maintain and troubleshoot. While
automated tools that handle overlay networks or device
level configuration can mitigate this, non-traditional
interconnects between functions and specialized
hardware need to be well documented or avoided to
prevent outages.</para>
</listitem>
<listitem>
<para>Non-standard features: There are additional risks
that arise from configuring the cloud network to take
advantage of vendor specific features. One example is
multi-link aggregation (MLAG) that is being used to
provide redundancy at the aggregator switch level of
the network. MLAG is not a standard and, as a result,
each vendor has their own proprietary implementation
of the feature. MLAG architectures are not
interoperable across switch vendors, which leads to
vendor lock-in, and can cause delays or inability when
upgrading components.</para>
</listitem>
</itemizedlist></section>
<section xml:id="security-network-focus"><title>Security</title>
<para>Security is often overlooked or added after a design has
been implemented. Consider security implications and
requirements before designing the physical and logical network
topologies. Some of the factors that need to be addressed
include making sure the networks are properly segregated and
traffic flows are going to the correct destinations without
crossing through locations that are undesirable. Some examples
of factors that need to be taken into consideration are:</para>
<itemizedlist>
<listitem>
<para>Firewalls</para>
</listitem>
<listitem>
<para>Overlay interconnects for joining separated tenant
networks</para>
</listitem>
<listitem>
<para>Routing through or avoiding specific networks</para>
</listitem>
</itemizedlist>
<para>Another security vulnerability that must be taken into
account is how networks are attached to hypervisors. If a
network must be separated from other systems at all costs, it
may be necessary to schedule instances for that network onto
dedicated compute nodes. This may also be done to mitigate
against exploiting a hypervisor breakout allowing the attacker
access to networks from a compromised instance.</para>
</section>
</section>

79
doc/arch-design/pom.xml Normal file
View File

@ -0,0 +1,79 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>
<groupId>org.openstack.docs</groupId>
<artifactId>parent-pom</artifactId>
<version>1.0.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>openstack-arch-design</artifactId>
<packaging>jar</packaging>
<name>OpenStack Architecture Design Guide</name>
<properties>
<!-- This is set by Jenkins according to the branch. -->
<release.path.name></release.path.name>
<comments.enabled>0</comments.enabled>
</properties>
<!-- ################################################ -->
<!-- USE "mvn clean generate-sources" to run this POM -->
<!-- ################################################ -->
<build>
<plugins>
<plugin>
<groupId>com.rackspace.cloud.api</groupId>
<artifactId>clouddocs-maven-plugin</artifactId>
<!-- version set in ../pom.xml -->
<executions>
<execution>
<id>generate-webhelp</id>
<goals>
<goal>generate-webhelp</goal>
</goals>
<phase>generate-sources</phase>
<configuration>
<!-- These parameters only apply to webhelp -->
<enableDisqus>0</enableDisqus>
<disqusShortname>openstack-arch-design</disqusShortname>
<enableGoogleAnalytics>1</enableGoogleAnalytics>
<googleAnalyticsId>UA-17511903-1</googleAnalyticsId>
<generateToc>
appendix toc,title
article/appendix nop
article toc,title
book toc,title,figure,table,example,equation
chapter toc,title
section toc
part toc,title
qandadiv toc
qandaset toc
reference toc,title
set toc,title
</generateToc>
<!-- The following elements sets the autonumbering of sections in output for chapter numbers but no numbered sections-->
<sectionAutolabel>0</sectionAutolabel>
<tocSectionDepth>1</tocSectionDepth>
<sectionLabelIncludesComponentLabel>0</sectionLabelIncludesComponentLabel>
<webhelpDirname>arch-design</webhelpDirname>
<pdfFilenameBase>arch-design</pdfFilenameBase>
</configuration>
</execution>
</executions>
<configuration>
<!-- These parameters apply to pdf and webhelp -->
<xincludeSupported>true</xincludeSupported>
<sourceDirectory>.</sourceDirectory>
<includes>
bk-openstack-arch-design.xml
</includes>
<canonicalUrlBase>http://docs.openstack.org/openstack-arch-design/content</canonicalUrlBase>
<glossaryCollection>${basedir}/../glossary/glossary-terms.xml</glossaryCollection>
<branding>openstack</branding>
<formalProcedures>0</formalProcedures>
</configuration>
</plugin>
</plugins>
</build>
</project>

View File

@ -0,0 +1,62 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="desktop-as-a-service">
<?dbhtml stop-chunking?>
<title>Desktop as a Service</title>
<para>Virtual Desktop Infrastructure (VDI) is a service that hosts
user desktop environments on remote servers. This application
is very sensitive to network latency and requires a high
performance compute environment. Traditionally these types of
environments have not been put on cloud environments because
few clouds are built to support such a demanding workload that
is so exposed to end users. Recently, as cloud environments
become more robust, vendors are starting to provide services
that allow virtual desktops to be hosted in the cloud. In the
not too distant future, OpenStack could be used as the
underlying infrastructure to run a virtual infrastructure
environment, either in-house or in the cloud.</para>
<section xml:id="challenges"><title>Challenges</title>
<para>Designing an infrastructure that is suitable to host virtual
desktops is a very different task to that of most virtual
workloads. The infrastructure will need to be designed, for
example:</para>
<itemizedlist>
<listitem>
<para>Boot storms - What happens when hundreds or
thousands of users log in during shift changes,
affects the storage design.</para>
</listitem>
<listitem>
<para>The performance of the applications running in these
virtual desktops</para>
</listitem>
<listitem>
<para>Operating system and compatibility with the
OpenStack hypervisor</para>
</listitem>
</itemizedlist></section>
<section xml:id="broker"><title>Broker</title>
<para>The Connection Broker is a central component of the
architecture that determines which Remote Desktop Host will be
assigned or connected to the user. The broker is often a
full-blown management product allowing for the automated
deployment and provisioning of Remote Desktop Hosts.</para></section>
<section xml:id="possible-solutions"><title>Possible Solutions</title>
<para>There a number of commercial products available today that
provide such a broker solution but nothing that is native in
the OpenStack project. There of course is also the option of
not providing a broker and managing this manually - but this
would not suffice as a large scale, enterprise
solution.</para></section>
<section xml:id="diagram"><title>Diagram</title>
<mediaobject>
<imageobject>
<imagedata
fileref="../images/Specialized_VDI1.png"
/>
</imageobject>
</mediaobject></section>
</section>

View File

@ -0,0 +1,45 @@
<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="specialized-hardware">
<?dbhtml stop-chunking?>
<title>Specialized Hardware</title>
<para>Certain workloads require specialized hardware devices that
are either difficult to virtualize or impossible to share.
Applications such as load balancers, highly parallel brute
force computing, and direct to wire networking may need
capabilities that basic OpenStack components do not
provide.</para>
<section xml:id="challenges-specialized-hardware"><title>Challenges</title>
<para>Some applications need access to hardware devices to either
improve performance or provide capabilities that are not
virtual CPU, RAM, network or storage. These can be a shared
resource, such as a cryptography processor, or a dedicated
resource such as a Graphics Processing Unit. OpenStack has
ways of providing some of these, while others may need extra
work.</para></section>
<section xml:id="solutions-specialized-hardware"><title>Solutions</title>
<para>In order to provide cryptography offloading to a set of
instances, it is possible to use Glance configuration options
to assign the cryptography chip to a device node in the guest.
The documentation at
http://docs.openstack.org/cli-reference/content/chapter_cli-glance-property.html
contains further information on configuring this solution, but
it allows all guests using the configured images to access the
hypervisor cryptography device.</para>
<para>If direct access to a specific device is required, it can be
dedicated to a single instance per hypervisor through the use
of PCI pass-through. The OpenStack administrator needs to
define a flavor that specifically has the PCI device in order
to properly schedule instances. More information regarding PCI
pass-through, including instructions for implementing and
using it, is available at
https://wiki.openstack.org/wiki/Pci_passthrough#How_to_check_PCI_status_with_PCI_api_patches.</para>
<mediaobject>
<imageobject>
<imagedata fileref="../images/Specialized_Hardware2.png"/>
</imageobject>
</mediaobject></section>
</section>

Some files were not shown because too many files have changed in this diff Show More