openstack-manuals/doc/arch-design/compute_focus/section_architecture_compute_focus.xml

600 lines
29 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="arch-design-architecture-hardware">
<?dbhtml stop-chunking?>
<title>Architecture</title>
<para>The hardware selection covers three areas:</para>
<itemizedlist>
<listitem>
<para>Compute</para>
</listitem>
<listitem>
<para>Network</para>
</listitem>
<listitem>
<para>Storage</para>
</listitem>
</itemizedlist>
<para>
An OpenStack cloud with extreme demands on processor and memory
resources is considered to be compute-focused, and requires hardware that
can handle these demands. This can mean choosing hardware which might
not perform as well on storage or network capabilities. In a compute-
focused architecture, storage and networking are required while loading a
data set into the computational cluster, but are not otherwise in heavy
demand.
</para>
<para>
Compute (server) hardware must be evaluated against four dimensions:
</para>
<variablelist>
<varlistentry>
<term>Server density</term>
<listitem>
<para>A measure of how many servers can fit into a
given amount of physical space, such as a rack unit (U).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Resource capacity</term>
<listitem>
<para>The number of CPU cores, how much RAM, or how
much storage a given server will deliver.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Expandability</term>
<listitem>
<para>The number of additional resources that can be
added to a server before it has reached its limit.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Cost</term>
<listitem>
<para>The relative purchase price of the hardware weighted
against the level of design effort needed to build the system.</para>
</listitem>
</varlistentry>
</variablelist>
<para>The dimensions need to be weighed against each other to determine the
best design for the desired purpose. For example, increasing server density
means sacrificing resource capacity or expandability. Increasing resource
capacity and expandability can increase cost but decreases server density.
Decreasing cost can mean decreasing supportability, server density,
resource capacity, and expandability.</para>
<para>A compute-focused cloud should have an emphasis on server hardware
that can offer more CPU sockets, more CPU cores, and more RAM. Network
connectivity and storage capacity are less critical. The hardware will
need to be configured to provide enough network connectivity and storage
capacity to meet minimum user requirements, but they are not the primary
consideration.</para>
<para>Some server hardware form factors are better suited than others, as
CPU and RAM capacity have the highest priority. Some considerations for
selecting hardware:</para>
<itemizedlist>
<listitem>
<para>Most blade servers can support dual-socket multi-core CPUs. To
avoid this CPU limit, select "full width" or "full height" blades,
however this will also decrease the server density. For example,
high density blade servers (like HP BladeSystem or Dell PowerEdge
M1000e) which support up to 16 servers in only ten rack units. Using
half-height blades is twice as dense as using full-height blades,
which results in only eight servers per ten rack units.</para>
</listitem>
<listitem>
<para>1U rack-mounted servers (servers that occupy only a single rack
unit) may be able to offer greater server density than a blade server
solution. It is possible to place forty 1U servers in a rack, providing
space for the top of rack (ToR) switches, compared to 32 full width
blade servers. However, as of the Icehouse release, 1U servers from
the major vendors are limited to dual-socket, multi-core CPU
configurations. To obtain greater than dual-socket support in a 1U
rack-mount form factor, you will need to buy systems from original
design (ODMs) or second-tier manufacturers.</para>
</listitem>
<listitem>
<para>2U rack-mounted servers provide quad-socket, multi-core CPU
support, but with a corresponding decrease in server density (half
the density offered by 1U rack-mounted servers).</para>
</listitem>
<listitem>
<para>Larger rack-mounted servers, such as 4U servers, often provide
even greater CPU capacity, commonly supporting four or even eight CPU
sockets. These servers have greater expandability, but such servers
have much lower server density and are often more expensive.</para>
</listitem>
<listitem>
<para>"Sled servers" (rack-mounted servers that support multiple
independent servers in a single 2U or 3U enclosure) deliver increased
density as compared to typical 1U or 2U rack-mounted servers. For
example, many sled servers offer four independent dual-socket
nodes in 2U for a total of eight CPU sockets in 2U. However, the
dual-socket limitation on individual nodes may not be sufficient to
offset their additional cost and configuration complexity.</para>
</listitem>
</itemizedlist>
<para>Consider these facts when choosing server hardware for a compute-
focused OpenStack design architecture:</para>
<variablelist>
<varlistentry>
<term>Instance density</term>
<listitem>
<para>In a compute-focused architecture, instance density is
lower, which means CPU and RAM over-subscription ratios are
also lower. More hosts will be required to support the anticipated
scale due to instance density being lower, especially if the
design uses dual-socket hardware designs.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Host density</term>
<listitem>
<para>Another option to address the higher host count
that might be needed with dual socket designs is to use a quad
socket platform. Taking this approach will decrease host density,
which increases rack count. This configuration may
affect the network requirements, the number of power connections, and
possibly impact the cooling requirements.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Power and cooling density</term>
<listitem>
<para>The power and cooling density
requirements might be lower than with blade, sled, or 1U server
designs because of lower host density (by using 2U, 3U or even 4U
server designs). For data centers with older infrastructure, this may
be a desirable feature.</para>
</listitem>
</varlistentry>
</variablelist>
<para>Compute-focused OpenStack design architecture server hardware
selection results in a "scale up" versus "scale out" decision.
Selecting a better solution, smaller number of larger hosts, or a
larger number of smaller hosts depends on a combination of factors:
cost, power, cooling, physical rack and floor space, support-warranty,
and manageability.</para>
<section xml:id="storage-hardware-selection">
<title>Storage hardware selection</title>
<para>For a compute-focused OpenStack design architecture, the selection of
storage hardware is not critical as it is not primary criteria, however
it is still important. There are a number of different factors that a
cloud architect must consider:</para>
<variablelist>
<varlistentry>
<term>Cost</term>
<listitem>
<para>The overall cost of the solution will play a major role
in what storage architecture (and resulting storage hardware) is
selected.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Performance</term>
<listitem>
<para>The performance of the solution is also a big
role and can be measured by observing the latency of storage I-O
requests. In a compute-focused OpenStack cloud, storage latency
can be a major consideration. In some compute-intensive
workloads, minimizing the delays that the CPU experiences while
fetching data from the storage can have a significant impact on
the overall performance of the application.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Scalability</term>
<listitem>
<para>This section will refer to the term "scalability"
to refer to how well the storage solution performs as it is
expanded up to its maximum size. A storage solution that performs
well in small configurations but has degrading
performance as it expands would not be considered scalable. On
the other hand, a solution that continues to perform well at
maximum expansion would be considered scalable.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Expandability</term>
<listitem>
<para>Expandability refers to the overall ability of
the solution to grow. A storage solution that expands to 50 PB is
considered more expandable than a solution that only scales to 10PB.
Note that this metric is related to, but different
from, scalability, which is a measure of the solution's
performance as it expands.</para>
</listitem>
</varlistentry>
</variablelist>
<para>For a compute-focused OpenStack cloud, latency of storage is a
major consideration. Using solid-state disks (SSDs) to minimize
latency for instance storage and reduce CPU delays caused by waiting
for the storage will increase performance. Consider using RAID
controller cards in compute hosts to improve the performance of the
underlying disk subsystem.</para>
<para>The selection of storage architecture, and the corresponding
storage hardware (if there is the option), is determined by evaluating
possible solutions against the key factors listed above. This will
determine if a scale-out solution (such as Ceph, GlusterFS, or similar)
should be used, or if a single, highly expandable and scalable
centralized storage array would be a better choice. If a centralized
storage array is the right fit for the requirements, the hardware will
be determined by the array vendor. It is also possible to build a
storage array using commodity hardware with Open Source software, but
there needs to be access to people with expertise to build such a
system. Conversely, a scale-out storage solution that uses
direct-attached storage (DAS) in the servers may be an appropriate
choice. If so, then the server hardware needs to be configured to
support the storage solution.</para>
<para>The following lists some of the potential impacts that may affect a
particular storage architecture, and the corresponding storage hardware,
of a compute-focused OpenStack cloud:</para>
<variablelist>
<varlistentry>
<term>Connectivity</term>
<listitem>
<para>Based on the storage solution selected, ensure
the connectivity matches the storage solution requirements. If a
centralized storage array is selected, it is important to determine
how the hypervisors will connect to the storage array. Connectivity
could affect latency and thus performance, so check that the network
characteristics will minimize latency to boost the overall
performance of the design.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Latency</term>
<listitem>
<para>Determine if the use case will have consistent or
highly variable latency.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Throughput</term>
<listitem>
<para>To improve overall performance, make sure that the
storage solution throughout is optimized. While it is not likely
that a compute-focused cloud will have major data I-O to and
from storage, this is an important factor to consider.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Server Hardware</term>
<listitem>
<para>If the solution uses DAS, this impacts, and
is not limited to, the server hardware choice that will ripple into
host density, instance density, power density, OS-hypervisor, and
management tools.</para>
</listitem>
</varlistentry>
</variablelist>
<para>Where instances need to be made highly available, or they need to be
capable of migration between hosts, use of a shared storage file-system
to store instance ephemeral data should be employed to ensure that
compute services can run uninterrupted in the event of a node
failure.</para>
</section>
<section xml:id="selecting-networking-hardware-arch">
<title>Selecting networking hardware</title>
<para>Some of the key considerations that should be included in
the selection of networking hardware include:</para>
<variablelist>
<varlistentry>
<term>Port count</term>
<listitem>
<para>The design will require networking hardware that
has the requisite port count.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Port density</term>
<listitem>
<para>The network design will be affected by the
physical space that is required to provide the requisite port count.
A switch that can provide 48 10 GbE ports in 1U has a much higher
port density than a switch that provides 24 10 GbE ports in 2U. A
higher port density is preferred, as it leaves more rack space for
compute or storage components that might be required by the design.
This also leads into concerns about fault domains and power density
that must also be considered. Higher density switches are more
expensive and should also be considered, as it is important not to
over design the network if it is not required.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Port speed</term>
<listitem>
<para>The networking hardware must support the proposed
network speed, for example: 1 GbE, 10 GbE, or 40 GbE (or even 100
GbE).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Redundancy</term>
<listitem>
<para>The level of network hardware redundancy required
is influenced by the user requirements for high availability and
cost considerations. Network redundancy can be achieved by adding
redundant power supplies or paired switches. If this is a
requirement, the hardware will need to support this configuration.
User requirements will determine if a completely redundant network
infrastructure is required.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Power requirements</term>
<listitem>
<para>Ensure that the physical data center
provides the necessary power for the selected network hardware. This
is not an issue for top of rack (ToR) switches, but may be an issue
for spine switches in a leaf and spine fabric, or end of row (EoR)
switches.</para>
</listitem>
</varlistentry>
</variablelist>
<para>It is important to first understand additional factors as well as
the use case because these additional factors heavily influence the
cloud network architecture. Once these key considerations have been
decided, the proper network can be designed to best serve the workloads
being placed in the cloud.</para>
<para>We recommend designing the network architecture using
a scalable network model that makes it easy to add capacity and
bandwidth. A good example of such a model is the leaf-spline model. In
this type of network design, it is possible to easily add additional
bandwidth as well as scale out to additional racks of gear. It is
important to select network hardware that will support the required
port count, port speed and port density while also allowing for future
growth as workload demands increase. It is also important to evaluate
where in the network architecture it is valuable to provide redundancy.
Increased network availability and redundancy comes at a cost, therefore
we recommend to weigh the cost versus the benefit gained from
utilizing and deploying redundant network switches and using bonded
interfaces at the host level.</para>
</section>
<section xml:id="software-selection-arch">
<title>Software selection</title>
<para>Selecting software to be included in a compute-focused OpenStack
architecture design must include three main areas:</para>
<itemizedlist>
<listitem>
<para>Operating system (OS) and hypervisor</para>
</listitem>
<listitem>
<para>OpenStack components</para>
</listitem>
<listitem>
<para>Supplemental software</para>
</listitem>
</itemizedlist>
<para>Design decisions made in each of these areas impact the rest
of the OpenStack architecture design.</para>
</section>
<section xml:id="os-and-hypervisor-arch">
<title>Operating system and hypervisor</title>
<para>The selection of operating system (OS) and hypervisor has a
significant impact on the end point design. Selecting a particular
operating system and hypervisor could affect server hardware selection.
For example, a selected combination needs to be supported on the selected
hardware. Ensuring the storage hardware selection and topology supports
the selected operating system and hypervisor combination should also be
considered. Additionally, make sure that the networking hardware
selection and topology will work with the chosen operating system and
hypervisor combination. For example, if the design uses Link Aggregation
Control Protocol (LACP), the hypervisor needs to support it.</para>
<para>Some areas that could be impacted by the selection of OS and
hypervisor include:</para>
<variablelist>
<varlistentry>
<term>Cost</term>
<listitem>
<para>Selecting a commercially supported hypervisor such as
Microsoft Hyper-V will result in a different cost model rather than
choosing a community-supported open source hypervisor like Kinstance
or Xen. Even within the ranks of open source solutions, choosing
Ubuntu over Red Hat (or vice versa) will have an impact on cost due
to support contracts. On the other hand, business or application
requirements might dictate a specific or commercially supported
hypervisor.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Supportability</term>
<listitem>
<para>Depending on the selected hypervisor, the staff
should have the appropriate training and knowledge to support the
selected OS and hypervisor combination. If they do not, training
will need to be provided which could have a cost impact on the
design.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Management tools</term>
<listitem>
<para>The management tools used for Ubuntu and
Kinstance differ from the management tools for VMware vSphere.
Although both OS and hypervisor combinations are supported by
OpenStack, there will be very different impacts to the rest of the
design as a result of the selection of one combination versus the
other.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Scale and performance</term>
<listitem>
<para>Ensure that selected OS and hypervisor
combinations meet the appropriate scale and performance
requirements. The chosen architecture will need to meet the targeted
instance-host ratios with the selected OS-hypervisor
combination.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Security</term>
<listitem>
<para>Ensure that the design can accommodate the regular
periodic installation of application security patches while
maintaining the required workloads. The frequency of security
patches for the proposed OS-hypervisor combination will have an
impact on performance and the patch installation process could
affect maintenance windows.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Supported features</term>
<listitem>
<para>Determine what features of OpenStack are
required. This will often determine the selection of the
OS-hypervisor combination. Certain features are only available with
specific OSs or hypervisors. For example, if certain features are
not available, the design might need to be modified to meet the user
requirements.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Interoperability</term>
<listitem>
<para>Consideration should be given to the ability
of the selected OS-hypervisor combination to interoperate or
co-exist with other OS-hypervisors, or other software solutions in
the overall design (if required). Operational and troubleshooting
tools for one OS-hypervisor combination may differ from the
tools used for another OS-hypervisor combination and,
as a result, the design will need to address if the
two sets of tools need to interoperate.</para>
</listitem>
</varlistentry>
</variablelist>
</section>
<section xml:id="openstack-components-arch">
<title>OpenStack components</title>
<para>The selection of which OpenStack components will actually be
included in the design and deployed has significant impact. There are
certain components that will always be present, (Compute and Image Service, for
example) yet there are other services that might not need to be present.
For example, a certain design may not require the Orchestration module.
Omitting Heat would not typically have a significant impact on the
overall design. However, if the architecture uses a replacement for
OpenStack Object Storage for its storage component, this could potentially have
significant impacts on the rest of the design.</para>
<para>For a compute-focused OpenStack design architecture, the
following components would be used:</para>
<itemizedlist>
<listitem>
<para>Identity (keystone)</para>
</listitem>
<listitem>
<para>Dashboard (horizon)</para>
</listitem>
<listitem>
<para>Compute (nova)</para>
</listitem>
<listitem>
<para>Object Storage (swift, ceph or a commercial solution)</para>
</listitem>
<listitem>
<para>Image (glance)</para>
</listitem>
<listitem>
<para>Networking (neutron)</para>
</listitem>
<listitem>
<para>Orchestration (heat)</para>
</listitem>
</itemizedlist>
<para>OpenStack Block Storage would potentially not be incorporated
into a compute-focused design due to persistent block storage not
being a significant requirement for the types of workloads that would
be deployed onto instances running in a compute-focused cloud. However,
there may be some situations where the need for performance dictates
that a block storage component be used to improve data I-O.</para>
<para>The exclusion of certain OpenStack components might also limit or
constrain the functionality of other components. If a design opts to
include the Orchestration module but excludes the Telemetry module, then
the design will not be able to take advantage of Orchestration's auto
scaling functionality (which relies on information from Telemetry). This is due
to the fact that you can use Orchestration to spin up a large number of
instances to perform the compute-intensive processing. This includes
Orchestration in a compute-focused architecture design, which is strongly
recommended.</para>
</section>
<section xml:id="supplemental-software">
<title>Supplemental software</title>
<para>While OpenStack is a fairly complete collection of software
projects for building a platform for cloud services, there are
invariably additional pieces of software that might need to be
added to any given OpenStack design.</para>
<section xml:id="networking-software-arch">
<title>Networking software</title>
<para>OpenStack Networking provides a wide variety of networking services
for instances. There are many additional networking software packages
that might be useful to manage the OpenStack components themselves.
Some examples include software to provide load balancing,
network redundancy protocols, and routing daemons. Some of these
software packages are described in more detail in the
<citetitle>OpenStack High Availability Guide</citetitle> (<link
xlink:href="http://docs.openstack.org/high-availability-guide/content">http://docs.openstack.org/high-availability-guide/content</link>).
</para>
<para>For a compute-focused OpenStack cloud, the OpenStack infrastructure
components will need to be highly available. If the design does not
include hardware load balancing, networking software packages like
HAProxy will need to be included.</para>
</section>
<section xml:id="management-software-arch">
<title>Management software</title>
<para>The selected supplemental software solution impacts and affects
the overall OpenStack cloud design. This includes software for
providing clustering, logging, monitoring and alerting.</para>
<para>Inclusion of clustering Software, such as Corosync or Pacemaker,
is determined primarily by the availability design requirements.
Therefore, the impact of including (or not including) these software
packages is primarily determined by the availability of the cloud
infrastructure and the complexity of supporting the configuration after
it is deployed. The OpenStack High Availability Guide provides more
details on the installation and configuration of Corosync and Pacemaker,
should these packages need to be included in the design.</para>
<para>Requirements for logging, monitoring, and alerting are determined
by operational considerations. Each of these sub-categories includes
a number of various options. For example, in the logging sub-category
one might consider Logstash, Splunk, Log Insight, or some other log
aggregation-consolidation tool. Logs should be stored in a centralized
location to make it easier to perform analytics against the data. Log
data analytics engines can also provide automation and issue
notification by providing a mechanism to both alert and automatically
attempt to remediate some of the more commonly known issues.</para>
<para>If any of these software packages are needed, then the design
must account for the additional resource consumption (CPU, RAM,
storage, and network bandwidth for a log aggregation solution, for
example). Some other potential design impacts include:</para>
<itemizedlist>
<listitem>
<para>OS-hypervisor combination: Ensure that the selected logging,
monitoring, or alerting tools support the proposed OS-hypervisor
combination.</para>
</listitem>
<listitem>
<para>Network hardware: The network hardware selection needs to be
supported by the logging, monitoring, and alerting software.</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="database-software-arch">
<title>Database software</title>
<para>A large majority of the OpenStack components require access to
back-end database services to store state and configuration
information. Selection of an appropriate back-end database that will
satisfy the availability and fault tolerance requirements of the
OpenStack services is required. OpenStack services support connecting
to any database that is supported by the SQLAlchemy Python drivers,
however most common database deployments make use of MySQL or some
variation of it. We recommend that the database which provides
back-end services within a general-purpose cloud, be made highly
available using an available technology which can accomplish that
goal. Some of the more common software solutions used include Galera,
MariaDB and MySQL with multi-master replication.</para>
</section>
</section>
</section>