openstack-manuals/doc/arch-design/generalpurpose/section_operational_considerations_general_purpose.xml
Thomas Kaergel b0337e125c replace 'metrics' with 'meters' wherever it makes sense
Replaced 'metrics' with 'meters' in all contexts, excluding the
metric system or  where the term is used as name.

Change-Id: If4c32dfe92c28a2079a485a6aec1d61c7b9999a1
Closes-Bug: #1446518
2015-06-16 13:49:56 +02:00

165 lines
8.1 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="operational-considerations-general-purpose">
<?dbhtml stop-chunking?>
<title>Operational considerations</title>
<para>In the planning and design phases of the build out, it is
important to include the operation's function. Operational
factors affect the design choices for a general purpose cloud,
and operations staff are often tasked with the maintenance of
cloud environments for larger installations.</para>
<para>Knowing when and where to implement redundancy and high
availability is directly affected by expectations set by the terms
of the Service Level Agreements (SLAs). SLAs are contractual
obligations that provide assurances for service availability.
They define the levels of availability that drive the technical
design, often with penalties for not meeting contractual obligations.</para>
<para>SLA terms that will affect the design include:</para>
<itemizedlist>
<listitem>
<para>API availability guarantees implying multiple
infrastructure services, and highly available
load balancers.</para>
</listitem>
<listitem>
<para>Network uptime guarantees affecting switch
design, which might require redundant switching and
power.</para>
</listitem>
<listitem>
<para>Network security policies requirements need to be
factored in to deployments.</para>
</listitem>
</itemizedlist>
<section xml:id="support-and-maintainability-general-purpose">
<title>Support and maintainability</title>
<para>To be able to support and maintain an installation, OpenStack
cloud management requires operations staff to understand and
comprehend design architecture content. The operations and engineering
staff skill level, and level of separation, are dependent on size and
purpose of the installation. Large cloud service providers, or telecom
providers, are more likely to be managed by a specially trained, dedicated
operations organization. Smaller implementations are more likely to rely
on support staff that need to take on combined engineering, design and
operations functions.</para>
<para>Maintaining OpenStack installations requires a
variety of technical skills. For example, if you are to incorporate features
into an architecture and design that reduce the operations burden,
it is advised to automate the operations functions. It may, however,
be beneficial to use third party management companies with special
expertise in managing OpenStack deployment.</para>
</section>
<section xml:id="monitoring-general-purpose">
<title>Monitoring</title>
<para>OpenStack clouds require appropriate monitoring platforms to
ensure errors are caught and managed appropriately. Specific
meters that are critically important to monitor include:</para>
<itemizedlist>
<listitem>
<para>
Image disk utilization
</para>
</listitem>
<listitem>
<para>
Response time to the Compute API
</para>
</listitem>
</itemizedlist>
<para>Leveraging existing monitoring systems is an effective check to
ensure OpenStack environments can be monitored.</para>
</section>
<section xml:id="downtime-general-purpose">
<title>Downtime</title>
<para>To effectively run cloud installations, initial downtime planning
includes creating processes and architectures that support
the following:</para>
<itemizedlist>
<listitem>
<para>
Planned (maintenance)
</para>
</listitem>
<listitem>
<para>
Unplanned (system faults)
</para>
</listitem>
</itemizedlist>
<para>Resiliency of overall system and individual components are going
to be dictated by the requirements of the SLA, meaning designing
for high availability (HA) can have cost ramifications.</para>
<para>For example, if a compute host failed, this would be an operational
consideration; requiring the restoration of instances from a
snapshot or re-spawning an instance. The overall application design
is impacted, general purpose clouds should not need to provide
abilities to migrate instances from one host to another. Additional
considerations need to be made around supporting instance migration if
the expectation is that the application will be designed to tolerate
failure. Extra support services, including shared storage attached to
compute hosts, might need to be deployed in this example.</para>
</section>
<section xml:id="capacity-planning">
<title>Capacity planning</title>
<para>Capacity constraints for a general purpose cloud environment
include:</para>
<itemizedlist>
<listitem>
<para>
Compute limits
</para>
</listitem>
<listitem>
<para>
Storage limits
</para>
</listitem>
</itemizedlist>
<para>A relationship exists between the size of the compute environment
and the supporting OpenStack infrastructure controller nodes requiring
support.</para>
<para>Increasing the size of the supporting compute environment increases
the network traffic and messages, adding load to the controller or
networking nodes. Effective monitoring of the environment will help
with capacity decisions on scaling.</para>
<para>Compute nodes automatically attach to OpenStack clouds, resulting in
a horizontally scaling process when adding extra compute capacity to an
OpenStack cloud. Additional processes are required to place nodes into
appropriate availability zones and host aggregates. When adding additional
compute nodes to environments, ensure identical or functional compatible
CPUs are used, otherwise live migration features will break. It is necessary
to add rack capacity or network switches as scaling out compute hosts directly
affects network and datacenter resources.</para>
<para>Assessing the average workloads and increasing the number of instances
that can run within the compute environment by adjusting the overcommit
ratio is another option. It is important to remember that changing the CPU overcommit
ratio can have a detrimental effect and cause a potential increase in a
noisy neighbor. The additional risk of increasing the overcommit ratio is
more instances failing when a compute host fails.</para>
<para>Compute host components can also be upgraded to account for
increases in demand; this is known as vertical scaling.
Upgrading CPUs with more cores, or increasing the overall
server memory, can add extra needed capacity depending on
whether the running applications are more CPU intensive or
memory intensive.</para>
<para>Insufficient disk capacity could also have a negative effect
on overall performance including CPU and memory usage.
Depending on the back-end architecture of the OpenStack Block
Storage layer, capacity includes adding disk shelves to
enterprise storage systems or installing additional block
storage nodes. Upgrading directly attached storage installed in
compute hosts, and adding capacity to the shared storage for
additional ephemeral storage to instances, may be necessary.</para>
<para>
For a deeper discussion on many of these topics, refer to the
<link
xlink:href="http://docs.openstack.org/ops"><citetitle>OpenStack
Operations Guide</citetitle></link>.
</para>
</section>
</section>