0733bb0bed
a general purpose cloud requirements need to be instead of needing changed to “telecom providers” “a” specially trained, dedicated operations organization Change-Id: I296e94c6cb3647b61f5d9cec25a0f3f25b7373c8
165 lines
8.1 KiB
XML
165 lines
8.1 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
version="5.0"
|
|
xml:id="operational-considerations-general-purpose">
|
|
<?dbhtml stop-chunking?>
|
|
<title>Operational considerations</title>
|
|
<para>In the planning and design phases of the build out, it is
|
|
important to include the operation's function. Operational
|
|
factors affect the design choices for a general purpose cloud,
|
|
and operations staff are often tasked with the maintenance of
|
|
cloud environments for larger installations.</para>
|
|
<para>Knowing when and where to implement redundancy and high
|
|
availability is directly affected by expectations set by the terms
|
|
of the Service Level Agreements (SLAs). SLAs are contractual
|
|
obligations that provide assurances for service availability.
|
|
They define the levels of availability that drive the technical
|
|
design, often with penalties for not meeting contractual obligations.</para>
|
|
<para>SLA terms that will affect the design include:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>API availability guarantees implying multiple
|
|
infrastructure services, and highly available
|
|
load balancers.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Network uptime guarantees affecting switch
|
|
design, which might require redundant switching and
|
|
power.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Network security policies requirements need to be
|
|
factored in to deployments.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<section xml:id="support-and-maintainability-general-purpose">
|
|
<title>Support and maintainability</title>
|
|
<para>To be able to support and maintain an installation, OpenStack
|
|
cloud management requires operations staff to understand and
|
|
comprehend design architecture content. The operations and engineering
|
|
staff skill level, and level of separation, are dependent on size and
|
|
purpose of the installation. Large cloud service providers, or telecom
|
|
providers, are more likely to be managed by a specially trained, dedicated
|
|
operations organization. Smaller implementations are more likely to rely
|
|
on support staff that need to take on combined engineering, design and
|
|
operations functions.</para>
|
|
<para>Maintaining OpenStack installations requires a
|
|
variety of technical skills. For example, if you are to incorporate features
|
|
into an architecture and design that reduce the operations burden,
|
|
it is advised to automate the operations functions. It may, however,
|
|
be beneficial to use third party management companies with special
|
|
expertise in managing OpenStack deployment.</para>
|
|
</section>
|
|
<section xml:id="monitoring-general-purpose">
|
|
<title>Monitoring</title>
|
|
<para>OpenStack clouds require appropriate monitoring platforms to
|
|
ensure errors are caught and managed appropriately. Specific
|
|
metrics that are critically important to monitor include:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Image disk utilization
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Response time to the Compute API
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>Leveraging existing monitoring systems is an effective check to
|
|
ensure OpenStack environments can be monitored.</para>
|
|
</section>
|
|
|
|
<section xml:id="downtime-general-purpose">
|
|
<title>Downtime</title>
|
|
<para>To effectively run cloud installations, initial downtime planning
|
|
includes creating processes and architectures that support
|
|
the following:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Planned (maintenance)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Unplanned (system faults)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>Resiliency of overall system and individual components are going
|
|
to be dictated by the requirements of the SLA, meaning designing
|
|
for high availability (HA) can have cost ramifications.</para>
|
|
<para>For example, if a compute host failed, this would be an operational
|
|
consideration; requiring the restoration of instances from a
|
|
snapshot or re-spawning an instance. The overall application design
|
|
is impacted, general purpose clouds should not need to provide
|
|
abilities to migrate instances from one host to another. Additional
|
|
considerations need to be made around supporting instance migration if
|
|
the expectation is that the application will be designed to tolerate
|
|
failure. Extra support services, including shared storage attached to
|
|
compute hosts, might need to be deployed in this example.</para>
|
|
</section>
|
|
<section xml:id="capacity-planning">
|
|
<title>Capacity planning</title>
|
|
<para>Capacity constraints for a general purpose cloud environment
|
|
include:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Compute limits
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Storage limits
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>A relationship exists between the size of the compute environment
|
|
and the supporting OpenStack infrastructure controller nodes requiring
|
|
support.</para>
|
|
<para>Increasing the size of the supporting compute environment increases
|
|
the network traffic and messages, adding load to the controller or
|
|
networking nodes. Effective monitoring of the environment will help
|
|
with capacity decisions on scaling.</para>
|
|
<para>Compute nodes automatically attach to OpenStack clouds, resulting in
|
|
a horizontally scaling process when adding extra compute capacity to an
|
|
OpenStack cloud. Additional processes are required to place nodes into
|
|
appropriate availability zones and host aggregates. When adding additional
|
|
compute nodes to environments, ensure identical or functional compatible
|
|
CPUs are used, otherwise live migration features will break. It is necessary
|
|
to add rack capacity or network switches as scaling out compute hosts directly
|
|
affects network and datacenter resources.</para>
|
|
<para>Assessing the average workloads and increasing the number of instances
|
|
that can run within the compute environment by adjusting the overcommit
|
|
ratio is another option. It is important to remember that changing the CPU overcommit
|
|
ratio can have a detrimental effect and cause a potential increase in a
|
|
noisy neighbor. The additional risk of increasing the overcommit ratio is
|
|
more instances failing when a compute host fails.</para>
|
|
<para>Compute host components can also be upgraded to account for
|
|
increases in demand; this is known as vertical scaling.
|
|
Upgrading CPUs with more cores, or increasing the overall
|
|
server memory, can add extra needed capacity depending on
|
|
whether the running applications are more CPU intensive or
|
|
memory intensive.</para>
|
|
<para>Insufficient disk capacity could also have a negative effect
|
|
on overall performance including CPU and memory usage.
|
|
Depending on the back-end architecture of the OpenStack Block
|
|
Storage layer, capacity includes adding disk shelves to
|
|
enterprise storage systems or installing additional block
|
|
storage nodes. Upgrading directly attached storage installed in
|
|
compute hosts, and adding capacity to the shared storage for
|
|
additional ephemeral storage to instances, may be necessary.</para>
|
|
<para>
|
|
For a deeper discussion on many of these topics, refer to the
|
|
<link
|
|
xlink:href="http://docs.openstack.org/ops"><citetitle>OpenStack
|
|
Operations Guide</citetitle></link>.
|
|
</para>
|
|
</section>
|
|
</section>
|