openstack-manuals/doc/arch-design/generalpurpose/section_operational_considerations_general_purpose.xml

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="operational-considerations-general-purpose">
    <?dbhtml stop-chunking?>
    <title>Operational considerations</title>
    <para>In the planning and design phases of the build out, it is
        important to include the operation's function. Operational
        factors affect the design choices for a general purpose cloud,
        and operations staff are often tasked with the maintenance of
        cloud environments for larger installations.</para>
    <para>Expectations set by the Service Level Agreements (SLAs) directly
        affect knowing when and where you should implement redundancy and
        high availability. SLAs are contractual
        obligations that provide assurances for service availability.
        They define the levels of availability that drive the technical
        design, often with penalties for not meeting contractual obligations.</para>
    <para>SLA terms that affect design include:</para>
    <itemizedlist>
        <listitem>
            <para>API availability guarantees implying multiple
                infrastructure services and highly available
                load balancers.</para>
        </listitem>
        <listitem>
            <para>Network uptime guarantees affecting switch
                design, which might require redundant switching and
                power.</para>
        </listitem>
        <listitem>
            <para>Factor in networking security policy requirements
                in to your deployments.</para>
        </listitem>
    </itemizedlist>

    <section xml:id="support-and-maintainability-general-purpose">
      <title>Support and maintainability</title>
    <para>To be able to support and maintain an installation, OpenStack
        cloud management requires operations staff to understand and
        comprehend design architecture content. The operations and engineering
        staff skill level, and level of separation, are dependent on size and
        purpose of the installation. Large cloud service providers, or telecom
        providers, are more likely to be managed by specially trained, dedicated
        operations organizations. Smaller implementations are more likely to rely
        on support staff that need to take on combined engineering, design and
        operations functions.</para>
    <para>Maintaining OpenStack installations requires a
        variety of technical skills. You may want to consider using a third-party
        management company with special expertise in managing
        OpenStack deployment.</para>
    </section>

    <section xml:id="monitoring-general-purpose">
      <title>Monitoring</title>
    <para>OpenStack clouds require appropriate monitoring platforms to
        ensure errors are caught and managed appropriately. Specific
        meters that are critically important to monitor include:</para>
      <itemizedlist>
        <listitem>
          <para>
            Image disk utilization
          </para>
        </listitem>
        <listitem>
          <para>
            Response time to the Compute API
          </para>
        </listitem>
      </itemizedlist>
    <para>Leveraging existing monitoring systems is an effective check to
        ensure OpenStack environments can be monitored.</para>
    </section>

    <section xml:id="downtime-general-purpose">
      <title>Downtime</title>
    <para>To effectively run cloud installations, initial downtime planning
        includes creating processes and architectures that support
        the following:</para>
      <itemizedlist>
        <listitem>
          <para>
            Planned (maintenance)
          </para>
        </listitem>
        <listitem>
          <para>
            Unplanned (system faults)
          </para>
        </listitem>
      </itemizedlist>
    <para>Resiliency of overall system and individual components are going
        to be dictated by the requirements of the SLA, meaning designing
        for high availability (HA) can have cost ramifications.</para>
    </section>

    <section xml:id="capacity-planning">
      <title>Capacity planning</title>
    <para>Capacity constraints for a general purpose cloud environment
        include:</para>
      <itemizedlist>
       <listitem>
         <para>
          Compute limits
         </para>
       </listitem>
       <listitem>
         <para>
          Storage limits
         </para>
       </listitem>
     </itemizedlist>
   <para>A relationship exists between the size of the compute environment
        and the supporting OpenStack infrastructure controller nodes requiring
        support.</para>
   <para>Increasing the size of the supporting compute environment increases
        the network traffic and messages, adding load to the controller or
        networking nodes. Effective monitoring of the environment will help
        with capacity decisions on scaling.</para>
   <para>Compute nodes automatically attach to OpenStack clouds, resulting in
        a horizontally scaling process when adding extra compute capacity to an
        OpenStack cloud. Additional processes are required to place nodes into
        appropriate availability zones and host aggregates. When adding additional
        compute nodes to environments, ensure identical or functional compatible
        CPUs are used, otherwise live migration features will break. It is necessary
        to add rack capacity or network switches as scaling out compute hosts directly
        affects network and datacenter resources.</para>
   <para>Assessing the average workloads and increasing the number of instances
        that can run within the compute environment by adjusting the overcommit
        ratio is another option. It is important to remember that changing the CPU overcommit
        ratio can have a detrimental effect and cause a potential increase in a
        noisy neighbor. The additional risk of increasing the overcommit ratio is
        more instances failing when a compute host fails.</para>
    <para>Compute host components can also be upgraded to account for
        increases in demand; this is known as vertical scaling.
        Upgrading CPUs with more cores, or increasing the overall
        server memory, can add extra needed capacity depending on
        whether the running applications are more CPU intensive or
        memory intensive.</para>
    <para>Insufficient disk capacity could also have a negative effect
        on overall performance including CPU and memory usage.
        Depending on the back-end architecture of the OpenStack Block
        Storage layer, capacity includes adding disk shelves to
        enterprise storage systems or installing additional block
        storage nodes. Upgrading directly attached storage installed in
        compute hosts, and adding capacity to the shared storage for
        additional ephemeral storage to instances, may be necessary.</para>
    <para>
      For a deeper discussion on many of these topics, refer to the
      <link
      xlink:href="http://docs.openstack.org/ops"><citetitle>OpenStack
      Operations Guide</citetitle></link>.
    </para>
    </section>
</section>