Operational considerationsMany operational factors will affect general purpose cloud
design choices. In larger installations, it is not uncommon
for operations staff to be tasked with maintaining cloud
environments. This differs from the operations staff that is
responsible for building or designing the infrastructure. It
is important to include the operations function in the
planning and design phases of the build out.Service Level Agreements (SLAs) are contractual obligations
that provide assurances for service availability. SLAs define
levels of availability that drive the technical design, often
with penalties for not meeting the contractual obligations.
The strictness of the SLA dictates the level of redundancy and
resiliency in the OpenStack cloud design. Knowing when and
where to implement redundancy and high availability is
directly affected by
expectations set by the terms of the SLA. Some of the SLA
terms that will affect the design include:Guarantees for API availability imply multiple
infrastructure services combined with highly available
load balancers.Network uptime guarantees will affect the switch
design and might require redundant switching and
power.Network security policies requirements need to be
factored in to deployments.Support and maintainabilityOpenStack cloud management requires operations staff to be
able to understand and comprehend design architecture content
on some level. The level of skills and the level of separation
of the operations and engineering staff are dependent on the
size and purpose of the installation. A large cloud service
provider or a telecom provider is more likely to be managed by
a specially trained, dedicated operations organization. A
smaller implementation is more likely to rely on a smaller
support staff that might need to take on the combined
engineering, design and operations functions.Furthermore, maintaining OpenStack installations requires a
variety of technical skills. Some of these skills may include
the ability to debug Python log output to a basic level and an
understanding of networking concepts.Consider incorporating features into the architecture and
design that reduce the operations burden. This is accomplished
by automating some of the operations functions. In some cases
it may be beneficial to use a third party management company
with special expertise in managing OpenStack
deployments.MonitoringLike any other infrastructure deployment, OpenStack clouds
need an appropriate monitoring platform to ensure any errors
are caught and managed appropriately. Consider leveraging any
existing monitoring system to see if it will be able to
effectively monitor an OpenStack environment. While there are
many aspects that need to be monitored, specific metrics that
are critically important to capture include image disk
utilization, or response time to the Compute API.DowntimeNo matter how robust the architecture is, at some point
components will fail. Designing for high availability (HA) can
have significant cost ramifications, therefore the resiliency
of the overall system and the individual components is going
to be dictated by the requirements of the SLA. Downtime
planning includes creating processes and architectures that
support planned (maintenance) and unplanned (system faults)
downtime.An example of an operational consideration is the recovery
of a failed compute host. This might mean requiring the
restoration of instances from a snapshot or respawning an
instance on another available compute host. This could have
consequences on the overall application design. A general
purpose cloud should not need to provide an ability to migrate
instances from one host to another. If the expectation is that
the application will be designed to tolerate failure,
additional considerations need to be made around supporting
instance migration. In this scenario, extra supporting
services, including shared storage attached to compute hosts,
might need to be deployed.Capacity planningCapacity planning for future growth is a critically
important and often overlooked consideration. Capacity
constraints in a general purpose cloud environment include
compute and storage limits. There is a relationship between
the size of the compute environment and the supporting
OpenStack infrastructure controller nodes required to support
it. As the size of the supporting compute environment
increases, the network traffic and messages will increase
which will add load to the controller or networking nodes.
While no hard and fast rule exists, effective monitoring of
the environment will help with capacity decisions on when to
scale the back-end infrastructure as part of the scaling of
the compute resources.Adding extra compute capacity to an OpenStack cloud is a
horizontally scaling process as consistently configured
compute nodes automatically attach to an OpenStack cloud. Be
mindful of any additional work that is needed to place the
nodes into appropriate availability zones and host aggregates.
Make sure to use identical or functionally compatible CPUs
when adding additional compute nodes to the environment
otherwise live migration features will break. Scaling out
compute hosts will directly affect network and other
datacenter resources so it will be necessary to add rack
capacity or network switches.Another option is to assess the average workloads and
increase the number of instances that can run within the
compute environment by adjusting the overcommit ratio. While
only appropriate in some environments, it's important to
remember that changing the CPU overcommit ratio can have a
detrimental effect and cause a potential increase in noisy
neighbor. The added risk of increasing the overcommit ratio is
more instances will fail when a compute host fails.Compute host components can also be upgraded to account for
increases in demand; this is known as vertical scaling.
Upgrading CPUs with more cores, or increasing the overall
server memory, can add extra needed capacity depending on
whether the running applications are more CPU intensive or
memory intensive.Insufficient disk capacity could also have a negative effect
on overall performance including CPU and memory usage.
Depending on the back-end architecture of the OpenStack Block
Storage layer, capacity might include adding disk shelves to
enterprise storage systems or installing additional block
storage nodes. It may also be necessary to upgrade directly
attached storage installed in compute hosts or add capacity to
the shared storage to provide additional ephemeral storage to
instances.
For a deeper discussion on many of these topics, refer to the
OpenStack
Operations Guide.