Operational considerations
Operationally, there are a number of considerations that affect the
design of compute-focused OpenStack clouds. Some examples might include
enforcing strict API availability requirements, understanding and dealing
with failure scenarios, or managing host maintenance schedules.
Service-level agreements (SLAs) are a contractual obligation that
gives assurances around the availability of a provided service. As such,
factoring in promises of availability implies a certain level of
redundancy and resiliency when designing an OpenStack cloud.
Guarantees for API availability imply multiple infrastructure
services combined with appropriately high available load
balancers.
Network uptime guarantees will affect the switch design and might
require redundant switching and power.
Network security policy requirements need to be factored in to
deployments.
Knowing when and where to implement redundancy and high availability
(HA) is directly affected by the terms contained in any associated SLA, if
one is present.
Support and maintainability
OpenStack cloud management requires operations staff to be
able to understand and comprehend design architecture content
on some level. The level of skills and the level of separation
of the operations and engineering staff is dependent on the
size and purpose of the installation. A large cloud service
provider or a telecom provider is more inclined to be managed
by a specially trained dedicated operations organization. A
smaller implementation is more inclined to rely on a smaller
support staff that might need to take on the combined
engineering, design and operations functions.
Maintaining OpenStack installations require a variety of
technical skills. Some of these skills may include the ability
to debug Python log output to a basic level as well as an
understanding of networking concepts.
Consider incorporating features into the architecture and
design that reduce the operational burden. Some examples
include automating some of the operations functions, or
alternatively exploring the possibility of using a third party
management company with special expertise in managing
OpenStack deployments.
Monitoring
Like any other infrastructure deployment, OpenStack clouds
need an appropriate monitoring platform to ensure errors are
caught and managed appropriately. Consider leveraging any
existing monitoring system to see if it will be able to
effectively monitor an OpenStack environment. While there are
many aspects that need to be monitored, specific metrics that
are critically important to capture include image disk
utilization, or response time to the Compute API.
Expected and unexpected server downtime
At some point, servers will fail. The SLAs in place affect
how the design has to address recovery time. Recovery of a
failed host may mean restoring instances from a snapshot, or
respawning that instance on another available host, which then
has consequences on the overall application design running on
the OpenStack cloud.
It might be acceptable to design a compute-focused cloud
without the ability to migrate instances from one host to
another, because the expectation is that the application
developer must handle failure within the application itself.
Conversely, a compute-focused cloud might be provisioned to
provide extra resilience as a requirement of that business. In
this scenario, it is expected that extra supporting services
are also deployed, such as shared storage attached to hosts to
aid in recovery and resiliency of services in order to meet
strict SLAs.
Capacity planning
Adding extra capacity to an OpenStack cloud is an easy
horizontally scaling process, as consistently configured nodes
automatically attach to an OpenStack cloud. Be mindful,
however, of any additional work to place the nodes into
appropriate Availability Zones and Host Aggregates if
necessary. The same (or very similar) CPUs are recommended
when adding extra nodes to the environment because it reduces
the chance to break any live-migration features if they are
present. Scaling out hypervisor hosts also has a direct effect
on network and other data center resources, so factor in this
increase when reaching rack capacity or when extra network
switches are required.
Compute hosts can also have internal components changed to
account for increases in demand, a process also known as
vertical scaling. Swapping a CPU for one with more cores, or
increasing the memory in a server, can help add extra needed
capacity depending on whether the running applications are
more CPU intensive or memory based (as would be expected in a
compute-focused OpenStack cloud).
Another option is to assess the average workloads and
increase the number of instances that can run within the
compute environment by adjusting the overcommit ratio. While
only appropriate in some environments, it's important to
remember that changing the CPU overcommit ratio can have a
detrimental effect and cause a potential increase in a noisy
neighbor. The added risk of increasing the overcommit ratio is that
more instances will fail when a compute host fails. In a
compute-focused OpenStack design architecture, increasing the
CPU overcommit ratio increases the potential for noisy
neighbor issues and is not recommended.