Operational considerations

Operational considerations Operationally, there are a number of considerations that affect the design of compute-focused OpenStack clouds. Some examples might include enforcing strict API availability requirements, understanding and dealing with failure scenarios, or managing host maintenance schedules. Service-level agreements (SLAs) are a contractual obligation that gives assurances around the availability of a provided service. As such, factoring in promises of availability implies a certain level of redundancy and resiliency when designing an OpenStack cloud. Guarantees for API availability imply multiple infrastructure services combined with appropriately high available load balancers. Network uptime guarantees will affect the switch design and might require redundant switching and power. Network security policy requirements need to be factored in to deployments. Knowing when and where to implement redundancy and high availability (HA) is directly affected by the terms contained in any associated SLA, if one is present.

Support and maintainability OpenStack cloud management requires operations staff to be able to understand and comprehend design architecture content on some level. The level of skills and the level of separation of the operations and engineering staff is dependent on the size and purpose of the installation. A large cloud service provider or a telecom provider is more inclined to be managed by a specially trained dedicated operations organization. A smaller implementation is more inclined to rely on a smaller support staff that might need to take on the combined engineering, design and operations functions. Maintaining OpenStack installations require a variety of technical skills. Some of these skills may include the ability to debug Python log output to a basic level as well as an understanding of networking concepts. Consider incorporating features into the architecture and design that reduce the operational burden. Some examples include automating some of the operations functions, or alternatively exploring the possibility of using a third party management company with special expertise in managing OpenStack deployments.

Monitoring Like any other infrastructure deployment, OpenStack clouds need an appropriate monitoring platform to ensure errors are caught and managed appropriately. Consider leveraging any existing monitoring system to see if it will be able to effectively monitor an OpenStack environment. While there are many aspects that need to be monitored, specific metrics that are critically important to capture include image disk utilization, or response time to the Compute API.

Expected and unexpected server downtime At some point, servers will fail. The SLAs in place affect how the design has to address recovery time. Recovery of a failed host may mean restoring instances from a snapshot, or respawning that instance on another available host, which then has consequences on the overall application design running on the OpenStack cloud. It might be acceptable to design a compute-focused cloud without the ability to migrate instances from one host to another, because the expectation is that the application developer must handle failure within the application itself. Conversely, a compute-focused cloud might be provisioned to provide extra resilience as a requirement of that business. In this scenario, it is expected that extra supporting services are also deployed, such as shared storage attached to hosts to aid in recovery and resiliency of services in order to meet strict SLAs.

Capacity planning Adding extra capacity to an OpenStack cloud is an easy horizontally scaling process, as consistently configured nodes automatically attach to an OpenStack cloud. Be mindful, however, of any additional work to place the nodes into appropriate Availability Zones and Host Aggregates if necessary. The same (or very similar) CPUs are recommended when adding extra nodes to the environment because it reduces the chance to break any live-migration features if they are present. Scaling out hypervisor hosts also has a direct effect on network and other data center resources, so factor in this increase when reaching rack capacity or when extra network switches are required. Compute hosts can also have internal components changed to account for increases in demand, a process also known as vertical scaling. Swapping a CPU for one with more cores, or increasing the memory in a server, can help add extra needed capacity depending on whether the running applications are more CPU intensive or memory based (as would be expected in a compute-focused OpenStack cloud). Another option is to assess the average workloads and increase the number of instances that can run within the compute environment by adjusting the overcommit ratio. While only appropriate in some environments, it's important to remember that changing the CPU overcommit ratio can have a detrimental effect and cause a potential increase in a noisy neighbor. The added risk of increasing the overcommit ratio is that more instances will fail when a compute host fails. In a compute-focused OpenStack design architecture, increasing the CPU overcommit ratio increases the potential for noisy neighbor issues and is not recommended.