Operational considerations

Operational considerations Many operational factors will affect general purpose cloud design choices. In larger installations, it is not uncommon for operations staff to be tasked with maintaining cloud environments. This differs from the operations staff that is responsible for building or designing the infrastructure. It is important to include the operations function in the planning and design phases of the build out. Service Level Agreements (SLAs) are contractual obligations that provide assurances for service availability. SLAs define levels of availability that drive the technical design, often with penalties for not meeting the contractual obligations. The strictness of the SLA dictates the level of redundancy and resiliency in the OpenStack cloud design. Knowing when and where to implement redundancy and high availability is directly affected by expectations set by the terms of the SLA. Some of the SLA terms that will affect the design include: Guarantees for API availability imply multiple infrastructure services combined with highly available load balancers. Network uptime guarantees will affect the switch design and might require redundant switching and power. Network security policies requirements need to be factored in to deployments.

Support and maintainability OpenStack cloud management requires operations staff to be able to understand and comprehend design architecture content on some level. The level of skills and the level of separation of the operations and engineering staff are dependent on the size and purpose of the installation. A large cloud service provider or a telecom provider is more likely to be managed by a specially trained, dedicated operations organization. A smaller implementation is more likely to rely on a smaller support staff that might need to take on the combined engineering, design and operations functions. Furthermore, maintaining OpenStack installations requires a variety of technical skills. Some of these skills may include the ability to debug Python log output to a basic level and an understanding of networking concepts. Consider incorporating features into the architecture and design that reduce the operations burden. This is accomplished by automating some of the operations functions. In some cases it may be beneficial to use a third party management company with special expertise in managing OpenStack deployments.

Monitoring Like any other infrastructure deployment, OpenStack clouds need an appropriate monitoring platform to ensure any errors are caught and managed appropriately. Consider leveraging any existing monitoring system to see if it will be able to effectively monitor an OpenStack environment. While there are many aspects that need to be monitored, specific metrics that are critically important to capture include image disk utilization, or response time to the Compute API.

Downtime No matter how robust the architecture is, at some point components will fail. Designing for high availability (HA) can have significant cost ramifications, therefore the resiliency of the overall system and the individual components is going to be dictated by the requirements of the SLA. Downtime planning includes creating processes and architectures that support planned (maintenance) and unplanned (system faults) downtime. An example of an operational consideration is the recovery of a failed compute host. This might mean requiring the restoration of instances from a snapshot or respawning an instance on another available compute host. This could have consequences on the overall application design. A general purpose cloud should not need to provide an ability to migrate instances from one host to another. If the expectation is that the application will be designed to tolerate failure, additional considerations need to be made around supporting instance migration. In this scenario, extra supporting services, including shared storage attached to compute hosts, might need to be deployed.

Capacity planning Capacity planning for future growth is a critically important and often overlooked consideration. Capacity constraints in a general purpose cloud environment include compute and storage limits. There is a relationship between the size of the compute environment and the supporting OpenStack infrastructure controller nodes required to support it. As the size of the supporting compute environment increases, the network traffic and messages will increase which will add load to the controller or networking nodes. While no hard and fast rule exists, effective monitoring of the environment will help with capacity decisions on when to scale the back-end infrastructure as part of the scaling of the compute resources. Adding extra compute capacity to an OpenStack cloud is a horizontally scaling process as consistently configured compute nodes automatically attach to an OpenStack cloud. Be mindful of any additional work that is needed to place the nodes into appropriate availability zones and host aggregates. Make sure to use identical or functionally compatible CPUs when adding additional compute nodes to the environment otherwise live migration features will break. Scaling out compute hosts will directly affect network and other datacenter resources so it will be necessary to add rack capacity or network switches. Another option is to assess the average workloads and increase the number of instances that can run within the compute environment by adjusting the overcommit ratio. While only appropriate in some environments, it's important to remember that changing the CPU overcommit ratio can have a detrimental effect and cause a potential increase in noisy neighbor. The added risk of increasing the overcommit ratio is more instances will fail when a compute host fails. Compute host components can also be upgraded to account for increases in demand; this is known as vertical scaling. Upgrading CPUs with more cores, or increasing the overall server memory, can add extra needed capacity depending on whether the running applications are more CPU intensive or memory intensive. Insufficient disk capacity could also have a negative effect on overall performance including CPU and memory usage. Depending on the back-end architecture of the OpenStack Block Storage layer, capacity might include adding disk shelves to enterprise storage systems or installing additional block storage nodes. It may also be necessary to upgrade directly attached storage installed in compute hosts or add capacity to the shared storage to provide additional ephemeral storage to instances. For a deeper discussion on many of these topics, refer to the OpenStack Operations Guide.