openstack-manuals/doc/arch-design-draft/source/operator-requirements.rst
Xav Paice 5bd7613f35 Add content to Ops section of arch guide draft
This adds consolidation from the previous version of the arch guide to
the new ops chapter.

Change-Id: I80acfb0362321fb3e3b4adb1bcbde2d39482a200
Implements: blueprint archguide-mitaka-reorg
2015-12-22 12:11:16 +13:00

501 lines
22 KiB
ReStructuredText

=====================
Operator requirements
=====================
.. toctree::
:maxdepth: 2
Introduction
~~~~~~~~~~~~
Several operational factors affect the design choices for a general
purpose cloud. Operations staff receive tasks regarding the maintenance
of cloud environments, including:
Maintenance tasks
Operating system patching, hardware/firmware upgrades, and datacenter
related changes, as well as minor and release upgrades to OpenStack
components are all ongoing operational tasks. In particular, the six
monthly release cycle of the OpenStack projects needs to be considered as
part of the cost of ongoing maintenance. The solution should take into
account storage and network maintenance and the impact on underlying
workloads.
Reliability and availability
Reliability and availability depend on the many supporting components'
availability and on the level of precautions taken by the service provider.
This includes network, storage systems, datacenter, and operating systems.
In order to run efficiently, automate as many of the operational processes as
possible. Automation includes the configuration of provisioning, monitoring and
alerting systems. Part of the automation process includes the capability to
determine when human intervention is required and who should act. The
objective is to increase the ratio of operational staff to running systems as
much as possible in order to reduce maintenance costs. In a massively scaled
environment, it is very difficult for staff to give each system individual
care.
Configuration management tools such as Ansible, Puppet, and Chef enable
operations staff to categorize systems into groups based on their roles and
thus create configurations and system states that the provisioning system
enforces. Systems that fall out of the defined state due to errors or failures
are quickly removed from the pool of active nodes and replaced.
At large scale, the resource cost of diagnosing failed individual systems is
far greater than the cost of replacement. It is more economical to replace the
failed system with a new system, provisioning and configuring it automatically
and adding it to the pool of active nodes. By automating tasks that are
labor-intensive, repetitive, and critical to operations, cloud operations
teams can work more efficiently because fewer resources are required for these
common tasks. Administrators are then free to tackle tasks that are not easy
to automate and that have longer-term impacts on the business, for example,
capacity planning.
SLA Considerations
~~~~~~~~~~~~~~~~~~
Service-level agreements (SLAs) are contractual obligations that ensure the
availability of a service. When designing an OpenStack cloud, factoring in
promises of availability implies a certain level of redundancy and resiliency.
Expectations set by the Service Level Agreements (SLAs) directly affect
knowing when and where you should implement redundancy and high
availability. SLAs are contractual obligations that provide assurances
for service availability. They define the levels of availability that
drive the technical design, often with penalties for not meeting
contractual obligations.
SLA terms that affect design include:
* API availability guarantees implying multiple infrastructure services
and highly available load balancers.
* Network uptime guarantees affecting switch design, which might
require redundant switching and power.
* Factor in networking security policy requirements in to your
deployments.
In any environment larger than just a few hosts, it is important to note that
there are two separate areas that might be subject to an SLA. Firstly, the
services that provide actual virtualization, networking and storage, are
subject to an SLA that customers of the environment are most likely to want to
be continuously available. This is often referred to as the 'Data Plane'.
Secondly, there are the ancillary services such as API endpoints, and the
various services that control CRUD operations. These are often referred to as
the 'Control Plane'. The services in this category are usually subject to a
different SLA expectation and therefore may be better suited on separate
hardware or at least containers from the Data Plane services.
To effectively run cloud installations, initial downtime planning
includes creating processes and architectures that support the
following:
* Planned (maintenance)
* Unplanned (system faults)
It is important to determine as part of the SLA negotiation which party is
responsible for monitoring and starting up Compute service Instances should an
outage occur which shuts them down.
Resiliency of overall system and individual components are going to be
dictated by the requirements of the SLA, meaning designing for
:term:`high availability (HA)` can have cost ramifications.
When upgrading, patching and changing configuration items this may require
downtime for some services. In these cases, stopping services that form the
Control Plane may leave the Data Plane unaffected, while actions such as
live-migration of Compute instances may be required in order to perform any
actions that require downtime to Data Plane components whilst still meeting SLA
expectations.
Note that there are many services that are outside the realms of pure OpenStack
code which affects the ability of any given design to meet SLA, including:
* Database services, such as ``MySQL`` or ``PostgreSQL``.
* Services providing RPC, such as ``RabbitMQ``.
* External network attachments.
* Physical constraints such as power, rack space, network cabling, etc.
* Shared storage including SAN based arrays, storage clusters such as ``Ceph``,
and/or NFS services.
Depending on the design, some Network service functions may fall into both the
Control and Data Plane categories. E.g. the neutron L3 Agent service may be
considered a Control Plane component, but the routers themselves would be Data
Plane.
It may be that a given set of requirements could dictate an SLA that suggests
some services need HA, and some do not.
In a design with multiple regions, the SLA would also need to take into
consideration the use of shared services such as the Identity service,
Dashboard, and so on.
Any SLA negotiation must also take into account the reliance on 3rd parties for
critical aspects of the design - for example, if there is an existing SLA on a
component such as a storage system, the cloud SLA must take into account this
limitation. If the required SLA for the cloud exceeds the agreed uptime levels
of the components comprising that cloud, additional redundancy would be
required. This consideration is critical to review in a hybrid cloud design,
where there are multiple 3rd parties involved.
Logging and Monitoring
~~~~~~~~~~~~~~~~~~~~~~
OpenStack clouds require appropriate monitoring platforms to catch and
manage errors.
.. note::
We recommend leveraging existing monitoring systems to see if they
are able to effectively monitor an OpenStack environment.
Specific meters that are critically important to capture include:
* Image disk utilization
* Response time to the Compute API
Logging and monitoring does not significantly differ for a multi-site OpenStack
cloud. The tools described in the `Logging and monitoring chapter
<http://docs.openstack.org/openstack-ops/content/logging_monitoring.html>`__ of
the Operations Guide remain applicable. Logging and monitoring can be provided
on a per-site basis, and in a common centralized location.
When attempting to deploy logging and monitoring facilities to a centralized
location, care must be taken with the load placed on the inter-site networking
links.
Network
~~~~~~~
The network design for an OpenStack cluster includes decisions regarding
the interconnect needs within the cluster, plus the need to allow clients to
access their resources, and for operators to access the cluster for
maintenance. The bandwidth, latency, and reliability of these networks needs
consideration.
Make additional design decisions about monitoring and alarming. This can
be an internal responsibility or the responsibility of the external
provider. In the case of using an external provider, service level
agreements (SLAs) likely apply. In addition, other operational
considerations such as bandwidth, latency, and jitter can be part of an
SLA.
Consider the ability to upgrade the infrastructure. As demand for
network resources increase, operators add additional IP address blocks
and add additional bandwidth capacity. In addition, consider managing
hardware and software life cycle events, for example upgrades,
decommissioning, and outages, while avoiding service interruptions for
tenants.
Factor maintainability into the overall network design. This includes
the ability to manage and maintain IP addresses as well as the use of
overlay identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS
tags. As an example, if you may need to change all of the IP addresses
on a network, a process known as renumbering, then the design must
support this function.
Address network-focused applications when considering certain
operational realities. For example, consider the impending exhaustion of
IPv4 addresses, the migration to IPv6, and the use of private networks
to segregate different types of traffic that an application receives or
generates. In the case of IPv4 to IPv6 migrations, applications should
follow best practices for storing IP addresses. We recommend you avoid
relying on IPv4 features that did not carry over to the IPv6 protocol or
have differences in implementation.
To segregate traffic, allow applications to create a private tenant
network for database and storage network traffic. Use a public network
for services that require direct client access from the internet. Upon
segregating the traffic, consider quality of service (QoS) and security
to ensure each network has the required level of service.
Finally, consider the routing of network traffic. For some applications,
develop a complex policy framework for routing. To create a routing
policy that satisfies business requirements, consider the economic cost
of transmitting traffic over expensive links versus cheaper links, in
addition to bandwidth, latency, and jitter requirements.
Additionally, consider how to respond to network events. As an example,
how load transfers from one link to another during a failure scenario
could be a factor in the design. If you do not plan network capacity
correctly, failover traffic could overwhelm other ports or network links
and create a cascading failure scenario. In this case, traffic that
fails over to one link overwhelms that link and then moves to the
subsequent links until all network traffic stops.
Licensing
~~~~~~~~~
The many different forms of license agreements for software are often written
with the use of dedicated hardware in mind. This model is relevant for the
cloud platform itself, including the hypervisor operating system, supporting
software for items such as database, RPC, backup, and so on. Consideration
must be made when offering Compute service instances and applications to end
users of the cloud, since the license terms for that software may need some
adjustment to be able to operate economically in the cloud.
Multi-site OpenStack deployments present additional licensing
considerations over and above regular OpenStack clouds, particularly
where site licenses are in use to provide cost efficient access to
software licenses. The licensing for host operating systems, guest
operating systems, OpenStack distributions (if applicable),
software-defined infrastructure including network controllers and
storage systems, and even individual applications need to be evaluated.
Topics to consider include:
* The definition of what constitutes a site in the relevant licenses,
as the term does not necessarily denote a geographic or otherwise
physically isolated location.
* Differentiations between "hot" (active) and "cold" (inactive) sites,
where significant savings may be made in situations where one site is
a cold standby for disaster recovery purposes only.
* Certain locations might require local vendors to provide support and
services for each site which may vary with the licensing agreement in
place.
Support and maintainability
~~~~~~~~~~~~~~~~~~~~~~~~~~~
To be able to support and maintain an installation, OpenStack cloud
management requires operations staff to understand and comprehend design
architecture content. The operations and engineering staff skill level,
and level of separation, are dependent on size and purpose of the
installation. Large cloud service providers, or telecom providers, are
more likely to be managed by specially trained, dedicated operations
organizations. Smaller implementations are more likely to rely on
support staff that need to take on combined engineering, design and
operations functions.
Maintaining OpenStack installations requires a variety of technical
skills. You may want to consider using a third-party management company
with special expertise in managing OpenStack deployment.
Operator access to systems
~~~~~~~~~~~~~~~~~~~~~~~~~~
As more and more applications are migrated into a Cloud based environment, we
get to a position where systems that are critical for Cloud operations are
hosted within the cloud that is being operated. Consideration must be given to
the ability for operators to be able to access the systems and tools required
in order to resolve a major incident.
If a significant portion of the cloud is on externally managed systems,
prepare for situations where it may not be possible to make changes.
Additionally, providers may differ on how infrastructure must be managed and
exposed. This can lead to delays in root cause analysis where each insists the
blame lies with the other provider.
Ensure that the network structure connects all clouds to form an integrated
system, keeping in mind the state of handoffs. These handoffs must both be as
reliable as possible and include as little latency as possible to ensure the
best performance of the overall system.
Capacity planning
~~~~~~~~~~~~~~~~~
An important consideration in running a cloud over time is projecting growth
and utilization trends in order to plan capital expenditures for the short and
long term. Gather utilization meters for compute, network, and storage, along
with historical records of these meters. While securing major anchor tenants
can lead to rapid jumps in the utilization rates of all resources, the steady
adoption of the cloud inside an organization or by consumers in a public
offering also creates a steady trend of increased utilization.
Capacity constraints for a general purpose cloud environment include:
* Compute limits
* Storage limits
A relationship exists between the size of the compute environment and
the supporting OpenStack infrastructure controller nodes requiring
support.
Increasing the size of the supporting compute environment increases the
network traffic and messages, adding load to the controller or
networking nodes. Effective monitoring of the environment will help with
capacity decisions on scaling.
Compute nodes automatically attach to OpenStack clouds, resulting in a
horizontally scaling process when adding extra compute capacity to an
OpenStack cloud. Additional processes are required to place nodes into
appropriate availability zones and host aggregates. When adding
additional compute nodes to environments, ensure identical or functional
compatible CPUs are used, otherwise live migration features will break.
It is necessary to add rack capacity or network switches as scaling out
compute hosts directly affects network and datacenter resources.
Compute host components can also be upgraded to account for increases in
demand; this is known as vertical scaling. Upgrading CPUs with more
cores, or increasing the overall server memory, can add extra needed
capacity depending on whether the running applications are more CPU
intensive or memory intensive.
Another option is to assess the average workloads and increase the
number of instances that can run within the compute environment by
adjusting the overcommit ratio.
.. note::
It is important to remember that changing the CPU overcommit ratio
can have a detrimental effect and cause a potential increase in a
noisy neighbor.
Insufficient disk capacity could also have a negative effect on overall
performance including CPU and memory usage. Depending on the back-end
architecture of the OpenStack Block Storage layer, capacity includes
adding disk shelves to enterprise storage systems or installing
additional block storage nodes. Upgrading directly attached storage
installed in compute hosts, and adding capacity to the shared storage
for additional ephemeral storage to instances, may be necessary.
For a deeper discussion on many of these topics, refer to the `OpenStack
Operations Guide <http://docs.openstack.org/ops>`_.
Quota management
~~~~~~~~~~~~~~~~
Quotas are used to set operational limits to prevent system capacities
from being exhausted without notification. They are currently enforced
at the tenant (or project) level rather than at the user level.
Quotas are defined on a per-region basis. Operators can define identical
quotas for tenants in each region of the cloud to provide a consistent
experience, or even create a process for synchronizing allocated quotas
across regions. It is important to note that only the operational limits
imposed by the quotas will be aligned consumption of quotas by users
will not be reflected between regions.
For example, given a cloud with two regions, if the operator grants a
user a quota of 25 instances in each region then that user may launch a
total of 50 instances spread across both regions. They may not, however,
launch more than 25 instances in any single region.
For more information on managing quotas refer to the `Managing projects
and users
chapter <http://docs.openstack.org/openstack-ops/content/projects_users.html>`__
of the OpenStack Operators Guide.
Policy management
~~~~~~~~~~~~~~~~~
OpenStack provides a default set of Role Based Access Control (RBAC)
policies, defined in a ``policy.json`` file, for each service. Operators
edit these files to customize the policies for their OpenStack
installation. If the application of consistent RBAC policies across
sites is a requirement, then it is necessary to ensure proper
synchronization of the ``policy.json`` files to all installations.
This must be done using system administration tools such as rsync as
functionality for synchronizing policies across regions is not currently
provided within OpenStack.
Selecting Hardware
~~~~~~~~~~~~~~~~~~
Integration with external IDP
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Upgrades
~~~~~~~~
Running OpenStack with a focus on availability requires striking a balance
between stability and features. For example, it might be tempting to run an
older stable release branch of OpenStack to make deployments easier. However
known issues that may be of some concern or only have minimal impact in smaller
deployments could become pain points as scale increases. Recent releases may
address well known issues. The OpenStack community can help resolve reported
issues by applying the collective expertise of the OpenStack developers.
In multi-site OpenStack clouds deployed using regions, sites are
independent OpenStack installations which are linked together using
shared centralized services such as OpenStack Identity. At a high level
the recommended order of operations to upgrade an individual OpenStack
environment is (see the `Upgrades
chapter <http://docs.openstack.org/openstack-ops/content/ops_upgrades-general-steps.html>`__
of the Operations Guide for details):
#. Upgrade the OpenStack Identity service (keystone).
#. Upgrade the OpenStack Image service (glance).
#. Upgrade OpenStack Compute (nova), including networking components.
#. Upgrade OpenStack Block Storage (cinder).
#. Upgrade the OpenStack dashboard (horizon).
The process for upgrading a multi-site environment is not significantly
different:
#. Upgrade the shared OpenStack Identity service (keystone) deployment.
#. Upgrade the OpenStack Image service (glance) at each site.
#. Upgrade OpenStack Compute (nova), including networking components, at
each site.
#. Upgrade OpenStack Block Storage (cinder) at each site.
#. Upgrade the OpenStack dashboard (horizon), at each site or in the
single central location if it is shared.
Compute upgrades within each site can also be performed in a rolling
fashion. Compute controller services (API, Scheduler, and Conductor) can
be upgraded prior to upgrading of individual compute nodes. This allows
operations staff to keep a site operational for users of Compute
services while performing an upgrade.
The bleeding edge
-----------------
The number of organizations running at massive scales is a small proportion of
the OpenStack community, therefore it is important to share related issues
with the community and be a vocal advocate for resolving them. Some issues
only manifest when operating at large scale, and the number of organizations
able to duplicate and validate an issue is small, so it is important to
document and dedicate resources to their resolution.
In some cases, the resolution to the problem is ultimately to deploy a more
recent version of OpenStack. Alternatively, when you must resolve an issue in
a production environment where rebuilding the entire environment is not an
option, it is sometimes possible to deploy updates to specific underlying
components in order to resolve issues or gain significant performance
improvements. Although this may appear to expose the deployment to increased
risk and instability, in many cases it could be an undiscovered issue.
We recommend building a development and operations organization that is
responsible for creating desired features, diagnosing and resolving issues,
and building the infrastructure for large scale continuous integration tests
and continuous deployment. This helps catch bugs early and makes deployments
faster and easier. In addition to development resources, we also recommend the
recruitment of experts in the fields of message queues, databases, distributed
systems, networking, cloud, and storage.
Skills and training
~~~~~~~~~~~~~~~~~~~
Projecting growth for storage, networking, and compute is only one aspect of a
growth plan for running OpenStack at massive scale. Growing and nurturing
development and operational staff is an additional consideration. Sending team
members to OpenStack conferences, meetup events, and encouraging active
participation in the mailing lists and committees is a very important way to
maintain skills and forge relationships in the community. For a list of
OpenStack training providers in the marketplace, see:
http://www.openstack.org/marketplace/training/.