Browse Source

Remove arch-design docs

The arch-design docs have not been maintained and the Ops Docs SIG plans
to take ownership and maintain it out of its own repo.

To avoid jobs overwriting the published content, this removes the docs
from openstack-manuals.

Depends-on: https://review.openstack.org/621012

Change-Id: I58acb6a5d25d8e0b02e5f3b068aebb4ec144bf1a
Signed-off-by: Sean McGinnis <sean.mcginnis@gmail.com>
changes/26/621026/1
Sean McGinnis 3 years ago
parent
commit
aee2cd3da8
  1. 27
      doc/arch-design/setup.cfg
  2. 30
      doc/arch-design/setup.py
  3. 13
      doc/arch-design/source/arch-requirements.rst
  4. 433
      doc/arch-design/source/arch-requirements/arch-requirements-enterprise.rst
  5. 182
      doc/arch-design/source/arch-requirements/arch-requirements-ha.rst
  6. 259
      doc/arch-design/source/arch-requirements/arch-requirements-operations.rst
  7. 1
      doc/arch-design/source/common
  8. 307
      doc/arch-design/source/conf.py
  9. 49
      doc/arch-design/source/design-cmp-tools.rst
  10. 20
      doc/arch-design/source/design-compute.rst
  11. 104
      doc/arch-design/source/design-compute/design-compute-arch.rst
  12. 85
      doc/arch-design/source/design-compute/design-compute-cpu.rst
  13. 165
      doc/arch-design/source/design-compute/design-compute-hardware.rst
  14. 46
      doc/arch-design/source/design-compute/design-compute-hypervisor.rst
  15. 105
      doc/arch-design/source/design-compute/design-compute-logging.rst
  16. 51
      doc/arch-design/source/design-compute/design-compute-networking.rst
  17. 48
      doc/arch-design/source/design-compute/design-compute-overcommit.rst
  18. 154
      doc/arch-design/source/design-compute/design-compute-storage.rst
  19. 413
      doc/arch-design/source/design-control-plane.rst
  20. 3
      doc/arch-design/source/design-identity.rst
  21. 3
      doc/arch-design/source/design-images.rst
  22. 31
      doc/arch-design/source/design-networking.rst
  23. 218
      doc/arch-design/source/design-networking/design-networking-concepts.rst
  24. 281
      doc/arch-design/source/design-networking/design-networking-design.rst
  25. 70
      doc/arch-design/source/design-networking/design-networking-services.rst
  26. 13
      doc/arch-design/source/design-storage.rst
  27. 546
      doc/arch-design/source/design-storage/design-storage-arch.rst
  28. 329
      doc/arch-design/source/design-storage/design-storage-concepts.rst
  29. 50
      doc/arch-design/source/design.rst
  30. BIN
      doc/arch-design/source/figures/Check_mark_23x20_02.png
  31. BIN
      doc/arch-design/source/figures/Compute_NSX.png
  32. BIN
      doc/arch-design/source/figures/Compute_Tech_Bin_Packing_CPU_optimized1.png
  33. BIN
      doc/arch-design/source/figures/Compute_Tech_Bin_Packing_General1.png
  34. BIN
      doc/arch-design/source/figures/ELKbasicArch.png
  35. 1
      doc/arch-design/source/figures/ELKbasicArch.svg
  36. BIN
      doc/arch-design/source/figures/General_Architecture3.png
  37. BIN
      doc/arch-design/source/figures/Generic_CERN_Architecture.png
  38. BIN
      doc/arch-design/source/figures/Generic_CERN_Example.png
  39. BIN
      doc/arch-design/source/figures/Massively_Scalable_Cells_regions_azs.png
  40. BIN
      doc/arch-design/source/figures/Multi-Cloud_Priv-AWS4.png
  41. BIN
      doc/arch-design/source/figures/Multi-Cloud_Priv-Pub3.png
  42. BIN
      doc/arch-design/source/figures/Multi-Cloud_failover2.png
  43. BIN
      doc/arch-design/source/figures/Multi-Site_Customer_Edge.png
  44. BIN
      doc/arch-design/source/figures/Multi-Site_shared_keystone1.png
  45. BIN
      doc/arch-design/source/figures/Multi-Site_shared_keystone_horizon_swift1.png
  46. BIN
      doc/arch-design/source/figures/Multi-site_Geo_Redundant_LB.png
  47. BIN
      doc/arch-design/source/figures/Network_Cloud_Storage2.png
  48. BIN
      doc/arch-design/source/figures/Network_Web_Services1.png
  49. BIN
      doc/arch-design/source/figures/Specialized_Hardware2.png
  50. BIN
      doc/arch-design/source/figures/Specialized_OOO.png
  51. BIN
      doc/arch-design/source/figures/Specialized_SDN_external.png
  52. BIN
      doc/arch-design/source/figures/Specialized_SDN_hosted.png
  53. BIN
      doc/arch-design/source/figures/Specialized_VDI1.png
  54. BIN
      doc/arch-design/source/figures/Storage_Database_+_Object5.png
  55. BIN
      doc/arch-design/source/figures/Storage_Hadoop3.png
  56. BIN
      doc/arch-design/source/figures/Storage_Object.png
  57. BIN
      doc/arch-design/source/figures/osog_0201.png
  58. 52
      doc/arch-design/source/index.rst
  59. 7985
      doc/arch-design/source/locale/tr_TR/LC_MESSAGES/arch-design.po
  60. 14
      doc/arch-design/source/use-cases.rst
  61. 14
      doc/arch-design/source/use-cases/use-case-development.rst
  62. 196
      doc/arch-design/source/use-cases/use-case-general-compute.rst
  63. 181
      doc/arch-design/source/use-cases/use-case-nfv.rst
  64. 210
      doc/arch-design/source/use-cases/use-case-storage.rst
  65. 14
      doc/arch-design/source/use-cases/use-case-web-scale.rst
  66. 5
      tools/build-all-rst.sh

27
doc/arch-design/setup.cfg

@ -1,27 +0,0 @@
[metadata]
name = architecturedesignguide
summary = OpenStack Architecture Design Guide
author = OpenStack
author-email = openstack-dev@lists.openstack.org
home-page = https://docs.openstack.org/
classifier =
Environment :: OpenStack
Intended Audience :: Information Technology
Intended Audience :: Cloud Architects
License :: OSI Approved :: Apache Software License
Operating System :: POSIX :: Linux
Topic :: Documentation
[global]
setup-hooks =
pbr.hooks.setup_hook
[files]
[build_sphinx]
warning-is-error = 1
build-dir = build
source-dir = source
[wheel]
universal = 1

30
doc/arch-design/setup.py

@ -1,30 +0,0 @@
#!/usr/bin/env python
# Copyright (c) 2013 Hewlett-Packard Development Company, L.P.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# THIS FILE IS MANAGED BY THE GLOBAL REQUIREMENTS REPO - DO NOT EDIT
import setuptools
# In python < 2.7.4, a lazy loading of package `pbr` will break
# setuptools if some other modules registered functions in `atexit`.
# solution from: http://bugs.python.org/issue15881#msg170215
try:
import multiprocessing # noqa
except ImportError:
pass
setuptools.setup(
setup_requires=['pbr'],
pbr=True)

13
doc/arch-design/source/arch-requirements.rst

@ -1,13 +0,0 @@
=========================
Architecture requirements
=========================
This chapter describes the enterprise and operational factors that impacts the
design of an OpenStack cloud.
.. toctree::
:maxdepth: 2
arch-requirements/arch-requirements-enterprise
arch-requirements/arch-requirements-operations
arch-requirements/arch-requirements-ha

433
doc/arch-design/source/arch-requirements/arch-requirements-enterprise.rst

@ -1,433 +0,0 @@
=======================
Enterprise requirements
=======================
The following sections describe business, usage, and performance
considerations for customers which will impact cloud architecture design.
Cost
~~~~
Financial factors are a primary concern for any organization. Cost
considerations may influence the type of cloud that you build.
For example, a general purpose cloud is unlikely to be the most
cost-effective environment for specialized applications.
Unless business needs dictate that cost is a critical factor,
cost should not be the sole consideration when choosing or designing a cloud.
As a general guideline, increasing the complexity of a cloud architecture
increases the cost of building and maintaining it. For example, a hybrid or
multi-site cloud architecture involving multiple vendors and technical
architectures may require higher setup and operational costs because of the
need for more sophisticated orchestration and brokerage tools than in other
architectures. However, overall operational costs might be lower by virtue of
using a cloud brokerage tool to deploy the workloads to the most cost effective
platform.
.. TODO Replace examples with the proposed example use cases in this guide.
Consider the following costs categories when designing a cloud:
* Compute resources
* Networking resources
* Replication
* Storage
* Management
* Operational costs
It is also important to consider how costs will increase as your cloud scales.
Choices that have a negligible impact in small systems may considerably
increase costs in large systems. In these cases, it is important to minimize
capital expenditure (CapEx) at all layers of the stack. Operators of massively
scalable OpenStack clouds require the use of dependable commodity hardware and
freely available open source software components to reduce deployment costs and
operational expenses. Initiatives like Open Compute (more information available
in the `Open Compute Project <http://www.opencompute.org>`_) provide additional
information.
Time-to-market
~~~~~~~~~~~~~~
The ability to deliver services or products within a flexible time
frame is a common business factor when building a cloud. Allowing users to
self-provision and gain access to compute, network, and
storage resources on-demand may decrease time-to-market for new products
and applications.
You must balance the time required to build a new cloud platform against the
time saved by migrating users away from legacy platforms. In some cases,
existing infrastructure may influence your architecture choices. For example,
using multiple cloud platforms may be a good option when there is an existing
investment in several applications, as it could be faster to tie the
investments together rather than migrating the components and refactoring them
to a single platform.
Revenue opportunity
~~~~~~~~~~~~~~~~~~~
Revenue opportunities vary based on the intent and use case of the cloud.
The requirements of a commercial, customer-facing product are often very
different from an internal, private cloud. You must consider what features
make your design most attractive to your users.
Capacity planning and scalability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Capacity and the placement of workloads are key design considerations
for clouds. A long-term capacity plan for these designs must
incorporate growth over time to prevent permanent consumption of more
expensive external clouds. To avoid this scenario, account for future
applications' capacity requirements and plan growth appropriately.
It is difficult to predict the amount of load a particular
application might incur if the number of users fluctuates, or the
application experiences an unexpected increase in use.
It is possible to define application requirements in terms of
vCPU, RAM, bandwidth, or other resources and plan appropriately.
However, other clouds might not use the same meter or even the same
oversubscription rates.
Oversubscription is a method to emulate more capacity than
may physically be present. For example, a physical hypervisor node with 32 GB
RAM may host 24 instances, each provisioned with 2 GB RAM.
As long as all 24 instances do not concurrently use 2 full
gigabytes, this arrangement works well.
However, some hosts take oversubscription to extremes and,
as a result, performance can be inconsistent.
If at all possible, determine what the oversubscription rates
of each host are and plan capacity accordingly.
.. TODO Considerations when building your cloud, racks, CPUs, compute node
density. For ongoing capacity planning refer to the Ops Guide.
Performance
~~~~~~~~~~~
Performance is a critical consideration when designing any cloud, and becomes
increasingly important as size and complexity grow. While single-site, private
clouds can be closely controlled, multi-site and hybrid deployments require
more careful planning to reduce problems such as network latency between sites.
For example, you should consider the time required to
run a workload in different clouds and methods for reducing this time.
This may require moving data closer to applications or applications
closer to the data they process, and grouping functionality so that
connections that require low latency take place over a single cloud
rather than spanning clouds.
This may also require a CMP that can determine which cloud can most
efficiently run which types of workloads.
Using native OpenStack tools can help improve performance.
For example, you can use Telemetry to measure performance and the
Orchestration service (heat) to react to changes in demand.
.. note::
Orchestration requires special client configurations to integrate
with Amazon Web Services. For other types of clouds, use CMP features.
Cloud resource deployment
The cloud user expects repeatable, dependable, and deterministic processes
for launching and deploying cloud resources. You could deliver this through
a web-based interface or publicly available API endpoints. All appropriate
options for requesting cloud resources must be available through some type
of user interface, a command-line interface (CLI), or API endpoints.
Consumption model
Cloud users expect a fully self-service and on-demand consumption model.
When an OpenStack cloud reaches the massively scalable size, expect
consumption as a service in each and every way.
* Everything must be capable of automation. For example, everything from
compute hardware, storage hardware, networking hardware, to the installation
and configuration of the supporting software. Manual processes are
impractical in a massively scalable OpenStack design architecture.
* Massively scalable OpenStack clouds require extensive metering and
monitoring functionality to maximize the operational efficiency by keeping
the operator informed about the status and state of the infrastructure. This
includes full scale metering of the hardware and software status. A
corresponding framework of logging and alerting is also required to store
and enable operations to act on the meters provided by the metering and
monitoring solutions. The cloud operator also needs a solution that uses the
data provided by the metering and monitoring solution to provide capacity
planning and capacity trending analysis.
Location
For many use cases the proximity of the user to their workloads has a
direct influence on the performance of the application and therefore
should be taken into consideration in the design. Certain applications
require zero to minimal latency that can only be achieved by deploying
the cloud in multiple locations. These locations could be in different
data centers, cities, countries or geographical regions, depending on
the user requirement and location of the users.
Input-Output requirements
Input-Output performance requirements require researching and
modeling before deciding on a final storage framework. Running
benchmarks for Input-Output performance provides a baseline for
expected performance levels. If these tests include details, then
the resulting data can help model behavior and results during
different workloads. Running scripted smaller benchmarks during the
lifecycle of the architecture helps record the system health at
different points in time. The data from these scripted benchmarks
assist in future scoping and gaining a deeper understanding of an
organization's needs.
Scale
Scaling storage solutions in a storage-focused OpenStack
architecture design is driven by initial requirements, including
:term:`IOPS <Input/output Operations Per Second (IOPS)>`, capacity,
bandwidth, and future needs. Planning capacity based on projected needs
over the course of a budget cycle is important for a design. The
architecture should balance cost and capacity, while also allowing
flexibility to implement new technologies and methods as they become
available.
Network
~~~~~~~
It is important to consider the functionality, security, scalability,
availability, and testability of the network when choosing a CMP and cloud
provider.
* Decide on a network framework and design minimum functionality tests.
This ensures testing and functionality persists during and after
upgrades.
* Scalability across multiple cloud providers may dictate which underlying
network framework you choose in different cloud providers.
It is important to present the network API functions and to verify
that functionality persists across all cloud endpoints chosen.
* High availability implementations vary in functionality and design.
Examples of some common methods are active-hot-standby, active-passive,
and active-active.
Development of high availability and test frameworks is necessary to
insure understanding of functionality and limitations.
* Consider the security of data between the client and the endpoint,
and of traffic that traverses the multiple clouds.
For example, degraded video streams and low quality VoIP sessions negatively
impact user experience and may lead to productivity and economic loss.
Network misconfigurations
Configuring incorrect IP addresses, VLANs, and routers can cause
outages to areas of the network or, in the worst-case scenario, the
entire cloud infrastructure. Automate network configurations to
minimize the opportunity for operator error as it can cause
disruptive problems.
Capacity planning
Cloud networks require management for capacity and growth over time.
Capacity planning includes the purchase of network circuits and
hardware that can potentially have lead times measured in months or
years.
Network tuning
Configure cloud networks to minimize link loss, packet loss, packet
storms, broadcast storms, and loops.
Single Point Of Failure (SPOF)
Consider high availability at the physical and environmental layers.
If there is a single point of failure due to only one upstream link,
or only one power supply, an outage can become unavoidable.
Complexity
An overly complex network design can be difficult to maintain and
troubleshoot. While device-level configuration can ease maintenance
concerns and automated tools can handle overlay networks, avoid or
document non-traditional interconnects between functions and
specialized hardware to prevent outages.
Non-standard features
There are additional risks that arise from configuring the cloud
network to take advantage of vendor specific features. One example
is multi-link aggregation (MLAG) used to provide redundancy at the
aggregator switch level of the network. MLAG is not a standard and,
as a result, each vendor has their own proprietary implementation of
the feature. MLAG architectures are not interoperable across switch
vendors, which leads to vendor lock-in, and can cause delays or
inability when upgrading components.
Dynamic resource expansion or bursting
An application that requires additional resources may suit a multiple
cloud architecture. For example, a retailer needs additional resources
during the holiday season, but does not want to add private cloud
resources to meet the peak demand.
The user can accommodate the increased load by bursting to
a public cloud for these peak load periods. These bursts could be
for long or short cycles ranging from hourly to yearly.
Compliance and geo-location
~~~~~~~~~~~~~~~~~~~~~~~~~~~
An organization may have certain legal obligations and regulatory
compliance measures which could require certain workloads or data to not
be located in certain regions.
Compliance considerations are particularly important for multi-site clouds.
Considerations include:
- federal legal requirements
- local jurisdictional legal and compliance requirements
- image consistency and availability
- storage replication and availability (both block and file/object storage)
- authentication, authorization, and auditing (AAA)
Geographical considerations may also impact the cost of building or leasing
data centers. Considerations include:
- floor space
- floor weight
- rack height and type
- environmental considerations
- power usage and power usage efficiency (PUE)
- physical security
Auditing
~~~~~~~~
A well-considered auditing plan is essential for quickly finding issues.
Keeping track of changes made to security groups and tenant changes can be
useful in rolling back the changes if they affect production. For example,
if all security group rules for a tenant disappeared, the ability to quickly
track down the issue would be important for operational and legal reasons.
For more details on auditing, see the `Compliance chapter
<https://docs.openstack.org/security-guide/compliance.html>`_ in the OpenStack
Security Guide.
Security
~~~~~~~~
The importance of security varies based on the type of organization using
a cloud. For example, government and financial institutions often have
very high security requirements. Security should be implemented according to
asset, threat, and vulnerability risk assessment matrices.
See `security-requirements`.
Service level agreements
~~~~~~~~~~~~~~~~~~~~~~~~
Service level agreements (SLA) must be developed in conjunction with business,
technical, and legal input. Small, private clouds may operate under an informal
SLA, but hybrid or public clouds generally require more formal agreements with
their users.
For a user of a massively scalable OpenStack public cloud, there are no
expectations for control over security, performance, or availability. Users
expect only SLAs related to uptime of API services, and very basic SLAs for
services offered. It is the user's responsibility to address these issues on
their own. The exception to this expectation is the rare case of a massively
scalable cloud infrastructure built for a private or government organization
that has specific requirements.
High performance systems have SLA requirements for a minimum quality of service
with regard to guaranteed uptime, latency, and bandwidth. The level of the
SLA can have a significant impact on the network architecture and
requirements for redundancy in the systems.
Hybrid cloud designs must accommodate differences in SLAs between providers,
and consider their enforceability.
Application readiness
~~~~~~~~~~~~~~~~~~~~~
Some applications are tolerant of a lack of synchronized object
storage, while others may need those objects to be replicated and
available across regions. Understanding how the cloud implementation
impacts new and existing applications is important for risk mitigation,
and the overall success of a cloud project. Applications may have to be
written or rewritten for an infrastructure with little to no redundancy,
or with the cloud in mind.
Application momentum
Businesses with existing applications may find that it is
more cost effective to integrate applications on multiple
cloud platforms than migrating them to a single platform.
No predefined usage model
The lack of a pre-defined usage model enables the user to run a wide
variety of applications without having to know the application
requirements in advance. This provides a degree of independence and
flexibility that no other cloud scenarios are able to provide.
On-demand and self-service application
By definition, a cloud provides end users with the ability to
self-provision computing power, storage, networks, and software in a
simple and flexible way. The user must be able to scale their
resources up to a substantial level without disrupting the
underlying host operations. One of the benefits of using a general
purpose cloud architecture is the ability to start with limited
resources and increase them over time as the user demand grows.
Authentication
~~~~~~~~~~~~~~
It is recommended to have a single authentication domain rather than a
separate implementation for each and every site. This requires an
authentication mechanism that is highly available and distributed to
ensure continuous operation. Authentication server locality might be
required and should be planned for.
Migration, availability, site loss and recovery
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Outages can cause partial or full loss of site functionality. Strategies
should be implemented to understand and plan for recovery scenarios.
* The deployed applications need to continue to function and, more
importantly, you must consider the impact on the performance and
reliability of the application when a site is unavailable.
* It is important to understand what happens to the replication of
objects and data between the sites when a site goes down. If this
causes queues to start building up, consider how long these queues
can safely exist until an error occurs.
* After an outage, ensure the method for resuming proper operations of
a site is implemented when it comes back online. We recommend you
architect the recovery to avoid race conditions.
Disaster recovery and business continuity
Cheaper storage makes the public cloud suitable for maintaining
backup applications.
Migration scenarios
Hybrid cloud architecture enables the migration of
applications between different clouds.
Provider availability or implementation details
Business changes can affect provider availability.
Likewise, changes in a provider's service can disrupt
a hybrid cloud environment or increase costs.
Provider API changes
Consumers of external clouds rarely have control over provider
changes to APIs, and changes can break compatibility.
Using only the most common and basic APIs can minimize potential conflicts.
Image portability
As of the Kilo release, there is no common image format that is
usable by all clouds. Conversion or recreation of images is necessary
if migrating between clouds. To simplify deployment, use the smallest
and simplest images feasible, install only what is necessary, and
use a deployment manager such as Chef or Puppet. Do not use golden
images to speed up the process unless you repeatedly deploy the same
images on the same cloud.
API differences
Avoid using a hybrid cloud deployment with more than just
OpenStack (or with different versions of OpenStack) as API changes
can cause compatibility issues.
Business or technical diversity
Organizations leveraging cloud-based services can embrace business
diversity and utilize a hybrid cloud design to spread their
workloads across multiple cloud providers. This ensures that
no single cloud provider is the sole host for an application.

182
doc/arch-design/source/arch-requirements/arch-requirements-ha.rst

@ -1,182 +0,0 @@
.. _high-availability:
=================
High availability
=================
Data plane and control plane
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When designing an OpenStack cloud, it is important to consider the needs
dictated by the :term:`Service Level Agreement (SLA)`. This includes the core
services required to maintain availability of running Compute service
instances, networks, storage, and additional services running on top of those
resources. These services are often referred to as the Data Plane services,
and are generally expected to be available all the time.
The remaining services, responsible for create, read, update and delete (CRUD)
operations, metering, monitoring, and so on, are often referred to as the
Control Plane. The SLA is likely to dictate a lower uptime requirement for
these services.
The services comprising an OpenStack cloud have a number of requirements that
you need to understand in order to be able to meet SLA terms. For example, in
order to provide the Compute service a minimum of storage, message queueing and
database services are necessary as well as the networking between
them.
Ongoing maintenance operations are made much simpler if there is logical and
physical separation of Data Plane and Control Plane systems. It then becomes
possible to, for example, reboot a controller without affecting customers.
If one service failure affects the operation of an entire server (``noisy
neighbor``), the separation between Control and Data Planes enables rapid
maintenance with a limited effect on customer operations.
Eliminating single points of failure within each site
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenStack lends itself to deployment in a highly available manner where it is
expected that at least 2 servers be utilized. These can run all the services
involved from the message queuing service, for example ``RabbitMQ`` or
``QPID``, and an appropriately deployed database service such as ``MySQL`` or
``MariaDB``. As services in the cloud are scaled out, back-end services will
need to scale too. Monitoring and reporting on server utilization and response
times, as well as load testing your systems, will help determine scale out
decisions.
The OpenStack services themselves should be deployed across multiple servers
that do not represent a single point of failure. Ensuring availability can
be achieved by placing these services behind highly available load balancers
that have multiple OpenStack servers as members.
There are a small number of OpenStack services which are intended to only run
in one place at a time (for example, the ``ceilometer-agent-central`` service)
. In order to prevent these services from becoming a single point of failure,
they can be controlled by clustering software such as ``Pacemaker``.
In OpenStack, the infrastructure is integral to providing services and should
always be available, especially when operating with SLAs. Ensuring network
availability is accomplished by designing the network architecture so that no
single point of failure exists. A consideration of the number of switches,
routes and redundancies of power should be factored into core infrastructure,
as well as the associated bonding of networks to provide diverse routes to your
highly available switch infrastructure.
Care must be taken when deciding network functionality. Currently, OpenStack
supports both the legacy networking (nova-network) system and the newer,
extensible OpenStack Networking (neutron). OpenStack Networking and legacy
networking both have their advantages and disadvantages. They are both valid
and supported options that fit different network deployment models described in
the `OpenStack Operations Guide
<https://docs.openstack.org/ops-guide/arch_network_design.html#network-topology>`_.
When using the Networking service, the OpenStack controller servers or separate
Networking hosts handle routing unless the dynamic virtual routers pattern for
routing is selected. Running routing directly on the controller servers mixes
the Data and Control Planes and can cause complex issues with performance and
troubleshooting. It is possible to use third party software and external
appliances that help maintain highly available layer three routes. Doing so
allows for common application endpoints to control network hardware, or to
provide complex multi-tier web applications in a secure manner. It is also
possible to completely remove routing from Networking, and instead rely on
hardware routing capabilities. In this case, the switching infrastructure must
support layer three routing.
Application design must also be factored into the capabilities of the
underlying cloud infrastructure. If the compute hosts do not provide a seamless
live migration capability, then it must be expected that if a compute host
fails, that instance and any data local to that instance will be deleted.
However, when providing an expectation to users that instances have a
high-level of uptime guaranteed, the infrastructure must be deployed in a way
that eliminates any single point of failure if a compute host disappears.
This may include utilizing shared file systems on enterprise storage or
OpenStack Block storage to provide a level of guarantee to match service
features.
If using a storage design that includes shared access to centralized storage,
ensure that this is also designed without single points of failure and the SLA
for the solution matches or exceeds the expected SLA for the Data Plane.
Eliminating single points of failure in a multi-region design
-------------------------------------------------------------
Some services are commonly shared between multiple regions, including the
Identity service and the Dashboard. In this case, it is necessary to ensure
that the databases backing the services are replicated, and that access to
multiple workers across each site can be maintained in the event of losing a
single region.
Multiple network links should be deployed between sites to provide redundancy
for all components. This includes storage replication, which should be isolated
to a dedicated network or VLAN with the ability to assign QoS to control the
replication traffic or provide priority for this traffic.
.. note::
If the data store is highly changeable, the network requirements could have
a significant effect on the operational cost of maintaining the sites.
If the design incorporates more than one site, the ability to maintain object
availability in both sites has significant implications on the Object Storage
design and implementation. It also has a significant impact on the WAN network
design between the sites.
If applications running in a cloud are not cloud-aware, there should be clear
measures and expectations to define what the infrastructure can and cannot
support. An example would be shared storage between sites. It is possible,
however such a solution is not native to OpenStack and requires a third-party
hardware vendor to fulfill such a requirement. Another example can be seen in
applications that are able to consume resources in object storage directly.
Connecting more than two sites increases the challenges and adds more
complexity to the design considerations. Multi-site implementations require
planning to address the additional topology used for internal and external
connectivity. Some options include full mesh topology, hub spoke, spine leaf,
and 3D Torus.
For more information on high availability in OpenStack, see the `OpenStack High
Availability Guide <https://docs.openstack.org/ha-guide/>`_.
Site loss and recovery
~~~~~~~~~~~~~~~~~~~~~~
Outages can cause partial or full loss of site functionality. Strategies
should be implemented to understand and plan for recovery scenarios.
* The deployed applications need to continue to function and, more
importantly, you must consider the impact on the performance and
reliability of the application if a site is unavailable.
* It is important to understand what happens to the replication of
objects and data between the sites when a site goes down. If this
causes queues to start building up, consider how long these queues
can safely exist until an error occurs.
* After an outage, ensure that operations of a site are resumed when it
comes back online. We recommend that you architect the recovery to
avoid race conditions.
Replicating inter-site data
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Traditionally, replication has been the best method of protecting object store
implementations. A variety of replication methods exist in storage
architectures, for example synchronous and asynchronous mirroring. Most object
stores and back-end storage systems implement methods for replication at the
storage subsystem layer. Object stores also tailor replication techniques to
fit a cloud's requirements.
Organizations must find the right balance between data integrity and data
availability. Replication strategy may also influence disaster recovery
methods.
Replication across different racks, data centers, and geographical regions
increases focus on determining and ensuring data locality. The ability to
guarantee data is accessed from the nearest or fastest storage can be necessary
for applications to perform well.
.. note::
When running embedded object store methods, ensure that you do not
instigate extra data replication as this may cause performance issues.

259
doc/arch-design/source/arch-requirements/arch-requirements-operations.rst

@ -1,259 +0,0 @@
========================
Operational requirements
========================
This section describes operational factors affecting the design of an
OpenStack cloud.
Network design
~~~~~~~~~~~~~~
The network design for an OpenStack cluster includes decisions regarding
the interconnect needs within the cluster, the need to allow clients to
access their resources, and the access requirements for operators to
administrate the cluster. You should consider the bandwidth, latency,
and reliability of these networks.
Consider additional design decisions about monitoring and alarming.
If you are using an external provider, service level agreements (SLAs)
are typically defined in your contract. Operational considerations such
as bandwidth, latency, and jitter can be part of the SLA.
As demand for network resources increase, make sure your network design
accommodates expansion and upgrades. Operators add additional IP address
blocks and add additional bandwidth capacity. In addition, consider
managing hardware and software lifecycle events, for example upgrades,
decommissioning, and outages, while avoiding service interruptions for
tenants.
Factor maintainability into the overall network design. This includes
the ability to manage and maintain IP addresses as well as the use of
overlay identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS
tags. As an example, if you may need to change all of the IP addresses
on a network, a process known as renumbering, then the design must
support this function.
Address network-focused applications when considering certain
operational realities. For example, consider the impending exhaustion of
IPv4 addresses, the migration to IPv6, and the use of private networks
to segregate different types of traffic that an application receives or
generates. In the case of IPv4 to IPv6 migrations, applications should
follow best practices for storing IP addresses. We recommend you avoid
relying on IPv4 features that did not carry over to the IPv6 protocol or
have differences in implementation.
To segregate traffic, allow applications to create a private tenant
network for database and storage network traffic. Use a public network
for services that require direct client access from the Internet. Upon
segregating the traffic, consider :term:`quality of service (QoS)` and
security to ensure each network has the required level of service.
Also consider the routing of network traffic. For some applications,
develop a complex policy framework for routing. To create a routing
policy that satisfies business requirements, consider the economic cost
of transmitting traffic over expensive links versus cheaper links, in
addition to bandwidth, latency, and jitter requirements.
Finally, consider how to respond to network events. How load
transfers from one link to another during a failure scenario could be
a factor in the design. If you do not plan network capacity
correctly, failover traffic could overwhelm other ports or network
links and create a cascading failure scenario. In this case,
traffic that fails over to one link overwhelms that link and then
moves to the subsequent links until all network traffic stops.
SLA considerations
~~~~~~~~~~~~~~~~~~
Service-level agreements (SLAs) define the levels of availability that will
impact the design of an OpenStack cloud to provide redundancy and high
availability.
SLA terms that affect the design include:
* API availability guarantees implying multiple infrastructure services
and highly available load balancers.
* Network uptime guarantees affecting switch design, which might
require redundant switching and power.
* Networking security policy requirements.
In any environment larger than just a few hosts, there are two areas
that might be subject to a SLA:
* Data Plane - services that provide virtualization, networking, and
storage. Customers usually require these services to be continuously
available.
* Control Plane - ancillary services such as API endpoints, and services that
control CRUD operations. The services in this category are usually subject to
a different SLA expectation and may be better suited on separate
hardware or containers from the Data Plane services.
To effectively run cloud installations, initial downtime planning includes
creating processes and architectures that support planned maintenance
and unplanned system faults.
It is important to determine as part of the SLA negotiation which party is
responsible for monitoring and starting up the Compute service instances if an
outage occurs.
Upgrading, patching, and changing configuration items may require
downtime for some services. Stopping services that form the Control Plane may
not impact the Data Plane. Live-migration of Compute instances may be required
to perform any actions that require downtime to Data Plane components.
There are many services outside the realms of pure OpenStack
code which affects the ability of a cloud design to meet SLAs, including:
* Database services, such as ``MySQL`` or ``PostgreSQL``.
* Services providing RPC, such as ``RabbitMQ``.
* External network attachments.
* Physical constraints such as power, rack space, network cabling, etc.
* Shared storage including SAN based arrays, storage clusters such as ``Ceph``,
and/or NFS services.
Depending on the design, some network service functions may fall into both the
Control and Data Plane categories. For example, the neutron L3 Agent service
may be considered a Control Plane component, but the routers themselves would
be a Data Plane component.
In a design with multiple regions, the SLA would also need to take into
consideration the use of shared services such as the Identity service
and Dashboard.
Any SLA negotiation must also take into account the reliance on third parties
for critical aspects of the design. For example, if there is an existing SLA
on a component such as a storage system, the SLA must take into account this
limitation. If the required SLA for the cloud exceeds the agreed uptime levels
of the cloud components, additional redundancy would be required. This
consideration is critical in a hybrid cloud design, where there are multiple
third parties involved.
Support and maintenance
~~~~~~~~~~~~~~~~~~~~~~~
An operations staff supports, manages, and maintains an OpenStack environment.
Their skills may be specialized or varied depending on the size and purpose of
the installation.
The maintenance function of an operator should be taken into consideration:
Maintenance tasks
Operating system patching, hardware/firmware upgrades, and datacenter
related changes, as well as minor and release upgrades to OpenStack
components are all ongoing operational tasks. The six monthly release
cycle of the OpenStack projects needs to be considered as part of the
cost of ongoing maintenance. The solution should take into account
storage and network maintenance and the impact on underlying
workloads.
Reliability and availability
Reliability and availability depend on the many supporting components'
availability and on the level of precautions taken by the service provider.
This includes network, storage systems, datacenter, and operating systems.
For more information on
managing and maintaining your OpenStack environment, see the
`OpenStack Operations Guide <https://docs.openstack.org/operations-guide/>`_.
Logging and monitoring
----------------------
OpenStack clouds require appropriate monitoring platforms to identify and
manage errors.
.. note::
We recommend leveraging existing monitoring systems to see if they
are able to effectively monitor an OpenStack environment.
Specific meters that are critically important to capture include:
* Image disk utilization
* Response time to the Compute API
Logging and monitoring does not significantly differ for a multi-site OpenStack
cloud. The tools described in the `Logging and monitoring
<https://docs.openstack.org/operations-guide/ops-logging-monitoring.html>`__ in
the Operations Guide remain applicable. Logging and monitoring can be provided
on a per-site basis, and in a common centralized location.
When attempting to deploy logging and monitoring facilities to a centralized
location, care must be taken with the load placed on the inter-site networking
links
Management software
-------------------
Management software providing clustering, logging, monitoring, and alerting
details for a cloud environment is often used. This impacts and affects the
overall OpenStack cloud design, and must account for the additional resource
consumption such as CPU, RAM, storage, and network
bandwidth.
The inclusion of clustering software, such as Corosync or Pacemaker, is
primarily determined by the availability of the cloud infrastructure and
the complexity of supporting the configuration after it is deployed. The
`OpenStack High Availability Guide <https://docs.openstack.org/ha-guide/>`_
provides more details on the installation and configuration of Corosync
and Pacemaker, should these packages need to be included in the design.
Some other potential design impacts include:
* OS-hypervisor combination
Ensure that the selected logging, monitoring, or alerting tools support
the proposed OS-hypervisor combination.
* Network hardware
The network hardware selection needs to be supported by the logging,
monitoring, and alerting software.
Database software
-----------------
Most OpenStack components require access to back-end database services
to store state and configuration information. Choose an appropriate
back-end database which satisfies the availability and fault tolerance
requirements of the OpenStack services.
MySQL is the default database for OpenStack, but other compatible
databases are available.
.. note::
Telemetry uses MongoDB.
The chosen high availability database solution changes according to the
selected database. MySQL, for example, provides several options. Use a
replication technology such as Galera for active-active clustering. For
active-passive use some form of shared storage. Each of these potential
solutions has an impact on the design:
* Solutions that employ Galera/MariaDB require at least three MySQL
nodes.
* MongoDB has its own design considerations for high availability.
* OpenStack design, generally, does not include shared storage.
However, for some high availability designs, certain components might
require it depending on the specific implementation.
Operator access to systems
~~~~~~~~~~~~~~~~~~~~~~~~~~
There is a trend for cloud operations systems being hosted within the cloud
environment. Operators require access to these systems to resolve a major
incident.
Ensure that the network structure connects all clouds to form an integrated
system. Also consider the state of handoffs which must be reliable and have
minimal latency for optimal performance of the system.
If a significant portion of the cloud is on externally managed systems,
prepare for situations where it may not be possible to make changes.
Additionally, cloud providers may differ on how infrastructure must be managed
and exposed. This can lead to delays in root cause analysis where a provider
insists the blame lies with the other provider.

1
doc/arch-design/source/common

@ -1 +0,0 @@
../../common

307
doc/arch-design/source/conf.py

@ -1,307 +0,0 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import os
# import sys
import openstackdocstheme
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
# sys.path.insert(0, os.path.abspath('.'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['openstackdocstheme']
# Add any paths that contain templates here, relative to this directory.
# templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
# source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
repository_name = "openstack/openstack-manuals"
bug_project = 'openstack-manuals'
project = u'Architecture Design Guide'
bug_tag = u'arch-design'
copyright = u'2015-2018, OpenStack contributors'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = ''
# The full version, including alpha/beta/rc tags.
release = ''
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
# language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
# today = ''
# Else, today_fmt is used as the format for a strftime call.
# today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['common/cli*', 'common/nova*', 'common/get-started-*']
# The reST default role (used for this markup: `text`) to use for all
# documents.
# default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
# add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
# add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
# show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
# modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
# keep_warnings = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = 'openstackdocs'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
html_theme_options = {
'display_badge': False
}
# Add any paths that contain custom themes here, relative to this directory.
# html_theme_path = [openstackdocstheme.get_html_theme_path()]
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
# html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
# html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
# html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
# html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = []
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
# html_extra_path = []
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
# So that we can enable "log-a-bug" links from each output HTML page, this
# variable must be set to a format that includes year, month, day, hours and
# minutes.
html_last_updated_fmt = '%Y-%m-%d %H:%M'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
# html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
# html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
# html_additional_pages = {}
# If false, no module index is generated.
# html_domain_indices = True
# If false, no index is generated.
html_use_index = False
# If true, the index is split into individual pages for each letter.
# html_split_index = False
# If true, links to the reST sources are added to the pages.
html_show_sourcelink = False
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
# html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
# html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
# html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
# html_file_suffix = None
# Output file base name for HTML help builder.
htmlhelp_basename = 'arch-design'
# If true, publish source files
html_copy_source = False
# -- Options for LaTeX output ---------------------------------------------
pdf_theme_path = openstackdocstheme.get_pdf_theme_path()
openstack_logo = openstackdocstheme.get_openstack_logo_path()
latex_custom_template = r"""
\newcommand{\openstacklogo}{%s}
\usepackage{%s}
""" % (openstack_logo, pdf_theme_path)
latex_engine = 'xelatex'
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'a4paper',
# The font size ('10pt', '11pt' or '12pt').
'pointsize': '11pt',
#Default figure align
'figure_align': 'H',
# Not to generate blank page after chapter
'classoptions': ',openany',
# Additional stuff for the LaTeX preamble.
'preamble': latex_custom_template,
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'ArchGuide.tex', u'Architecture Design Guide',
u'OpenStack contributors', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
# latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
# latex_use_parts = False
# If true, show page references after internal links.
# latex_show_pagerefs = False
# If true, show URL addresses after external links.
# latex_show_urls = False
# Documents to append as an appendix to all manuals.
# latex_appendices = []
# If false, no module index is generated.
# latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'ArchDesign', u'Architecture Design Guide',
[u'OpenStack contributors'], 1)
]
# If true, show URL addresses after external links.
# man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'ArchDesign', u'Architecture Design Guide',
u'OpenStack contributors', 'ArchDesign',
'To reap the benefits of OpenStack, you should plan, design,'
'and architect your cloud properly, taking user needs into'
'account and understanding the use cases.'
'commands.', 'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
# texinfo_appendices = []
# If false, no module index is generated.
# texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
# texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
# texinfo_no_detailmenu = False
# -- Options for Internationalization output ------------------------------
locale_dirs = ['locale/']
# -- Options for PDF output --------------------------------------------------
pdf_documents = [
('index', u'ArchDesignGuide', u'Architecture Design Guide',
u'OpenStack contributors')
]

49
doc/arch-design/source/design-cmp-tools.rst

@ -1,49 +0,0 @@
=============================
Cloud management architecture
=============================
Complex clouds, in particular hybrid clouds, may require tools to
facilitate working across multiple clouds.
Broker between clouds
Brokering software evaluates relative costs between different
cloud platforms. Cloud Management Platforms (CMP)
allow the designer to determine the right location for the
workload based on predetermined criteria.
Facilitate orchestration across the clouds
CMPs simplify the migration of application workloads between
public, private, and hybrid cloud platforms.
We recommend using cloud orchestration tools for managing a diverse
portfolio of systems and applications across multiple cloud platforms.
Technical details
~~~~~~~~~~~~~~~~~
.. TODO
Capacity and scale
~~~~~~~~~~~~~~~~~~
.. TODO
High availability
~~~~~~~~~~~~~~~~~
.. TODO
Operator requirements
~~~~~~~~~~~~~~~~~~~~~
.. TODO
Deployment considerations
~~~~~~~~~~~~~~~~~~~~~~~~~
.. TODO
Maintenance considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. TODO

20
doc/arch-design/source/design-compute.rst

@ -1,20 +0,0 @@
====================
Compute architecture
====================
.. toctree::
:maxdepth: 3
design-compute/design-compute-arch
design-compute/design-compute-cpu
design-compute/design-compute-hypervisor
design-compute/design-compute-hardware
design-compute/design-compute-overcommit
design-compute/design-compute-storage
design-compute/design-compute-networking
design-compute/design-compute-logging
This section describes some of the choices you need to consider
when designing and building your compute nodes. Compute nodes form the
resource core of the OpenStack Compute cloud, providing the processing, memory,
network and storage resources to run instances.

104
doc/arch-design/source/design-compute/design-compute-arch.rst

@ -1,104 +0,0 @@
====================================
Compute server architecture overview
====================================
When designing compute resource pools, consider the number of processors,
amount of memory, network requirements, the quantity of storage required for
each hypervisor, and any requirements for bare metal hosts provisioned
through ironic.
When architecting an OpenStack cloud, as part of the planning process, you
must not only determine what hardware to utilize but whether compute
resources will be provided in a single pool or in multiple pools or
availability zones. You should consider if the cloud will provide distinctly
different profiles for compute.
For example, CPU, memory or local storage based compute nodes. For NFV
or HPC based clouds, there may even be specific network configurations that
should be reserved for those specific workloads on specific compute nodes. This
method of designing specific resources into groups or zones of compute can be
referred to as bin packing.
.. note::
In a bin packing design, each independent resource pool provides service for
specific flavors. Since instances are scheduled onto compute hypervisors,
each independent node's resources will be allocated to efficiently use the
available hardware. While bin packing can separate workload specific
resources onto individual servers, bin packing also requires a common
hardware design, with all hardware nodes within a compute resource pool
sharing a common processor, memory, and storage layout. This makes it easier
to deploy, support, and maintain nodes throughout their lifecycle.
Increasing the size of the supporting compute environment increases the network
traffic and messages, adding load to the controllers and administrative
services used to support the OpenStack cloud or networking nodes. When
considering hardware for controller nodes, whether using the monolithic
controller design, where all of the controller services live on one or more
physical hardware nodes, or in any of the newer shared nothing control plane
models, adequate resources must be allocated and scaled to meet scale
requirements. Effective monitoring of the environment will help with capacity
decisions on scaling. Proper planning will help avoid bottlenecks and network
oversubscription as the cloud scales.
Compute nodes automatically attach to OpenStack clouds, resulting in a
horizontally scaling process when adding extra compute capacity to an
OpenStack cloud. To further group compute nodes and place nodes into
appropriate availability zones and host aggregates, additional work is
required. It is necessary to plan rack capacity and network switches as scaling
out compute hosts directly affects data center infrastructure resources as
would any other infrastructure expansion.
While not as common in large enterprises, compute host components can also be
upgraded to account for increases in
demand, known as vertical scaling. Upgrading CPUs with more
cores, or increasing the overall server memory, can add extra needed
capacity depending on whether the running applications are more CPU
intensive or memory intensive. We recommend a rolling upgrade of compute
nodes for redundancy and availability.
After the upgrade, when compute nodes return to the OpenStack cluster, they
will be re-scanned and the new resources will be discovered adjusted in the
OpenStack database.
When selecting a processor, compare features and performance
characteristics. Some processors include features specific to
virtualized compute hosts, such as hardware-assisted virtualization, and
technology related to memory paging (also known as EPT shadowing). These
types of features can have a significant impact on the performance of
your virtual machine.
The number of processor cores and threads impacts the number of worker
threads which can be run on a resource node. Design decisions must
relate directly to the service being run on it, as well as provide a
balanced infrastructure for all services.
Another option is to assess the average workloads and increase the
number of instances that can run within the compute environment by
adjusting the overcommit ratio. This ratio is configurable for CPU and
memory. The default CPU overcommit ratio is 16:1, and the default memory
overcommit ratio is 1.5:1. Determining the tuning of the overcommit
ratios during the design phase is important as it has a direct impact on
the hardware layout of your compute nodes.
.. note::
Changing the CPU overcommit ratio can have a detrimental effect
and cause a potential increase in a noisy neighbor.
Insufficient disk capacity could also have a negative effect on overall
performance including CPU and memory usage. Depending on the back end
architecture of the OpenStack Block Storage layer, capacity includes
adding disk shelves to enterprise storage systems or installing
additional Block Storage nodes. Upgrading directly attached storage
installed in Compute hosts, and adding capacity to the shared storage
for additional ephemeral storage to instances, may be necessary.
Consider the Compute requirements of non-hypervisor nodes (also referred to as
resource nodes). This includes controller, Object Storage nodes, Block Storage
nodes, and networking services.
The ability to create pools or availability zones for unpredictable workloads
should be considered. In some cases, the demand for certain instance types or
flavors may not justify individual hardware design. Allocate hardware designs
that are capable of servicing the most common instance requests. Adding
hardware to the overall architecture can be done later.

85
doc/arch-design/source/design-compute/design-compute-cpu.rst

@ -1,85 +0,0 @@
.. _choosing-a-cpu:
==============
Choosing a CPU
==============
The type of CPU in your compute node is a very important decision. You must
ensure that the CPU supports virtualization by way of *VT-x* for Intel chips
and *AMD-v* for AMD chips.
.. tip::
Consult the vendor documentation to check for virtualization support. For
Intel CPUs, see
`Does my processor support Intel® Virtualization Technology?
<https://www.intel.com/content/www/us/en/support/processors/000005486.html>`_. For AMD CPUs,
see `AMD Virtualization
<https://www.amd.com/en-us/innovations/software-technologies/server-solution/virtualization>`_.
Your CPU may support virtualization but it may be disabled. Consult your
BIOS documentation for how to enable CPU features.
The number of cores that the CPU has also affects your decision. It is
common for current CPUs to have up to 24 cores. Additionally, if an Intel CPU
supports hyper-threading, those 24 cores are doubled to 48 cores. If you
purchase a server that supports multiple CPUs, the number of cores is further
multiplied.
As of the Kilo release, key enhancements have been added to the
OpenStack code to improve guest performance. These improvements allow the
Compute service to take advantage of greater insight into a compute host's
physical layout and therefore make smarter decisions regarding workload
placement. Administrators can use this functionality to enable smarter planning
choices for use cases like NFV (Network Function Virtualization) and HPC (High
Performance Computing).
Considering non-uniform memory access (NUMA) is important when selecting CPU
sizes and types, as there are use cases that use NUMA pinning to reserve host
cores for operating system processes. These reduce the available CPU for
workloads and protects the operating system.
.. tip::
When CPU pinning is requested for a guest, it is assumed
there is no overcommit (or, an overcommit ratio of 1.0). When dedicated
resourcing is not requested for a workload, the normal overcommit ratios
are applied.
Therefore, we recommend that host aggregates are used to separate not
only bare metal hosts, but hosts that will provide resources for workloads
that require dedicated resources. This said, when workloads are provisioned
to NUMA host aggregates, NUMA nodes are chosen at random and vCPUs can float
across NUMA nodes on a host. If workloads require SR-IOV or DPDK, they should
be assigned to a NUMA node aggregate with hosts that supply the
functionality. More importantly, the workload or vCPUs that are executing
processes for a workload should be on the same NUMA node due to the limited
amount of cross-node memory bandwidth. In all cases, the ``NUMATopologyFilter``
must be enabled for ``nova-scheduler``.
Additionally, CPU selection may not be one-size-fits-all across enterprises,
but more of a list of SKUs that are tuned for the enterprise workloads.
For more information about NUMA, see `CPU topologies
<https://docs.openstack.org/admin-guide/compute-cpu-topologies.html>`_ in
the Administrator Guide.
In order to take advantage of these new enhancements in the Compute service,
compute hosts must be using NUMA capable CPUs.
.. tip::
**Multithread Considerations**
Hyper-Threading is Intel's proprietary simultaneous multithreading
implementation used to improve parallelization on their CPUs. You might
consider enabling Hyper-Threading to improve the performance of
multithreaded applications.
Whether you should enable Hyper-Threading on your CPUs depends upon your use
case. For example, disabling Hyper-Threading can be beneficial in intense
computing environments. We recommend performance testing with your local
workload with both Hyper-Threading on and off to determine what is more
appropriate in your case.
In most cases, hyper-threading CPUs can provide a 1.3x to 2.0x performance
benefit over non-hyper-threaded CPUs depending on types of workload.

165
doc/arch-design/source/design-compute/design-compute-hardware.rst

@ -1,165 +0,0 @@
========================
Choosing server hardware
========================
Consider the following factors when selecting compute server hardware:
* Server density
A measure of how many servers can fit into a given measure of
physical space, such as a rack unit [U].
* Resource capacity
The number of CPU cores, how much RAM, or how much storage a given
server delivers.
* Expandability
The number of additional resources you can add to a server before it
reaches capacity.
* Cost
The relative cost of the hardware weighed against the total amount of
capacity available on the hardware based on predetermined requirements.
Weigh these considerations against each other to determine the best design for
the desired purpose. For example, increasing server density means sacrificing
resource capacity or expandability. It also can decrease availability and
increase the chance of noisy neighbor issues. Increasing resource capacity and
expandability can increase cost but decrease server density. Decreasing cost
often means decreasing supportability, availability, server density, resource
capacity, and expandability.
Determine the requirements for the cloud prior to constructing the cloud,
and plan for hardware lifecycles, and expansion and new features that may
require different hardware.
If the cloud is initially built with near end of life, but cost effective
hardware, then the performance and capacity demand of new workloads will drive
the purchase of more modern hardware. With individual hardware components
changing over time, you may prefer to manage configurations as stock keeping
units (SKU)s. This method provides an enterprise with a standard
configuration unit of compute (server) that can be placed in any IT service
manager or vendor supplied ordering system that can be triggered manually or
through advanced operational automations. This simplifies ordering,
provisioning, and activating additional compute resources. For example, there
are plug-ins for several commercial service management tools that enable
integration with hardware APIs. These configure and activate new compute
resources from standby hardware based on a standard configurations. Using this
methodology, spare hardware can be ordered for a datacenter and provisioned
based on capacity data derived from OpenStack Telemetry.
Compute capacity (CPU cores and RAM capacity) is a secondary consideration for
selecting server hardware. The required server hardware must supply adequate
CPU sockets, additional CPU cores, and adequate RA. For more information, see
:ref:`choosing-a-cpu`.
In compute server architecture design, you must also consider network and
storage requirements. For more information on network considerations, see
:ref:`network-design`.
Considerations when choosing hardware
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here are some other factors to consider when selecting hardware for your
compute servers.
Instance density
----------------
More hosts are required to support the anticipated scale
if the design architecture uses dual-socket hardware designs.
For a general purpose OpenStack cloud, sizing is an important consideration.
The expected or anticipated number of instances that each hypervisor can
host is a common meter used in sizing the deployment. The selected server
hardware needs to support the expected or anticipated instance density.
Host density
------------
Another option to address the higher host count is to use a
quad-socket platform. Taking this approach decreases host density
which also increases rack count. This configuration affects the
number of power connections and also impacts network and cooling
requirements.
Physical data centers have limited physical space, power, and
cooling. The number of hosts (or hypervisors) that can be fitted
into a given metric (rack, rack unit, or floor tile) is another
important method of sizing. Floor weight is an often overlooked
consideration.
The data center floor must be able to support the weight of the proposed number
of hosts within a rack or set of racks. These factors need to be applied as
part of the host density calculation and server hardware selection.
Power and cooling density
-------------------------
The power and cooling density requirements might be lower than with
blade, sled, or 1U server designs due to lower host density (by
using 2U, 3U or even 4U server designs). For data centers with older
infrastructure, this might be a desirable feature.
Data centers have a specified amount of power fed to a given rack or
set of racks. Older data centers may have power densities as low as 20A per
rack, and current data centers can be designed to support power densities as
high as 120A per rack. The selected server hardware must take power density
into account.
Selecting hardware form factor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Consider the following in selecting server hardware form factor suited for
your OpenStack design architecture:
* Most blade servers can support dual-socket multi-core CPUs. To avoid
this CPU limit, select ``full width`` or ``full height`` blades. Be
aware, however, that this also decreases server density. For example,
high density blade servers such as HP BladeSystem or Dell PowerEdge
M1000e support up to 16 servers in only ten rack units. Using
half-height blades is twice as dense as using full-height blades,
which results in only eight servers per ten rack units.
* 1U rack-mounted servers have the ability to offer greater server density
than a blade server solution, but are often limited to dual-socket,
multi-core CPU configurations. It is possible to place forty 1U servers
in a rack, providing space for the top of rack (ToR) switches, compared
to 32 full width blade servers.
To obtain greater than dual-socket support in a 1U rack-mount form
factor, customers need to buy their systems from Original Design
Manufacturers (ODMs) or second-tier manufacturers.
.. warning::
This may cause issues for organizations that have preferred
vendor policies or concerns with support and hardware warranties
of non-tier 1 vendors.
* 2U rack-mounted servers provide quad-socket, multi-core CPU support,
but with a corresponding decrease in server density (half the density
that 1U rack-mounted servers offer).
* Larger rack-mounted servers, such as 4U servers, often provide even
greater CPU capacity, commonly supporting four or even eight CPU
sockets. These servers have greater expandability, but such servers
have much lower server density and are often more expensive.
* ``Sled servers`` are rack-mounted servers that support multiple
independent servers in a single 2U or 3U enclosure. These deliver
higher density as compared to typical 1U or 2U rack-mounted servers.
For example, many sled servers offer four independent dual-socket
nodes in 2U for a total of eight CPU sockets in 2U.
Scaling your cloud
~~~~~~~~~~~~~~~~~~
When designing a OpenStack cloud compute server architecture, you must
decide whether you intend to scale up or scale out. Selecting a
smaller number of larger hosts, or a larger number of smaller hosts,
depends on a combination of factors: cost, power, cooling, physical rack
and floor space, support-warranty, and manageability. Typically, the scale out
model has been popular for OpenStack because it reduces the number of possible
failure domains by spreading workloads across more infrastructure.
However, the downside is the cost of additional servers and the datacenter
resources needed to power, network, and cool the servers.

46
doc/arch-design/source/design-compute/design-compute-hypervisor.rst

@ -1,46 +0,0 @@
======================
Choosing a hypervisor
======================
A hypervisor provides software to manage virtual machine access to the
underlying hardware. The hypervisor creates, manages, and monitors
virtual machines. OpenStack Compute (nova) supports many hypervisors to various
degrees, including:
* `Ironic <https://docs.openstack.org/ironic/latest/>`_
* `KVM <https://www.linux-kvm.org/page/Main_Page>`_
* `LXC <https://linuxcontainers.org/>`_
* `QEMU <https://wiki.qemu.org/Main_Page>`_
* `VMware ESX/ESXi <https://www.vmware.com/support/vsphere-hypervisor.html>`_
* `Xen (using libvirt) <https://www.xenproject.org>`_
* `XenServer <https://xenserver.org>`_
* `Hyper-V
<https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/hyper-v-technology-overview>`_
* `PowerVM <https://www.ibm.com/us-en/marketplace/ibm-powervm>`_
* `UML <http://user-mode-linux.sourceforge.net>`_
* `Virtuozzo <https://www.virtuozzo.com/products/vz7.html>`_
* `zVM <https://www.ibm.com/it-infrastructure/z/zvm>`_
An important factor in your choice of hypervisor is your current organization's
hypervisor usage or experience. Also important is the hypervisor's feature
parity, documentation, and the level of community experience.
As per the recent OpenStack user survey, KVM is the most widely adopted
hypervisor in the OpenStack community. Besides KVM, there are many deployments
that run other hypervisors such as LXC, VMware, Xen, and Hyper-V. However,
these hypervisors are either less used, are niche hypervisors, or have limited
functionality compared to more commonly used hypervisors.
.. note::
It is also possible to run multiple hypervisors in a single
deployment using host aggregates or cells. However, an individual
compute node can run only a single hypervisor at a time.
For more information about feature support for
hypervisors as well as ironic and Virtuozzo (formerly Parallels), see
`Hypervisor Support Matrix
<https://docs.openstack.org/nova/latest/user/support-matrix.html>`_
and `Hypervisors
<https://docs.openstack.org/ocata/config-reference/compute/hypervisors.html>`_
in the Configuration Reference.

105
doc/arch-design/source/design-compute/design-compute-logging.rst

@ -1,105 +0,0 @@
======================
Compute server logging
======================
The logs on the compute nodes, or any server running nova-compute (for example
in a hyperconverged architecture), are the primary points for troubleshooting
issues with the hypervisor and compute services. Additionally, operating system
logs can also provide useful information.
As the cloud environment grows, the amount of log data increases exponentially.
Enabling debugging on either the OpenStack services or the operating system
further compounds the data issues.
Logging is described in more detail in the `Logging and Monitoring
<https://docs.openstack.org/operations-guide/ops-logging-monitoring.html>`_.
However, it is an important design consideration to take into account before
commencing operations of your cloud.
OpenStack produces a great deal of useful logging information, but for
the information to be useful for operations purposes, you should consider
having a central logging server to send logs to, and a log parsing/analysis
system such as Elastic Stack [formerly known as ELK].
Elastic Stack consists of mainly three components: Elasticsearch (log search
and analysis), Logstash (log intake, processing and output) and Kibana (log
dashboard service).
.. figure:: ../figures/ELKbasicArch.png
:align: center
:alt: Elastic Search Basic Architecture
Due to the amount of logs being sent from servers in the OpenStack environment,
an optional in-memory data structure store can be used. Common examples are
Redis and Memcached. In newer versions of Elastic Stack, a file buffer called
`Filebeat <https://www.elastic.co/products/beats/filebeat>`_ is used for a
similar purpose but adds a "backpressure-sensitive" protocol when sending data
to Logstash or Elasticsearch.
Log analysis often requires disparate logs of differing formats. Elastic
Stack (namely Logstash) was created to take many different log inputs and
transform them into a consistent format that Elasticsearch can catalog and
analyze. As seen in the image above, the process of ingestion starts on the
servers by Logstash, is forwarded to the Elasticsearch server for storage and
searching, and then displayed through Kibana for visual analysis and
interaction.
For instructions on installing Logstash, Elasticsearch and Kibana, see the
`Elasticsearch reference
<https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html>`_.
There are some specific configuration parameters that are needed to
configure Logstash for OpenStack. For example, in order to get Logstash to
collect, parse, and send the correct portions of log files to the Elasticsearch
server, you need to format the configuration file properly. There
are input, output and filter configurations. Input configurations tell Logstash
where to receive data from (log files/forwarders/filebeats/StdIn/Eventlog),
output configurations specify where to put the data, and filter configurations
define the input contents to forward to the output.
The Logstash filter performs intermediary processing on each event. Conditional
filters are applied based on the characteristics of the input and the event.
Some examples of filtering are:
* grok
<