[arch-design] Set up book structure

1. Move mitaka changes temporarily into the arch-guide-draft-mitaka subdirectory 2. Set up book structure per Architecture Design Guide specification Change-Id: I27e66561208647f8ead32fc3e5b333051cd92a42 Implements: blueprint arch-guide-restructure
2016-05-23 16:39:32 +10:00 · 2016-05-23 16:39:32 +10:00 · 13b9cbf6d9
commit 13b9cbf6d9
parent 6f6299693a
74 changed files with 701 additions and 602 deletions
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-compute.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-compute.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-general.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-general.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-hybrid.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-hybrid.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-multi-site.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-multi-site.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-network.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-network.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-specialized.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-specialized.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-storage.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples-storage.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/arch-examples.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/capacity-planning-scaling.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/capacity-planning-scaling.rst
@ -0,0 +1,402 @@
+=============================
+Capacity planning and scaling
+=============================
+
+An important consideration in running a cloud over time is projecting growth
+and utilization trends in order to plan capital expenditures for the short and
+long term. Gather utilization meters for compute, network, and storage, along
+with historical records of these meters. While securing major anchor tenants
+can lead to rapid jumps in the utilization of resources, the average rate of
+adoption of cloud services through normal usage also needs to be carefully
+monitored.
+
+General storage considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A wide variety of operator-specific requirements dictates the nature of the
+storage back end. Examples of such requirements are as follows:
+
+* Public, private or a hybrid cloud, and associated SLA requirements
+* The need for encryption-at-rest, for data on storage nodes
+* Whether live migration will be offered
+
+We recommend that data be encrypted both in transit and at-rest.
+If you plan to use live migration, a shared storage configuration is highly
+recommended.
+
+Capacity planning for a multi-site cloud
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+An OpenStack cloud can be designed in a variety of ways to handle individual
+application needs. A multi-site deployment has additional challenges compared
+to single site installations.
+
+When determining capacity options, take into account technical, economic and
+operational issues that might arise from specific decisions.
+
+Inter-site link capacity describes the connectivity capability between
+different OpenStack sites. This includes parameters such as
+bandwidth, latency, whether or not a link is dedicated, and any business
+policies applied to the connection. The capability and number of the
+links between sites determine what kind of options are available for
+deployment. For example, if two sites have a pair of high-bandwidth
+links available between them, it may be wise to configure a separate
+storage replication network between the two sites to support a single
+swift endpoint and a shared Object Storage capability between them. An
+example of this technique, as well as a configuration walk-through, is
+available at
+http://docs.openstack.org/developer/swift/replication_network.html#dedicated-replication-network.
+Another option in this scenario is to build a dedicated set of tenant
+private networks across the secondary link, using overlay networks with
+a third party mapping the site overlays to each other.
+
+The capacity requirements of the links between sites is driven by
+application behavior. If the link latency is too high, certain
+applications that use a large number of small packets, for example
+:term:`RPC <Remote Procedure Call (RPC)>` API calls, may encounter
+issues communicating with each other or operating
+properly. OpenStack may also encounter similar types of issues.
+To mitigate this, the Identity service provides service call timeout
+tuning to prevent issues authenticating against a central Identity services.
+
+Another network capacity consideration for a multi-site deployment is
+the amount and performance of overlay networks available for tenant
+networks. If using shared tenant networks across zones, it is imperative
+that an external overlay manager or controller be used to map these
+overlays together. It is necessary to ensure the amount of possible IDs
+between the zones are identical.
+
+.. note::
+
+   As of the Kilo release, OpenStack Networking was not capable of
+   managing tunnel IDs across installations. So if one site runs out of
+   IDs, but another does not, that tenant's network is unable to reach
+   the other site.
+
+The ability for a region to grow depends on scaling out the number of
+available compute nodes. However, it may be necessary to grow cells in an
+individual region, depending on the size of your cluster and the ratio of
+virtual machines per hypervisor.
+
+A third form of capacity comes in the multi-region-capable components of
+OpenStack. Centralized Object Storage is capable of serving objects
+through a single namespace across multiple regions. Since this works by
+accessing the object store through swift proxy, it is possible to
+overload the proxies. There are two options available to mitigate this
+issue:
+
+* Deploy a large number of swift proxies. The drawback is that the
+  proxies are not load-balanced and a large file request could
+  continually hit the same proxy.
+
+* Add a caching HTTP proxy and load balancer in front of the swift
+  proxies. Since swift objects are returned to the requester via HTTP,
+  this load balancer alleviates the load required on the swift
+  proxies.
+
+Capacity planning for a compute-focused cloud
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Adding extra capacity to an compute-focused cloud is a horizontally scaling
+process.
+
+We recommend using similar CPUs when adding extra nodes to the environment.
+This reduces the chance of breaking live-migration features if they are
+present. Scaling out hypervisor hosts also has a direct effect on network
+and other data center resources. We recommend you factor in this increase
+when reaching rack capacity or when requiring extra network switches.
+
+Changing the internal components of a Compute host to account for increases in
+demand is a process known as vertical scaling. Swapping a CPU for one with more
+cores, or increasing the memory in a server, can help add extra capacity for
+running applications.
+
+Another option is to assess the average workloads and increase the number of
+instances that can run within the compute environment by adjusting the
+overcommit ratio.
+
+.. note::
+   It is important to remember that changing the CPU overcommit ratio can
+   have a detrimental effect and cause a potential increase in a noisy
+   neighbor.
+
+The added risk of increasing the overcommit ratio is that more instances fail
+when a compute host fails. We do not recommend that you increase the CPU
+overcommit ratio in compute-focused OpenStack design architecture. It can
+increase the potential for noisy neighbor issues.
+
+Capacity planning for a hybrid cloud
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One of the primary reasons many organizations use a hybrid cloud is to
+increase capacity without making large capital investments.
+
+Capacity and the placement of workloads are key design considerations for
+hybrid clouds. The long-term capacity plan for these designs must incorporate
+growth over time to prevent permanent consumption of more expensive external
+clouds. To avoid this scenario, account for future applications’ capacity
+requirements and plan growth appropriately.
+
+It is difficult to predict the amount of load a particular application might
+incur if the number of users fluctuate, or the application experiences an
+unexpected increase in use. It is possible to define application requirements
+in terms of vCPU, RAM, bandwidth, or other resources and plan appropriately.
+However, other clouds might not use the same meter or even the same
+oversubscription rates.
+
+Oversubscription is a method to emulate more capacity than may physically be
+present. For example, a physical hypervisor node with 32 GB RAM may host 24
+instances, each provisioned with 2 GB RAM. As long as all 24 instances do not
+concurrently use 2 full gigabytes, this arrangement works well. However, some
+hosts take oversubscription to extremes and, as a result, performance can be
+inconsistent. If at all possible, determine what the oversubscription rates
+of each host are and plan capacity accordingly.
+
+Block Storage
+~~~~~~~~~~~~~
+
+Configure Block Storage resource nodes with advanced RAID controllers
+and high-performance disks to provide fault tolerance at the hardware
+level.
+
+Deploy high performing storage solutions such as SSD drives or
+flash storage systems for applications requiring additional performance out
+of Block Storage devices.
+
+In environments that place substantial demands on Block Storage, we
+recommend using multiple storage pools. In this case, each pool of
+devices should have a similar hardware design and disk configuration
+across all hardware nodes in that pool. This allows for a design that
+provides applications with access to a wide variety of Block Storage
+pools, each with their own redundancy, availability, and performance
+characteristics. When deploying multiple pools of storage, it is also
+important to consider the impact on the Block Storage scheduler which is
+responsible for provisioning storage across resource nodes. Ideally,
+ensure that applications can schedule volumes in multiple regions, each with
+their own network, power, and cooling infrastructure. This will give tenants
+the option of building fault-tolerant applications that are distributed
+across multiple availability zones.
+
+In addition to the Block Storage resource nodes, it is important to
+design for high availability and redundancy of the APIs, and related
+services that are responsible for provisioning and providing access to
+storage. We recommend designing a layer of hardware or software load
+balancers in order to achieve high availability of the appropriate REST
+API services to provide uninterrupted service. In some cases, it may
+also be necessary to deploy an additional layer of load balancing to
+provide access to back-end database services responsible for servicing
+and storing the state of Block Storage volumes. It is imperative that a
+highly available database cluster is used to store the Block
+Storage metadata.
+
+In a cloud with significant demands on Block Storage, the network
+architecture should take into account the amount of East-West bandwidth
+required for instances to make use of the available storage resources.
+The selected network devices should support jumbo frames for
+transferring large blocks of data, and utilize a dedicated network for
+providing connectivity between instances and Block Storage.
+
+Scaling Block Storage
+---------------------
+
+You can upgrade Block Storage pools to add storage capacity without
+interrupting the overall Block Storage service. Add nodes to the pool by
+installing and configuring the appropriate hardware and software and
+then allowing that node to report in to the proper storage pool through the
+message bus. Block Storage nodes generally report into the scheduler
+service advertising their availability. As a result, after the node is
+online and available, tenants can make use of those storage resources
+instantly.
+
+In some cases, the demand on Block Storage may exhaust the available
+network bandwidth. As a result, design network infrastructure that
+services Block Storage resources in such a way that you can add capacity
+and bandwidth easily. This often involves the use of dynamic routing
+protocols or advanced networking solutions to add capacity to downstream
+devices easily. Both the front-end and back-end storage network designs
+should encompass the ability to quickly and easily add capacity and
+bandwidth.
+
+.. note::
+
+   Sufficient monitoring and data collection should be in-place
+   from the start, such that timely decisions regarding capacity,
+   input/output metrics (IOPS) or storage-associated bandwidth can
+   be made.
+
+Object Storage
+~~~~~~~~~~~~~~
+
+While consistency and partition tolerance are both inherent features of
+the Object Storage service, it is important to design the overall
+storage architecture to ensure that the implemented system meets those
+goals. The OpenStack Object Storage service places a specific number of
+data replicas as objects on resource nodes. Replicas are distributed
+throughout the cluster, based on a consistent hash ring also stored on
+each node in the cluster.
+
+Design the Object Storage system with a sufficient number of zones to
+provide quorum for the number of replicas defined. For example, with
+three replicas configured in the swift cluster, the recommended number
+of zones to configure within the Object Storage cluster in order to
+achieve quorum is five. While it is possible to deploy a solution with
+fewer zones, the implied risk of doing so is that some data may not be
+available and API requests to certain objects stored in the cluster
+might fail. For this reason, ensure you properly account for the number
+of zones in the Object Storage cluster.
+
+Each Object Storage zone should be self-contained within its own
+availability zone. Each availability zone should have independent access
+to network, power, and cooling infrastructure to ensure uninterrupted
+access to data. In addition, a pool of Object Storage proxy servers
+providing access to data stored on the object nodes should service each
+availability zone. Object proxies in each region should leverage local
+read and write affinity so that local storage resources facilitate
+access to objects wherever possible. We recommend deploying upstream
+load balancing to ensure that proxy services are distributed across the
+multiple zones and, in some cases, it may be necessary to make use of
+third-party solutions to aid with geographical distribution of services.
+
+A zone within an Object Storage cluster is a logical division. Any of
+the following may represent a zone:
+
+*  A disk within a single node
+*  One zone per node
+*  Zone per collection of nodes
+*  Multiple racks
+*  Multiple data centers
+
+Selecting the proper zone design is crucial for allowing the Object
+Storage cluster to scale while providing an available and redundant
+storage system. It may be necessary to configure storage policies that
+have different requirements with regards to replicas, retention, and
+other factors that could heavily affect the design of storage in a
+specific zone.
+
+Scaling Object Storage
+----------------------
+
+Adding back-end storage capacity to an Object Storage cluster requires
+careful planning and forethought. In the design phase, it is important
+to determine the maximum partition power required by the Object Storage
+service, which determines the maximum number of partitions which can
+exist. Object Storage distributes data among all available storage, but
+a partition cannot span more than one disk, so the maximum number of
+partitions can only be as high as the number of disks.
+
+For example, a system that starts with a single disk and a partition
+power of 3 can have 8 (2^3) partitions. Adding a second disk means that
+each has 4 partitions. The one-disk-per-partition limit means that this
+system can never have more than 8 disks, limiting its scalability.
+However, a system that starts with a single disk and a partition power
+of 10 can have up to 1024 (2^10) disks.
+
+As you add back-end storage capacity to the system, the partition maps
+redistribute data amongst the storage nodes. In some cases, this
+involves replication of extremely large data sets. In these cases, we
+recommend using back-end replication links that do not contend with
+tenants' access to data.
+
+As more tenants begin to access data within the cluster and their data
+sets grow, it is necessary to add front-end bandwidth to service data
+access requests. Adding front-end bandwidth to an Object Storage cluster
+requires careful planning and design of the Object Storage proxies that
+tenants use to gain access to the data, along with the high availability
+solutions that enable easy scaling of the proxy layer. We recommend
+designing a front-end load balancing layer that tenants and consumers
+use to gain access to data stored within the cluster. This load
+balancing layer may be distributed across zones, regions or even across
+geographic boundaries, which may also require that the design encompass
+geo-location solutions.
+
+In some cases, you must add bandwidth and capacity to the network
+resources servicing requests between proxy servers and storage nodes.
+For this reason, the network architecture used for access to storage
+nodes and proxy servers should make use of a design which is scalable.
+
+Compute resource design
+~~~~~~~~~~~~~~~~~~~~~~~
+
+When designing compute resource pools, consider the number of processors,
+amount of memory, and the quantity of storage required for each hypervisor.
+
+Consider whether compute resources will be provided in a single pool or in
+multiple pools. In most cases, multiple pools of resources can be allocated
+and addressed on demand, commonly referred to as bin packing.
+
+In a bin packing design, each independent resource pool provides service
+for specific flavors. Since instances are scheduled onto compute hypervisors,
+each independent node's resources will be allocated to efficiently use the
+available hardware. Bin packing also requires a common hardware design,
+with all hardware nodes within a compute resource pool sharing a common
+processor, memory, and storage layout. This makes it easier to deploy,
+support, and maintain nodes throughout their lifecycle.
+
+Increasing the size of the supporting compute environment increases the
+network traffic and messages, adding load to the controller or
+networking nodes. Effective monitoring of the environment will help with
+capacity decisions on scaling.
+
+Compute nodes automatically attach to OpenStack clouds, resulting in a
+horizontally scaling process when adding extra compute capacity to an
+OpenStack cloud. Additional processes are required to place nodes into
+appropriate availability zones and host aggregates. When adding
+additional compute nodes to environments, ensure identical or functional
+compatible CPUs are used, otherwise live migration features will break.
+It is necessary to add rack capacity or network switches as scaling out
+compute hosts directly affects network and data center resources.
+
+Compute host components can also be upgraded to account for increases in
+demand, known as vertical scaling. Upgrading CPUs with more
+cores, or increasing the overall server memory, can add extra needed
+capacity depending on whether the running applications are more CPU
+intensive or memory intensive.
+
+When selecting a processor, compare features and performance
+characteristics. Some processors include features specific to
+virtualized compute hosts, such as hardware-assisted virtualization, and
+technology related to memory paging (also known as EPT shadowing). These
+types of features can have a significant impact on the performance of
+your virtual machine.
+
+The number of processor cores and threads impacts the number of worker
+threads which can be run on a resource node. Design decisions must
+relate directly to the service being run on it, as well as provide a
+balanced infrastructure for all services.
+
+Another option is to assess the average workloads and increase the
+number of instances that can run within the compute environment by
+adjusting the overcommit ratio.
+
+An overcommit ratio is the ratio of available virtual resources to
+available physical resources. This ratio is configurable for CPU and
+memory. The default CPU overcommit ratio is 16:1, and the default memory
+overcommit ratio is 1.5:1. Determining the tuning of the overcommit
+ratios during the design phase is important as it has a direct impact on
+the hardware layout of your compute nodes.
+
+.. note::
+
+   Changing the CPU overcommit ratio can have a detrimental effect
+   and cause a potential increase in a noisy neighbor.
+
+Insufficient disk capacity could also have a negative effect on overall
+performance including CPU and memory usage. Depending on the back-end
+architecture of the OpenStack Block Storage layer, capacity includes
+adding disk shelves to enterprise storage systems or installing
+additional block storage nodes. Upgrading directly attached storage
+installed in compute hosts, and adding capacity to the shared storage
+for additional ephemeral storage to instances, may be necessary.
+
+Consider the compute requirements of non-hypervisor nodes (also referred to as
+resource nodes). This includes controller, object storage, and block storage
+nodes, and networking services.
+
+The ability to add compute resource pools for unpredictable workloads should
+be considered. In some cases, the demand for certain instance types or flavors
+may not justify individual hardware design. Allocate hardware designs that are
+capable of servicing the most common instance requests. Adding hardware to the
+overall architecture can be done later.
+
+For more information on these topics, refer to the `OpenStack
+Operations Guide <http://docs.openstack.org/ops>`_.
+
+.. TODO Add information on control plane API services and horizon.
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements-business-considerations.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements-business-considerations.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements-performance-considerations.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements-performance-considerations.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements-usage-considerations.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements-usage-considerations.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/customer-requirements.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Compute_NSX.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Compute_NSX.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Compute_Tech_Bin_Packing_CPU_optimized1.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Compute_Tech_Bin_Packing_CPU_optimized1.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Compute_Tech_Bin_Packing_General1.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Compute_Tech_Bin_Packing_General1.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/General_Architecture3.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/General_Architecture3.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Generic_CERN_Architecture.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Generic_CERN_Architecture.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Generic_CERN_Example.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Generic_CERN_Example.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Massively_Scalable_Cells_regions_azs.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Massively_Scalable_Cells_regions_azs.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Cloud_Priv-AWS4.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Cloud_Priv-AWS4.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Cloud_Priv-Pub3.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Cloud_Priv-Pub3.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Cloud_failover2.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Cloud_failover2.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Site_Customer_Edge.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Site_Customer_Edge.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Site_shared_keystone1.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Site_shared_keystone1.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Site_shared_keystone_horizon_swift1.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-Site_shared_keystone_horizon_swift1.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-site_Geo_Redundant_LB.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Multi-site_Geo_Redundant_LB.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Network_Cloud_Storage2.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Network_Cloud_Storage2.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Network_Web_Services1.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Network_Web_Services1.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_Hardware2.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_Hardware2.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_OOO.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_OOO.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_SDN_external.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_SDN_external.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_SDN_hosted.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_SDN_hosted.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_VDI1.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Specialized_VDI1.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Storage_Database_+_Object5.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Storage_Database_+_Object5.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Storage_Hadoop3.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Storage_Hadoop3.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Storage_Object.png
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/figures/Storage_Object.png
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/high-availability.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/high-availability.rst
@ -0,0 +1,190 @@
+.. _high-availability:
+
+=================
+High availability
+=================
+
+.. toctree::
+   :maxdepth: 2
+
+
+
+
+
+
+Data Plane and Control Plane
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When designing an OpenStack cloud, it is important to consider the needs
+dictated by the :term:`Service Level Agreement (SLA)` in terms of the core
+services required to maintain availability of running Compute service
+instances, networks, storage and additional services running on top of those
+resources. These services are often referred to as the Data Plane services,
+and are generally expected to be available all the time.
+
+The remaining services, responsible for CRUD operations, metering, monitoring,
+and so on, are often referred to as the Control Plane. The SLA is likely to
+dictate a lower uptime requirement for these services.
+
+The services comprising an OpenStack cloud have a number of requirements which
+the architect needs to understand in order to be able to meet SLA terms. For
+example, in order to provide the Compute service a minimum of storage, message
+queueing, and database services are necessary as well as the networking between
+them.
+
+Ongoing maintenance operations are made much simpler if there is logical and
+physical separation of Data Plane and Control Plane systems. It then becomes
+possible to, for example, reboot a controller without affecting customers.
+If one service failure affects the operation of an entire server ('noisy
+neighbor’), the separation between Control and Data Planes enables rapid
+maintenance with a limited effect on customer operations.
+
+
+Eliminating Single Points of Failure
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Within each site
+----------------
+
+OpenStack lends itself to deployment in a highly available manner where it is
+expected that at least 2 servers be utilized. These can run all the services
+involved from the message queuing service, for example ``RabbitMQ`` or
+``QPID``, and an appropriately deployed database service such as ``MySQL`` or
+``MariaDB``. As services in the cloud are scaled out, back-end services will
+need to scale too. Monitoring and reporting on server utilization and response
+times, as well as load testing your systems, will help determine scale out
+decisions.
+
+The OpenStack services themselves should be deployed across multiple servers
+that do not represent a single point of failure. Ensuring availability can
+be achieved by placing these services behind highly available load balancers
+that have multiple OpenStack servers as members.
+
+There are a small number of OpenStack services which are intended to only run
+in one place at a time (e.g. the ``ceilometer-agent-central`` service). In
+order to prevent these services from becoming a single point of failure, they
+can be controlled by clustering software such as ``Pacemaker``.
+
+In OpenStack, the infrastructure is integral to providing services and should
+always be available, especially when operating with SLAs. Ensuring network
+availability is accomplished by designing the network architecture so that no
+single point of failure exists. A consideration of the number of switches,
+routes and redundancies of power should be factored into core infrastructure,
+as well as the associated bonding of networks to provide diverse routes to your
+highly available switch infrastructure.
+
+Care must be taken when deciding network functionality. Currently, OpenStack
+supports both the legacy networking (nova-network) system and the newer,
+extensible OpenStack Networking (neutron). OpenStack Networking and legacy
+networking both have their advantages and disadvantages. They are both valid
+and supported options that fit different network deployment models described in
+the `OpenStack Operations Guide
+<http://docs.openstack.org/openstack-ops/content/network_design.html#network_deployment_options>`_.
+
+When using the Networking service, the OpenStack controller servers or separate
+Networking hosts handle routing unless the dynamic virtual routers pattern for
+routing is selected. Running routing directly on the controller servers mixes
+the Data and Control Planes and can cause complex issues with performance and
+troubleshooting. It is possible to use third party software and external
+appliances that help maintain highly available layer three routes. Doing so
+allows for common application endpoints to control network hardware, or to
+provide complex multi-tier web applications in a secure manner. It is also
+possible to completely remove routing from Networking, and instead rely on
+hardware routing capabilities. In this case, the switching infrastructure must
+support layer three routing.
+
+Application design must also be factored into the capabilities of the
+underlying cloud infrastructure. If the compute hosts do not provide a seamless
+live migration capability, then it must be expected that if a compute host
+fails, that instance and any data local to that instance will be deleted.
+However, when providing an expectation to users that instances have a
+high-level of uptime guaranteed, the infrastructure must be deployed in a way
+that eliminates any single point of failure if a compute host disappears.
+This may include utilizing shared file systems on enterprise storage or
+OpenStack Block storage to provide a level of guarantee to match service
+features.
+
+If using a storage design that includes shared access to centralized storage,
+ensure that this is also designed without single points of failure and the SLA
+for the solution matches or exceeds the expected SLA for the Data Plane.
+
+Between sites in a multi region design
+--------------------------------------
+
+Some services are commonly shared between multiple regions, including the
+Identity service and the Dashboard. In this case, it is necessary to ensure
+that the databases backing the services are replicated, and that access to
+multiple workers across each site can be maintained in the event of losing a
+single region.
+
+Multiple network links should be deployed between sites to provide redundancy
+for all components. This includes storage replication, which should be isolated
+to a dedicated network or VLAN with the ability to assign QoS to control the
+replication traffic or provide priority for this traffic. Note that if the data
+store is highly changeable, the network requirements could have a significant
+effect on the operational cost of maintaining the sites.
+
+If the design incorporates more than one site, the ability to maintain object
+availability in both sites has significant implications on the object storage
+design and implementation. It also has a significant impact on the WAN network
+design between the sites.
+
+If applications running in a cloud are not cloud-aware, there should be clear
+measures and expectations to define what the infrastructure can and cannot
+support. An example would be shared storage between sites. It is possible,
+however such a solution is not native to OpenStack and requires a third-party
+hardware vendor to fulfill such a requirement. Another example can be seen in
+applications that are able to consume resources in object storage directly.
+
+Connecting more than two sites increases the challenges and adds more
+complexity to the design considerations. Multi-site implementations require
+planning to address the additional topology used for internal and external
+connectivity. Some options include full mesh topology, hub spoke, spine leaf,
+and 3D Torus.
+
+For more information on high availability in OpenStack, see the `OpenStack High
+Availability Guide <http://docs.openstack.org/ha-guide/>`_.
+
+Site loss and recovery
+~~~~~~~~~~~~~~~~~~~~~~
+
+Outages can cause partial or full loss of site functionality. Strategies
+should be implemented to understand and plan for recovery scenarios.
+
+*  The deployed applications need to continue to function and, more
+   importantly, you must consider the impact on the performance and
+   reliability of the application if a site is unavailable.
+
+*  It is important to understand what happens to the replication of
+   objects and data between the sites when a site goes down. If this
+   causes queues to start building up, consider how long these queues
+   can safely exist until an error occurs.
+
+*  After an outage, ensure that operations of a site are resumed when it
+   comes back online. We recommend that you architect the recovery to
+   avoid race conditions.
+
+
+Inter-site replication data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Traditionally, replication has been the best method of protecting object store
+implementations. A variety of replication methods exist in storage
+architectures, for example synchronous and asynchronous mirroring. Most object
+stores and back-end storage systems implement methods for replication at the
+storage subsystem layer. Object stores also tailor replication techniques to
+fit a cloud's requirements.
+
+Organizations must find the right balance between data integrity and data
+availability. Replication strategy may also influence disaster recovery
+methods.
+
+Replication across different racks, data centers, and geographical regions
+increases focus on determining and ensuring data locality. The ability to
+guarantee data is accessed from the nearest or fastest storage can be necessary
+for applications to perform well.
+
+.. note::
+
+   When running embedded object store methods, ensure that you do not
+   instigate extra data replication as this may cause performance issues.
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/identifying-stakeholders.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/identifying-stakeholders.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/index.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/index.rst
@ -0,0 +1,56 @@
+.. meta::
+   :description: This guide targets OpenStack Architects
+                 for architectural design
+   :keywords: Architecture, OpenStack
+
+===================================
+OpenStack Architecture Design Guide
+===================================
+
+Abstract
+~~~~~~~~
+
+To reap the benefits of OpenStack, you should plan, design,
+and architect your cloud properly, taking user's needs into
+account and understanding the use cases.
+
+.. TODO rewrite the abstract
+
+Contents
+~~~~~~~~
+
+.. toctree::
+   :maxdepth: 2
+
+   common/conventions.rst
+   introduction.rst
+   identifying-stakeholders.rst
+   technical-requirements.rst
+   customer-requirements.rst
+   operator-requirements.rst
+   capacity-planning-scaling.rst
+   high-availability.rst
+   security-requirements.rst
+   legal-requirements.rst
+   arch-examples.rst
+
+Appendix
+~~~~~~~~
+
+.. toctree::
+   :maxdepth: 1
+
+   common/app_support.rst
+
+Glossary
+~~~~~~~~
+
+.. toctree::
+   :maxdepth: 1
+
+   common/glossary.rst
+
+Search in this guide
+~~~~~~~~~~~~~~~~~~~~
+
+* :ref:`search`
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/introduction.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/introduction.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/legal-requirements.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/legal-requirements.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-bleeding-edge.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-bleeding-edge.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-external-idp.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-external-idp.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-ops-access.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-ops-access.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-policy-management.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-policy-management.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-quota-management.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-quota-management.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-skills-training.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-skills-training.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-sla.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-sla.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-support-maintenance.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-support-maintenance.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-upgrades.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements-upgrades.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/operator-requirements.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/security-requirements.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/security-requirements.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-add-region.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-add-region.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-desktop-as-a-service.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-desktop-as-a-service.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-hardware.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-hardware.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-multi-hypervisor.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-multi-hypervisor.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-networking.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-networking.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-openstack-on-openstack.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-openstack-on-openstack.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-scaling-multiple-cells.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-scaling-multiple-cells.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-single-site.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-single-site.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-software-defined-networking.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/specialized-software-defined-networking.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-hardware-selection.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-hardware-selection.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-logging-monitoring.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-logging-monitoring.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-network-design.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-network-design.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-software-selection.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements-software-selection.rst
--- a/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements.rst
+++ b/doc/arch-design-draft/source/arch-guide-draft-mitaka/technical-requirements.rst
--- a/doc/arch-design-draft/source/capacity-planning-scaling.rst
+++ b/doc/arch-design-draft/source/capacity-planning-scaling.rst
@ -1,402 +1,3 @@
 =============================
 Capacity planning and scaling
 =============================
-
-An important consideration in running a cloud over time is projecting growth
-and utilization trends in order to plan capital expenditures for the short and
-long term. Gather utilization meters for compute, network, and storage, along
-with historical records of these meters. While securing major anchor tenants
-can lead to rapid jumps in the utilization of resources, the average rate of
-adoption of cloud services through normal usage also needs to be carefully
-monitored.
-
-General storage considerations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-A wide variety of operator-specific requirements dictates the nature of the
-storage back end. Examples of such requirements are as follows:
-
-* Public, private or a hybrid cloud, and associated SLA requirements
-* The need for encryption-at-rest, for data on storage nodes
-* Whether live migration will be offered
-
-We recommend that data be encrypted both in transit and at-rest.
-If you plan to use live migration, a shared storage configuration is highly
-recommended.
-
-Capacity planning for a multi-site cloud
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-An OpenStack cloud can be designed in a variety of ways to handle individual
-application needs. A multi-site deployment has additional challenges compared
-to single site installations.
-
-When determining capacity options, take into account technical, economic and
-operational issues that might arise from specific decisions.
-
-Inter-site link capacity describes the connectivity capability between
-different OpenStack sites. This includes parameters such as
-bandwidth, latency, whether or not a link is dedicated, and any business
-policies applied to the connection. The capability and number of the
-links between sites determine what kind of options are available for
-deployment. For example, if two sites have a pair of high-bandwidth
-links available between them, it may be wise to configure a separate
-storage replication network between the two sites to support a single
-swift endpoint and a shared Object Storage capability between them. An
-example of this technique, as well as a configuration walk-through, is
-available at
-http://docs.openstack.org/developer/swift/replication_network.html#dedicated-replication-network.
-Another option in this scenario is to build a dedicated set of tenant
-private networks across the secondary link, using overlay networks with
-a third party mapping the site overlays to each other.
-
-The capacity requirements of the links between sites is driven by
-application behavior. If the link latency is too high, certain
-applications that use a large number of small packets, for example
-:term:`RPC <Remote Procedure Call (RPC)>` API calls, may encounter
-issues communicating with each other or operating
-properly. OpenStack may also encounter similar types of issues.
-To mitigate this, the Identity service provides service call timeout
-tuning to prevent issues authenticating against a central Identity services.
-
-Another network capacity consideration for a multi-site deployment is
-the amount and performance of overlay networks available for tenant
-networks. If using shared tenant networks across zones, it is imperative
-that an external overlay manager or controller be used to map these
-overlays together. It is necessary to ensure the amount of possible IDs
-between the zones are identical.
-
-.. note::
-
-   As of the Kilo release, OpenStack Networking was not capable of
-   managing tunnel IDs across installations. So if one site runs out of
-   IDs, but another does not, that tenant's network is unable to reach
-   the other site.
-
-The ability for a region to grow depends on scaling out the number of
-available compute nodes. However, it may be necessary to grow cells in an
-individual region, depending on the size of your cluster and the ratio of
-virtual machines per hypervisor.
-
-A third form of capacity comes in the multi-region-capable components of
-OpenStack. Centralized Object Storage is capable of serving objects
-through a single namespace across multiple regions. Since this works by
-accessing the object store through swift proxy, it is possible to
-overload the proxies. There are two options available to mitigate this
-issue:
-
-* Deploy a large number of swift proxies. The drawback is that the
-  proxies are not load-balanced and a large file request could
-  continually hit the same proxy.
-
-* Add a caching HTTP proxy and load balancer in front of the swift
-  proxies. Since swift objects are returned to the requester via HTTP,
-  this load balancer alleviates the load required on the swift
-  proxies.
-
-Capacity planning for a compute-focused cloud
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Adding extra capacity to an compute-focused cloud is a horizontally scaling
-process.
-
-We recommend using similar CPUs when adding extra nodes to the environment.
-This reduces the chance of breaking live-migration features if they are
-present. Scaling out hypervisor hosts also has a direct effect on network
-and other data center resources. We recommend you factor in this increase
-when reaching rack capacity or when requiring extra network switches.
-
-Changing the internal components of a Compute host to account for increases in
-demand is a process known as vertical scaling. Swapping a CPU for one with more
-cores, or increasing the memory in a server, can help add extra capacity for
-running applications.
-
-Another option is to assess the average workloads and increase the number of
-instances that can run within the compute environment by adjusting the
-overcommit ratio.
-
-.. note::
-   It is important to remember that changing the CPU overcommit ratio can
-   have a detrimental effect and cause a potential increase in a noisy
-   neighbor.
-
-The added risk of increasing the overcommit ratio is that more instances fail
-when a compute host fails. We do not recommend that you increase the CPU
-overcommit ratio in compute-focused OpenStack design architecture. It can
-increase the potential for noisy neighbor issues.
-
-Capacity planning for a hybrid cloud
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-One of the primary reasons many organizations use a hybrid cloud is to
-increase capacity without making large capital investments.
-
-Capacity and the placement of workloads are key design considerations for
-hybrid clouds. The long-term capacity plan for these designs must incorporate
-growth over time to prevent permanent consumption of more expensive external
-clouds. To avoid this scenario, account for future applications’ capacity
-requirements and plan growth appropriately.
-
-It is difficult to predict the amount of load a particular application might
-incur if the number of users fluctuate, or the application experiences an
-unexpected increase in use. It is possible to define application requirements
-in terms of vCPU, RAM, bandwidth, or other resources and plan appropriately.
-However, other clouds might not use the same meter or even the same
-oversubscription rates.
-
-Oversubscription is a method to emulate more capacity than may physically be
-present. For example, a physical hypervisor node with 32 GB RAM may host 24
-instances, each provisioned with 2 GB RAM. As long as all 24 instances do not
-concurrently use 2 full gigabytes, this arrangement works well. However, some
-hosts take oversubscription to extremes and, as a result, performance can be
-inconsistent. If at all possible, determine what the oversubscription rates
-of each host are and plan capacity accordingly.
-
-Block Storage
-~~~~~~~~~~~~~
-
-Configure Block Storage resource nodes with advanced RAID controllers
-and high-performance disks to provide fault tolerance at the hardware
-level.
-
-Deploy high performing storage solutions such as SSD drives or
-flash storage systems for applications requiring additional performance out
-of Block Storage devices.
-
-In environments that place substantial demands on Block Storage, we
-recommend using multiple storage pools. In this case, each pool of
-devices should have a similar hardware design and disk configuration
-across all hardware nodes in that pool. This allows for a design that
-provides applications with access to a wide variety of Block Storage
-pools, each with their own redundancy, availability, and performance
-characteristics. When deploying multiple pools of storage, it is also
-important to consider the impact on the Block Storage scheduler which is
-responsible for provisioning storage across resource nodes. Ideally,
-ensure that applications can schedule volumes in multiple regions, each with
-their own network, power, and cooling infrastructure. This will give tenants
-the option of building fault-tolerant applications that are distributed
-across multiple availability zones.
-
-In addition to the Block Storage resource nodes, it is important to
-design for high availability and redundancy of the APIs, and related
-services that are responsible for provisioning and providing access to
-storage. We recommend designing a layer of hardware or software load
-balancers in order to achieve high availability of the appropriate REST
-API services to provide uninterrupted service. In some cases, it may
-also be necessary to deploy an additional layer of load balancing to
-provide access to back-end database services responsible for servicing
-and storing the state of Block Storage volumes. It is imperative that a
-highly available database cluster is used to store the Block
-Storage metadata.
-
-In a cloud with significant demands on Block Storage, the network
-architecture should take into account the amount of East-West bandwidth
-required for instances to make use of the available storage resources.
-The selected network devices should support jumbo frames for
-transferring large blocks of data, and utilize a dedicated network for
-providing connectivity between instances and Block Storage.
-
-Scaling Block Storage
---------------------
-
-You can upgrade Block Storage pools to add storage capacity without
-interrupting the overall Block Storage service. Add nodes to the pool by
-installing and configuring the appropriate hardware and software and
-then allowing that node to report in to the proper storage pool through the
-message bus. Block Storage nodes generally report into the scheduler
-service advertising their availability. As a result, after the node is
-online and available, tenants can make use of those storage resources
-instantly.
-
-In some cases, the demand on Block Storage may exhaust the available
-network bandwidth. As a result, design network infrastructure that
-services Block Storage resources in such a way that you can add capacity
-and bandwidth easily. This often involves the use of dynamic routing
-protocols or advanced networking solutions to add capacity to downstream
-devices easily. Both the front-end and back-end storage network designs
-should encompass the ability to quickly and easily add capacity and
-bandwidth.
-
-.. note::
-
-   Sufficient monitoring and data collection should be in-place
-   from the start, such that timely decisions regarding capacity,
-   input/output metrics (IOPS) or storage-associated bandwidth can
-   be made.
-
-Object Storage
-~~~~~~~~~~~~~~
-
-While consistency and partition tolerance are both inherent features of
-the Object Storage service, it is important to design the overall
-storage architecture to ensure that the implemented system meets those
-goals. The OpenStack Object Storage service places a specific number of
-data replicas as objects on resource nodes. Replicas are distributed
-throughout the cluster, based on a consistent hash ring also stored on
-each node in the cluster.
-
-Design the Object Storage system with a sufficient number of zones to
-provide quorum for the number of replicas defined. For example, with
-three replicas configured in the swift cluster, the recommended number
-of zones to configure within the Object Storage cluster in order to
-achieve quorum is five. While it is possible to deploy a solution with
-fewer zones, the implied risk of doing so is that some data may not be
-available and API requests to certain objects stored in the cluster
-might fail. For this reason, ensure you properly account for the number
-of zones in the Object Storage cluster.
-
-Each Object Storage zone should be self-contained within its own
-availability zone. Each availability zone should have independent access
-to network, power, and cooling infrastructure to ensure uninterrupted
-access to data. In addition, a pool of Object Storage proxy servers
-providing access to data stored on the object nodes should service each
-availability zone. Object proxies in each region should leverage local
-read and write affinity so that local storage resources facilitate
-access to objects wherever possible. We recommend deploying upstream
-load balancing to ensure that proxy services are distributed across the
-multiple zones and, in some cases, it may be necessary to make use of
-third-party solutions to aid with geographical distribution of services.
-
-A zone within an Object Storage cluster is a logical division. Any of
-the following may represent a zone:
-
-*  A disk within a single node
-*  One zone per node
-*  Zone per collection of nodes
-*  Multiple racks
-*  Multiple data centers
-
-Selecting the proper zone design is crucial for allowing the Object
-Storage cluster to scale while providing an available and redundant
-storage system. It may be necessary to configure storage policies that
-have different requirements with regards to replicas, retention, and
-other factors that could heavily affect the design of storage in a
-specific zone.
-
-Scaling Object Storage
----------------------
-
-Adding back-end storage capacity to an Object Storage cluster requires
-careful planning and forethought. In the design phase, it is important
-to determine the maximum partition power required by the Object Storage
-service, which determines the maximum number of partitions which can
-exist. Object Storage distributes data among all available storage, but
-a partition cannot span more than one disk, so the maximum number of
-partitions can only be as high as the number of disks.
-
-For example, a system that starts with a single disk and a partition
-power of 3 can have 8 (2^3) partitions. Adding a second disk means that
-each has 4 partitions. The one-disk-per-partition limit means that this
-system can never have more than 8 disks, limiting its scalability.
-However, a system that starts with a single disk and a partition power
-of 10 can have up to 1024 (2^10) disks.
-
-As you add back-end storage capacity to the system, the partition maps
-redistribute data amongst the storage nodes. In some cases, this
-involves replication of extremely large data sets. In these cases, we
-recommend using back-end replication links that do not contend with
-tenants' access to data.
-
-As more tenants begin to access data within the cluster and their data
-sets grow, it is necessary to add front-end bandwidth to service data
-access requests. Adding front-end bandwidth to an Object Storage cluster
-requires careful planning and design of the Object Storage proxies that
-tenants use to gain access to the data, along with the high availability
-solutions that enable easy scaling of the proxy layer. We recommend
-designing a front-end load balancing layer that tenants and consumers
-use to gain access to data stored within the cluster. This load
-balancing layer may be distributed across zones, regions or even across
-geographic boundaries, which may also require that the design encompass
-geo-location solutions.
-
-In some cases, you must add bandwidth and capacity to the network
-resources servicing requests between proxy servers and storage nodes.
-For this reason, the network architecture used for access to storage
-nodes and proxy servers should make use of a design which is scalable.
-
-Compute resource design
-~~~~~~~~~~~~~~~~~~~~~~~
-
-When designing compute resource pools, consider the number of processors,
-amount of memory, and the quantity of storage required for each hypervisor.
-
-Consider whether compute resources will be provided in a single pool or in
-multiple pools. In most cases, multiple pools of resources can be allocated
-and addressed on demand, commonly referred to as bin packing.
-
-In a bin packing design, each independent resource pool provides service
-for specific flavors. Since instances are scheduled onto compute hypervisors,
-each independent node's resources will be allocated to efficiently use the
-available hardware. Bin packing also requires a common hardware design,
-with all hardware nodes within a compute resource pool sharing a common
-processor, memory, and storage layout. This makes it easier to deploy,
-support, and maintain nodes throughout their lifecycle.
-
-Increasing the size of the supporting compute environment increases the
-network traffic and messages, adding load to the controller or
-networking nodes. Effective monitoring of the environment will help with
-capacity decisions on scaling.
-
-Compute nodes automatically attach to OpenStack clouds, resulting in a
-horizontally scaling process when adding extra compute capacity to an
-OpenStack cloud. Additional processes are required to place nodes into
-appropriate availability zones and host aggregates. When adding
-additional compute nodes to environments, ensure identical or functional
-compatible CPUs are used, otherwise live migration features will break.
-It is necessary to add rack capacity or network switches as scaling out
-compute hosts directly affects network and data center resources.
-
-Compute host components can also be upgraded to account for increases in
-demand, known as vertical scaling. Upgrading CPUs with more
-cores, or increasing the overall server memory, can add extra needed
-capacity depending on whether the running applications are more CPU
-intensive or memory intensive.
-
-When selecting a processor, compare features and performance
-characteristics. Some processors include features specific to
-virtualized compute hosts, such as hardware-assisted virtualization, and
-technology related to memory paging (also known as EPT shadowing). These
-types of features can have a significant impact on the performance of
-your virtual machine.
-
-The number of processor cores and threads impacts the number of worker
-threads which can be run on a resource node. Design decisions must
-relate directly to the service being run on it, as well as provide a
-balanced infrastructure for all services.
-
-Another option is to assess the average workloads and increase the
-number of instances that can run within the compute environment by
-adjusting the overcommit ratio.
-
-An overcommit ratio is the ratio of available virtual resources to
-available physical resources. This ratio is configurable for CPU and
-memory. The default CPU overcommit ratio is 16:1, and the default memory
-overcommit ratio is 1.5:1. Determining the tuning of the overcommit
-ratios during the design phase is important as it has a direct impact on
-the hardware layout of your compute nodes.
-
-.. note::
-
-   Changing the CPU overcommit ratio can have a detrimental effect
-   and cause a potential increase in a noisy neighbor.
-
-Insufficient disk capacity could also have a negative effect on overall
-performance including CPU and memory usage. Depending on the back-end
-architecture of the OpenStack Block Storage layer, capacity includes
-adding disk shelves to enterprise storage systems or installing
-additional block storage nodes. Upgrading directly attached storage
-installed in compute hosts, and adding capacity to the shared storage
-for additional ephemeral storage to instances, may be necessary.
-
-Consider the compute requirements of non-hypervisor nodes (also referred to as
-resource nodes). This includes controller, object storage, and block storage
-nodes, and networking services.
-
-The ability to add compute resource pools for unpredictable workloads should
-be considered. In some cases, the demand for certain instance types or flavors
-may not justify individual hardware design. Allocate hardware designs that are
-capable of servicing the most common instance requests. Adding hardware to the
-overall architecture can be done later.
-
-For more information on these topics, refer to the `OpenStack
-Operations Guide <http://docs.openstack.org/ops>`_.
-
-.. TODO Add information on control plane API services and horizon.
--- a/doc/arch-design-draft/source/conf.py
+++ b/doc/arch-design-draft/source/conf.py
@ -90,7 +90,7 @@ html_context = {"gitsha": gitsha, "bug_tag": bug_tag,
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 exclude_patterns = ['common/cli*', 'common/nova*', 'common/get_started_*',
-                    'common/dashboard_customizing.rst']
+                    'common/dashboard_customizing.rst', 'arch-guide-draft-mitaka']

 # The reST default role (used for this markup: `text`) to use for all
 # documents.
--- a/doc/arch-design-draft/source/design.rst
+++ b/doc/arch-design-draft/source/design.rst
@ -0,0 +1,24 @@
+======
+Design
+======
+
+Compute service
+~~~~~~~~~~~~~~~
+
+Storage
+~~~~~~~
+
+Networking service
+~~~~~~~~~~~~~~~~~~
+
+Identity service
+~~~~~~~~~~~~~~~~
+
+Image service
+~~~~~~~~~~~~~
+
+Control Plane
+~~~~~~~~~~~~~
+
+Dashboard and APIs
+~~~~~~~~~~~~~~~~~~
--- a/doc/arch-design-draft/source/high-availability.rst
+++ b/doc/arch-design-draft/source/high-availability.rst
@ -1,190 +1,3 @@
-.. _high-availability:
-
 =================
-High availability
+High Availability
 =================
-
-.. toctree::
-   :maxdepth: 2
-
-
-
-
-
-
-Data Plane and Control Plane
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When designing an OpenStack cloud, it is important to consider the needs
-dictated by the :term:`Service Level Agreement (SLA)` in terms of the core
-services required to maintain availability of running Compute service
-instances, networks, storage and additional services running on top of those
-resources. These services are often referred to as the Data Plane services,
-and are generally expected to be available all the time.
-
-The remaining services, responsible for CRUD operations, metering, monitoring,
-and so on, are often referred to as the Control Plane. The SLA is likely to
-dictate a lower uptime requirement for these services.
-
-The services comprising an OpenStack cloud have a number of requirements which
-the architect needs to understand in order to be able to meet SLA terms. For
-example, in order to provide the Compute service a minimum of storage, message
-queueing, and database services are necessary as well as the networking between
-them.
-
-Ongoing maintenance operations are made much simpler if there is logical and
-physical separation of Data Plane and Control Plane systems. It then becomes
-possible to, for example, reboot a controller without affecting customers.
-If one service failure affects the operation of an entire server ('noisy
-neighbor’), the separation between Control and Data Planes enables rapid
-maintenance with a limited effect on customer operations.
-
-
-Eliminating Single Points of Failure
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Within each site
----------------
-
-OpenStack lends itself to deployment in a highly available manner where it is
-expected that at least 2 servers be utilized. These can run all the services
-involved from the message queuing service, for example ``RabbitMQ`` or
-``QPID``, and an appropriately deployed database service such as ``MySQL`` or
-``MariaDB``. As services in the cloud are scaled out, back-end services will
-need to scale too. Monitoring and reporting on server utilization and response
-times, as well as load testing your systems, will help determine scale out
-decisions.
-
-The OpenStack services themselves should be deployed across multiple servers
-that do not represent a single point of failure. Ensuring availability can
-be achieved by placing these services behind highly available load balancers
-that have multiple OpenStack servers as members.
-
-There are a small number of OpenStack services which are intended to only run
-in one place at a time (e.g. the ``ceilometer-agent-central`` service). In
-order to prevent these services from becoming a single point of failure, they
-can be controlled by clustering software such as ``Pacemaker``.
-
-In OpenStack, the infrastructure is integral to providing services and should
-always be available, especially when operating with SLAs. Ensuring network
-availability is accomplished by designing the network architecture so that no
-single point of failure exists. A consideration of the number of switches,
-routes and redundancies of power should be factored into core infrastructure,
-as well as the associated bonding of networks to provide diverse routes to your
-highly available switch infrastructure.
-
-Care must be taken when deciding network functionality. Currently, OpenStack
-supports both the legacy networking (nova-network) system and the newer,
-extensible OpenStack Networking (neutron). OpenStack Networking and legacy
-networking both have their advantages and disadvantages. They are both valid
-and supported options that fit different network deployment models described in
-the `OpenStack Operations Guide
-<http://docs.openstack.org/openstack-ops/content/network_design.html#network_deployment_options>`_.
-
-When using the Networking service, the OpenStack controller servers or separate
-Networking hosts handle routing unless the dynamic virtual routers pattern for
-routing is selected. Running routing directly on the controller servers mixes
-the Data and Control Planes and can cause complex issues with performance and
-troubleshooting. It is possible to use third party software and external
-appliances that help maintain highly available layer three routes. Doing so
-allows for common application endpoints to control network hardware, or to
-provide complex multi-tier web applications in a secure manner. It is also
-possible to completely remove routing from Networking, and instead rely on
-hardware routing capabilities. In this case, the switching infrastructure must
-support layer three routing.
-
-Application design must also be factored into the capabilities of the
-underlying cloud infrastructure. If the compute hosts do not provide a seamless
-live migration capability, then it must be expected that if a compute host
-fails, that instance and any data local to that instance will be deleted.
-However, when providing an expectation to users that instances have a
-high-level of uptime guaranteed, the infrastructure must be deployed in a way
-that eliminates any single point of failure if a compute host disappears.
-This may include utilizing shared file systems on enterprise storage or
-OpenStack Block storage to provide a level of guarantee to match service
-features.
-
-If using a storage design that includes shared access to centralized storage,
-ensure that this is also designed without single points of failure and the SLA
-for the solution matches or exceeds the expected SLA for the Data Plane.
-
-Between sites in a multi region design
--------------------------------------
-
-Some services are commonly shared between multiple regions, including the
-Identity service and the Dashboard. In this case, it is necessary to ensure
-that the databases backing the services are replicated, and that access to
-multiple workers across each site can be maintained in the event of losing a
-single region.
-
-Multiple network links should be deployed between sites to provide redundancy
-for all components. This includes storage replication, which should be isolated
-to a dedicated network or VLAN with the ability to assign QoS to control the
-replication traffic or provide priority for this traffic. Note that if the data
-store is highly changeable, the network requirements could have a significant
-effect on the operational cost of maintaining the sites.
-
-If the design incorporates more than one site, the ability to maintain object
-availability in both sites has significant implications on the object storage
-design and implementation. It also has a significant impact on the WAN network
-design between the sites.
-
-If applications running in a cloud are not cloud-aware, there should be clear
-measures and expectations to define what the infrastructure can and cannot
-support. An example would be shared storage between sites. It is possible,
-however such a solution is not native to OpenStack and requires a third-party
-hardware vendor to fulfill such a requirement. Another example can be seen in
-applications that are able to consume resources in object storage directly.
-
-Connecting more than two sites increases the challenges and adds more
-complexity to the design considerations. Multi-site implementations require
-planning to address the additional topology used for internal and external
-connectivity. Some options include full mesh topology, hub spoke, spine leaf,
-and 3D Torus.
-
-For more information on high availability in OpenStack, see the `OpenStack High
-Availability Guide <http://docs.openstack.org/ha-guide/>`_.
-
-Site loss and recovery
-~~~~~~~~~~~~~~~~~~~~~~
-
-Outages can cause partial or full loss of site functionality. Strategies
-should be implemented to understand and plan for recovery scenarios.
-
-*  The deployed applications need to continue to function and, more
-   importantly, you must consider the impact on the performance and
-   reliability of the application if a site is unavailable.
-
-*  It is important to understand what happens to the replication of
-   objects and data between the sites when a site goes down. If this
-   causes queues to start building up, consider how long these queues
-   can safely exist until an error occurs.
-
-*  After an outage, ensure that operations of a site are resumed when it
-   comes back online. We recommend that you architect the recovery to
-   avoid race conditions.
-
-
-Inter-site replication data
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Traditionally, replication has been the best method of protecting object store
-implementations. A variety of replication methods exist in storage
-architectures, for example synchronous and asynchronous mirroring. Most object
-stores and back-end storage systems implement methods for replication at the
-storage subsystem layer. Object stores also tailor replication techniques to
-fit a cloud's requirements.
-
-Organizations must find the right balance between data integrity and data
-availability. Replication strategy may also influence disaster recovery
-methods.
-
-Replication across different racks, data centers, and geographical regions
-increases focus on determining and ensuring data locality. The ability to
-guarantee data is accessed from the nearest or fastest storage can be necessary
-for applications to perform well.
-
-.. note::
-
-   When running embedded object store methods, ensure that you do not
-   instigate extra data replication as this may cause performance issues.
--- a/doc/arch-design-draft/source/index.rst
+++ b/doc/arch-design-draft/source/index.rst
@ -10,11 +10,11 @@ OpenStack Architecture Design Guide
 Abstract
 ~~~~~~~~

-To reap the benefits of OpenStack, you should plan, design,
-and architect your cloud properly, taking user's needs into
-account and understanding the use cases.
-
-.. TODO rewrite the abstract
+This guide provides information on planning and designing an OpenStack
+cloud. It describes common use cases, high availability, and considerations
+when changing capacity and scaling your cloud environment. A breakdown of the
+major OpenStack components is also described in relation to cloud architecture
+design.

 Contents
 ~~~~~~~~
@ -23,16 +23,11 @@ Contents
   :maxdepth: 2

   common/conventions.rst
-   introduction.rst
-   identifying-stakeholders.rst
-   technical-requirements.rst
-   customer-requirements.rst
-   operator-requirements.rst
-   capacity-planning-scaling.rst
+   overview.rst
+   use-cases.rst
   high-availability.rst
-   security-requirements.rst
-   legal-requirements.rst
-   arch-examples.rst
+   capacity-planning-scaling.rst
+   design.rst

 Appendix
 ~~~~~~~~
--- a/doc/arch-design-draft/source/overview.rst
+++ b/doc/arch-design-draft/source/overview.rst
@ -0,0 +1,3 @@
+========
+Overview
+========
--- a/doc/arch-design-draft/source/use-cases.rst
+++ b/doc/arch-design-draft/source/use-cases.rst
@ -0,0 +1,15 @@
+=========
+Use Cases
+=========
+
+Development Cloud
+~~~~~~~~~~~~~~~~~
+
+General Compute Cloud
+~~~~~~~~~~~~~~~~~~~~~
+
+Web Scale Cloud
+~~~~~~~~~~~~~~~
+
+Public Cloud
+~~~~~~~~~~~~