[arch-design] Update capacity planning and scaling chapter

Consolidate capacity planning and scaling content from current guide to the updated arch-guide Change-Id: I2520954a3b2a67337445615d982263513872b1f5 Closes-Bug: #1548179
2016-04-11 17:08:50 +10:00 · 2016-04-11 17:08:50 +10:00 · 4619ae9b19
commit 4619ae9b19
parent 098ab6546b
1 changed files with 130 additions and 12 deletions
--- a/doc/arch-design-draft/source/capacity-planning-scaling.rst
+++ b/doc/arch-design-draft/source/capacity-planning-scaling.rst
@ -10,13 +10,12 @@ can lead to rapid jumps in the utilization of resources, the average rate of
 adoption of cloud services through normal usage also needs to be carefully
 monitored.

-
 General storage considerations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 A wide variety of operator-specific requirements dictates the nature of the
 storage back end. Examples of such requirements are as follows:

-* Public or private cloud, and associated SLA requirements
+* Public, private or a hybrid cloud, and associated SLA requirements
 * The need for encryption-at-rest, for data on storage nodes
 * Whether live migration will be offered

@ -24,6 +23,133 @@ We recommend that data be encrypted both in transit and at-rest.
 If you plan to use live migration, a shared storage configuration is highly
 recommended.

+Capacity planning for a multi-site cloud
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+An OpenStack cloud can be designed in a variety of ways to handle individual
+application needs. A multi-site deployment has additional challenges compared
+to single site installations.
+
+When determining capacity options, take into account technical, economic and
+operational issues that might arise from specific decisions.
+
+Inter-site link capacity describes the connectivity capability between
+different OpenStack sites. This includes parameters such as
+bandwidth, latency, whether or not a link is dedicated, and any business
+policies applied to the connection. The capability and number of the
+links between sites determine what kind of options are available for
+deployment. For example, if two sites have a pair of high-bandwidth
+links available between them, it may be wise to configure a separate
+storage replication network between the two sites to support a single
+swift endpoint and a shared Object Storage capability between them. An
+example of this technique, as well as a configuration walk-through, is
+available at
+http://docs.openstack.org/developer/swift/replication_network.html#dedicated-replication-network.
+Another option in this scenario is to build a dedicated set of tenant
+private networks across the secondary link, using overlay networks with
+a third party mapping the site overlays to each other.
+
+The capacity requirements of the links between sites is driven by
+application behavior. If the link latency is too high, certain
+applications that use a large number of small packets, for example
+:term:`RPC <Remote Procedure Call (RPC)>` API calls, may encounter
+issues communicating with each other or operating
+properly. OpenStack may also encounter similar types of issues.
+To mitigate this, the Identity service provides service call timeout
+tuning to prevent issues authenticating against a central Identity services.
+
+Another network capacity consideration for a multi-site deployment is
+the amount and performance of overlay networks available for tenant
+networks. If using shared tenant networks across zones, it is imperative
+that an external overlay manager or controller be used to map these
+overlays together. It is necessary to ensure the amount of possible IDs
+between the zones are identical.
+
+.. note::
+
+   As of the Kilo release, OpenStack Networking was not capable of
+   managing tunnel IDs across installations. So if one site runs out of
+   IDs, but another does not, that tenant's network is unable to reach
+   the other site.
+
+The ability for a region to grow depends on scaling out the number of
+available compute nodes. However, it may be necessary to grow cells in an
+individual region, depending on the size of your cluster and the ratio of
+virtual machines per hypervisor.
+
+A third form of capacity comes in the multi-region-capable components of
+OpenStack. Centralized Object Storage is capable of serving objects
+through a single namespace across multiple regions. Since this works by
+accessing the object store through swift proxy, it is possible to
+overload the proxies. There are two options available to mitigate this
+issue:
+
+* Deploy a large number of swift proxies. The drawback is that the
+  proxies are not load-balanced and a large file request could
+  continually hit the same proxy.
+
+* Add a caching HTTP proxy and load balancer in front of the swift
+  proxies. Since swift objects are returned to the requester via HTTP,
+  this load balancer alleviates the load required on the swift
+  proxies.
+
+Capacity planning for a compute-focused cloud
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Adding extra capacity to an compute-focused cloud is a horizontally scaling
+process.
+
+We recommend using similar CPUs when adding extra nodes to the environment.
+This reduces the chance of breaking live-migration features if they are
+present. Scaling out hypervisor hosts also has a direct effect on network
+and other data center resources. We recommend you factor in this increase
+when reaching rack capacity or when requiring extra network switches.
+
+Changing the internal components of a Compute host to account for increases in
+demand is a process known as vertical scaling. Swapping a CPU for one with more
+cores, or increasing the memory in a server, can help add extra capacity for
+running applications.
+
+Another option is to assess the average workloads and increase the number of
+instances that can run within the compute environment by adjusting the
+overcommit ratio.
+
+.. note::
+   It is important to remember that changing the CPU overcommit ratio can
+   have a detrimental effect and cause a potential increase in a noisy
+   neighbor.
+
+The added risk of increasing the overcommit ratio is that more instances fail
+when a compute host fails. We do not recommend that you increase the CPU
+overcommit ratio in compute-focused OpenStack design architecture. It can
+increase the potential for noisy neighbor issues.
+
+Capacity planning for a hybrid cloud
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One of the primary reasons many organizations use a hybrid cloud is to
+increase capacity without making large capital investments.
+
+Capacity and the placement of workloads are key design considerations for
+hybrid clouds. The long-term capacity plan for these designs must incorporate
+growth over time to prevent permanent consumption of more expensive external
+clouds. To avoid this scenario, account for future applications’ capacity
+requirements and plan growth appropriately.
+
+It is difficult to predict the amount of load a particular application might
+incur if the number of users fluctuate, or the application experiences an
+unexpected increase in use. It is possible to define application requirements
+in terms of vCPU, RAM, bandwidth, or other resources and plan appropriately.
+However, other clouds might not use the same meter or even the same
+oversubscription rates.
+
+Oversubscription is a method to emulate more capacity than may physically be
+present. For example, a physical hypervisor node with 32 GB RAM may host 24
+instances, each provisioned with 2 GB RAM. As long as all 24 instances do not
+concurrently use 2 full gigabytes, this arrangement works well. However, some
+hosts take oversubscription to extremes and, as a result, performance can be
+inconsistent. If at all possible, determine what the oversubscription rates
+of each host are and plan capacity accordingly.
+
 Block Storage
 ~~~~~~~~~~~~~

@ -186,11 +312,6 @@ resources servicing requests between proxy servers and storage nodes.
 For this reason, the network architecture used for access to storage
 nodes and proxy servers should make use of a design which is scalable.

-
-Network
-~~~~~~~
-.. TODO(unassigned): consolidate and update existing network sub-chapters.
-
 Compute resource design
 ~~~~~~~~~~~~~~~~~~~~~~~

@ -278,7 +399,4 @@ overall architecture can be done later.
 For more information on these topics, refer to the `OpenStack
 Operations Guide <http://docs.openstack.org/ops>`_.

-Control plane API services and Horizon
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. No existing control plane sub-chapters in the current guide.
+.. TODO Add information on control plane API services and horizon.