diff --git a/doc/arch-design/ch_massively_scalable.xml b/doc/arch-design/ch_massively_scalable.xml index 7d51996ccd..7fa6dd52fe 100644 --- a/doc/arch-design/ch_massively_scalable.xml +++ b/doc/arch-design/ch_massively_scalable.xml @@ -6,33 +6,31 @@ xml:id="massively_scalable"> Massively scalable - A massively scalable architecture is defined as a cloud + A massively scalable architecture is a cloud implementation that is either a very large deployment, such as - one that would be built by a commercial service provider, or + a commercial service provider might build, or one that has the capability to support user requests for large - amounts of cloud resources. An example would be an - infrastructure in which requests to service 500 instances or - more at a time is not uncommon. In a massively scalable - infrastructure, such a request is fulfilled without completely - consuming all of the available cloud infrastructure resources. - While the high capital cost of implementing such a cloud - architecture makes it cost prohibitive and is only spearheaded - by few organizations, many organizations are planning for - massive scalability moving toward the future. + amounts of cloud resources. An example is an + infrastructure in which requests to service 500 or more instances + at a time is common. A massively scalable infrastructure + fulfills such a request without exhausting the available + cloud infrastructure resources. While the high capital cost + of implementing such a cloud architecture means that it is + currently in limited use, many organizations are planning + for massive scalability in the future. A massively scalable OpenStack cloud design presents a unique set of challenges and considerations. For the most part it is similar to a general purpose cloud architecture, as it is built to address a non-specific range of potential use - cases or functions. Typically, it is rare that massively - scalable clouds are designed or specialized for particular - workloads. Like the general purpose cloud, the massively + cases or functions. Typically, it is rare that particular + workloads determine the design or configuration of massively + scalable clouds. Like the general purpose cloud, the massively scalable cloud is most often built as a platform for a variety - of workloads. Massively scalable OpenStack clouds are - generally built as commercial public cloud offerings since - single private organizations rarely have the resources or need - for this scale. + of workloads. Because private organizations rarely require + or have the resources for them, massively scalable OpenStack clouds + are generally built as commercial, public cloud offerings. Services provided by a massively scalable OpenStack cloud - will include: + include: Virtual-machine disk image library @@ -64,12 +62,12 @@ Like a general purpose cloud, the instances deployed in a - massively scalable OpenStack cloud will not necessarily use + massively scalable OpenStack cloud do not necessarily use any specific aspect of the cloud offering (compute, network, - or storage). As the cloud grows in scale, the scale of the - number of workloads can cause stress on all of the cloud - components. Additional stresses are introduced to supporting - infrastructure including databases and message brokers. The + or storage). As the cloud grows in scale, the number of + workloads can cause stress on all the cloud + components. This adds further stresses to supporting + infrastructure such as databases and message brokers. The architecture design for such a cloud must account for these performance pressures without negatively impacting user experience. diff --git a/doc/arch-design/massively_scalable/section_operational_considerations_massively_scalable.xml b/doc/arch-design/massively_scalable/section_operational_considerations_massively_scalable.xml index c999835b2d..7a082d0dda 100644 --- a/doc/arch-design/massively_scalable/section_operational_considerations_massively_scalable.xml +++ b/doc/arch-design/massively_scalable/section_operational_considerations_massively_scalable.xml @@ -6,35 +6,35 @@ xml:id="operational-considerations-massive-scale"> Operational considerations - In order to run at massive scale, it is important to plan on - the automation of as many of the operational processes as + In order to run efficiently at massive scale, automate + as many of the operational processes as possible. Automation includes the configuration of provisioning, monitoring and alerting systems. Part of the automation process includes the capability to determine when human intervention is required and who should act. The objective is to increase the ratio of operational staff to - running systems as much as possible to reduce maintenance + running systems as much as possible in order to reduce maintenance costs. In a massively scaled environment, it is impossible for staff to give each system individual care. - Configuration management tools such as Puppet or Chef allow + Configuration management tools such as Puppet and Chef enable operations staff to categorize systems into groups based on - their role and thus create configurations and system states - that are enforced through the provisioning system. Systems + their roles and thus create configurations and system states + that the provisioning system enforces. Systems that fall out of the defined state due to errors or failures are quickly removed from the pool of active nodes and replaced. - At large scale the resource cost of diagnosing individual - systems that have failed is far greater than the cost of - replacement. It is more economical to immediately replace the - system with a new system that can be provisioned and - configured automatically and quickly brought back into the + At large scale the resource cost of diagnosing failed individual + systems is far greater than the cost of + replacement. It is more economical to replace the failed + system with a new system, provisioning and configuring it + automatically then quickly adding it to the pool of active nodes. By automating tasks that are labor-intensive, - repetitive, and critical to operations with - automation, cloud operations teams are able to be managed more - efficiently because fewer resources are needed for these - babysitting tasks. Administrators are then free to tackle - tasks that cannot be easily automated and have longer-term - impacts on the business such as capacity planning. + repetitive, and critical to operations, cloud operations + teams can work more + efficiently because fewer resources are required for these + common tasks. Administrators are then free to tackle + tasks that are not easy to automate and that have longer-term + impacts on the business, for example capacity planning.
The bleeding edge Running OpenStack at massive scale requires striking a @@ -42,49 +42,48 @@ be tempting to run an older stable release branch of OpenStack to make deployments easier. However, when running at massive scale, known issues that may be of some concern or only have - minimal impact in smaller deployments could become pain points - at massive scale. If the issue is well known, in many cases, - it may be resolved in more recent releases. The OpenStack - community can help resolve any issues reported by applying + minimal impact in smaller deployments could become pain points. + Recent releases may address well known issues. The OpenStack + community can help resolve reported issues by applying the collective expertise of the OpenStack developers. - When issues crop up, the number of organizations running at - a similar scale is a relatively tiny proportion of the - OpenStack community, therefore it is important to share these - issues with the community and be a vocal advocate for + The number of organizations running at + massive scales is a small proportion of the + OpenStack community, therefore it is important to share + related issues with the community and be a vocal advocate for resolving them. Some issues only manifest when operating at - large scale and the number of organizations able to duplicate - and validate an issue is small, so it will be important to + large scale, and the number of organizations able to duplicate + and validate an issue is small, so it is important to document and dedicate resources to their resolution. In some cases, the resolution to the problem is ultimately to deploy a more recent version of OpenStack. Alternatively, - when the issue needs to be resolved in a production + when you must resolve an issue in a production environment where rebuilding the entire environment is not an - option, it is possible to deploy just the more recent separate - underlying components required to resolve issues or gain - significant performance improvements. At first glance, this - could be perceived as potentially exposing the deployment to - increased risk and instability. However, in many cases it - could be an issue that has not been discovered yet. - It is advisable to cultivate a development and operations + option, it is sometimes possible to deploy updates to specific + underlying components in order to resolve issues or gain + significant performance improvements. Although this may appear + to expose the deployment to + increased risk and instability, in many cases it + could be an undiscovered issue. + We recommend building a development and operations organization that is responsible for creating desired - features, diagnose and resolve issues, and also build the + features, diagnosing and resolving issues, and building the infrastructure for large scale continuous integration tests and continuous deployment. This helps catch bugs early and - make deployments quicker and less painful. In addition to - development resources, the recruitment of experts in the - fields of message queues, databases, distributed systems, and - networking, cloud and storage is also advisable.
+ makes deployments faster and easier. In addition to + development resources, we also recommend the recruitment + of experts in the fields of message queues, databases, distributed + systems, networking, cloud, and storage.
Growth and capacity planning An important consideration in running at massive scale is - projecting growth and utilization trends to plan capital - expenditures for the near and long term. Utilization metrics - for compute, network, and storage as well as a historical - record of these metrics are required. While securing major + projecting growth and utilization trends in order to plan capital + expenditures for the short and long term. Gather utilization + metrics for compute, network, and storage, along with historical + records of these metrics. While securing major anchor tenants can lead to rapid jumps in the utilization rates of all resources, the steady adoption of the cloud - inside an organizations or by public consumers in a public - offering will also create a steady trend of increased + inside an organization or by consumers in a public + offering also creates a steady trend of increased utilization.
Skills and training @@ -95,8 +94,8 @@ members to OpenStack conferences, meetup events, and encouraging active participation in the mailing lists and committees is a very important way to maintain skills and - forge relationships in the community. A list of OpenStack - training providers in the marketplace can be found here: http://www.openstack.org/marketplace/training/.
diff --git a/doc/arch-design/massively_scalable/section_tech_considerations_massively_scalable.xml b/doc/arch-design/massively_scalable/section_tech_considerations_massively_scalable.xml index 5f257d973a..fe9bd1b1b1 100644 --- a/doc/arch-design/massively_scalable/section_tech_considerations_massively_scalable.xml +++ b/doc/arch-design/massively_scalable/section_tech_considerations_massively_scalable.xml @@ -10,119 +10,114 @@ xml:id="technical-considerations-massive-scale"> Technical considerations - Converting an existing OpenStack environment that was - designed for a different purpose to be massively scalable is a - formidable task. When building a massively scalable - environment from the ground up, make sure the initial - deployment is built with the same principles and choices that - apply as the environment grows. For example, a good approach - is to deploy the first site as a multi-site environment. This - allows the same deployment and segregation methods to be used - as the environment grows to separate locations across - dedicated links or wide area networks. In a hyperscale cloud, - scale trumps redundancy. Applications must be modified with - this in mind, relying on the scale and homogeneity of the + Repurposing an existing OpenStack environment to be + massively scalable is a formidable task. When building + a massively scalable environment from the ground up, ensure + you build the initial deployment with the same principles + and choices that apply as the environment grows. For example, + a good approach is to deploy the first site as a multi-site + environment. This enables you to use the same deployment + and segregation methods as the environment grows to separate + locations across dedicated links or wide area networks. In + a hyperscale cloud, scale trumps redundancy. Modify applications + with this in mind, relying on the scale and homogeneity of the environment to provide reliability rather than redundant infrastructure provided by non-commodity hardware solutions.
Infrastructure segregation - Fortunately, OpenStack services are designed to support - massive horizontal scale. Be aware that this is not the case - for the entire supporting infrastructure. This is particularly - a problem for the database management systems and message - queues used by the various OpenStack services for data storage - and remote procedure call communications. - Traditional clustering techniques are typically used to + OpenStack services support massive horizontal scale. + Be aware that this is not the case for the entire supporting + infrastructure. This is particularly a problem for the database + management systems and message queues that OpenStack services + use for data storage and remote procedure call communications. + Traditional clustering techniques typically provide high availability and some additional scale for these - environments. In the quest for massive scale, however, - additional steps need to be taken to relieve the performance - pressure on these components to prevent them from negatively - impacting the overall performance of the environment. It is - important to make sure that all the components are in balance - so that, if and when the massively scalable environment fails, - all the components are at, or close to, maximum + environments. In the quest for massive scale, however, you must + take additional steps to relieve the performance + pressure on these components in order to prevent them from negatively + impacting the overall performance of the environment. Ensure + that all the components are in balance so that if the massively + scalable environment fails, all the components are near maximum capacity. - Regions are used to segregate completely independent + Regions segregate completely independent installations linked only by an Identity and Dashboard - (optional) installation. Services are installed with separate - API endpoints for each region, complete with separate database + (optional) installation. Services have separate + API endpoints for each region, an include separate database and queue installations. This exposes some awareness of the environment's fault domains to users and gives them the ability to ensure some degree of application resiliency while - also imposing the requirement to specify which region their - actions must be applied to. + also imposing the requirement to specify which region to apply + their actions to. Environments operating at massive scale typically need their regions or sites subdivided further without exposing the requirement to specify the failure domain to the user. This provides the ability to further divide the installation into failure domains while also providing a logical unit for maintenance and the addition of new hardware. At hyperscale, - instead of adding single compute nodes, administrators may add + instead of adding single compute nodes, administrators can add entire racks or even groups of racks at a time with each new addition of nodes exposed via one of the segregation concepts mentioned herein. Cells provide the ability to subdivide the compute portion of an OpenStack installation, including regions, while still - exposing a single endpoint. In each region an API cell is - created along with a number of compute cells where the - workloads actually run. Each cell gets its own database and + exposing a single endpoint. Each region has an API cell + along with a number of compute cells where the + workloads actually run. Each cell has its own database and message queue setup (ideally clustered), providing the ability to subdivide the load on these subsystems, improving overall performance. - Within each compute cell a complete compute installation is - provided, complete with full database and queue installations, + Each compute cell provides a complete compute installation, + complete with full database and queue installations, scheduler, conductor, and multiple compute hosts. The cells scheduler handles placement of user requests from the single API endpoint to a specific cell from those available. The normal filter scheduler then handles placement within the cell. - The downside of using cells is that they are not well - supported by any of the OpenStack services other than Compute. - Also, they do not adequately support some relatively standard + Unfortunately, Compute is the only OpenStack service that + provides good support for cells. In addition, cells + do not adequately support some standard OpenStack functionality such as security groups and host aggregates. Due to their relative newness and specialized use, - they receive relatively little testing in the OpenStack gate. - Despite these issues, however, cells are used in some very - well known OpenStack installations operating at massive scale - including those at CERN and Rackspace.
+ cells receive relatively little testing in the OpenStack gate. + Despite these issues, cells play an important role in + well known OpenStack installations operating at massive scale, + such as those at CERN and Rackspace.
Host aggregates Host aggregates enable partitioning of OpenStack Compute deployments into logical groups for load balancing and - instance distribution. Host aggregates may also be used to + instance distribution. You can also use host aggregates to further partition an availability zone. Consider a cloud which might use host aggregates to partition an availability zone into groups of hosts that either share common resources, such as storage and network, or have a special property, such as - trusted computing hardware. Host aggregates are not explicitly - user-targetable; instead they are implicitly targeted via the - selection of instance flavors with extra specifications that - map to host aggregate metadata.
+ trusted computing hardware. You cannot target host aggregates + explicitly. Instead, select instance flavors that map to host + aggregate metadata. These flavors target host aggregates + implicitly.
Availability zones Availability zones provide another mechanism for subdividing an installation or region. They are, in effect, host - aggregates that are exposed for (optional) explicit targeting + aggregates exposed for (optional) explicit targeting by users. - Unlike cells, they do not have their own database server or - queue broker but simply represent an arbitrary grouping of - compute nodes. Typically, grouping of nodes into availability - zones is based on a shared failure domain based on a physical - characteristic such as a shared power source, physical network - connection, and so on. Availability zones are exposed to the - user because they can be targeted; however, users are not - required to target them. An alternate approach is for the - operator to set a default availability zone to schedule - instances to other than the default availability zone of - nova.
+ Unlike cells, availability zones do not have their own database + server or queue broker but represent an arbitrary grouping of + compute nodes. Typically, nodes are grouped into availability + zones using a shared failure domain based on a physical + characteristic such as a shared power source or physical network + connections. Users can target exposed availability zones; however, + this is not a requirement. An alternative approach is to set a default + availability zone to schedule instances to a non-default availability + zone of nova.
Segregation example In this example the cloud is divided into two regions, one for each site, with two availability zones in each based on the power layout of the data centers. A number of host - aggregates have also been defined to allow targeting of + aggregates enable targeting of virtual machine instances using flavors, that require special capabilities shared by the target hosts such as SSDs, 10 GbE networks, or GPU cards. diff --git a/doc/arch-design/massively_scalable/section_user_requirements_massively_scalable.xml b/doc/arch-design/massively_scalable/section_user_requirements_massively_scalable.xml index 3c0263c37d..a7a50a438c 100644 --- a/doc/arch-design/massively_scalable/section_user_requirements_massively_scalable.xml +++ b/doc/arch-design/massively_scalable/section_user_requirements_massively_scalable.xml @@ -56,48 +56,47 @@ The cloud user expects repeatable, dependable, and deterministic processes for launching and deploying - cloud resources. This could be delivered through a + cloud resources. You could deliver this through a web-based interface or publicly available API endpoints. All appropriate options for requesting - cloud resources need to be available through some type + cloud resources must be available through some type of user interface, a command-line interface (CLI), or API endpoints. Cloud users expect a fully self-service and on-demand consumption model. When an OpenStack cloud - reaches the "massively scalable" size, it means it is - expected to be consumed "as a service" in each and + reaches the "massively scalable" size, expect + consumption "as a service" in each and every way. For a user of a massively scalable OpenStack public - cloud, there will be no expectations for control over - security, performance, or availability. Only SLAs - related to uptime of API services are expected, and - very basic SLAs expected of services offered. The user - understands it is his or her responsibility to address + cloud, there are no expectations for control over + security, performance, or availability. Users expect + only SLAs related to uptime of API services, and + very basic SLAs for services offered. It is the user's + responsibility to address these issues on their own. The exception to this expectation is the rare case of a massively scalable cloud infrastructure built for a private or government organization that has specific requirements. - As might be expected, the cloud user requirements or - expectations that determine the design are all focused on the - consumption model. The user expects to be able to easily - consume cloud resources in an automated and deterministic way, + The cloud user's requirements and expectations that determine + the cloud design focus on the + consumption model. The user expects to consume cloud resources + in an automated and deterministic way, without any need for knowledge of the capacity, scalability, or other attributes of the cloud's underlying infrastructure.
Operator requirements - Whereas the cloud user should be completely unaware of the + While the cloud user can be completely unaware of the underlying infrastructure of the cloud and its attributes, the - operator must be able to build and support the infrastructure, - as well as how it needs to operate at scale. This presents a - very demanding set of requirements for building such a cloud - from the operator's perspective: + operator must build and support the infrastructure for operating + at scale. This presents a very demanding set of requirements + for building such a cloud from the operator's perspective: First and foremost, everything must be capable of @@ -105,7 +104,7 @@ compute hardware, storage hardware, or networking hardware, to the installation and configuration of the supporting software, everything must be capable of - being automated. Manual processes will not suffice in + automation. Manual processes are impractical in a massively scalable OpenStack design architecture. @@ -127,13 +126,13 @@ Companies operating a massively scalable OpenStack cloud also require that operational expenditures - (OpEx) be minimized as much as possible. It is - recommended that cloud-optimized hardware is a good - approach when managing operational overhead. Some of - the factors that need to be considered include power, - cooling, and the physical design of the chassis. It is - possible to customize the hardware and systems so they - are optimized for this type of workload because of the + (OpEx) be minimized as much as possible. We + recommend using cloud-optimized hardware when + managing operational overhead. Some of + the factors to consider include power, + cooling, and the physical design of the chassis. Through + customization, it is possible to optimize the hardware + and systems for this type of workload because of the scale of these implementations. @@ -144,16 +143,16 @@ infrastructure. This includes full scale metering of the hardware and software status. A corresponding framework of logging and alerting is also required to - store and allow operations to act upon the metrics - provided by the metering and monitoring solution(s). + store and enable operations to act on the metrics + provided by the metering and monitoring solutions. The cloud operator also needs a solution that uses the data provided by the metering and monitoring solution to provide capacity planning and capacity trending analysis. - A massively scalable OpenStack cloud will be a - multi-site cloud. Therefore, the user-operator + Invariably, massively scalable OpenStack clouds extend + over several sites. Therefore, the user-operator requirements for a multi-site OpenStack architecture design are also applicable here. This includes various legal requirements for data storage, data placement, @@ -161,18 +160,17 @@ compliance requirements; image consistency-availability; storage replication and availability (both block and file/object storage); and - authentication, authorization, and auditing (AAA), - just to name a few. Refer to the + authentication, authorization, and auditing (AAA). + See for more details on requirements and considerations for multi-site OpenStack clouds. - Considerations around physical facilities such as - space, floor weight, rack height and type, + The design architecture of a massively scalable OpenStack + cloud must address considerations around physical + facilities such as space, floor weight, rack height and type, environmental considerations, power usage and power - usage efficiency (PUE), and physical security must - also be addressed by the design architecture of a - massively scalable OpenStack cloud. + usage efficiency (PUE), and physical security.