Updating overview document

* formatted lines to 79 characters * many small grammar fixes * changed most "Hadoop" references to either specify "Hadoop, Spark, and Storm", or reference "data processing frameworks" * updating image reference link Change-Id: I5e358babdaede6d795422fb281c18cd280ca2876 Partial-Bug: 1490687
2015-08-31 14:23:19 -04:00 · 2015-08-31 14:23:19 -04:00 · 07226e6089
commit 07226e6089
parent 6b7e3a0271
1 changed files with 131 additions and 80 deletions
--- a/doc/source/overview.rst
+++ b/doc/source/overview.rst
@ -4,46 +4,67 @@ Rationale
 Introduction
 ------------

-Apache Hadoop is an industry standard and widely adopted MapReduce implementation.
-The aim of this project is to enable users to easily provision and manage Hadoop clusters on OpenStack.
-It is worth mentioning that Amazon provides Hadoop for several years as Amazon Elastic MapReduce (EMR) service.
+Apache Hadoop is an industry standard and widely adopted MapReduce
+implementation, it is one among a growing number of data processing
+frameworks. The aim of this project is to enable users to easily provision
+and manage clusters with Hadoop and other data processing frameworks on
+OpenStack. It is worth mentioning that Amazon has provided Hadoop for
+several years as Amazon Elastic MapReduce (EMR) service.

-Sahara aims to provide users with simple means to provision Hadoop clusters
-by specifying several parameters like Hadoop version, cluster topology, nodes hardware details
-and a few more. After user fills in all the parameters, Sahara deploys the cluster in a few minutes.
-Also Sahara provides means to scale already provisioned cluster by adding/removing worker nodes on demand.
+Sahara aims to provide users with a simple means to provision Hadoop, Spark,
+and Storm clusters by specifying several parameters such as the version,
+cluster topology, hardware node details and more. After a user fills in all
+the parameters, sahara deploys the cluster in a few minutes. Also sahara
+provides means to scale an already provisioned cluster by adding or removing
+worker nodes on demand.

-The solution will address following use cases:
+The solution will address the following use cases:

-* fast provisioning of Hadoop clusters on OpenStack for Dev and QA;
-* utilization of unused compute power from general purpose OpenStack IaaS cloud;
-* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar to AWS EMR).
+* fast provisioning of data processing clusters on OpenStack for development
+  and quality assurance(QA).
+* utilization of unused compute power from a general purpose OpenStack IaaS
+  cloud.
+* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar
+  to AWS EMR).

 Key features are:

 * designed as an OpenStack component;
-* managed through REST API with UI available as part of OpenStack Dashboard;
-* support for different Hadoop distributions:
-    * pluggable system of Hadoop installation engines;
-    * integration with vendor specific management tools, such as Apache Ambari or Cloudera Management Console;
-* predefined templates of Hadoop configurations with ability to modify parameters.
+* managed through a REST API with a user interface(UI) available as part of
+  OpenStack Dashboard;
+* support for a variety of data processing frameworks:
+    * multiple Hadoop vendor distributions
+    * Apache Spark and Storm
+    * pluggable system of Hadoop installation engines
+    * integration with vendor specific management tools, such as Apache
+      Ambari and Cloudera Management Console
+* predefined configuration templates with the ability to modify parameters.

 Details
 -------

-The Sahara product communicates with the following OpenStack components:
+The sahara product communicates with the following OpenStack services:

-* Horizon - provides GUI with ability to use all of Sahara’s features.
-* Keystone - authenticates users and provides security token that is used to work with the OpenStack,
-  hence limiting user abilities in Sahara to his OpenStack privileges.
-* Nova - is used to provision VMs for Hadoop Cluster.
-* Heat - Sahara can be configured to use Heat; Heat orchestrates the required services for Hadoop Cluster.
-* Glance - Hadoop VM images are stored there, each image containing an installed OS and Hadoop.
-  the pre-installed Hadoop should give us good handicap on node start-up.
-* Swift - can be used as a storage for data that will be processed by Hadoop jobs.
-* Cinder - can be used as a block storage.
-* Neutron - provides the networking service.
-* Ceilometer - used to collect measures of cluster usage for metering and monitoring purposes.
+* Dashboard (horizon) - provides a GUI with ability to use all of sahara’s
+  features.
+* Identity (keystone) - authenticates users and provides security tokens that
+  are used to work with OpenStack, limiting a user's abilities in sahara to
+  their OpenStack privileges.
+* Compute (nova) - used to provision VMs for data processing clusters.
+* Orchestration (heat) - used to provision and orchestrate the deployment of
+  data processing clusters.
+* Image (glance) - stores VM images, each image containing an operating system
+  and a pre-installed data processing distribution or framework.
+* Object Storage (swift) - can be used as storage for job binaries and data
+  that will be processed or created by framework jobs.
+* Block Storage (cinder) - can be used to provision block storage for VM
+  instances.
+* Networking (neutron) - provides networking services to data processing
+  clusters.
+* Telemetry (ceilometer) - used to collect measures of cluster usage for
+  metering and monitoring purposes.
+* Shared file systems (manila) - can be used for storage of framework job
+  binaries and data that will be processed or created by jobs.

 .. image:: images/openstack-interop.png
    :width: 800 px
@ -53,86 +74,116 @@ The Sahara product communicates with the following OpenStack components:
 General Workflow
 ----------------

-Sahara will provide two level of abstraction for API and UI based on the addressed use cases:
-cluster provisioning and analytics as a service.
+Sahara will provide two levels of abstraction for the API and UI based on the
+addressed use cases: cluster provisioning and analytics as a service.

-For the fast cluster provisioning generic workflow will be as following:
+For fast cluster provisioning a generic workflow will be as following:

-* select Hadoop version;
-* select base image with or without pre-installed Hadoop:
+* select a Hadoop (or framework) version.
+* select a base image with or without pre-installed data processing framework:

-    * for base images without Hadoop pre-installed Sahara will support pluggable deployment engines integrated with vendor tooling;
-    * you could download prepared up-to-date images from http://sahara-files.mirantis.com/images/upstream/kilo/;
+    * for base images without a pre-installed framework, sahara will support
+      pluggable deployment engines that integrate with vendor tooling.
+    * you can download prepared up-to-date images from
+      http://sahara-files.mirantis.com/images/upstream/liberty/

-* define cluster configuration, including size and topology of the cluster and setting the different type of Hadoop parameters (e.g. heap size):
+* define cluster configuration, including cluster size, topology, and
+  framework parameters (for example, heap size):

-    * to ease the configuration of such parameters mechanism of configurable templates will be provided;
+    * to ease the configuration of such parameters, configurable templates
+      are provided.

-* provision the cluster: Sahara will provision VMs, install and configure Hadoop;
-* operation on the cluster: add/remove nodes;
-* terminate the cluster when it’s not needed anymore.
+* provision the cluster; sahara will provision VMs, install and configure
+  the data processing framework.
+* perform operations on the cluster; add or remove nodes.
+* terminate the cluster when it is no longer needed.

-For analytic as a service generic workflow will be as following:
+For analytics as a service, a generic workflow will be as follows:

-* select one of predefined Hadoop versions;
-* configure the job:
+* select one of the predefined data processing framework versions.
+* configure a job:

-    * choose type of the job: pig, hive, jar-file, etc.;
-    * provide the job script source or jar location;
-    * select input and output data location (initially only Swift will be supported);
-    * select location for logs;
+    * choose the type of job: pig, hive, jar-file, etc.
+    * provide the job script source or jar location.
+    * select input and output data location.

-* set limit for the cluster size;
+* set the limit for the cluster size.
 * execute the job:

-    * all cluster provisioning and job execution will happen transparently to the user;
-    * cluster will be removed automatically after job completion;
+    * all cluster provisioning and job execution will happen transparently
+      to the user.
+    * cluster will be removed automatically after job completion.

 * get the results of computations (for example, from Swift).

 User's Perspective
 ------------------

-While provisioning cluster through Sahara, user operates on three types of entities: Node Group Templates, Cluster Templates and Clusters.
+While provisioning clusters through sahara, the user operates on three types
+of entities: Node Group Templates, Cluster Templates and Clusters.

-A Node Group Template describes a group of nodes within cluster. It contains a list of hadoop processes that will be launched on each instance in a group.
-Also a Node Group Template may provide node scoped configurations for those processes.
-This kind of templates encapsulates hardware parameters (flavor) for the node VM and configuration for Hadoop processes running on the node.
+A Node Group Template describes a group of nodes within cluster. It contains
+a list of hadoop processes that will be launched on each instance in a group.
+Also a Node Group Template may provide node scoped configurations for those
+processes. This kind of template encapsulates hardware parameters (flavor)
+for the node VM and configuration for data processing framework processes
+running on the node.

-A Cluster Template is designed to bring Node Group Templates together to form a Cluster.
-A Cluster Template defines what Node Groups will be included and how many instances will be created in each.
-Some of Hadoop Configurations can not be applied to a single node, but to a whole Cluster, so user can specify this kind of configurations in a Cluster Template.
-Sahara enables user to specify which processes should be added to an anti-affinity group within a Cluster Template. If a process is included into an anti-affinity
-group, it means that VMs where this process is going to be launched should be scheduled to different hardware hosts.
+A Cluster Template is designed to bring Node Group Templates together to
+form a Cluster. A Cluster Template defines what Node Groups will be included
+and how many instances will be created in each. Some data processing framework
+configurations can not be applied to a single node, but to a whole Cluster.
+A user can specify these kinds of configurations in a Cluster Template. Sahara
+enables users to specify which processes should be added to an anti-affinity
+group within a Cluster Template. If a process is included into an
+anti-affinity group, it means that VMs where this process are going to be
+launched should be scheduled to different hardware hosts.

-The Cluster entity represents a Hadoop Cluster. It is mainly characterized by VM image with pre-installed Hadoop which
-will be used for cluster deployment. User may choose one of pre-configured Cluster Templates to start a Cluster.
-To get access to VMs after a Cluster has started, user should specify a keypair.
+The Cluster entity represents a collection of VM instances that all have the
+same data processing framework installed. It is mainly characterized by a VM
+image with a pre-installed framework which will be used for cluster
+deployment. Users may choose one of the pre-configured Cluster Templates to
+start a Cluster. To get access to VMs after a Cluster has started, the user
+should specify a keypair.

-Sahara provides several constraints on Hadoop cluster topology. JobTracker and NameNode processes could be run either on a single
-VM or two separate ones. Also cluster could contain worker nodes of different types. Worker nodes could run both TaskTracker and DataNode,
-or either of these processes alone. Sahara allows user to create cluster with any combination of these options,
-but it will not allow to create a non working topology, for example: a set of workers with DataNodes, but without a NameNode.
+Sahara provides several constraints on cluster framework topology. JobTracker
+and NameNode processes could be run either on a single VM or two separate
+VMs. Also a cluster could contain worker nodes of different types. Worker
+nodes could run both TaskTracker and DataNode, or either of these processes
+alone. Sahara allows a user to create a cluster with any combination of these
+options, but it will not allow the creation of a non working topology (for
+example: a set of workers with DataNodes, but without a NameNode).

-Each Cluster belongs to some tenant determined by user. Users have access only to objects located in
-tenants they have access to. Users could edit/delete only objects they created. Naturally admin users have full access to every object.
-That way Sahara complies with general OpenStack access policy.
+Each Cluster belongs to some Identity service project determined by the user.
+Users have access only to objects located in projects they have access to.
+Users can edit and delete only objects they have created or exist in their
+project. Naturally, admin users have full access to every object. In this
+manner, sahara complies with general OpenStack access policy.

-Integration with Swift
----------------------
+Integration with Object Storage
+-------------------------------

-The Swift service is a standard object storage in OpenStack environment, analog of Amazon S3. As a rule it is deployed
-on bare metal machines. It is natural to expect Hadoop on OpenStack to process data stored there. And it is so.
-With a FileSystem implementation for Swift `HADOOP-8545 <https://issues.apache.org/jira/browse/HADOOP-8545>`_
-and `Change I6b1ba25b <https://review.openstack.org/#/c/21015/>`_ which implements the ability to list endpoints for
-an object, account or container, to make it possible to integrate swift with software that relies on data locality
-information to avoid network overhead.
+The swift project provides the standard Object Storage service for OpenStack
+environments; it is an analog of the Amazon S3 service. As a rule it is
+deployed on bare metal machines. It is natural to expect data processing on
+OpenStack to access data stored there. Sahara provides this option with a
+file system implementation for swift
+`HADOOP-8545 <https://issues.apache.org/jira/browse/HADOOP-8545>`_ and
+`Change I6b1ba25b <https://review.openstack.org/#/c/21015/>`_ which
+implements the ability to list endpoints for an object, account or container.
+This makes it possible to integrate swift with software that relies on data
+locality information to avoid network overhead.

-To get more information on how to enable Swift support see :doc:`userdoc/hadoop-swift`.
+To get more information on how to enable swift support see
+:doc:`userdoc/hadoop-swift`.

 Pluggable Deployment and Monitoring
 -----------------------------------

-In addition to the monitoring capabilities provided by vendor-specific Hadoop management tooling, Sahara will provide pluggable integration with external monitoring systems such as Nagios or Zabbix.
+In addition to the monitoring capabilities provided by vendor-specific
+Hadoop management tooling, sahara provides pluggable integration with
+external monitoring systems such as Nagios or Zabbix.

-Both deployment and monitoring tools will be installed on stand-alone VMs, thus allowing a single instance to manage/monitor several clusters at once.
+Both deployment and monitoring tools can be installed on stand-alone VMs,
+thus allowing a single instance to manage and monitor several clusters at
+once.