From 07226e6089a42ec419aadff756a1893c20e4615b Mon Sep 17 00:00:00 2001 From: Michael McCune Date: Mon, 31 Aug 2015 14:23:19 -0400 Subject: [PATCH] Updating overview document * formatted lines to 79 characters * many small grammar fixes * changed most "Hadoop" references to either specify "Hadoop, Spark, and Storm", or reference "data processing frameworks" * updating image reference link Change-Id: I5e358babdaede6d795422fb281c18cd280ca2876 Partial-Bug: 1490687 --- doc/source/overview.rst | 211 +++++++++++++++++++++++++--------------- 1 file changed, 131 insertions(+), 80 deletions(-) diff --git a/doc/source/overview.rst b/doc/source/overview.rst index 53e4ce39..29fc4870 100644 --- a/doc/source/overview.rst +++ b/doc/source/overview.rst @@ -4,46 +4,67 @@ Rationale Introduction ------------ -Apache Hadoop is an industry standard and widely adopted MapReduce implementation. -The aim of this project is to enable users to easily provision and manage Hadoop clusters on OpenStack. -It is worth mentioning that Amazon provides Hadoop for several years as Amazon Elastic MapReduce (EMR) service. +Apache Hadoop is an industry standard and widely adopted MapReduce +implementation, it is one among a growing number of data processing +frameworks. The aim of this project is to enable users to easily provision +and manage clusters with Hadoop and other data processing frameworks on +OpenStack. It is worth mentioning that Amazon has provided Hadoop for +several years as Amazon Elastic MapReduce (EMR) service. -Sahara aims to provide users with simple means to provision Hadoop clusters -by specifying several parameters like Hadoop version, cluster topology, nodes hardware details -and a few more. After user fills in all the parameters, Sahara deploys the cluster in a few minutes. -Also Sahara provides means to scale already provisioned cluster by adding/removing worker nodes on demand. +Sahara aims to provide users with a simple means to provision Hadoop, Spark, +and Storm clusters by specifying several parameters such as the version, +cluster topology, hardware node details and more. After a user fills in all +the parameters, sahara deploys the cluster in a few minutes. Also sahara +provides means to scale an already provisioned cluster by adding or removing +worker nodes on demand. -The solution will address following use cases: +The solution will address the following use cases: -* fast provisioning of Hadoop clusters on OpenStack for Dev and QA; -* utilization of unused compute power from general purpose OpenStack IaaS cloud; -* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar to AWS EMR). +* fast provisioning of data processing clusters on OpenStack for development + and quality assurance(QA). +* utilization of unused compute power from a general purpose OpenStack IaaS + cloud. +* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar + to AWS EMR). Key features are: * designed as an OpenStack component; -* managed through REST API with UI available as part of OpenStack Dashboard; -* support for different Hadoop distributions: - * pluggable system of Hadoop installation engines; - * integration with vendor specific management tools, such as Apache Ambari or Cloudera Management Console; -* predefined templates of Hadoop configurations with ability to modify parameters. +* managed through a REST API with a user interface(UI) available as part of + OpenStack Dashboard; +* support for a variety of data processing frameworks: + * multiple Hadoop vendor distributions + * Apache Spark and Storm + * pluggable system of Hadoop installation engines + * integration with vendor specific management tools, such as Apache + Ambari and Cloudera Management Console +* predefined configuration templates with the ability to modify parameters. Details ------- -The Sahara product communicates with the following OpenStack components: +The sahara product communicates with the following OpenStack services: -* Horizon - provides GUI with ability to use all of Sahara’s features. -* Keystone - authenticates users and provides security token that is used to work with the OpenStack, - hence limiting user abilities in Sahara to his OpenStack privileges. -* Nova - is used to provision VMs for Hadoop Cluster. -* Heat - Sahara can be configured to use Heat; Heat orchestrates the required services for Hadoop Cluster. -* Glance - Hadoop VM images are stored there, each image containing an installed OS and Hadoop. - the pre-installed Hadoop should give us good handicap on node start-up. -* Swift - can be used as a storage for data that will be processed by Hadoop jobs. -* Cinder - can be used as a block storage. -* Neutron - provides the networking service. -* Ceilometer - used to collect measures of cluster usage for metering and monitoring purposes. +* Dashboard (horizon) - provides a GUI with ability to use all of sahara’s + features. +* Identity (keystone) - authenticates users and provides security tokens that + are used to work with OpenStack, limiting a user's abilities in sahara to + their OpenStack privileges. +* Compute (nova) - used to provision VMs for data processing clusters. +* Orchestration (heat) - used to provision and orchestrate the deployment of + data processing clusters. +* Image (glance) - stores VM images, each image containing an operating system + and a pre-installed data processing distribution or framework. +* Object Storage (swift) - can be used as storage for job binaries and data + that will be processed or created by framework jobs. +* Block Storage (cinder) - can be used to provision block storage for VM + instances. +* Networking (neutron) - provides networking services to data processing + clusters. +* Telemetry (ceilometer) - used to collect measures of cluster usage for + metering and monitoring purposes. +* Shared file systems (manila) - can be used for storage of framework job + binaries and data that will be processed or created by jobs. .. image:: images/openstack-interop.png :width: 800 px @@ -53,86 +74,116 @@ The Sahara product communicates with the following OpenStack components: General Workflow ---------------- -Sahara will provide two level of abstraction for API and UI based on the addressed use cases: -cluster provisioning and analytics as a service. +Sahara will provide two levels of abstraction for the API and UI based on the +addressed use cases: cluster provisioning and analytics as a service. -For the fast cluster provisioning generic workflow will be as following: +For fast cluster provisioning a generic workflow will be as following: -* select Hadoop version; -* select base image with or without pre-installed Hadoop: +* select a Hadoop (or framework) version. +* select a base image with or without pre-installed data processing framework: - * for base images without Hadoop pre-installed Sahara will support pluggable deployment engines integrated with vendor tooling; - * you could download prepared up-to-date images from http://sahara-files.mirantis.com/images/upstream/kilo/; + * for base images without a pre-installed framework, sahara will support + pluggable deployment engines that integrate with vendor tooling. + * you can download prepared up-to-date images from + http://sahara-files.mirantis.com/images/upstream/liberty/ -* define cluster configuration, including size and topology of the cluster and setting the different type of Hadoop parameters (e.g. heap size): +* define cluster configuration, including cluster size, topology, and + framework parameters (for example, heap size): - * to ease the configuration of such parameters mechanism of configurable templates will be provided; + * to ease the configuration of such parameters, configurable templates + are provided. -* provision the cluster: Sahara will provision VMs, install and configure Hadoop; -* operation on the cluster: add/remove nodes; -* terminate the cluster when it’s not needed anymore. +* provision the cluster; sahara will provision VMs, install and configure + the data processing framework. +* perform operations on the cluster; add or remove nodes. +* terminate the cluster when it is no longer needed. -For analytic as a service generic workflow will be as following: +For analytics as a service, a generic workflow will be as follows: -* select one of predefined Hadoop versions; -* configure the job: +* select one of the predefined data processing framework versions. +* configure a job: - * choose type of the job: pig, hive, jar-file, etc.; - * provide the job script source or jar location; - * select input and output data location (initially only Swift will be supported); - * select location for logs; + * choose the type of job: pig, hive, jar-file, etc. + * provide the job script source or jar location. + * select input and output data location. -* set limit for the cluster size; +* set the limit for the cluster size. * execute the job: - * all cluster provisioning and job execution will happen transparently to the user; - * cluster will be removed automatically after job completion; + * all cluster provisioning and job execution will happen transparently + to the user. + * cluster will be removed automatically after job completion. * get the results of computations (for example, from Swift). User's Perspective ------------------ -While provisioning cluster through Sahara, user operates on three types of entities: Node Group Templates, Cluster Templates and Clusters. +While provisioning clusters through sahara, the user operates on three types +of entities: Node Group Templates, Cluster Templates and Clusters. -A Node Group Template describes a group of nodes within cluster. It contains a list of hadoop processes that will be launched on each instance in a group. -Also a Node Group Template may provide node scoped configurations for those processes. -This kind of templates encapsulates hardware parameters (flavor) for the node VM and configuration for Hadoop processes running on the node. +A Node Group Template describes a group of nodes within cluster. It contains +a list of hadoop processes that will be launched on each instance in a group. +Also a Node Group Template may provide node scoped configurations for those +processes. This kind of template encapsulates hardware parameters (flavor) +for the node VM and configuration for data processing framework processes +running on the node. -A Cluster Template is designed to bring Node Group Templates together to form a Cluster. -A Cluster Template defines what Node Groups will be included and how many instances will be created in each. -Some of Hadoop Configurations can not be applied to a single node, but to a whole Cluster, so user can specify this kind of configurations in a Cluster Template. -Sahara enables user to specify which processes should be added to an anti-affinity group within a Cluster Template. If a process is included into an anti-affinity -group, it means that VMs where this process is going to be launched should be scheduled to different hardware hosts. +A Cluster Template is designed to bring Node Group Templates together to +form a Cluster. A Cluster Template defines what Node Groups will be included +and how many instances will be created in each. Some data processing framework +configurations can not be applied to a single node, but to a whole Cluster. +A user can specify these kinds of configurations in a Cluster Template. Sahara +enables users to specify which processes should be added to an anti-affinity +group within a Cluster Template. If a process is included into an +anti-affinity group, it means that VMs where this process are going to be +launched should be scheduled to different hardware hosts. -The Cluster entity represents a Hadoop Cluster. It is mainly characterized by VM image with pre-installed Hadoop which -will be used for cluster deployment. User may choose one of pre-configured Cluster Templates to start a Cluster. -To get access to VMs after a Cluster has started, user should specify a keypair. +The Cluster entity represents a collection of VM instances that all have the +same data processing framework installed. It is mainly characterized by a VM +image with a pre-installed framework which will be used for cluster +deployment. Users may choose one of the pre-configured Cluster Templates to +start a Cluster. To get access to VMs after a Cluster has started, the user +should specify a keypair. -Sahara provides several constraints on Hadoop cluster topology. JobTracker and NameNode processes could be run either on a single -VM or two separate ones. Also cluster could contain worker nodes of different types. Worker nodes could run both TaskTracker and DataNode, -or either of these processes alone. Sahara allows user to create cluster with any combination of these options, -but it will not allow to create a non working topology, for example: a set of workers with DataNodes, but without a NameNode. +Sahara provides several constraints on cluster framework topology. JobTracker +and NameNode processes could be run either on a single VM or two separate +VMs. Also a cluster could contain worker nodes of different types. Worker +nodes could run both TaskTracker and DataNode, or either of these processes +alone. Sahara allows a user to create a cluster with any combination of these +options, but it will not allow the creation of a non working topology (for +example: a set of workers with DataNodes, but without a NameNode). -Each Cluster belongs to some tenant determined by user. Users have access only to objects located in -tenants they have access to. Users could edit/delete only objects they created. Naturally admin users have full access to every object. -That way Sahara complies with general OpenStack access policy. +Each Cluster belongs to some Identity service project determined by the user. +Users have access only to objects located in projects they have access to. +Users can edit and delete only objects they have created or exist in their +project. Naturally, admin users have full access to every object. In this +manner, sahara complies with general OpenStack access policy. -Integration with Swift ----------------------- +Integration with Object Storage +------------------------------- -The Swift service is a standard object storage in OpenStack environment, analog of Amazon S3. As a rule it is deployed -on bare metal machines. It is natural to expect Hadoop on OpenStack to process data stored there. And it is so. -With a FileSystem implementation for Swift `HADOOP-8545 `_ -and `Change I6b1ba25b `_ which implements the ability to list endpoints for -an object, account or container, to make it possible to integrate swift with software that relies on data locality -information to avoid network overhead. +The swift project provides the standard Object Storage service for OpenStack +environments; it is an analog of the Amazon S3 service. As a rule it is +deployed on bare metal machines. It is natural to expect data processing on +OpenStack to access data stored there. Sahara provides this option with a +file system implementation for swift +`HADOOP-8545 `_ and +`Change I6b1ba25b `_ which +implements the ability to list endpoints for an object, account or container. +This makes it possible to integrate swift with software that relies on data +locality information to avoid network overhead. -To get more information on how to enable Swift support see :doc:`userdoc/hadoop-swift`. +To get more information on how to enable swift support see +:doc:`userdoc/hadoop-swift`. Pluggable Deployment and Monitoring ----------------------------------- -In addition to the monitoring capabilities provided by vendor-specific Hadoop management tooling, Sahara will provide pluggable integration with external monitoring systems such as Nagios or Zabbix. +In addition to the monitoring capabilities provided by vendor-specific +Hadoop management tooling, sahara provides pluggable integration with +external monitoring systems such as Nagios or Zabbix. -Both deployment and monitoring tools will be installed on stand-alone VMs, thus allowing a single instance to manage/monitor several clusters at once. +Both deployment and monitoring tools can be installed on stand-alone VMs, +thus allowing a single instance to manage and monitor several clusters at +once.