Updating overview document

* formatted lines to 79 characters
* many small grammar fixes
* changed most "Hadoop" references to either specify "Hadoop, Spark, and
  Storm", or reference "data processing frameworks"
* updating image reference link

Change-Id: I5e358babdaede6d795422fb281c18cd280ca2876
Partial-Bug: 1490687
This commit is contained in:
Michael McCune 2015-08-31 14:23:19 -04:00
parent 6b7e3a0271
commit 07226e6089

View File

@ -4,46 +4,67 @@ Rationale
Introduction
------------
Apache Hadoop is an industry standard and widely adopted MapReduce implementation.
The aim of this project is to enable users to easily provision and manage Hadoop clusters on OpenStack.
It is worth mentioning that Amazon provides Hadoop for several years as Amazon Elastic MapReduce (EMR) service.
Apache Hadoop is an industry standard and widely adopted MapReduce
implementation, it is one among a growing number of data processing
frameworks. The aim of this project is to enable users to easily provision
and manage clusters with Hadoop and other data processing frameworks on
OpenStack. It is worth mentioning that Amazon has provided Hadoop for
several years as Amazon Elastic MapReduce (EMR) service.
Sahara aims to provide users with simple means to provision Hadoop clusters
by specifying several parameters like Hadoop version, cluster topology, nodes hardware details
and a few more. After user fills in all the parameters, Sahara deploys the cluster in a few minutes.
Also Sahara provides means to scale already provisioned cluster by adding/removing worker nodes on demand.
Sahara aims to provide users with a simple means to provision Hadoop, Spark,
and Storm clusters by specifying several parameters such as the version,
cluster topology, hardware node details and more. After a user fills in all
the parameters, sahara deploys the cluster in a few minutes. Also sahara
provides means to scale an already provisioned cluster by adding or removing
worker nodes on demand.
The solution will address following use cases:
The solution will address the following use cases:
* fast provisioning of Hadoop clusters on OpenStack for Dev and QA;
* utilization of unused compute power from general purpose OpenStack IaaS cloud;
* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar to AWS EMR).
* fast provisioning of data processing clusters on OpenStack for development
and quality assurance(QA).
* utilization of unused compute power from a general purpose OpenStack IaaS
cloud.
* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar
to AWS EMR).
Key features are:
* designed as an OpenStack component;
* managed through REST API with UI available as part of OpenStack Dashboard;
* support for different Hadoop distributions:
* pluggable system of Hadoop installation engines;
* integration with vendor specific management tools, such as Apache Ambari or Cloudera Management Console;
* predefined templates of Hadoop configurations with ability to modify parameters.
* managed through a REST API with a user interface(UI) available as part of
OpenStack Dashboard;
* support for a variety of data processing frameworks:
* multiple Hadoop vendor distributions
* Apache Spark and Storm
* pluggable system of Hadoop installation engines
* integration with vendor specific management tools, such as Apache
Ambari and Cloudera Management Console
* predefined configuration templates with the ability to modify parameters.
Details
-------
The Sahara product communicates with the following OpenStack components:
The sahara product communicates with the following OpenStack services:
* Horizon - provides GUI with ability to use all of Saharas features.
* Keystone - authenticates users and provides security token that is used to work with the OpenStack,
hence limiting user abilities in Sahara to his OpenStack privileges.
* Nova - is used to provision VMs for Hadoop Cluster.
* Heat - Sahara can be configured to use Heat; Heat orchestrates the required services for Hadoop Cluster.
* Glance - Hadoop VM images are stored there, each image containing an installed OS and Hadoop.
the pre-installed Hadoop should give us good handicap on node start-up.
* Swift - can be used as a storage for data that will be processed by Hadoop jobs.
* Cinder - can be used as a block storage.
* Neutron - provides the networking service.
* Ceilometer - used to collect measures of cluster usage for metering and monitoring purposes.
* Dashboard (horizon) - provides a GUI with ability to use all of saharas
features.
* Identity (keystone) - authenticates users and provides security tokens that
are used to work with OpenStack, limiting a user's abilities in sahara to
their OpenStack privileges.
* Compute (nova) - used to provision VMs for data processing clusters.
* Orchestration (heat) - used to provision and orchestrate the deployment of
data processing clusters.
* Image (glance) - stores VM images, each image containing an operating system
and a pre-installed data processing distribution or framework.
* Object Storage (swift) - can be used as storage for job binaries and data
that will be processed or created by framework jobs.
* Block Storage (cinder) - can be used to provision block storage for VM
instances.
* Networking (neutron) - provides networking services to data processing
clusters.
* Telemetry (ceilometer) - used to collect measures of cluster usage for
metering and monitoring purposes.
* Shared file systems (manila) - can be used for storage of framework job
binaries and data that will be processed or created by jobs.
.. image:: images/openstack-interop.png
:width: 800 px
@ -53,86 +74,116 @@ The Sahara product communicates with the following OpenStack components:
General Workflow
----------------
Sahara will provide two level of abstraction for API and UI based on the addressed use cases:
cluster provisioning and analytics as a service.
Sahara will provide two levels of abstraction for the API and UI based on the
addressed use cases: cluster provisioning and analytics as a service.
For the fast cluster provisioning generic workflow will be as following:
For fast cluster provisioning a generic workflow will be as following:
* select Hadoop version;
* select base image with or without pre-installed Hadoop:
* select a Hadoop (or framework) version.
* select a base image with or without pre-installed data processing framework:
* for base images without Hadoop pre-installed Sahara will support pluggable deployment engines integrated with vendor tooling;
* you could download prepared up-to-date images from http://sahara-files.mirantis.com/images/upstream/kilo/;
* for base images without a pre-installed framework, sahara will support
pluggable deployment engines that integrate with vendor tooling.
* you can download prepared up-to-date images from
http://sahara-files.mirantis.com/images/upstream/liberty/
* define cluster configuration, including size and topology of the cluster and setting the different type of Hadoop parameters (e.g. heap size):
* define cluster configuration, including cluster size, topology, and
framework parameters (for example, heap size):
* to ease the configuration of such parameters mechanism of configurable templates will be provided;
* to ease the configuration of such parameters, configurable templates
are provided.
* provision the cluster: Sahara will provision VMs, install and configure Hadoop;
* operation on the cluster: add/remove nodes;
* terminate the cluster when its not needed anymore.
* provision the cluster; sahara will provision VMs, install and configure
the data processing framework.
* perform operations on the cluster; add or remove nodes.
* terminate the cluster when it is no longer needed.
For analytic as a service generic workflow will be as following:
For analytics as a service, a generic workflow will be as follows:
* select one of predefined Hadoop versions;
* configure the job:
* select one of the predefined data processing framework versions.
* configure a job:
* choose type of the job: pig, hive, jar-file, etc.;
* provide the job script source or jar location;
* select input and output data location (initially only Swift will be supported);
* select location for logs;
* choose the type of job: pig, hive, jar-file, etc.
* provide the job script source or jar location.
* select input and output data location.
* set limit for the cluster size;
* set the limit for the cluster size.
* execute the job:
* all cluster provisioning and job execution will happen transparently to the user;
* cluster will be removed automatically after job completion;
* all cluster provisioning and job execution will happen transparently
to the user.
* cluster will be removed automatically after job completion.
* get the results of computations (for example, from Swift).
User's Perspective
------------------
While provisioning cluster through Sahara, user operates on three types of entities: Node Group Templates, Cluster Templates and Clusters.
While provisioning clusters through sahara, the user operates on three types
of entities: Node Group Templates, Cluster Templates and Clusters.
A Node Group Template describes a group of nodes within cluster. It contains a list of hadoop processes that will be launched on each instance in a group.
Also a Node Group Template may provide node scoped configurations for those processes.
This kind of templates encapsulates hardware parameters (flavor) for the node VM and configuration for Hadoop processes running on the node.
A Node Group Template describes a group of nodes within cluster. It contains
a list of hadoop processes that will be launched on each instance in a group.
Also a Node Group Template may provide node scoped configurations for those
processes. This kind of template encapsulates hardware parameters (flavor)
for the node VM and configuration for data processing framework processes
running on the node.
A Cluster Template is designed to bring Node Group Templates together to form a Cluster.
A Cluster Template defines what Node Groups will be included and how many instances will be created in each.
Some of Hadoop Configurations can not be applied to a single node, but to a whole Cluster, so user can specify this kind of configurations in a Cluster Template.
Sahara enables user to specify which processes should be added to an anti-affinity group within a Cluster Template. If a process is included into an anti-affinity
group, it means that VMs where this process is going to be launched should be scheduled to different hardware hosts.
A Cluster Template is designed to bring Node Group Templates together to
form a Cluster. A Cluster Template defines what Node Groups will be included
and how many instances will be created in each. Some data processing framework
configurations can not be applied to a single node, but to a whole Cluster.
A user can specify these kinds of configurations in a Cluster Template. Sahara
enables users to specify which processes should be added to an anti-affinity
group within a Cluster Template. If a process is included into an
anti-affinity group, it means that VMs where this process are going to be
launched should be scheduled to different hardware hosts.
The Cluster entity represents a Hadoop Cluster. It is mainly characterized by VM image with pre-installed Hadoop which
will be used for cluster deployment. User may choose one of pre-configured Cluster Templates to start a Cluster.
To get access to VMs after a Cluster has started, user should specify a keypair.
The Cluster entity represents a collection of VM instances that all have the
same data processing framework installed. It is mainly characterized by a VM
image with a pre-installed framework which will be used for cluster
deployment. Users may choose one of the pre-configured Cluster Templates to
start a Cluster. To get access to VMs after a Cluster has started, the user
should specify a keypair.
Sahara provides several constraints on Hadoop cluster topology. JobTracker and NameNode processes could be run either on a single
VM or two separate ones. Also cluster could contain worker nodes of different types. Worker nodes could run both TaskTracker and DataNode,
or either of these processes alone. Sahara allows user to create cluster with any combination of these options,
but it will not allow to create a non working topology, for example: a set of workers with DataNodes, but without a NameNode.
Sahara provides several constraints on cluster framework topology. JobTracker
and NameNode processes could be run either on a single VM or two separate
VMs. Also a cluster could contain worker nodes of different types. Worker
nodes could run both TaskTracker and DataNode, or either of these processes
alone. Sahara allows a user to create a cluster with any combination of these
options, but it will not allow the creation of a non working topology (for
example: a set of workers with DataNodes, but without a NameNode).
Each Cluster belongs to some tenant determined by user. Users have access only to objects located in
tenants they have access to. Users could edit/delete only objects they created. Naturally admin users have full access to every object.
That way Sahara complies with general OpenStack access policy.
Each Cluster belongs to some Identity service project determined by the user.
Users have access only to objects located in projects they have access to.
Users can edit and delete only objects they have created or exist in their
project. Naturally, admin users have full access to every object. In this
manner, sahara complies with general OpenStack access policy.
Integration with Swift
----------------------
Integration with Object Storage
-------------------------------
The Swift service is a standard object storage in OpenStack environment, analog of Amazon S3. As a rule it is deployed
on bare metal machines. It is natural to expect Hadoop on OpenStack to process data stored there. And it is so.
With a FileSystem implementation for Swift `HADOOP-8545 <https://issues.apache.org/jira/browse/HADOOP-8545>`_
and `Change I6b1ba25b <https://review.openstack.org/#/c/21015/>`_ which implements the ability to list endpoints for
an object, account or container, to make it possible to integrate swift with software that relies on data locality
information to avoid network overhead.
The swift project provides the standard Object Storage service for OpenStack
environments; it is an analog of the Amazon S3 service. As a rule it is
deployed on bare metal machines. It is natural to expect data processing on
OpenStack to access data stored there. Sahara provides this option with a
file system implementation for swift
`HADOOP-8545 <https://issues.apache.org/jira/browse/HADOOP-8545>`_ and
`Change I6b1ba25b <https://review.openstack.org/#/c/21015/>`_ which
implements the ability to list endpoints for an object, account or container.
This makes it possible to integrate swift with software that relies on data
locality information to avoid network overhead.
To get more information on how to enable Swift support see :doc:`userdoc/hadoop-swift`.
To get more information on how to enable swift support see
:doc:`userdoc/hadoop-swift`.
Pluggable Deployment and Monitoring
-----------------------------------
In addition to the monitoring capabilities provided by vendor-specific Hadoop management tooling, Sahara will provide pluggable integration with external monitoring systems such as Nagios or Zabbix.
In addition to the monitoring capabilities provided by vendor-specific
Hadoop management tooling, sahara provides pluggable integration with
external monitoring systems such as Nagios or Zabbix.
Both deployment and monitoring tools will be installed on stand-alone VMs, thus allowing a single instance to manage/monitor several clusters at once.
Both deployment and monitoring tools can be installed on stand-alone VMs,
thus allowing a single instance to manage and monitor several clusters at
once.