Updating overview document
* formatted lines to 79 characters * many small grammar fixes * changed most "Hadoop" references to either specify "Hadoop, Spark, and Storm", or reference "data processing frameworks" * updating image reference link Change-Id: I5e358babdaede6d795422fb281c18cd280ca2876 Partial-Bug: 1490687
This commit is contained in:
parent
6b7e3a0271
commit
07226e6089
@ -4,46 +4,67 @@ Rationale
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Apache Hadoop is an industry standard and widely adopted MapReduce implementation.
|
||||
The aim of this project is to enable users to easily provision and manage Hadoop clusters on OpenStack.
|
||||
It is worth mentioning that Amazon provides Hadoop for several years as Amazon Elastic MapReduce (EMR) service.
|
||||
Apache Hadoop is an industry standard and widely adopted MapReduce
|
||||
implementation, it is one among a growing number of data processing
|
||||
frameworks. The aim of this project is to enable users to easily provision
|
||||
and manage clusters with Hadoop and other data processing frameworks on
|
||||
OpenStack. It is worth mentioning that Amazon has provided Hadoop for
|
||||
several years as Amazon Elastic MapReduce (EMR) service.
|
||||
|
||||
Sahara aims to provide users with simple means to provision Hadoop clusters
|
||||
by specifying several parameters like Hadoop version, cluster topology, nodes hardware details
|
||||
and a few more. After user fills in all the parameters, Sahara deploys the cluster in a few minutes.
|
||||
Also Sahara provides means to scale already provisioned cluster by adding/removing worker nodes on demand.
|
||||
Sahara aims to provide users with a simple means to provision Hadoop, Spark,
|
||||
and Storm clusters by specifying several parameters such as the version,
|
||||
cluster topology, hardware node details and more. After a user fills in all
|
||||
the parameters, sahara deploys the cluster in a few minutes. Also sahara
|
||||
provides means to scale an already provisioned cluster by adding or removing
|
||||
worker nodes on demand.
|
||||
|
||||
The solution will address following use cases:
|
||||
The solution will address the following use cases:
|
||||
|
||||
* fast provisioning of Hadoop clusters on OpenStack for Dev and QA;
|
||||
* utilization of unused compute power from general purpose OpenStack IaaS cloud;
|
||||
* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar to AWS EMR).
|
||||
* fast provisioning of data processing clusters on OpenStack for development
|
||||
and quality assurance(QA).
|
||||
* utilization of unused compute power from a general purpose OpenStack IaaS
|
||||
cloud.
|
||||
* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar
|
||||
to AWS EMR).
|
||||
|
||||
Key features are:
|
||||
|
||||
* designed as an OpenStack component;
|
||||
* managed through REST API with UI available as part of OpenStack Dashboard;
|
||||
* support for different Hadoop distributions:
|
||||
* pluggable system of Hadoop installation engines;
|
||||
* integration with vendor specific management tools, such as Apache Ambari or Cloudera Management Console;
|
||||
* predefined templates of Hadoop configurations with ability to modify parameters.
|
||||
* managed through a REST API with a user interface(UI) available as part of
|
||||
OpenStack Dashboard;
|
||||
* support for a variety of data processing frameworks:
|
||||
* multiple Hadoop vendor distributions
|
||||
* Apache Spark and Storm
|
||||
* pluggable system of Hadoop installation engines
|
||||
* integration with vendor specific management tools, such as Apache
|
||||
Ambari and Cloudera Management Console
|
||||
* predefined configuration templates with the ability to modify parameters.
|
||||
|
||||
Details
|
||||
-------
|
||||
|
||||
The Sahara product communicates with the following OpenStack components:
|
||||
The sahara product communicates with the following OpenStack services:
|
||||
|
||||
* Horizon - provides GUI with ability to use all of Sahara’s features.
|
||||
* Keystone - authenticates users and provides security token that is used to work with the OpenStack,
|
||||
hence limiting user abilities in Sahara to his OpenStack privileges.
|
||||
* Nova - is used to provision VMs for Hadoop Cluster.
|
||||
* Heat - Sahara can be configured to use Heat; Heat orchestrates the required services for Hadoop Cluster.
|
||||
* Glance - Hadoop VM images are stored there, each image containing an installed OS and Hadoop.
|
||||
the pre-installed Hadoop should give us good handicap on node start-up.
|
||||
* Swift - can be used as a storage for data that will be processed by Hadoop jobs.
|
||||
* Cinder - can be used as a block storage.
|
||||
* Neutron - provides the networking service.
|
||||
* Ceilometer - used to collect measures of cluster usage for metering and monitoring purposes.
|
||||
* Dashboard (horizon) - provides a GUI with ability to use all of sahara’s
|
||||
features.
|
||||
* Identity (keystone) - authenticates users and provides security tokens that
|
||||
are used to work with OpenStack, limiting a user's abilities in sahara to
|
||||
their OpenStack privileges.
|
||||
* Compute (nova) - used to provision VMs for data processing clusters.
|
||||
* Orchestration (heat) - used to provision and orchestrate the deployment of
|
||||
data processing clusters.
|
||||
* Image (glance) - stores VM images, each image containing an operating system
|
||||
and a pre-installed data processing distribution or framework.
|
||||
* Object Storage (swift) - can be used as storage for job binaries and data
|
||||
that will be processed or created by framework jobs.
|
||||
* Block Storage (cinder) - can be used to provision block storage for VM
|
||||
instances.
|
||||
* Networking (neutron) - provides networking services to data processing
|
||||
clusters.
|
||||
* Telemetry (ceilometer) - used to collect measures of cluster usage for
|
||||
metering and monitoring purposes.
|
||||
* Shared file systems (manila) - can be used for storage of framework job
|
||||
binaries and data that will be processed or created by jobs.
|
||||
|
||||
.. image:: images/openstack-interop.png
|
||||
:width: 800 px
|
||||
@ -53,86 +74,116 @@ The Sahara product communicates with the following OpenStack components:
|
||||
General Workflow
|
||||
----------------
|
||||
|
||||
Sahara will provide two level of abstraction for API and UI based on the addressed use cases:
|
||||
cluster provisioning and analytics as a service.
|
||||
Sahara will provide two levels of abstraction for the API and UI based on the
|
||||
addressed use cases: cluster provisioning and analytics as a service.
|
||||
|
||||
For the fast cluster provisioning generic workflow will be as following:
|
||||
For fast cluster provisioning a generic workflow will be as following:
|
||||
|
||||
* select Hadoop version;
|
||||
* select base image with or without pre-installed Hadoop:
|
||||
* select a Hadoop (or framework) version.
|
||||
* select a base image with or without pre-installed data processing framework:
|
||||
|
||||
* for base images without Hadoop pre-installed Sahara will support pluggable deployment engines integrated with vendor tooling;
|
||||
* you could download prepared up-to-date images from http://sahara-files.mirantis.com/images/upstream/kilo/;
|
||||
* for base images without a pre-installed framework, sahara will support
|
||||
pluggable deployment engines that integrate with vendor tooling.
|
||||
* you can download prepared up-to-date images from
|
||||
http://sahara-files.mirantis.com/images/upstream/liberty/
|
||||
|
||||
* define cluster configuration, including size and topology of the cluster and setting the different type of Hadoop parameters (e.g. heap size):
|
||||
* define cluster configuration, including cluster size, topology, and
|
||||
framework parameters (for example, heap size):
|
||||
|
||||
* to ease the configuration of such parameters mechanism of configurable templates will be provided;
|
||||
* to ease the configuration of such parameters, configurable templates
|
||||
are provided.
|
||||
|
||||
* provision the cluster: Sahara will provision VMs, install and configure Hadoop;
|
||||
* operation on the cluster: add/remove nodes;
|
||||
* terminate the cluster when it’s not needed anymore.
|
||||
* provision the cluster; sahara will provision VMs, install and configure
|
||||
the data processing framework.
|
||||
* perform operations on the cluster; add or remove nodes.
|
||||
* terminate the cluster when it is no longer needed.
|
||||
|
||||
For analytic as a service generic workflow will be as following:
|
||||
For analytics as a service, a generic workflow will be as follows:
|
||||
|
||||
* select one of predefined Hadoop versions;
|
||||
* configure the job:
|
||||
* select one of the predefined data processing framework versions.
|
||||
* configure a job:
|
||||
|
||||
* choose type of the job: pig, hive, jar-file, etc.;
|
||||
* provide the job script source or jar location;
|
||||
* select input and output data location (initially only Swift will be supported);
|
||||
* select location for logs;
|
||||
* choose the type of job: pig, hive, jar-file, etc.
|
||||
* provide the job script source or jar location.
|
||||
* select input and output data location.
|
||||
|
||||
* set limit for the cluster size;
|
||||
* set the limit for the cluster size.
|
||||
* execute the job:
|
||||
|
||||
* all cluster provisioning and job execution will happen transparently to the user;
|
||||
* cluster will be removed automatically after job completion;
|
||||
* all cluster provisioning and job execution will happen transparently
|
||||
to the user.
|
||||
* cluster will be removed automatically after job completion.
|
||||
|
||||
* get the results of computations (for example, from Swift).
|
||||
|
||||
User's Perspective
|
||||
------------------
|
||||
|
||||
While provisioning cluster through Sahara, user operates on three types of entities: Node Group Templates, Cluster Templates and Clusters.
|
||||
While provisioning clusters through sahara, the user operates on three types
|
||||
of entities: Node Group Templates, Cluster Templates and Clusters.
|
||||
|
||||
A Node Group Template describes a group of nodes within cluster. It contains a list of hadoop processes that will be launched on each instance in a group.
|
||||
Also a Node Group Template may provide node scoped configurations for those processes.
|
||||
This kind of templates encapsulates hardware parameters (flavor) for the node VM and configuration for Hadoop processes running on the node.
|
||||
A Node Group Template describes a group of nodes within cluster. It contains
|
||||
a list of hadoop processes that will be launched on each instance in a group.
|
||||
Also a Node Group Template may provide node scoped configurations for those
|
||||
processes. This kind of template encapsulates hardware parameters (flavor)
|
||||
for the node VM and configuration for data processing framework processes
|
||||
running on the node.
|
||||
|
||||
A Cluster Template is designed to bring Node Group Templates together to form a Cluster.
|
||||
A Cluster Template defines what Node Groups will be included and how many instances will be created in each.
|
||||
Some of Hadoop Configurations can not be applied to a single node, but to a whole Cluster, so user can specify this kind of configurations in a Cluster Template.
|
||||
Sahara enables user to specify which processes should be added to an anti-affinity group within a Cluster Template. If a process is included into an anti-affinity
|
||||
group, it means that VMs where this process is going to be launched should be scheduled to different hardware hosts.
|
||||
A Cluster Template is designed to bring Node Group Templates together to
|
||||
form a Cluster. A Cluster Template defines what Node Groups will be included
|
||||
and how many instances will be created in each. Some data processing framework
|
||||
configurations can not be applied to a single node, but to a whole Cluster.
|
||||
A user can specify these kinds of configurations in a Cluster Template. Sahara
|
||||
enables users to specify which processes should be added to an anti-affinity
|
||||
group within a Cluster Template. If a process is included into an
|
||||
anti-affinity group, it means that VMs where this process are going to be
|
||||
launched should be scheduled to different hardware hosts.
|
||||
|
||||
The Cluster entity represents a Hadoop Cluster. It is mainly characterized by VM image with pre-installed Hadoop which
|
||||
will be used for cluster deployment. User may choose one of pre-configured Cluster Templates to start a Cluster.
|
||||
To get access to VMs after a Cluster has started, user should specify a keypair.
|
||||
The Cluster entity represents a collection of VM instances that all have the
|
||||
same data processing framework installed. It is mainly characterized by a VM
|
||||
image with a pre-installed framework which will be used for cluster
|
||||
deployment. Users may choose one of the pre-configured Cluster Templates to
|
||||
start a Cluster. To get access to VMs after a Cluster has started, the user
|
||||
should specify a keypair.
|
||||
|
||||
Sahara provides several constraints on Hadoop cluster topology. JobTracker and NameNode processes could be run either on a single
|
||||
VM or two separate ones. Also cluster could contain worker nodes of different types. Worker nodes could run both TaskTracker and DataNode,
|
||||
or either of these processes alone. Sahara allows user to create cluster with any combination of these options,
|
||||
but it will not allow to create a non working topology, for example: a set of workers with DataNodes, but without a NameNode.
|
||||
Sahara provides several constraints on cluster framework topology. JobTracker
|
||||
and NameNode processes could be run either on a single VM or two separate
|
||||
VMs. Also a cluster could contain worker nodes of different types. Worker
|
||||
nodes could run both TaskTracker and DataNode, or either of these processes
|
||||
alone. Sahara allows a user to create a cluster with any combination of these
|
||||
options, but it will not allow the creation of a non working topology (for
|
||||
example: a set of workers with DataNodes, but without a NameNode).
|
||||
|
||||
Each Cluster belongs to some tenant determined by user. Users have access only to objects located in
|
||||
tenants they have access to. Users could edit/delete only objects they created. Naturally admin users have full access to every object.
|
||||
That way Sahara complies with general OpenStack access policy.
|
||||
Each Cluster belongs to some Identity service project determined by the user.
|
||||
Users have access only to objects located in projects they have access to.
|
||||
Users can edit and delete only objects they have created or exist in their
|
||||
project. Naturally, admin users have full access to every object. In this
|
||||
manner, sahara complies with general OpenStack access policy.
|
||||
|
||||
Integration with Swift
|
||||
----------------------
|
||||
Integration with Object Storage
|
||||
-------------------------------
|
||||
|
||||
The Swift service is a standard object storage in OpenStack environment, analog of Amazon S3. As a rule it is deployed
|
||||
on bare metal machines. It is natural to expect Hadoop on OpenStack to process data stored there. And it is so.
|
||||
With a FileSystem implementation for Swift `HADOOP-8545 <https://issues.apache.org/jira/browse/HADOOP-8545>`_
|
||||
and `Change I6b1ba25b <https://review.openstack.org/#/c/21015/>`_ which implements the ability to list endpoints for
|
||||
an object, account or container, to make it possible to integrate swift with software that relies on data locality
|
||||
information to avoid network overhead.
|
||||
The swift project provides the standard Object Storage service for OpenStack
|
||||
environments; it is an analog of the Amazon S3 service. As a rule it is
|
||||
deployed on bare metal machines. It is natural to expect data processing on
|
||||
OpenStack to access data stored there. Sahara provides this option with a
|
||||
file system implementation for swift
|
||||
`HADOOP-8545 <https://issues.apache.org/jira/browse/HADOOP-8545>`_ and
|
||||
`Change I6b1ba25b <https://review.openstack.org/#/c/21015/>`_ which
|
||||
implements the ability to list endpoints for an object, account or container.
|
||||
This makes it possible to integrate swift with software that relies on data
|
||||
locality information to avoid network overhead.
|
||||
|
||||
To get more information on how to enable Swift support see :doc:`userdoc/hadoop-swift`.
|
||||
To get more information on how to enable swift support see
|
||||
:doc:`userdoc/hadoop-swift`.
|
||||
|
||||
Pluggable Deployment and Monitoring
|
||||
-----------------------------------
|
||||
|
||||
In addition to the monitoring capabilities provided by vendor-specific Hadoop management tooling, Sahara will provide pluggable integration with external monitoring systems such as Nagios or Zabbix.
|
||||
In addition to the monitoring capabilities provided by vendor-specific
|
||||
Hadoop management tooling, sahara provides pluggable integration with
|
||||
external monitoring systems such as Nagios or Zabbix.
|
||||
|
||||
Both deployment and monitoring tools will be installed on stand-alone VMs, thus allowing a single instance to manage/monitor several clusters at once.
|
||||
Both deployment and monitoring tools can be installed on stand-alone VMs,
|
||||
thus allowing a single instance to manage and monitor several clusters at
|
||||
once.
|
||||
|
Loading…
Reference in New Issue
Block a user