07226e6089
* formatted lines to 79 characters * many small grammar fixes * changed most "Hadoop" references to either specify "Hadoop, Spark, and Storm", or reference "data processing frameworks" * updating image reference link Change-Id: I5e358babdaede6d795422fb281c18cd280ca2876 Partial-Bug: 1490687
190 lines
8.3 KiB
ReStructuredText
190 lines
8.3 KiB
ReStructuredText
Rationale
|
||
=========
|
||
|
||
Introduction
|
||
------------
|
||
|
||
Apache Hadoop is an industry standard and widely adopted MapReduce
|
||
implementation, it is one among a growing number of data processing
|
||
frameworks. The aim of this project is to enable users to easily provision
|
||
and manage clusters with Hadoop and other data processing frameworks on
|
||
OpenStack. It is worth mentioning that Amazon has provided Hadoop for
|
||
several years as Amazon Elastic MapReduce (EMR) service.
|
||
|
||
Sahara aims to provide users with a simple means to provision Hadoop, Spark,
|
||
and Storm clusters by specifying several parameters such as the version,
|
||
cluster topology, hardware node details and more. After a user fills in all
|
||
the parameters, sahara deploys the cluster in a few minutes. Also sahara
|
||
provides means to scale an already provisioned cluster by adding or removing
|
||
worker nodes on demand.
|
||
|
||
The solution will address the following use cases:
|
||
|
||
* fast provisioning of data processing clusters on OpenStack for development
|
||
and quality assurance(QA).
|
||
* utilization of unused compute power from a general purpose OpenStack IaaS
|
||
cloud.
|
||
* "Analytics as a Service" for ad-hoc or bursty analytic workloads (similar
|
||
to AWS EMR).
|
||
|
||
Key features are:
|
||
|
||
* designed as an OpenStack component;
|
||
* managed through a REST API with a user interface(UI) available as part of
|
||
OpenStack Dashboard;
|
||
* support for a variety of data processing frameworks:
|
||
* multiple Hadoop vendor distributions
|
||
* Apache Spark and Storm
|
||
* pluggable system of Hadoop installation engines
|
||
* integration with vendor specific management tools, such as Apache
|
||
Ambari and Cloudera Management Console
|
||
* predefined configuration templates with the ability to modify parameters.
|
||
|
||
Details
|
||
-------
|
||
|
||
The sahara product communicates with the following OpenStack services:
|
||
|
||
* Dashboard (horizon) - provides a GUI with ability to use all of sahara’s
|
||
features.
|
||
* Identity (keystone) - authenticates users and provides security tokens that
|
||
are used to work with OpenStack, limiting a user's abilities in sahara to
|
||
their OpenStack privileges.
|
||
* Compute (nova) - used to provision VMs for data processing clusters.
|
||
* Orchestration (heat) - used to provision and orchestrate the deployment of
|
||
data processing clusters.
|
||
* Image (glance) - stores VM images, each image containing an operating system
|
||
and a pre-installed data processing distribution or framework.
|
||
* Object Storage (swift) - can be used as storage for job binaries and data
|
||
that will be processed or created by framework jobs.
|
||
* Block Storage (cinder) - can be used to provision block storage for VM
|
||
instances.
|
||
* Networking (neutron) - provides networking services to data processing
|
||
clusters.
|
||
* Telemetry (ceilometer) - used to collect measures of cluster usage for
|
||
metering and monitoring purposes.
|
||
* Shared file systems (manila) - can be used for storage of framework job
|
||
binaries and data that will be processed or created by jobs.
|
||
|
||
.. image:: images/openstack-interop.png
|
||
:width: 800 px
|
||
:scale: 99 %
|
||
:align: left
|
||
|
||
General Workflow
|
||
----------------
|
||
|
||
Sahara will provide two levels of abstraction for the API and UI based on the
|
||
addressed use cases: cluster provisioning and analytics as a service.
|
||
|
||
For fast cluster provisioning a generic workflow will be as following:
|
||
|
||
* select a Hadoop (or framework) version.
|
||
* select a base image with or without pre-installed data processing framework:
|
||
|
||
* for base images without a pre-installed framework, sahara will support
|
||
pluggable deployment engines that integrate with vendor tooling.
|
||
* you can download prepared up-to-date images from
|
||
http://sahara-files.mirantis.com/images/upstream/liberty/
|
||
|
||
* define cluster configuration, including cluster size, topology, and
|
||
framework parameters (for example, heap size):
|
||
|
||
* to ease the configuration of such parameters, configurable templates
|
||
are provided.
|
||
|
||
* provision the cluster; sahara will provision VMs, install and configure
|
||
the data processing framework.
|
||
* perform operations on the cluster; add or remove nodes.
|
||
* terminate the cluster when it is no longer needed.
|
||
|
||
For analytics as a service, a generic workflow will be as follows:
|
||
|
||
* select one of the predefined data processing framework versions.
|
||
* configure a job:
|
||
|
||
* choose the type of job: pig, hive, jar-file, etc.
|
||
* provide the job script source or jar location.
|
||
* select input and output data location.
|
||
|
||
* set the limit for the cluster size.
|
||
* execute the job:
|
||
|
||
* all cluster provisioning and job execution will happen transparently
|
||
to the user.
|
||
* cluster will be removed automatically after job completion.
|
||
|
||
* get the results of computations (for example, from Swift).
|
||
|
||
User's Perspective
|
||
------------------
|
||
|
||
While provisioning clusters through sahara, the user operates on three types
|
||
of entities: Node Group Templates, Cluster Templates and Clusters.
|
||
|
||
A Node Group Template describes a group of nodes within cluster. It contains
|
||
a list of hadoop processes that will be launched on each instance in a group.
|
||
Also a Node Group Template may provide node scoped configurations for those
|
||
processes. This kind of template encapsulates hardware parameters (flavor)
|
||
for the node VM and configuration for data processing framework processes
|
||
running on the node.
|
||
|
||
A Cluster Template is designed to bring Node Group Templates together to
|
||
form a Cluster. A Cluster Template defines what Node Groups will be included
|
||
and how many instances will be created in each. Some data processing framework
|
||
configurations can not be applied to a single node, but to a whole Cluster.
|
||
A user can specify these kinds of configurations in a Cluster Template. Sahara
|
||
enables users to specify which processes should be added to an anti-affinity
|
||
group within a Cluster Template. If a process is included into an
|
||
anti-affinity group, it means that VMs where this process are going to be
|
||
launched should be scheduled to different hardware hosts.
|
||
|
||
The Cluster entity represents a collection of VM instances that all have the
|
||
same data processing framework installed. It is mainly characterized by a VM
|
||
image with a pre-installed framework which will be used for cluster
|
||
deployment. Users may choose one of the pre-configured Cluster Templates to
|
||
start a Cluster. To get access to VMs after a Cluster has started, the user
|
||
should specify a keypair.
|
||
|
||
Sahara provides several constraints on cluster framework topology. JobTracker
|
||
and NameNode processes could be run either on a single VM or two separate
|
||
VMs. Also a cluster could contain worker nodes of different types. Worker
|
||
nodes could run both TaskTracker and DataNode, or either of these processes
|
||
alone. Sahara allows a user to create a cluster with any combination of these
|
||
options, but it will not allow the creation of a non working topology (for
|
||
example: a set of workers with DataNodes, but without a NameNode).
|
||
|
||
Each Cluster belongs to some Identity service project determined by the user.
|
||
Users have access only to objects located in projects they have access to.
|
||
Users can edit and delete only objects they have created or exist in their
|
||
project. Naturally, admin users have full access to every object. In this
|
||
manner, sahara complies with general OpenStack access policy.
|
||
|
||
Integration with Object Storage
|
||
-------------------------------
|
||
|
||
The swift project provides the standard Object Storage service for OpenStack
|
||
environments; it is an analog of the Amazon S3 service. As a rule it is
|
||
deployed on bare metal machines. It is natural to expect data processing on
|
||
OpenStack to access data stored there. Sahara provides this option with a
|
||
file system implementation for swift
|
||
`HADOOP-8545 <https://issues.apache.org/jira/browse/HADOOP-8545>`_ and
|
||
`Change I6b1ba25b <https://review.openstack.org/#/c/21015/>`_ which
|
||
implements the ability to list endpoints for an object, account or container.
|
||
This makes it possible to integrate swift with software that relies on data
|
||
locality information to avoid network overhead.
|
||
|
||
To get more information on how to enable swift support see
|
||
:doc:`userdoc/hadoop-swift`.
|
||
|
||
Pluggable Deployment and Monitoring
|
||
-----------------------------------
|
||
|
||
In addition to the monitoring capabilities provided by vendor-specific
|
||
Hadoop management tooling, sahara provides pluggable integration with
|
||
external monitoring systems such as Nagios or Zabbix.
|
||
|
||
Both deployment and monitoring tools can be installed on stand-alone VMs,
|
||
thus allowing a single instance to manage and monitor several clusters at
|
||
once.
|