Minor updates to edp documentation
Partial-bug: 1490687 Change-Id: Ibf25881f2abecc9660445545be353a4683ed248f
This commit is contained in:
parent
649e210b60
commit
08b3b503e6
@ -4,15 +4,21 @@ Elastic Data Processing (EDP)
|
|||||||
Overview
|
Overview
|
||||||
--------
|
--------
|
||||||
|
|
||||||
Sahara's Elastic Data Processing facility or :dfn:`EDP` allows the execution of jobs on clusters created from Sahara. EDP supports:
|
Sahara's Elastic Data Processing facility or :dfn:`EDP` allows the execution
|
||||||
|
of jobs on clusters created from sahara. EDP supports:
|
||||||
|
|
||||||
* Hive, Pig, MapReduce, MapReduce.Streaming, Java, and Shell job types on Hadoop clusters
|
* Hive, Pig, MapReduce, MapReduce.Streaming, Java, and Shell job types on
|
||||||
* Spark jobs on Spark standalone clusters and MapR spark clusters
|
Hadoop clusters
|
||||||
* storage of job binaries in Swift or Sahara's own database
|
* Spark jobs on Spark standalone clusters, MapR spark clusters, and CDH
|
||||||
|
clusters (v5.3.0 or higher)
|
||||||
|
* storage of job binaries in the OpenStack Object Storage service (swift),
|
||||||
|
the OpenStack Shared file systems service (manila), or sahara's own
|
||||||
|
database
|
||||||
* access to input and output data sources in
|
* access to input and output data sources in
|
||||||
|
|
||||||
+ HDFS for all job types
|
+ HDFS for all job types
|
||||||
+ Swift for all types excluding Hive
|
+ swift for all types excluding Hive
|
||||||
|
+ manila (NFS shares only) for all types excluding Pig
|
||||||
|
|
||||||
* configuration of jobs at submission time
|
* configuration of jobs at submission time
|
||||||
* execution of jobs on existing clusters or transient clusters
|
* execution of jobs on existing clusters or transient clusters
|
||||||
@ -20,21 +26,24 @@ Sahara's Elastic Data Processing facility or :dfn:`EDP` allows the execution of
|
|||||||
Interfaces
|
Interfaces
|
||||||
----------
|
----------
|
||||||
|
|
||||||
The EDP features can be used from the Sahara web UI which is described in the :doc:`../horizon/dashboard.user.guide`.
|
The EDP features can be used from the sahara web UI which is described in the
|
||||||
|
:doc:`../horizon/dashboard.user.guide`.
|
||||||
|
|
||||||
The EDP features also can be used directly by a client through the `REST api <http://developer.openstack.org/api-ref-data-processing-v1.1.html>`_
|
The EDP features also can be used directly by a client through the
|
||||||
|
`REST api <http://developer.openstack.org/api-ref-data-processing-v1.1.html>`_
|
||||||
|
|
||||||
EDP Concepts
|
EDP Concepts
|
||||||
------------
|
------------
|
||||||
|
|
||||||
Sahara EDP uses a collection of simple objects to define and execute jobs. These objects are stored in the Sahara database when they
|
Sahara EDP uses a collection of simple objects to define and execute jobs.
|
||||||
are created, allowing them to be reused. This modular approach with database persistence allows code and data to be reused across multiple jobs.
|
These objects are stored in the sahara database when they are created,
|
||||||
|
allowing them to be reused. This modular approach with database persistence
|
||||||
|
allows code and data to be reused across multiple jobs.
|
||||||
|
|
||||||
The essential components of a job are:
|
The essential components of a job are:
|
||||||
|
|
||||||
* executable code to run
|
* executable code to run
|
||||||
* input data to process
|
* input and output data paths, as needed for the job
|
||||||
* an output data location
|
|
||||||
* any additional configuration values needed for the job run
|
* any additional configuration values needed for the job run
|
||||||
|
|
||||||
These components are supplied through the objects described below.
|
These components are supplied through the objects described below.
|
||||||
@ -42,20 +51,37 @@ These components are supplied through the objects described below.
|
|||||||
Job Binaries
|
Job Binaries
|
||||||
++++++++++++
|
++++++++++++
|
||||||
|
|
||||||
A :dfn:`Job Binary` object stores a URL to a single script or Jar file and any credentials needed to retrieve the file. The file itself may be stored in the Sahara internal database, in Swift, or in Manila.
|
A :dfn:`Job Binary` object stores a URL to a single script or Jar file and
|
||||||
|
any credentials needed to retrieve the file. The file itself may be stored
|
||||||
|
in the sahara internal database, in swift, or in manila.
|
||||||
|
|
||||||
Files in the Sahara database are stored as raw bytes in a :dfn:`Job Binary Internal` object. This object's sole purpose is to store a file for later retrieval. No extra credentials need to be supplied for files stored internally.
|
Files in the sahara database are stored as raw bytes in a
|
||||||
|
:dfn:`Job Binary Internal` object. This object's sole purpose is to store a
|
||||||
|
file for later retrieval. No extra credentials need to be supplied for files
|
||||||
|
stored internally.
|
||||||
|
|
||||||
Sahara requires credentials (username and password) to access files stored in Swift unless Swift proxy users are configured as described in :doc:`../userdoc/advanced.configuration.guide`. The Swift service must be running in the same OpenStack installation referenced by Sahara.
|
Sahara requires credentials (username and password) to access files stored in
|
||||||
|
swift unless swift proxy users are configured as described in
|
||||||
|
:doc:`../userdoc/advanced.configuration.guide`. The swift service must be
|
||||||
|
running in the same OpenStack installation referenced by sahara.
|
||||||
|
|
||||||
To reference a binary file stored in Manila, create the job binary with the URL ``manila//{share_id}/{path}``. This assumes that you have already stored that file in the appropriate path on the share. The share will be automatically mounted to any cluster nodes which require access to the file, if it is not mounted already.
|
To reference a binary file stored in manila, create the job binary with the
|
||||||
|
URL ``manila//{share_id}/{path}``. This assumes that you have already stored
|
||||||
|
that file in the appropriate path on the share. The share will be
|
||||||
|
automatically mounted to any cluster nodes which require access to the file,
|
||||||
|
if it is not mounted already.
|
||||||
|
|
||||||
There is a configurable limit on the size of a single job binary that may be retrieved by Sahara. This limit is 5MB and may be set with the *job_binary_max_KB* setting in the :file:`sahara.conf` configuration file.
|
There is a configurable limit on the size of a single job binary that may be
|
||||||
|
retrieved by sahara. This limit is 5MB and may be set with the
|
||||||
|
*job_binary_max_KB* setting in the :file:`sahara.conf` configuration file.
|
||||||
|
|
||||||
Jobs
|
Jobs
|
||||||
++++
|
++++
|
||||||
|
|
||||||
A :dfn:`Job` object specifies the type of the job and lists all of the individual Job Binary objects that are required for execution. An individual Job Binary may be referenced by multiple Jobs. A Job object specifies a main binary and/or supporting libraries depending on its type:
|
A :dfn:`Job` object specifies the type of the job and lists all of the
|
||||||
|
individual Job Binary objects that are required for execution. An individual
|
||||||
|
Job Binary may be referenced by multiple Jos. A Job object specifies a main
|
||||||
|
binary and/or supporting libraries depending on its type:
|
||||||
|
|
||||||
+-------------------------+-------------+-----------+
|
+-------------------------+-------------+-----------+
|
||||||
| Job type | Main binary | Libraries |
|
| Job type | Main binary | Libraries |
|
||||||
@ -79,56 +105,92 @@ A :dfn:`Job` object specifies the type of the job and lists all of the individua
|
|||||||
Data Sources
|
Data Sources
|
||||||
++++++++++++
|
++++++++++++
|
||||||
|
|
||||||
A :dfn:`Data Source` object stores a URL which designates the location of input or output data and any credentials needed to access the location.
|
A :dfn:`Data Source` object stores a URL which designates the location of
|
||||||
|
input or output data and any credentials needed to access the location.
|
||||||
|
|
||||||
Sahara supports data sources in Swift. The Swift service must be running in the same OpenStack installation referenced by Sahara.
|
Sahara supports data sources in swift. The swift service must be running in
|
||||||
|
the same OpenStack installation referenced by sahara.
|
||||||
|
|
||||||
Sahara also supports data sources in HDFS. Any HDFS instance running on a Sahara cluster in the same OpenStack installation is accessible without manual configuration. Other instances of HDFS may be used as well provided that the URL is resolvable from the node executing the job.
|
Sahara also supports data sources in HDFS. Any HDFS instance running on a
|
||||||
|
sahara cluster in the same OpenStack installation is accessible without
|
||||||
|
manual configuration. Other instances of HDFS may be used as well provided
|
||||||
|
that the URL is resolvable from the node executing the job.
|
||||||
|
|
||||||
Sahara supports data sources in Manila as well. To reference a path on an NFS share as a data source, create the data source with the URL ``manila//{share_id}/{path}``. As in the case of job binaries, the specified share will be automatically mounted to your cluster's nodes as needed to access the data source.
|
Sahara supports data sources in manila as well. To reference a path on an NFS
|
||||||
|
share as a data source, create the data source with the URL
|
||||||
|
``manila//{share_id}/{path}``. As in the case of job binaries, the specified
|
||||||
|
share will be automatically mounted to your cluster's nodes as needed to
|
||||||
|
access the data source.
|
||||||
|
|
||||||
Some job types require the use of data source objects to specify input and output when a job is launched. For example, when running a Pig job the UI will prompt the user for input and output data source objects.
|
Some job types require the use of data source objects to specify input and
|
||||||
|
output when a job is launched. For example, when running a Pig job the UI will
|
||||||
|
prompt the user for input and output data source objects.
|
||||||
|
|
||||||
Other job types like Java or Spark do not require the user to specify data sources. For these job types, data paths are passed as arguments. For convenience, Sahara allows data source objects to be
|
Other job types like Java or Spark do not require the user to specify data
|
||||||
referenced by name or id. The section `Using Data Source References as Arguments`_ gives further details.
|
sources. For these job types, data paths are passed as arguments. For
|
||||||
|
convenience, sahara allows data source objects to be referenced by name or id.
|
||||||
|
The section `Using Data Source References as Arguments`_ gives further
|
||||||
|
details.
|
||||||
|
|
||||||
|
|
||||||
Job Execution
|
Job Execution
|
||||||
+++++++++++++
|
+++++++++++++
|
||||||
|
|
||||||
Job objects must be *launched* or *executed* in order for them to run on the cluster. During job launch, a user specifies execution details including data sources, configuration values, and program arguments. The relevant details will vary by job type. The launch will create a :dfn:`Job Execution` object in Sahara which is used to monitor and manage the job.
|
Job objects must be *launched* or *executed* in order for them to run on the
|
||||||
|
cluster. During job launch, a user specifies execution details including data
|
||||||
|
sources, configuration values, and program arguments. The relevant details
|
||||||
|
will vary by job type. The launch will create a :dfn:`Job Execution` object in
|
||||||
|
sahara which is used to monitor and manage the job.
|
||||||
|
|
||||||
To execute Hadoop jobs, Sahara generates an Oozie workflow and submits it to the Oozie server running on the cluster. Familiarity with Oozie is not necessary for using Sahara but it may be beneficial to the user. A link to the Oozie web console can be found in the Sahara web UI in the cluster details.
|
To execute Hadoop jobs, sahara generates an Oozie workflow and submits it to
|
||||||
|
the Oozie server running on the cluster. Familiarity with Oozie is not
|
||||||
|
necessary for using sahara but it may be beneficial to the user. A link to
|
||||||
|
the Oozie web console can be found in the sahara web UI in the cluster
|
||||||
|
details.
|
||||||
|
|
||||||
For Spark jobs, Sahara uses the *spark-submit* shell script and executes the Spark job from the master node. Logs of spark jobs run by Sahara can be found on the master node under the */tmp/spark-edp* directory.
|
For Spark jobs, sahara uses the *spark-submit* shell script and executes the
|
||||||
|
Spark job from the master node. Logs of spark jobs run by sahara can be found
|
||||||
|
on the master node under the */tmp/spark-edp* directory.
|
||||||
|
|
||||||
.. _edp_workflow:
|
.. _edp_workflow:
|
||||||
|
|
||||||
General Workflow
|
General Workflow
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
The general workflow for defining and executing a job in Sahara is essentially the same whether using the web UI or the REST API.
|
The general workflow for defining and executing a job in sahara is essentially
|
||||||
|
the same whether using the web UI or the REST API.
|
||||||
|
|
||||||
1. Launch a cluster from Sahara if there is not one already available
|
1. Launch a cluster from sahara if there is not one already available
|
||||||
2. Create all of the Job Binaries needed to run the job, stored in the Sahara database or in Swift
|
2. Create all of the Job Binaries needed to run the job, stored in the sahara
|
||||||
|
database, in swift, or in manila
|
||||||
|
|
||||||
+ When using the REST API and internal storage of job binaries, there is an extra step here to first create the Job Binary Internal objects
|
+ When using the REST API and internal storage of job binaries, the Job
|
||||||
+ Once the Job Binary Internal objects are created, Job Binary objects may be created which refer to them by URL
|
Binary Internal objects must be created first
|
||||||
|
+ Once the Job Binary Internal objects are created, Job Binary objects may
|
||||||
|
be created which refer to them by URL
|
||||||
|
|
||||||
3. Create a Job object which references the Job Binaries created in step 2
|
3. Create a Job object which references the Job Binaries created in step 2
|
||||||
4. Create an input Data Source which points to the data you wish to process
|
4. Create an input Data Source which points to the data you wish to process
|
||||||
5. Create an output Data Source which points to the location for output data
|
5. Create an output Data Source which points to the location for output data
|
||||||
6. Create a Job Execution object specifying the cluster and Job object plus relevant data sources, configuration values, and program arguments
|
6. Create a Job Execution object specifying the cluster and Job object plus
|
||||||
|
relevant data sources, configuration values, and program arguments
|
||||||
|
|
||||||
+ When using the web UI this is done with the :guilabel:`Launch On Existing Cluster` or :guilabel:`Launch on New Cluster` buttons on the Jobs tab
|
+ When using the web UI this is done with the
|
||||||
+ When using the REST API this is done via the */jobs/<job_id>/execute* method
|
:guilabel:`Launch On Existing Cluster` or
|
||||||
|
:guilabel:`Launch on New Cluster` buttons on the Jobs tab
|
||||||
|
+ When using the REST API this is done via the */jobs/<job_id>/execute*
|
||||||
|
method
|
||||||
|
|
||||||
The workflow is simpler when using existing objects. For example, to construct a new job which uses existing binaries and input data a user may only need to perform steps 3, 5, and 6 above. Of course, to repeat the same job multiple times a user would need only step 6.
|
The workflow is simpler when using existing objects. For example, to
|
||||||
|
construct a new job which uses existing binaries and input data a user may
|
||||||
|
only need to perform steps 3, 5, and 6 above. Of course, to repeat the same
|
||||||
|
job multiple times a user would need only step 6.
|
||||||
|
|
||||||
Specifying Configuration Values, Parameters, and Arguments
|
Specifying Configuration Values, Parameters, and Arguments
|
||||||
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Jobs can be configured at launch. The job type determines the kinds of values that may be set:
|
Jobs can be configured at launch. The job type determines the kinds of values
|
||||||
|
that may be set:
|
||||||
|
|
||||||
+--------------------------+---------------+------------+-----------+
|
+--------------------------+---------------+------------+-----------+
|
||||||
| Job type | Configuration | Parameters | Arguments |
|
| Job type | Configuration | Parameters | Arguments |
|
||||||
@ -151,57 +213,78 @@ Jobs can be configured at launch. The job type determines the kinds of values th
|
|||||||
|
|
||||||
* :dfn:`Configuration values` are key/value pairs.
|
* :dfn:`Configuration values` are key/value pairs.
|
||||||
|
|
||||||
+ The EDP configuration values have names beginning with *edp.* and are consumed by Sahara
|
+ The EDP configuration values have names beginning with *edp.* and are
|
||||||
|
consumed by sahara
|
||||||
+ Other configuration values may be read at runtime by Hadoop jobs
|
+ Other configuration values may be read at runtime by Hadoop jobs
|
||||||
+ Currently additional configuration values are not available to Spark jobs at runtime
|
+ Currently additional configuration values are not available to Spark jobs
|
||||||
|
at runtime
|
||||||
|
|
||||||
* :dfn:`Parameters` are key/value pairs. They supply values for the Hive and Pig parameter substitution mechanisms. In Shell jobs, they are passed as environment variables.
|
* :dfn:`Parameters` are key/value pairs. They supply values for the Hive and
|
||||||
* :dfn:`Arguments` are strings passed as command line arguments to a shell or main program
|
Pig parameter substitution mechanisms. In Shell jobs, they are passed as
|
||||||
|
environment variables.
|
||||||
|
* :dfn:`Arguments` are strings passed as command line arguments to a shell or
|
||||||
|
main program
|
||||||
|
|
||||||
These values can be set on the :guilabel:`Configure` tab during job launch through the web UI or through the *job_configs* parameter when using the */jobs/<job_id>/execute* REST method.
|
These values can be set on the :guilabel:`Configure` tab during job launch
|
||||||
|
through the web UI or through the *job_configs* parameter when using the
|
||||||
|
*/jobs/<job_id>/execute* REST method.
|
||||||
|
|
||||||
In some cases Sahara generates configuration values or parameters automatically. Values set explicitly by the user during launch will override those generated by Sahara.
|
In some cases sahara generates configuration values or parameters
|
||||||
|
automatically. Values set explicitly by the user during launch will override
|
||||||
|
those generated by sahara.
|
||||||
|
|
||||||
Using Data Source References as Arguments
|
Using Data Source References as Arguments
|
||||||
+++++++++++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Sometimes it's necessary or desirable to pass a data path as an argument to a job. In these cases,
|
Sometimes it's necessary or desirable to pass a data path as an argument to a
|
||||||
a user may simply type out the path as an argument when launching a job. If the path requires
|
job. In these cases, a user may simply type out the path as an argument when
|
||||||
credentials, the user can manually add the credentials as configuration values. However, if a data
|
launching a job. If the path requires credentials, the user can manually add
|
||||||
source object has been created that contains the desired path and credentials there is no need
|
the credentials as configuration values. However, if a data source object has
|
||||||
|
been created that contains the desired path and credentials there is no need
|
||||||
to specify this information manually.
|
to specify this information manually.
|
||||||
|
|
||||||
As a convenience, Sahara allows data source objects to be referenced by name or id
|
As a convenience, sahara allows data source objects to be referenced by name
|
||||||
in arguments, configuration values, or parameters. When the job is executed, Sahara will replace
|
or id in arguments, configuration values, or parameters. When the job is
|
||||||
the reference with the path stored in the data source object and will add any necessary credentials
|
executed, sahara will replace the reference with the path stored in the data
|
||||||
to the job configuration. Referencing an existing data source object is much faster than adding
|
source object and will add any necessary credentials to the job configuration.
|
||||||
this information by hand. This is particularly useful for job types like Java or Spark that do
|
Referencing an existing data source object is much faster than adding this
|
||||||
not use data source objects directly.
|
information by hand. This is particularly useful for job types like Java or
|
||||||
|
Spark that do not use data source objects directly.
|
||||||
|
|
||||||
There are two job configuration parameters that enable data source references. They may
|
There are two job configuration parameters that enable data source references.
|
||||||
be used with any job type and are set on the ``Configuration`` tab when the job is launched:
|
They may be used with any job type and are set on the ``Configuration`` tab
|
||||||
|
when the job is launched:
|
||||||
|
|
||||||
* ``edp.substitute_data_source_for_name`` (default **False**) If set to **True**, causes Sahara
|
* ``edp.substitute_data_source_for_name`` (default **False**) If set to
|
||||||
to look for data source object name references in configuration values, arguments, and parameters
|
**True**, causes sahara to look for data source object name references in
|
||||||
when a job is launched. Name references have the form **datasource://name_of_the_object**.
|
configuration values, arguments, and parameters when a job is launched. Name
|
||||||
|
references have the form **datasource://name_of_the_object**.
|
||||||
|
|
||||||
For example, assume a user has a WordCount application that takes an input path as an argument.
|
For example, assume a user has a WordCount application that takes an input
|
||||||
If there is a data source object named **my_input**, a user may simply set the
|
path as an argument. If there is a data source object named **my_input**, a
|
||||||
**edp.substitute_data_source_for_name** configuration parameter to **True** and add
|
user may simply set the **edp.substitute_data_source_for_name**
|
||||||
**datasource://my_input** as an argument when launching the job.
|
configuration parameter to **True** and add **datasource://my_input** as an
|
||||||
|
argument when launching the job.
|
||||||
|
|
||||||
* ``edp.substitute_data_source_for_uuid`` (default **False**) If set to **True**, causes Sahara
|
* ``edp.substitute_data_source_for_uuid`` (default **False**) If set to
|
||||||
to look for data source object ids in configuration values, arguments, and parameters when
|
**True**, causes sahara to look for data source object ids in configuration
|
||||||
a job is launched. A data source object id is a uuid, so they are unique. The id of a data
|
values, arguments, and parameters when a job is launched. A data source
|
||||||
source object is available through the UI or the Sahara command line client. A user may
|
object id is a uuid, so they are unique. The id of a data source object is
|
||||||
|
available through the UI or the sahara command line client. A user may
|
||||||
simply use the id as a value.
|
simply use the id as a value.
|
||||||
|
|
||||||
Generation of Swift Properties for Data Sources
|
Generation of Swift Properties for Data Sources
|
||||||
+++++++++++++++++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
If Swift proxy users are not configured (see :doc:`../userdoc/advanced.configuration.guide`) and a job is run with data source objects containing Swift paths, Sahara will automatically generate Swift username and password configuration values based on the credentials in the data sources. If the input and output data sources are both in Swift, it is expected that they specify the same credentials.
|
If swift proxy users are not configured (see
|
||||||
|
:doc:`../userdoc/advanced.configuration.guide`) and a job is run with data
|
||||||
|
source objects containing swift paths, sahara will automatically generate
|
||||||
|
swift username and password configuration values based on the credentials
|
||||||
|
in the data sources. If the input and output data sources are both in swift,
|
||||||
|
it is expected that they specify the same credentials.
|
||||||
|
|
||||||
The Swift credentials may be set explicitly with the following configuration values:
|
The swift credentials may be set explicitly with the following configuration
|
||||||
|
values:
|
||||||
|
|
||||||
+------------------------------------+
|
+------------------------------------+
|
||||||
| Name |
|
| Name |
|
||||||
@ -211,36 +294,40 @@ The Swift credentials may be set explicitly with the following configuration val
|
|||||||
| fs.swift.service.sahara.password |
|
| fs.swift.service.sahara.password |
|
||||||
+------------------------------------+
|
+------------------------------------+
|
||||||
|
|
||||||
Setting the Swift credentials explicitly is required when passing literal Swift paths as arguments
|
Setting the swift credentials explicitly is required when passing literal
|
||||||
instead of using data source references. When possible, use data source references as described
|
swift paths as arguments instead of using data source references. When
|
||||||
in `Using Data Source References as Arguments`_.
|
possible, use data source references as described in
|
||||||
|
`Using Data Source References as Arguments`_.
|
||||||
|
|
||||||
Additional Details for Hive jobs
|
Additional Details for Hive jobs
|
||||||
++++++++++++++++++++++++++++++++
|
++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by
|
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT``
|
||||||
Hive based on the specified data sources.
|
parameters required by Hive based on the specified data sources.
|
||||||
|
|
||||||
Additional Details for Pig jobs
|
Additional Details for Pig jobs
|
||||||
+++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by
|
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT``
|
||||||
Pig based on the specified data sources.
|
parameters required by Pig based on the specified data sources.
|
||||||
|
|
||||||
For Pig jobs, ``arguments`` should be thought of as command line arguments separated by spaces and
|
For Pig jobs, ``arguments`` should be thought of as command line arguments
|
||||||
passed to the ``pig`` shell.
|
separated by spaces and passed to the ``pig`` shell.
|
||||||
|
|
||||||
``Parameters`` are a shorthand and are actually translated to the arguments ``-param name=value``
|
``Parameters`` are a shorthand and are actually translated to the arguments
|
||||||
|
``-param name=value``
|
||||||
|
|
||||||
Additional Details for MapReduce jobs
|
Additional Details for MapReduce jobs
|
||||||
+++++++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
**Important!**
|
**Important!**
|
||||||
|
|
||||||
If the job type is MapReduce, the mapper and reducer classes *must* be specified as configuration
|
If the job type is MapReduce, the mapper and reducer classes *must* be
|
||||||
values.
|
specified as configuration values.
|
||||||
Note, the UI will not prompt the user for these required values, they must be added manually with
|
|
||||||
the ``Configure`` tab.
|
Note that the UI will not prompt the user for these required values; they must
|
||||||
|
be added manually with the ``Configure`` tab.
|
||||||
|
|
||||||
Make sure to add these values with the correct names:
|
Make sure to add these values with the correct names:
|
||||||
|
|
||||||
+-------------------------+-----------------------------------------+
|
+-------------------------+-----------------------------------------+
|
||||||
@ -256,12 +343,13 @@ Additional Details for MapReduce.Streaming jobs
|
|||||||
|
|
||||||
**Important!**
|
**Important!**
|
||||||
|
|
||||||
If the job type is MapReduce.Streaming, the streaming mapper and reducer classes *must* be specified.
|
If the job type is MapReduce.Streaming, the streaming mapper and reducer
|
||||||
|
classes *must* be specified.
|
||||||
|
|
||||||
In this case, the UI *will* prompt the user to enter mapper and reducer values on the form and will
|
In this case, the UI *will* prompt the user to enter mapper and reducer
|
||||||
take care of adding them to the job configuration with the appropriate names. If using the python
|
values on the form and will take care of adding them to the job configuration
|
||||||
client, however, be certain to add these values to the job configuration manually with the correct
|
with the appropriate names. If using the python client, however, be certain to
|
||||||
names:
|
add these values to the job configuration manually with the correct names:
|
||||||
|
|
||||||
+-------------------------+---------------+
|
+-------------------------+---------------+
|
||||||
| Name | Example Value |
|
| Name | Example Value |
|
||||||
@ -276,9 +364,9 @@ Additional Details for Java jobs
|
|||||||
|
|
||||||
Data Source objects are not used directly with Java job types. Instead, any
|
Data Source objects are not used directly with Java job types. Instead, any
|
||||||
input or output paths must be specified as arguments at job launch either
|
input or output paths must be specified as arguments at job launch either
|
||||||
explicitly or by reference as described in `Using Data Source References as Arguments`_.
|
explicitly or by reference as described in
|
||||||
Using data source references is the recommended way to pass paths to
|
`Using Data Source References as Arguments`_. Using data source references is
|
||||||
Java jobs.
|
the recommended way to pass paths to Java jobs.
|
||||||
|
|
||||||
If configuration values are specified, they must be added to the job's
|
If configuration values are specified, they must be added to the job's
|
||||||
Hadoop configuration at runtime. There are two methods of doing this. The
|
Hadoop configuration at runtime. There are two methods of doing this. The
|
||||||
@ -287,7 +375,8 @@ below. The other method is to use the code from
|
|||||||
`this example <https://github.com/openstack/sahara/blob/master/etc/edp-examples/edp-java/README.rst>`_
|
`this example <https://github.com/openstack/sahara/blob/master/etc/edp-examples/edp-java/README.rst>`_
|
||||||
to explicitly load the values.
|
to explicitly load the values.
|
||||||
|
|
||||||
The following special configuration values are read by Sahara and affect how Java jobs are run:
|
The following special configuration values are read by sahara and affect how
|
||||||
|
Java jobs are run:
|
||||||
|
|
||||||
* ``edp.java.main_class`` (required) Specifies the full name of the class
|
* ``edp.java.main_class`` (required) Specifies the full name of the class
|
||||||
containing ``main(String[] args)``
|
containing ``main(String[] args)``
|
||||||
@ -301,11 +390,11 @@ The following special configuration values are read by Sahara and affect how Jav
|
|||||||
|
|
||||||
* ``edp.java.java_opts`` (optional) Specifies configuration values for the JVM
|
* ``edp.java.java_opts`` (optional) Specifies configuration values for the JVM
|
||||||
|
|
||||||
* ``edp.java.adapt_for_oozie`` (optional) Specifies that Sahara should perform
|
* ``edp.java.adapt_for_oozie`` (optional) Specifies that sahara should perform
|
||||||
special handling of configuration values and exit conditions. The default is
|
special handling of configuration values and exit conditions. The default is
|
||||||
**False**.
|
**False**.
|
||||||
|
|
||||||
If this configuration value is set to **True**, Sahara will modify
|
If this configuration value is set to **True**, sahara will modify
|
||||||
the job's Hadoop configuration before invoking the specified **main** method.
|
the job's Hadoop configuration before invoking the specified **main** method.
|
||||||
Any configuration values specified during job launch (excluding those
|
Any configuration values specified during job launch (excluding those
|
||||||
beginning with **edp.**) will be automatically set in the job's Hadoop
|
beginning with **edp.**) will be automatically set in the job's Hadoop
|
||||||
@ -318,36 +407,40 @@ At this time, the following special configuration value only applies when
|
|||||||
running jobs on a cluster generated by the Cloudera plugin with the
|
running jobs on a cluster generated by the Cloudera plugin with the
|
||||||
**Enable Hbase Common Lib** cluster config set to **True** (the default value):
|
**Enable Hbase Common Lib** cluster config set to **True** (the default value):
|
||||||
|
|
||||||
* ``edp.hbase_common_lib`` (optional) Specifies that a common Hbase lib generated by
|
* ``edp.hbase_common_lib`` (optional) Specifies that a common Hbase lib
|
||||||
Sahara in HDFS be added to the **oozie.libpath**. This for use when an Hbase application
|
generated by sahara in HDFS be added to the **oozie.libpath**. This for use
|
||||||
is driven from a Java job. Default is **False**.
|
when an Hbase application is driven from a Java job. Default is **False**.
|
||||||
|
|
||||||
The **edp-wordcount** example bundled with Sahara shows how to use configuration
|
The **edp-wordcount** example bundled with sahara shows how to use configuration
|
||||||
values, arguments, and Swift data paths in a Java job type. Note that the
|
values, arguments, and swift data paths in a Java job type. Note that the
|
||||||
example does not use the **edp.java.adapt_for_oozie** option but includes the
|
example does not use the **edp.java.adapt_for_oozie** option but includes the
|
||||||
code to load the configuration values explicitly.
|
code to load the configuration values explicitly.
|
||||||
|
|
||||||
Additional Details for Shell jobs
|
Additional Details for Shell jobs
|
||||||
+++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
A shell job will execute the script specified as ``main``, and will place any files specified
|
A shell job will execute the script specified as ``main``, and will place any
|
||||||
as ``libs`` in the same working directory (on both the filesystem and in HDFS). Command line
|
files specified as ``libs`` in the same working directory (on both the
|
||||||
arguments may be passed to the script through the ``args`` array, and any ``params`` values will
|
filesystem and in HDFS). Command line arguments may be passed to the script
|
||||||
be passed as environment variables.
|
through the ``args`` array, and any ``params`` values will be passed as
|
||||||
|
environment variables.
|
||||||
|
|
||||||
Data Source objects are not used directly with Shell job types but data source references
|
Data Source objects are not used directly with Shell job types but data source
|
||||||
may be used as described in `Using Data Source References as Arguments`_.
|
references may be used as described in
|
||||||
|
`Using Data Source References as Arguments`_.
|
||||||
|
|
||||||
The **edp-shell** example bundled with Sahara contains a script which will output the executing
|
The **edp-shell** example bundled with sahara contains a script which will
|
||||||
user to a file specified by the first command line argument.
|
output the executing user to a file specified by the first command line
|
||||||
|
argument.
|
||||||
|
|
||||||
Additional Details for Spark jobs
|
Additional Details for Spark jobs
|
||||||
+++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Data Source objects are not used directly with Spark job types. Instead, any
|
Data Source objects are not used directly with Spark job types. Instead, any
|
||||||
input or output paths must be specified as arguments at job launch either
|
input or output paths must be specified as arguments at job launch either
|
||||||
explicitly or by reference as described in `Using Data Source References as Arguments`_.
|
explicitly or by reference as described in
|
||||||
Using data source references is the recommended way to pass paths to Spark jobs.
|
`Using Data Source References as Arguments`_. Using data source references
|
||||||
|
is the recommended way to pass paths to Spark jobs.
|
||||||
|
|
||||||
Spark jobs use some special configuration values:
|
Spark jobs use some special configuration values:
|
||||||
|
|
||||||
@ -357,29 +450,35 @@ Spark jobs use some special configuration values:
|
|||||||
+ ``main(String[] args)`` for Java
|
+ ``main(String[] args)`` for Java
|
||||||
+ ``main(args: Array[String]`` for Scala
|
+ ``main(args: Array[String]`` for Scala
|
||||||
|
|
||||||
A Spark job will execute the **main** method of the specified main class. Any
|
A Spark job will execute the **main** method of the specified main class.
|
||||||
arguments set during job launch will be passed to the program through the
|
Any arguments set during job launch will be passed to the program through the
|
||||||
**args** array.
|
**args** array.
|
||||||
|
|
||||||
* ``edp.spark.adapt_for_swift`` (optional) If set to **True**, instructs Sahara to modify the
|
* ``edp.spark.adapt_for_swift`` (optional) If set to **True**, instructs
|
||||||
job's Hadoop configuration so that Swift paths may be accessed. Without this configuration
|
sahara to modify the job's Hadoop configuration so that swift paths may be
|
||||||
value, Swift paths will not be accessible to Spark jobs. The default is **False**.
|
accessed. Without this configuration value, swift paths will not be
|
||||||
|
accessible to Spark jobs. The default is **False**.
|
||||||
|
|
||||||
The **edp-spark** example bundled with Sahara contains a Spark program for estimating Pi.
|
The **edp-spark** example bundled with sahara contains a Spark program for
|
||||||
|
estimating Pi.
|
||||||
|
|
||||||
|
|
||||||
Special Sahara URLs
|
Special Sahara URLs
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
Sahara uses custom URLs to refer to objects stored in Swift or the Sahara internal database. These URLs are not meant to be used
|
Sahara uses custom URLs to refer to objects stored in swift, in manila, or in
|
||||||
outside of Sahara.
|
the sahara internal database. These URLs are not meant to be used outside of
|
||||||
|
sahara.
|
||||||
|
|
||||||
Sahara Swift URLs passed to running jobs as input or output sources include a ".sahara" suffix on the container, for example:
|
Sahara swift URLs passed to running jobs as input or output sources include a
|
||||||
|
".sahara" suffix on the container, for example:
|
||||||
|
|
||||||
``swift://container.sahara/object``
|
``swift://container.sahara/object``
|
||||||
|
|
||||||
You may notice these Swift URLs in job logs, however, you do not need to add the suffix to the containers
|
You may notice these swift URLs in job logs, however, you do not need to add
|
||||||
yourself. Sahara will add the suffix if necessary, so when using the UI or the python client you may write the above URL simply as:
|
the suffix to the containers yourself. sahara will add the suffix if
|
||||||
|
necessary, so when using the UI or the python client you may write the above
|
||||||
|
URL simply as:
|
||||||
|
|
||||||
``swift://container/object``
|
``swift://container/object``
|
||||||
|
|
||||||
@ -387,36 +486,47 @@ Sahara internal database URLs have the form:
|
|||||||
|
|
||||||
``internal-db://sahara-generated-uuid``
|
``internal-db://sahara-generated-uuid``
|
||||||
|
|
||||||
This indicates a file object in the Sahara database which has the given uuid as a key
|
This indicates a file object in the sahara database which has the given uuid
|
||||||
|
as a key.
|
||||||
|
|
||||||
|
Manila NFS filesystem reference URLS take the form:
|
||||||
|
|
||||||
|
``manila://share-uuid/path``
|
||||||
|
|
||||||
|
This format should be used when referring to a job binary or a data source
|
||||||
|
stored in a manila NFS share.
|
||||||
|
|
||||||
|
|
||||||
EDP Requirements
|
EDP Requirements
|
||||||
================
|
================
|
||||||
|
|
||||||
The OpenStack installation and the cluster launched from Sahara must meet the following minimum requirements in order for EDP to function:
|
The OpenStack installation and the cluster launched from sahara must meet the
|
||||||
|
following minimum requirements in order for EDP to function:
|
||||||
|
|
||||||
OpenStack Services
|
OpenStack Services
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
When a Hadoop job is executed, binaries are first uploaded to a cluster node and then moved from the node local filesystem to HDFS. Therefore, there must be an instance of HDFS available to the nodes in the Sahara cluster.
|
When a Hadoop job is executed, binaries are first uploaded to a cluster node
|
||||||
|
and then moved from the node local filesystem to HDFS. Therefore, there must
|
||||||
|
be an instance of HDFS available to the nodes in the sahara cluster.
|
||||||
|
|
||||||
If the Swift service *is not* running in the OpenStack installation
|
If the swift service *is not* running in the OpenStack installation:
|
||||||
|
|
||||||
+ Job binaries may only be stored in the Sahara internal database
|
+ Job binaries may only be stored in the sahara internal database
|
||||||
+ Data sources require a long-running HDFS
|
+ Data sources require a long-running HDFS
|
||||||
|
|
||||||
If the Swift service *is* running in the OpenStack installation
|
If the swift service *is* running in the OpenStack installation:
|
||||||
|
|
||||||
+ Job binaries may be stored in Swift or the Sahara internal database
|
+ Job binaries may be stored in swift or the sahara internal database
|
||||||
+ Data sources may be in Swift or a long-running HDFS
|
+ Data sources may be in swift or a long-running HDFS
|
||||||
|
|
||||||
|
|
||||||
Cluster Processes
|
Cluster Processes
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
Requirements for EDP support depend on the EDP job type and plugin used for the cluster.
|
Requirements for EDP support depend on the EDP job type and plugin used for
|
||||||
For example a Vanilla Sahara cluster must run at least one instance of these processes
|
the cluster. For example a Vanilla sahara cluster must run at least one
|
||||||
to support EDP:
|
instance of these processes to support EDP:
|
||||||
|
|
||||||
* For Hadoop version 1:
|
* For Hadoop version 1:
|
||||||
|
|
||||||
@ -445,29 +555,29 @@ to work properly. They are listed on this page.
|
|||||||
Transient Clusters
|
Transient Clusters
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
EDP allows running jobs on transient clusters. In this case the cluster is created
|
EDP allows running jobs on transient clusters. In this case the cluster is
|
||||||
specifically for the job and is shut down automatically once the job is
|
created specifically for the job and is shut down automatically once the job
|
||||||
finished.
|
is finished.
|
||||||
|
|
||||||
Two config parameters control the behaviour of periodic clusters:
|
Two config parameters control the behaviour of periodic clusters:
|
||||||
|
|
||||||
* periodic_enable - if set to 'False', Sahara will do nothing to a transient
|
* periodic_enable - if set to 'False', sahara will do nothing to a transient
|
||||||
cluster once the job it was created for is completed. If it is set to
|
cluster once the job it was created for is completed. If it is set to
|
||||||
'True', then the behaviour depends on the value of the next parameter.
|
'True', then the behaviour depends on the value of the next parameter.
|
||||||
* use_identity_api_v3 - set it to 'False' if your OpenStack installation
|
* use_identity_api_v3 - set it to 'False' if your OpenStack installation
|
||||||
does not provide Keystone API v3. In that case Sahara will not terminate
|
does not provide keystone API v3. In that case sahara will not terminate
|
||||||
unneeded clusters. Instead it will set their state to 'AwaitingTermination'
|
unneeded clusters. Instead it will set their state to 'AwaitingTermination'
|
||||||
meaning that they could be manually deleted by a user. If the parameter is
|
meaning that they could be manually deleted by a user. If the parameter is
|
||||||
set to 'True', Sahara will itself terminate the cluster. The limitation is
|
set to 'True', sahara will itself terminate the cluster. The limitation is
|
||||||
caused by lack of 'trusts' feature in Keystone API older than v3.
|
caused by lack of 'trusts' feature in Keystone API older than v3.
|
||||||
|
|
||||||
If both parameters are set to 'True', Sahara works with transient clusters in
|
If both parameters are set to 'True', sahara works with transient clusters in
|
||||||
the following manner:
|
the following manner:
|
||||||
|
|
||||||
1. When a user requests for a job to be executed on a transient cluster,
|
1. When a user requests for a job to be executed on a transient cluster,
|
||||||
Sahara creates such a cluster.
|
sahara creates such a cluster.
|
||||||
2. Sahara drops the user's credentials once the cluster is created but
|
2. Sahara drops the user's credentials once the cluster is created but
|
||||||
prior to that it creates a trust allowing it to operate with the
|
prior to that it creates a trust allowing it to operate with the
|
||||||
cluster instances in the future without user credentials.
|
cluster instances in the future without user credentials.
|
||||||
3. Once a cluster is not needed, Sahara terminates its instances using the
|
3. Once a cluster is not needed, sahara terminates its instances using the
|
||||||
stored trust. Sahara drops the trust after that.
|
stored trust. sahara drops the trust after that.
|
||||||
|
Loading…
Reference in New Issue
Block a user