Updates to the EDP doc
Made general corrections, tried to make the Java job section more readable, and added explanations for the following parameters: * edp.spark.adapt_for_swift * edp.substitute_data_source_for_name * edp.substitute_data_source_for_uuid Change-Id: Ida7ec396998ade72f3137785cab27a0b4acae357
This commit is contained in:
parent
24b5778cc4
commit
60b28a5aa4
@ -22,7 +22,7 @@ Interfaces
|
|||||||
|
|
||||||
The EDP features can be used from the Sahara web UI which is described in the :doc:`../horizon/dashboard.user.guide`.
|
The EDP features can be used from the Sahara web UI which is described in the :doc:`../horizon/dashboard.user.guide`.
|
||||||
|
|
||||||
The EDP features also can be used directly by a client through the :doc:`../restapi/rest_api_v1.1_EDP`.
|
The EDP features also can be used directly by a client through the `REST api <http://developer.openstack.org/api-ref-data-processing-v1.1.html>`_
|
||||||
|
|
||||||
EDP Concepts
|
EDP Concepts
|
||||||
------------
|
------------
|
||||||
@ -83,6 +83,12 @@ Sahara supports data sources in Swift. The Swift service must be running in the
|
|||||||
|
|
||||||
Sahara also supports data sources in HDFS. Any HDFS instance running on a Sahara cluster in the same OpenStack installation is accessible without manual configuration. Other instances of HDFS may be used as well provided that the URL is resolvable from the node executing the job.
|
Sahara also supports data sources in HDFS. Any HDFS instance running on a Sahara cluster in the same OpenStack installation is accessible without manual configuration. Other instances of HDFS may be used as well provided that the URL is resolvable from the node executing the job.
|
||||||
|
|
||||||
|
Some job types require the use of data source objects to specify input and output when a job is launched. For example, when running a Pig job the UI will prompt the user for input and output data source objects.
|
||||||
|
|
||||||
|
Other job types like Java or Spark do not require the user to specify data sources. For these job types, data paths are passed as arguments. For convenience, Sahara allows data source objects to be
|
||||||
|
referenced by name or id. The section `Using Data Source References as Arguments`_ gives further details.
|
||||||
|
|
||||||
|
|
||||||
Job Execution
|
Job Execution
|
||||||
+++++++++++++
|
+++++++++++++
|
||||||
|
|
||||||
@ -108,9 +114,6 @@ The general workflow for defining and executing a job in Sahara is essentially t
|
|||||||
3. Create a Job object which references the Job Binaries created in step 2
|
3. Create a Job object which references the Job Binaries created in step 2
|
||||||
4. Create an input Data Source which points to the data you wish to process
|
4. Create an input Data Source which points to the data you wish to process
|
||||||
5. Create an output Data Source which points to the location for output data
|
5. Create an output Data Source which points to the location for output data
|
||||||
|
|
||||||
(Steps 4 and 5 do not apply to Java or Spark job types. See `Additional Details for Java jobs`_ and `Additional Details for Spark jobs`_)
|
|
||||||
|
|
||||||
6. Create a Job Execution object specifying the cluster and Job object plus relevant data sources, configuration values, and program arguments
|
6. Create a Job Execution object specifying the cluster and Job object plus relevant data sources, configuration values, and program arguments
|
||||||
|
|
||||||
+ When using the web UI this is done with the :guilabel:`Launch On Existing Cluster` or :guilabel:`Launch on New Cluster` buttons on the Jobs tab
|
+ When using the web UI this is done with the :guilabel:`Launch On Existing Cluster` or :guilabel:`Launch on New Cluster` buttons on the Jobs tab
|
||||||
@ -155,12 +158,46 @@ These values can be set on the :guilabel:`Configure` tab during job launch throu
|
|||||||
|
|
||||||
In some cases Sahara generates configuration values or parameters automatically. Values set explicitly by the user during launch will override those generated by Sahara.
|
In some cases Sahara generates configuration values or parameters automatically. Values set explicitly by the user during launch will override those generated by Sahara.
|
||||||
|
|
||||||
|
Using Data Source References as Arguments
|
||||||
|
+++++++++++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
|
Sometimes it's necessary or desirable to pass a data path as an argument to a job. In these cases,
|
||||||
|
a user may simply type out the path as an argument when launching a job. If the path requires
|
||||||
|
credentials, the user can manually add the credentials as configuration values. However, if a data
|
||||||
|
source object has been created that contains the desired path and credentials there is no need
|
||||||
|
to specify this information manually.
|
||||||
|
|
||||||
|
As a convenience, Sahara allows data source objects to be referenced by name or id
|
||||||
|
in arguments, configuration values, or parameters. When the job is executed, Sahara will replace
|
||||||
|
the reference with the path stored in the data source object and will add any necessary credentials
|
||||||
|
to the job configuration. Referencing an existing data source object is much faster than adding
|
||||||
|
this information by hand. This is particularly useful for job types like Java or Spark that do
|
||||||
|
not use data source objects directly.
|
||||||
|
|
||||||
|
There are two job configuration parameters that enable data source references. They may
|
||||||
|
be used with any job type and are set on the ``Configuration`` tab when the job is launched:
|
||||||
|
|
||||||
|
* ``edp.substitute_data_source_for_name`` (default **False**) If set to **True**, causes Sahara
|
||||||
|
to look for data source object name references in configuration values, arguments, and parameters
|
||||||
|
when a job is launched. Name references have the form **datasource://name_of_the_object**.
|
||||||
|
|
||||||
|
For example, assume a user has a WordCount application that takes an input path as an argument.
|
||||||
|
If there is a data source object named **my_input**, a user may simply set the
|
||||||
|
**edp.substitute_data_source_for_name** configuration parameter to **True** and add
|
||||||
|
**datasource://my_input** as an argument when launching the job.
|
||||||
|
|
||||||
|
* ``edp.substitute_data_source_for_uuid`` (default **False**) If set to **True**, causes Sahara
|
||||||
|
to look for data source object ids in configuration values, arguments, and parameters when
|
||||||
|
a job is launched. A data source object id is a uuid, so they are unique. The id of a data
|
||||||
|
source object is available through the UI or the Sahara command line client. A user may
|
||||||
|
simply use the id as a value.
|
||||||
|
|
||||||
Generation of Swift Properties for Data Sources
|
Generation of Swift Properties for Data Sources
|
||||||
+++++++++++++++++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
If Swift proxy users are not configured (see :doc:`../userdoc/advanced.configuration.guide`) and a job is run with data sources in Swift, Sahara will automatically generate Swift username and password configuration values based on the credentials in the data sources. If the input and output data sources are both in Swift, it is expected that they specify the same credentials.
|
If Swift proxy users are not configured (see :doc:`../userdoc/advanced.configuration.guide`) and a job is run with data source objects containing Swift paths, Sahara will automatically generate Swift username and password configuration values based on the credentials in the data sources. If the input and output data sources are both in Swift, it is expected that they specify the same credentials.
|
||||||
|
|
||||||
The Swift credentials can be set explicitly with the following configuration values:
|
The Swift credentials may be set explicitly with the following configuration values:
|
||||||
|
|
||||||
+------------------------------------+
|
+------------------------------------+
|
||||||
| Name |
|
| Name |
|
||||||
@ -170,17 +207,24 @@ The Swift credentials can be set explicitly with the following configuration val
|
|||||||
| fs.swift.service.sahara.password |
|
| fs.swift.service.sahara.password |
|
||||||
+------------------------------------+
|
+------------------------------------+
|
||||||
|
|
||||||
|
Setting the Swift credentials explicitly is required when passing literal Swift paths as arguments
|
||||||
|
instead of using data source references. When possible, use data source references as described
|
||||||
|
in `Using Data Source References as Arguments`_.
|
||||||
|
|
||||||
Additional Details for Hive jobs
|
Additional Details for Hive jobs
|
||||||
++++++++++++++++++++++++++++++++
|
++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by Hive based on the specified data sources.
|
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by
|
||||||
|
Hive based on the specified data sources.
|
||||||
|
|
||||||
Additional Details for Pig jobs
|
Additional Details for Pig jobs
|
||||||
+++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by Pig based on the specified data sources.
|
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by
|
||||||
|
Pig based on the specified data sources.
|
||||||
|
|
||||||
For Pig jobs, ``arguments`` should be thought of as command line arguments separated by spaces and passed to the ``pig`` shell.
|
For Pig jobs, ``arguments`` should be thought of as command line arguments separated by spaces and
|
||||||
|
passed to the ``pig`` shell.
|
||||||
|
|
||||||
``Parameters`` are a shorthand and are actually translated to the arguments ``-param name=value``
|
``Parameters`` are a shorthand and are actually translated to the arguments ``-param name=value``
|
||||||
|
|
||||||
@ -189,8 +233,10 @@ Additional Details for MapReduce jobs
|
|||||||
|
|
||||||
**Important!**
|
**Important!**
|
||||||
|
|
||||||
If the job type is MapReduce, the mapper and reducer classes *must* be specified as configuration values.
|
If the job type is MapReduce, the mapper and reducer classes *must* be specified as configuration
|
||||||
Note, the UI will not prompt the user for these required values, they must be added manually with the ``Configure`` tab.
|
values.
|
||||||
|
Note, the UI will not prompt the user for these required values, they must be added manually with
|
||||||
|
the ``Configure`` tab.
|
||||||
Make sure to add these values with the correct names:
|
Make sure to add these values with the correct names:
|
||||||
|
|
||||||
+-------------------------+-----------------------------------------+
|
+-------------------------+-----------------------------------------+
|
||||||
@ -208,9 +254,10 @@ Additional Details for MapReduce.Streaming jobs
|
|||||||
|
|
||||||
If the job type is MapReduce.Streaming, the streaming mapper and reducer classes *must* be specified.
|
If the job type is MapReduce.Streaming, the streaming mapper and reducer classes *must* be specified.
|
||||||
|
|
||||||
In this case, the UI *will* prompt the user to enter mapper and reducer values on the form and will take care of
|
In this case, the UI *will* prompt the user to enter mapper and reducer values on the form and will
|
||||||
adding them to the job configuration with the appropriate names. If using the python client, however, be certain
|
take care of adding them to the job configuration with the appropriate names. If using the python
|
||||||
to add these values to the job configuration manually with the correct names:
|
client, however, be certain to add these values to the job configuration manually with the correct
|
||||||
|
names:
|
||||||
|
|
||||||
+-------------------------+---------------+
|
+-------------------------+---------------+
|
||||||
| Name | Example Value |
|
| Name | Example Value |
|
||||||
@ -223,64 +270,98 @@ to add these values to the job configuration manually with the correct names:
|
|||||||
Additional Details for Java jobs
|
Additional Details for Java jobs
|
||||||
++++++++++++++++++++++++++++++++
|
++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Java jobs use two special configuration values:
|
Data Source objects are not used directly with Java job types. Instead, any
|
||||||
|
input or output paths must be specified as arguments at job launch either
|
||||||
|
explicitly or by reference as described in `Using Data Source References as Arguments`_.
|
||||||
|
Using data source references is the recommended way to pass paths to
|
||||||
|
Java jobs.
|
||||||
|
|
||||||
* ``edp.java.main_class`` (required) Specifies the class(including the package name, for example: org.openstack.sahara.examples.WordCount) containing ``main(String[] args)``
|
If configuration values are specified, they must be added to the job's
|
||||||
|
Hadoop configuration at runtime. There are two methods of doing this. The
|
||||||
|
simplest way is to use the **edp.java.adapt_for_oozie** option described
|
||||||
|
below. The other method is to use the code from
|
||||||
|
`this example <https://github.com/openstack/sahara/blob/master/etc/edp-examples/edp-java/README.rst>`_
|
||||||
|
to explicitly load the values.
|
||||||
|
|
||||||
|
The following special configuration values are read by Sahara and affect how Java jobs are run:
|
||||||
|
|
||||||
|
* ``edp.java.main_class`` (required) Specifies the full name of the class
|
||||||
|
containing ``main(String[] args)``
|
||||||
|
|
||||||
|
A Java job will execute the **main** method of the specified main class. Any
|
||||||
|
arguments set during job launch will be passed to the program through the
|
||||||
|
**args** array.
|
||||||
|
|
||||||
|
* ``oozie.libpath`` (optional) Specifies configuration values for the Oozie
|
||||||
|
share libs, these libs can be shared by different workflows
|
||||||
|
|
||||||
* ``edp.java.java_opts`` (optional) Specifies configuration values for the JVM
|
* ``edp.java.java_opts`` (optional) Specifies configuration values for the JVM
|
||||||
|
|
||||||
* ``edp.java.adapt_for_oozie`` (optional) Specifies configuration values for adapting oozie. If this configuration value is unset or set to "False", users will need to modify source code as shown `here <https://github.com/openstack/sahara/blob/master/etc/edp-examples/edp-java/README.rst>`_ to read Hadoop configuration values from the Oozie job configuration. Setting this configuration value to "True" ensures that the Oozie job configuration values will be set in the Hadoop config automatically with no need for code modification and that exit conditions will be handled correctly by Oozie.
|
* ``edp.java.adapt_for_oozie`` (optional) Specifies that Sahara should perform
|
||||||
|
special handling of configuration values and exit conditions. The default is
|
||||||
|
**False**.
|
||||||
|
|
||||||
* ``oozie.libpath`` (optional) Specifies configuration values for the Oozie share libs, these libs can be shared by different workflows
|
If this configuration value is set to **True**, Sahara will modify
|
||||||
|
the job's Hadoop configuration before invoking the specified **main** method.
|
||||||
|
Any configuration values specified during job launch (excluding those
|
||||||
|
beginning with **edp.**) will be automatically set in the job's Hadoop
|
||||||
|
configuration and will be available through standard methods.
|
||||||
|
|
||||||
* Use HBase Common Libs (optional) specifies configuration value for whether using the common HBase libs on HDFS or not if running HBase Job written by Java
|
Secondly, setting this option to **True** ensures that Oozie will handle
|
||||||
|
program exit conditions correctly.
|
||||||
|
|
||||||
A Java job will execute the ``main(String[] args)`` method of the specified main class. There are two methods of passing
|
At this time, the following special configuration value only applies when
|
||||||
values to the ``main`` method:
|
running jobs on a cluster generated by the Cloudera plugin with the
|
||||||
|
**Enable Hbase Common Lib** cluster config set to **True** (the default value):
|
||||||
|
|
||||||
* Passing values as arguments
|
* ``edp.hbase_common_lib`` (optional) Specifies that a common Hbase lib generated by
|
||||||
|
Sahara in HDFS be added to the **oozie.libpath**. This for use when an Hbase application
|
||||||
|
is driven from a Java job. Default is **False**.
|
||||||
|
|
||||||
Arguments set during job launch will be passed in the ``String[] args`` array.
|
The **edp-wordcount** example bundled with Sahara shows how to use configuration
|
||||||
|
values, arguments, and Swift data paths in a Java job type. Note that the
|
||||||
* Setting configuration values
|
example does not use the **edp.java.adapt_for_oozie** option but includes the
|
||||||
|
code to load the configuration values explicitly.
|
||||||
Any configuration values that are set can be read from a special file created by Oozie.
|
|
||||||
|
|
||||||
Data Source objects are not used with Java job types. Instead, any input or output paths must be passed to the ``main`` method
|
|
||||||
using one of the above two methods. Furthermore, if Swift data sources are used the configuration values listed in `Generation of Swift Properties for Data Sources`_ must be passed with one of the above two methods and set in the configuration by ``main``.
|
|
||||||
|
|
||||||
The ``edp-wordcount`` example bundled with Sahara shows how to use configuration values, arguments, and Swift data paths in a Java job type.
|
|
||||||
|
|
||||||
Additional Details for Shell jobs
|
Additional Details for Shell jobs
|
||||||
+++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
A shell job will execute the script specified as ``main``, and will place any files specified as ``libs``
|
A shell job will execute the script specified as ``main``, and will place any files specified
|
||||||
in the same working directory (on both the filesystem and in HDFS). Command line arguments may be passed to
|
as ``libs`` in the same working directory (on both the filesystem and in HDFS). Command line
|
||||||
the script through the ``args`` array, and any ``params`` values will be passed as environment variables.
|
arguments may be passed to the script through the ``args`` array, and any ``params`` values will
|
||||||
|
be passed as environment variables.
|
||||||
|
|
||||||
Data Source objects are not used with Shell job types.
|
Data Source objects are not used directly with Shell job types but data source references
|
||||||
|
may be used as described in `Using Data Source References as Arguments`_.
|
||||||
|
|
||||||
The ``edp-shell`` example bundled with Sahara contains a script which will output the executing user to
|
The **edp-shell** example bundled with Sahara contains a script which will output the executing
|
||||||
a file specified by the first command line argument.
|
user to a file specified by the first command line argument.
|
||||||
|
|
||||||
Additional Details for Spark jobs
|
Additional Details for Spark jobs
|
||||||
+++++++++++++++++++++++++++++++++
|
+++++++++++++++++++++++++++++++++
|
||||||
|
|
||||||
Spark jobs use a special configuration value:
|
Data Source objects are not used directly with Spark job types. Instead, any
|
||||||
|
input or output paths must be specified as arguments at job launch either
|
||||||
|
explicitly or by reference as described in `Using Data Source References as Arguments`_.
|
||||||
|
Using data source references is the recommended way to pass paths to Spark jobs.
|
||||||
|
|
||||||
* ``edp.java.main_class`` (required) Specifies the class containing the Java or Scala main method:
|
Spark jobs use some special configuration values:
|
||||||
|
|
||||||
|
* ``edp.java.main_class`` (required) Specifies the full name of the class
|
||||||
|
containing the Java or Scala main method:
|
||||||
|
|
||||||
+ ``main(String[] args)`` for Java
|
+ ``main(String[] args)`` for Java
|
||||||
+ ``main(args: Array[String]`` for Scala
|
+ ``main(args: Array[String]`` for Scala
|
||||||
|
|
||||||
A Spark job will execute the ``main`` method of the specified main class. Values may be passed to
|
A Spark job will execute the **main** method of the specified main class. Any
|
||||||
the main method through the ``args`` array. Any arguments set during job launch will be passed to the
|
arguments set during job launch will be passed to the program through the
|
||||||
program as command-line arguments by *spark-submit*.
|
**args** array.
|
||||||
|
|
||||||
Data Source objects are not used with Spark job types. Instead, any input or output paths must be passed to the ``main`` method
|
* ``edp.spark.adapt_for_swift`` (optional) If set to **True**, instructs Sahara to modify the
|
||||||
as arguments.
|
job's Hadoop configuration so that Swift paths may be accessed. Without this configuration
|
||||||
|
value, Swift paths will not be accessible to Spark jobs. The default is **False**.
|
||||||
|
|
||||||
The ``edp-spark`` example bundled with Sahara contains a Spark program for estimating Pi.
|
The **edp-spark** example bundled with Sahara contains a Spark program for estimating Pi.
|
||||||
|
|
||||||
|
|
||||||
Special Sahara URLs
|
Special Sahara URLs
|
||||||
|
Loading…
Reference in New Issue
Block a user