Updates to the EDP doc

Made general corrections, tried to make the Java job section more
readable, and added explanations for the following parameters:

* edp.spark.adapt_for_swift
* edp.substitute_data_source_for_name
* edp.substitute_data_source_for_uuid

Change-Id: Ida7ec396998ade72f3137785cab27a0b4acae357
This commit is contained in:
Trevor McKay 2015-04-07 20:36:18 -04:00
parent 24b5778cc4
commit 60b28a5aa4

View File

@ -22,7 +22,7 @@ Interfaces
The EDP features can be used from the Sahara web UI which is described in the :doc:`../horizon/dashboard.user.guide`.
The EDP features also can be used directly by a client through the :doc:`../restapi/rest_api_v1.1_EDP`.
The EDP features also can be used directly by a client through the `REST api <http://developer.openstack.org/api-ref-data-processing-v1.1.html>`_
EDP Concepts
------------
@ -83,6 +83,12 @@ Sahara supports data sources in Swift. The Swift service must be running in the
Sahara also supports data sources in HDFS. Any HDFS instance running on a Sahara cluster in the same OpenStack installation is accessible without manual configuration. Other instances of HDFS may be used as well provided that the URL is resolvable from the node executing the job.
Some job types require the use of data source objects to specify input and output when a job is launched. For example, when running a Pig job the UI will prompt the user for input and output data source objects.
Other job types like Java or Spark do not require the user to specify data sources. For these job types, data paths are passed as arguments. For convenience, Sahara allows data source objects to be
referenced by name or id. The section `Using Data Source References as Arguments`_ gives further details.
Job Execution
+++++++++++++
@ -108,9 +114,6 @@ The general workflow for defining and executing a job in Sahara is essentially t
3. Create a Job object which references the Job Binaries created in step 2
4. Create an input Data Source which points to the data you wish to process
5. Create an output Data Source which points to the location for output data
(Steps 4 and 5 do not apply to Java or Spark job types. See `Additional Details for Java jobs`_ and `Additional Details for Spark jobs`_)
6. Create a Job Execution object specifying the cluster and Job object plus relevant data sources, configuration values, and program arguments
+ When using the web UI this is done with the :guilabel:`Launch On Existing Cluster` or :guilabel:`Launch on New Cluster` buttons on the Jobs tab
@ -155,12 +158,46 @@ These values can be set on the :guilabel:`Configure` tab during job launch throu
In some cases Sahara generates configuration values or parameters automatically. Values set explicitly by the user during launch will override those generated by Sahara.
Using Data Source References as Arguments
+++++++++++++++++++++++++++++++++++++++++
Sometimes it's necessary or desirable to pass a data path as an argument to a job. In these cases,
a user may simply type out the path as an argument when launching a job. If the path requires
credentials, the user can manually add the credentials as configuration values. However, if a data
source object has been created that contains the desired path and credentials there is no need
to specify this information manually.
As a convenience, Sahara allows data source objects to be referenced by name or id
in arguments, configuration values, or parameters. When the job is executed, Sahara will replace
the reference with the path stored in the data source object and will add any necessary credentials
to the job configuration. Referencing an existing data source object is much faster than adding
this information by hand. This is particularly useful for job types like Java or Spark that do
not use data source objects directly.
There are two job configuration parameters that enable data source references. They may
be used with any job type and are set on the ``Configuration`` tab when the job is launched:
* ``edp.substitute_data_source_for_name`` (default **False**) If set to **True**, causes Sahara
to look for data source object name references in configuration values, arguments, and parameters
when a job is launched. Name references have the form **datasource://name_of_the_object**.
For example, assume a user has a WordCount application that takes an input path as an argument.
If there is a data source object named **my_input**, a user may simply set the
**edp.substitute_data_source_for_name** configuration parameter to **True** and add
**datasource://my_input** as an argument when launching the job.
* ``edp.substitute_data_source_for_uuid`` (default **False**) If set to **True**, causes Sahara
to look for data source object ids in configuration values, arguments, and parameters when
a job is launched. A data source object id is a uuid, so they are unique. The id of a data
source object is available through the UI or the Sahara command line client. A user may
simply use the id as a value.
Generation of Swift Properties for Data Sources
+++++++++++++++++++++++++++++++++++++++++++++++
If Swift proxy users are not configured (see :doc:`../userdoc/advanced.configuration.guide`) and a job is run with data sources in Swift, Sahara will automatically generate Swift username and password configuration values based on the credentials in the data sources. If the input and output data sources are both in Swift, it is expected that they specify the same credentials.
If Swift proxy users are not configured (see :doc:`../userdoc/advanced.configuration.guide`) and a job is run with data source objects containing Swift paths, Sahara will automatically generate Swift username and password configuration values based on the credentials in the data sources. If the input and output data sources are both in Swift, it is expected that they specify the same credentials.
The Swift credentials can be set explicitly with the following configuration values:
The Swift credentials may be set explicitly with the following configuration values:
+------------------------------------+
| Name |
@ -170,17 +207,24 @@ The Swift credentials can be set explicitly with the following configuration val
| fs.swift.service.sahara.password |
+------------------------------------+
Setting the Swift credentials explicitly is required when passing literal Swift paths as arguments
instead of using data source references. When possible, use data source references as described
in `Using Data Source References as Arguments`_.
Additional Details for Hive jobs
++++++++++++++++++++++++++++++++
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by Hive based on the specified data sources.
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by
Hive based on the specified data sources.
Additional Details for Pig jobs
+++++++++++++++++++++++++++++++
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by Pig based on the specified data sources.
Sahara will automatically generate values for the ``INPUT`` and ``OUTPUT`` parameters required by
Pig based on the specified data sources.
For Pig jobs, ``arguments`` should be thought of as command line arguments separated by spaces and passed to the ``pig`` shell.
For Pig jobs, ``arguments`` should be thought of as command line arguments separated by spaces and
passed to the ``pig`` shell.
``Parameters`` are a shorthand and are actually translated to the arguments ``-param name=value``
@ -189,8 +233,10 @@ Additional Details for MapReduce jobs
**Important!**
If the job type is MapReduce, the mapper and reducer classes *must* be specified as configuration values.
Note, the UI will not prompt the user for these required values, they must be added manually with the ``Configure`` tab.
If the job type is MapReduce, the mapper and reducer classes *must* be specified as configuration
values.
Note, the UI will not prompt the user for these required values, they must be added manually with
the ``Configure`` tab.
Make sure to add these values with the correct names:
+-------------------------+-----------------------------------------+
@ -208,9 +254,10 @@ Additional Details for MapReduce.Streaming jobs
If the job type is MapReduce.Streaming, the streaming mapper and reducer classes *must* be specified.
In this case, the UI *will* prompt the user to enter mapper and reducer values on the form and will take care of
adding them to the job configuration with the appropriate names. If using the python client, however, be certain
to add these values to the job configuration manually with the correct names:
In this case, the UI *will* prompt the user to enter mapper and reducer values on the form and will
take care of adding them to the job configuration with the appropriate names. If using the python
client, however, be certain to add these values to the job configuration manually with the correct
names:
+-------------------------+---------------+
| Name | Example Value |
@ -223,64 +270,98 @@ to add these values to the job configuration manually with the correct names:
Additional Details for Java jobs
++++++++++++++++++++++++++++++++
Java jobs use two special configuration values:
Data Source objects are not used directly with Java job types. Instead, any
input or output paths must be specified as arguments at job launch either
explicitly or by reference as described in `Using Data Source References as Arguments`_.
Using data source references is the recommended way to pass paths to
Java jobs.
* ``edp.java.main_class`` (required) Specifies the class(including the package name, for example: org.openstack.sahara.examples.WordCount) containing ``main(String[] args)``
If configuration values are specified, they must be added to the job's
Hadoop configuration at runtime. There are two methods of doing this. The
simplest way is to use the **edp.java.adapt_for_oozie** option described
below. The other method is to use the code from
`this example <https://github.com/openstack/sahara/blob/master/etc/edp-examples/edp-java/README.rst>`_
to explicitly load the values.
The following special configuration values are read by Sahara and affect how Java jobs are run:
* ``edp.java.main_class`` (required) Specifies the full name of the class
containing ``main(String[] args)``
A Java job will execute the **main** method of the specified main class. Any
arguments set during job launch will be passed to the program through the
**args** array.
* ``oozie.libpath`` (optional) Specifies configuration values for the Oozie
share libs, these libs can be shared by different workflows
* ``edp.java.java_opts`` (optional) Specifies configuration values for the JVM
* ``edp.java.adapt_for_oozie`` (optional) Specifies configuration values for adapting oozie. If this configuration value is unset or set to "False", users will need to modify source code as shown `here <https://github.com/openstack/sahara/blob/master/etc/edp-examples/edp-java/README.rst>`_ to read Hadoop configuration values from the Oozie job configuration. Setting this configuration value to "True" ensures that the Oozie job configuration values will be set in the Hadoop config automatically with no need for code modification and that exit conditions will be handled correctly by Oozie.
* ``edp.java.adapt_for_oozie`` (optional) Specifies that Sahara should perform
special handling of configuration values and exit conditions. The default is
**False**.
* ``oozie.libpath`` (optional) Specifies configuration values for the Oozie share libs, these libs can be shared by different workflows
If this configuration value is set to **True**, Sahara will modify
the job's Hadoop configuration before invoking the specified **main** method.
Any configuration values specified during job launch (excluding those
beginning with **edp.**) will be automatically set in the job's Hadoop
configuration and will be available through standard methods.
* Use HBase Common Libs (optional) specifies configuration value for whether using the common HBase libs on HDFS or not if running HBase Job written by Java
Secondly, setting this option to **True** ensures that Oozie will handle
program exit conditions correctly.
A Java job will execute the ``main(String[] args)`` method of the specified main class. There are two methods of passing
values to the ``main`` method:
At this time, the following special configuration value only applies when
running jobs on a cluster generated by the Cloudera plugin with the
**Enable Hbase Common Lib** cluster config set to **True** (the default value):
* Passing values as arguments
* ``edp.hbase_common_lib`` (optional) Specifies that a common Hbase lib generated by
Sahara in HDFS be added to the **oozie.libpath**. This for use when an Hbase application
is driven from a Java job. Default is **False**.
Arguments set during job launch will be passed in the ``String[] args`` array.
* Setting configuration values
Any configuration values that are set can be read from a special file created by Oozie.
Data Source objects are not used with Java job types. Instead, any input or output paths must be passed to the ``main`` method
using one of the above two methods. Furthermore, if Swift data sources are used the configuration values listed in `Generation of Swift Properties for Data Sources`_ must be passed with one of the above two methods and set in the configuration by ``main``.
The ``edp-wordcount`` example bundled with Sahara shows how to use configuration values, arguments, and Swift data paths in a Java job type.
The **edp-wordcount** example bundled with Sahara shows how to use configuration
values, arguments, and Swift data paths in a Java job type. Note that the
example does not use the **edp.java.adapt_for_oozie** option but includes the
code to load the configuration values explicitly.
Additional Details for Shell jobs
+++++++++++++++++++++++++++++++++
A shell job will execute the script specified as ``main``, and will place any files specified as ``libs``
in the same working directory (on both the filesystem and in HDFS). Command line arguments may be passed to
the script through the ``args`` array, and any ``params`` values will be passed as environment variables.
A shell job will execute the script specified as ``main``, and will place any files specified
as ``libs`` in the same working directory (on both the filesystem and in HDFS). Command line
arguments may be passed to the script through the ``args`` array, and any ``params`` values will
be passed as environment variables.
Data Source objects are not used with Shell job types.
Data Source objects are not used directly with Shell job types but data source references
may be used as described in `Using Data Source References as Arguments`_.
The ``edp-shell`` example bundled with Sahara contains a script which will output the executing user to
a file specified by the first command line argument.
The **edp-shell** example bundled with Sahara contains a script which will output the executing
user to a file specified by the first command line argument.
Additional Details for Spark jobs
+++++++++++++++++++++++++++++++++
Spark jobs use a special configuration value:
Data Source objects are not used directly with Spark job types. Instead, any
input or output paths must be specified as arguments at job launch either
explicitly or by reference as described in `Using Data Source References as Arguments`_.
Using data source references is the recommended way to pass paths to Spark jobs.
* ``edp.java.main_class`` (required) Specifies the class containing the Java or Scala main method:
Spark jobs use some special configuration values:
* ``edp.java.main_class`` (required) Specifies the full name of the class
containing the Java or Scala main method:
+ ``main(String[] args)`` for Java
+ ``main(args: Array[String]`` for Scala
A Spark job will execute the ``main`` method of the specified main class. Values may be passed to
the main method through the ``args`` array. Any arguments set during job launch will be passed to the
program as command-line arguments by *spark-submit*.
A Spark job will execute the **main** method of the specified main class. Any
arguments set during job launch will be passed to the program through the
**args** array.
Data Source objects are not used with Spark job types. Instead, any input or output paths must be passed to the ``main`` method
as arguments.
* ``edp.spark.adapt_for_swift`` (optional) If set to **True**, instructs Sahara to modify the
job's Hadoop configuration so that Swift paths may be accessed. Without this configuration
value, Swift paths will not be accessible to Spark jobs. The default is **False**.
The ``edp-spark`` example bundled with Sahara contains a Spark program for estimating Pi.
The **edp-spark** example bundled with Sahara contains a Spark program for estimating Pi.
Special Sahara URLs