Merge "[DOC] update userdoc/edp.rst"

This commit is contained in:
Jenkins 2016-09-22 22:52:07 +00:00 committed by Gerrit Code Review
commit c0f7e1f656

View File

@ -9,8 +9,8 @@ of jobs on clusters created from sahara. EDP supports:
* Hive, Pig, MapReduce, MapReduce.Streaming, Java, and Shell job types on
Hadoop clusters
* Spark jobs on Spark standalone clusters, MapR spark clusters, and CDH
clusters (v5.3.0 or higher)
* Spark jobs on Spark standalone clusters, MapR (v5.0.0 - v5.2.0) clusters,
Vanilla clusters (v2.7.1) and CDH clusters (v5.3.0 or higher).
* storage of job binaries in the OpenStack Object Storage service (swift),
the OpenStack Shared file systems service (manila), or sahara's own
database
@ -53,11 +53,12 @@ Job Binaries
A :dfn:`Job Binary` object stores a URL to a single script or Jar file and
any credentials needed to retrieve the file. The file itself may be stored
in the sahara internal database, in swift, or in manila.
in the sahara internal database (but it is deprecated now), in swift,
or in manila.
Files in the sahara database are stored as raw bytes in a
:dfn:`Job Binary Internal` object. This object's sole purpose is to store a
file for later retrieval. No extra credentials need to be supplied for files
**deprecated:** Files in the sahara database are stored as raw bytes in a
:dfn:`Job Binary Internal` object. This object's sole purpose is to store a
file for later retrieval. No extra credentials need to be supplied for files
stored internally.
Sahara requires credentials (username and password) to access files stored in
@ -72,7 +73,7 @@ automatically mounted to any cluster nodes which require access to the file,
if it is not mounted already.
There is a configurable limit on the size of a single job binary that may be
retrieved by sahara. This limit is 5MB and may be set with the
retrieved by sahara. This limit is 5MB and may be set with the
*job_binary_max_KB* setting in the :file:`sahara.conf` configuration file.
Jobs
@ -100,6 +101,10 @@ binary and/or supporting libraries depending on its type:
+-------------------------+-------------+-----------+
| ``Spark`` | required | optional |
+-------------------------+-------------+-----------+
| ``Storm`` | required | not used |
+-------------------------+-------------+-----------+
| ``Storm Pyelus`` | required | not used |
+-------------------------+-------------+-----------+
Data Sources
@ -144,13 +149,14 @@ sahara which is used to monitor and manage the job.
To execute Hadoop jobs, sahara generates an Oozie workflow and submits it to
the Oozie server running on the cluster. Familiarity with Oozie is not
necessary for using sahara but it may be beneficial to the user. A link to
necessary for using sahara but it may be beneficial to the user. A link to
the Oozie web console can be found in the sahara web UI in the cluster
details.
For Spark jobs, sahara uses the *spark-submit* shell script and executes the
Spark job from the master node. Logs of spark jobs run by sahara can be found
on the master node under the */tmp/spark-edp* directory.
Spark job from the master node in case of Spark cluster and from the Spark
Job History server in other cases. Logs of spark jobs run by sahara can be
found on this node under the */tmp/spark-edp* directory.
.. _edp_workflow:
@ -181,13 +187,13 @@ the same whether using the web UI or the REST API.
+ When using the REST API this is done via the */jobs/<job_id>/execute*
method
The workflow is simpler when using existing objects. For example, to
The workflow is simpler when using existing objects. For example, to
construct a new job which uses existing binaries and input data a user may
only need to perform steps 3, 5, and 6 above. Of course, to repeat the same
job multiple times a user would need only step 6.
Specifying Configuration Values, Parameters, and Arguments
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Jobs can be configured at launch. The job type determines the kinds of values
that may be set:
@ -210,6 +216,10 @@ that may be set:
+--------------------------+---------------+------------+-----------+
| ``Spark`` | Yes | No | Yes |
+--------------------------+---------------+------------+-----------+
| ``Storm`` | Yes | No | Yes |
+--------------------------+---------------+------------+-----------+
| ``Storm Pyelus`` | Yes | No | Yes |
+--------------------------+---------------+------------+-----------+
* :dfn:`Configuration values` are key/value pairs.
@ -281,21 +291,21 @@ in the future), sahara allows the addition of an interface (or method
signature) to your job template. A sample interface for the Teragen Hadoop
example might be:
+---------+---------+------------+-------------+----------+-------------------+
| Name | Mapping | Location | Value | Required | Default |
| | Type | | Type | | |
+=========+=========+============+=============+=======+======================+
| Example | args | 0 | string | false | teragen |
| Class | | | | | |
+---------+---------+------------+-------------+-------+----------------------+
| Rows | args | 1 | number | true | ``unset`` |
+---------+---------+------------+-------------+-------+----------------------+
| Output | args | 2 | data_source | false | hdfs://ip:port/path |
| Path | | | | | |
+---------+---------+------------+-------------+-------+----------------------+
| Mapper | configs | mapred. | number | false | ``unset`` |
| Count | | map.tasks | | | |
+---------+---------+------------+-------------+-------+----------------------+
+---------+---------+-----------+-------------+----------+--------------------+
| Name | Mapping | Location | Value | Required | Default |
| | Type | | Type | | |
+=========+=========+===========+=============+==========+====================+
| Example | args | 0 | string | false | teragen |
| Class | | | | | |
+---------+---------+-----------+-------------+----------+--------------------+
| Rows | args | 1 | number | true | unset |
+---------+---------+-----------+-------------+----------+--------------------+
| Output | args | 2 | data_source | false | hdfs://ip:port/path|
| Path | | | | | |
+---------+---------+-----------+-------------+----------+--------------------+
| Mapper | configs | mapred. | number | false | unset |
| Count | | map.tasks | | | |
+---------+---------+-----------+-------------+----------+--------------------+
A "Description" field may also be added to each interface argument.
@ -307,14 +317,13 @@ the value of which consists of a list of JSON objects, as below:
[
{
"name": "Example Class",
"description": "Indicates which example job class should be used."
"description": "Indicates which example job class should be used.",
"mapping_type": "args",
"location": "0",
"value_type": "string",
"required": false,
"default": "teragen"
},
# Other arguments above here, as JSON objects
]
Creating this interface would allow you to specify a configuration for any
@ -336,8 +345,16 @@ each interface argument. The final ``job_configs`` map would be:
{
"job_configs": {
"configs": {"mapred.map.tasks": "3"},
"args" ["teragen", "1000000", "hdfs://mycluster:8020/user/myuser/teragen-output"]
"configs":
{
"mapred.map.tasks": "3"
},
"args":
[
"teragen",
"1000000",
"hdfs://mycluster:8020/user/myuser/teragen-output"
]
}
}
@ -369,7 +386,7 @@ If swift proxy users are not configured (see
:doc:`../userdoc/advanced.configuration.guide`) and a job is run with data
source objects containing swift paths, sahara will automatically generate
swift username and password configuration values based on the credentials
in the data sources. If the input and output data sources are both in swift,
in the data sources. If the input and output data sources are both in swift,
it is expected that they specify the same credentials.
The swift credentials may be set explicitly with the following configuration
@ -558,7 +575,7 @@ estimating Pi.
Special Sahara URLs
--------------------
-------------------
Sahara uses custom URLs to refer to objects stored in swift, in manila, or in
the sahara internal database. These URLs are not meant to be used outside of
@ -656,17 +673,17 @@ is finished.
Two config parameters control the behaviour of periodic clusters:
* periodic_enable - if set to 'False', sahara will do nothing to a transient
* periodic_enable - if set to 'false', sahara will do nothing to a transient
cluster once the job it was created for is completed. If it is set to
'True', then the behaviour depends on the value of the next parameter.
* use_identity_api_v3 - set it to 'False' if your OpenStack installation
'true', then the behaviour depends on the value of the next parameter.
* use_identity_api_v3 - set it to 'false' if your OpenStack installation
does not provide keystone API v3. In that case sahara will not terminate
unneeded clusters. Instead it will set their state to 'AwaitingTermination'
meaning that they could be manually deleted by a user. If the parameter is
set to 'True', sahara will itself terminate the cluster. The limitation is
set to 'true', sahara will itself terminate the cluster. The limitation is
caused by lack of 'trusts' feature in Keystone API older than v3.
If both parameters are set to 'True', sahara works with transient clusters in
If both parameters are set to 'true', sahara works with transient clusters in
the following manner:
1. When a user requests for a job to be executed on a transient cluster,