Add sample spark wordcount job

Added new spark job that can read data from Swift. Also added job to Sahara CI to test that. Implements blueprint: edp-spark-example-with-swift Change-Id: I3484a8ba0bddebea34b46ab33af9e6ed06bf4f44
2015-07-29 16:22:58 +03:00 · 2015-07-29 16:22:58 +03:00 · 82942e5125
commit 82942e5125
parent 32d8be795f
4 changed files with 51 additions and 3 deletions
--- a/etc/edp-examples/edp-spark/README.rst
+++ b/etc/edp-examples/edp-spark/README.rst
@ -2,7 +2,33 @@ Example Spark Job
 =================
 This example contains the compiled classes for SparkPi extracted from
-the example jar distributed with Apache Spark version 1.0.0.
+the example jar distributed with Apache Spark version 1.3.1.
 SparkPi example estimates Pi. It can take a single optional integer
 argument specifying the number of slices (tasks) to use.
 Example spark-wordcount Job
 ==========================
 spark-wordcount is a modified version of the WordCount example from Apache Spark.
 It can read input data from hdfs or swift container, then output the number of occurrences
 of each word to standard output or hdfs.
 Launching wordcount job from Sahara UI
 --------------------------------------
 1. Create a job binary that points to ``spark-wordcount.jar``.
 2. Create a job template and set ``spark-wordcount.jar`` as the main binary
   of the job template.
 3. Create a Swift container with your input file. As example, you can upload
   ``sample_input.txt``.
 3. Launch job:
    1. Put path to input file in ``args``
    2. Put path to output file in ``args``
    3. Fill the ``Main class`` input with the following class: ``sahara.edp.spark.SparkWordCount``
    4. Put the following values in the job's configs: ``edp.spark.adapt_for_swift`` with value ``True``,
       ``fs.swift.service.sahara.password`` with password for your username, and
       ``fs.swift.service.sahara.username`` with your username. These values are required for
       correct access to your input file, located in Swift.
    5. Execute the job. You will be able to view your output in hdfs.
--- a/etc/edp-examples/edp-spark/sample_input.txt
+++ b/etc/edp-examples/edp-spark/sample_input.txt
@ -0,0 +1,10 @@
 one
 one
 one
 one
 two
 two
 two
 three
 three
 four
--- a/etc/edp-examples/edp-spark/spark-wordcount.jar
+++ b/etc/edp-examples/edp-spark/spark-wordcount.jar
--- a/etc/scenario/sahara-ci/edp.yaml.mako
+++ b/etc/scenario/sahara-ci/edp.yaml.mako
@ -95,6 +95,20 @@ edp_jobs_flow:
        edp.java.main_class: org.apache.spark.examples.SparkPi
      args:
        - 4
    - type: Spark
      input_datasource:
        type: swift
        source: etc/edp-examples/edp-spark/sample_input.txt
      main_lib:
        type: database
        source: etc/edp-examples/edp-spark/spark-wordcount.jar
      configs:
        edp.java.main_class: sahara.edp.spark.SparkWordCount
        edp.spark.adapt_for_swift: true
        fs.swift.service.sahara.username: ${OS_USERNAME}
        fs.swift.service.sahara.password: ${OS_PASSWORD}
      args:
        - '{input_datasource}'
  transient:
    - type: Pig
      input_datasource:
@ -155,5 +169,3 @@ edp_jobs_flow:
      args:
        - 10
        - 10