Add sample spark wordcount job

Added new spark job that can read data from Swift. Also added job to Sahara CI to test that. Implements blueprint: edp-spark-example-with-swift Change-Id: I3484a8ba0bddebea34b46ab33af9e6ed06bf4f44
2015-07-29 16:22:58 +03:00 · 2015-07-29 16:22:58 +03:00 · 82942e5125
commit 82942e5125
parent 32d8be795f
4 changed files with 51 additions and 3 deletions
--- a/etc/edp-examples/edp-spark/README.rst
+++ b/etc/edp-examples/edp-spark/README.rst
@ -2,7 +2,33 @@ Example Spark Job
 =================

 This example contains the compiled classes for SparkPi extracted from
-the example jar distributed with Apache Spark version 1.0.0.
+the example jar distributed with Apache Spark version 1.3.1.

 SparkPi example estimates Pi. It can take a single optional integer
 argument specifying the number of slices (tasks) to use.
+
+Example spark-wordcount Job
+==========================
+
+spark-wordcount is a modified version of the WordCount example from Apache Spark.
+It can read input data from hdfs or swift container, then output the number of occurrences
+of each word to standard output or hdfs.
+
+Launching wordcount job from Sahara UI
+--------------------------------------
+
+1. Create a job binary that points to ``spark-wordcount.jar``.
+2. Create a job template and set ``spark-wordcount.jar`` as the main binary
+   of the job template.
+3. Create a Swift container with your input file. As example, you can upload
+   ``sample_input.txt``.
+3. Launch job:
+
+    1. Put path to input file in ``args``
+    2. Put path to output file in ``args``
+    3. Fill the ``Main class`` input with the following class: ``sahara.edp.spark.SparkWordCount``
+    4. Put the following values in the job's configs: ``edp.spark.adapt_for_swift`` with value ``True``,
+       ``fs.swift.service.sahara.password`` with password for your username, and
+       ``fs.swift.service.sahara.username`` with your username. These values are required for
+       correct access to your input file, located in Swift.
+    5. Execute the job. You will be able to view your output in hdfs.
--- a/etc/edp-examples/edp-spark/sample_input.txt
+++ b/etc/edp-examples/edp-spark/sample_input.txt
@ -0,0 +1,10 @@
+one
+one
+one
+one
+two
+two
+two
+three
+three
+four
--- a/etc/edp-examples/edp-spark/spark-wordcount.jar
+++ b/etc/edp-examples/edp-spark/spark-wordcount.jar
--- a/etc/scenario/sahara-ci/edp.yaml.mako
+++ b/etc/scenario/sahara-ci/edp.yaml.mako
@ -95,6 +95,20 @@ edp_jobs_flow:
        edp.java.main_class: org.apache.spark.examples.SparkPi
      args:
        - 4
+    - type: Spark
+      input_datasource:
+        type: swift
+        source: etc/edp-examples/edp-spark/sample_input.txt
+      main_lib:
+        type: database
+        source: etc/edp-examples/edp-spark/spark-wordcount.jar
+      configs:
+        edp.java.main_class: sahara.edp.spark.SparkWordCount
+        edp.spark.adapt_for_swift: true
+        fs.swift.service.sahara.username: ${OS_USERNAME}
+        fs.swift.service.sahara.password: ${OS_PASSWORD}
+      args:
+        - '{input_datasource}'
  transient:
    - type: Pig
      input_datasource:
@ -155,5 +169,3 @@ edp_jobs_flow:
      args:
        - 10
        - 10
-
-