deb-sahara/etc/edp-examples/edp-spark
Vitaly Gridnev 82942e5125 Add sample spark wordcount job
Added new spark job that can read data from
Swift. Also added job to Sahara CI to test that.

Implements blueprint: edp-spark-example-with-swift

Change-Id: I3484a8ba0bddebea34b46ab33af9e6ed06bf4f44
2015-08-27 15:31:49 +03:00
..
NOTICE.txt Add Spark integration test 2014-08-18 11:02:39 -04:00
README.rst Add sample spark wordcount job 2015-08-27 15:31:49 +03:00
sample_input.txt Add sample spark wordcount job 2015-08-27 15:31:49 +03:00
spark-example.jar Add Spark integration test 2014-08-18 11:02:39 -04:00
spark-wordcount.jar Add sample spark wordcount job 2015-08-27 15:31:49 +03:00

Example Spark Job

This example contains the compiled classes for SparkPi extracted from the example jar distributed with Apache Spark version 1.3.1.

SparkPi example estimates Pi. It can take a single optional integer argument specifying the number of slices (tasks) to use.

Example spark-wordcount Job ==========================

spark-wordcount is a modified version of the WordCount example from Apache Spark. It can read input data from hdfs or swift container, then output the number of occurrences of each word to standard output or hdfs.

Launching wordcount job from Sahara UI

  1. Create a job binary that points to spark-wordcount.jar.

  2. Create a job template and set spark-wordcount.jar as the main binary of the job template.

  3. Create a Swift container with your input file. As example, you can upload sample_input.txt.

  4. Launch job:

    1. Put path to input file in args
    2. Put path to output file in args
    3. Fill the Main class input with the following class: sahara.edp.spark.SparkWordCount
    4. Put the following values in the job's configs: edp.spark.adapt_for_swift with value True, fs.swift.service.sahara.password with password for your username, and fs.swift.service.sahara.username with your username. These values are required for correct access to your input file, located in Swift.
    5. Execute the job. You will be able to view your output in hdfs.