Add one more sample for pig job examples
This patch adds new pig example "Top TODOers" shown in previous ATL OpenStack Summit: * Updated corresponding documentation and paths in integration tests Implements blueprint: edp-examples Change-Id: I4285b9c3a334cda7387776bc06147ef53c0a57e0
This commit is contained in:
parent
889f5296b5
commit
4e86bda8eb
@ -240,10 +240,10 @@ will give a walkthrough on how to run those jobs via the Horizon UI. These steps
|
|||||||
assume that you already have a cluster up and running (in the "Active" state).
|
assume that you already have a cluster up and running (in the "Active" state).
|
||||||
|
|
||||||
1) Sample Pig job -
|
1) Sample Pig job -
|
||||||
https://github.com/openstack/sahara/tree/master/etc/edp-examples/pig-job
|
https://github.com/openstack/sahara/tree/master/etc/edp-examples/edp-pig/trim-spaces
|
||||||
|
|
||||||
- Load the input data file from
|
- Load the input data file from
|
||||||
https://github.com/openstack/sahara/tree/master/etc/edp-examples/pig-job/data/input
|
https://github.com/openstack/sahara/tree/master/etc/edp-examples/edp-pig/trim-spaces/data/input
|
||||||
into swift
|
into swift
|
||||||
|
|
||||||
- Click on Project/Object Store/Containers and create a container with any
|
- Click on Project/Object Store/Containers and create a container with any
|
||||||
@ -270,11 +270,11 @@ assume that you already have a cluster up and running (in the "Active" state).
|
|||||||
|
|
||||||
- Name = example.pig, Storage type = Internal database, click Browse and
|
- Name = example.pig, Storage type = Internal database, click Browse and
|
||||||
find example.pig wherever you checked out the sahara project
|
find example.pig wherever you checked out the sahara project
|
||||||
<sahara root>/etc/edp-examples/pig-job
|
<sahara root>/etc/edp-examples/edp-pig/trim-spaces
|
||||||
|
|
||||||
- Create another Job Binary: Name = udf.jar, Storage type = Internal
|
- Create another Job Binary: Name = udf.jar, Storage type = Internal
|
||||||
database, click Browse and find udf.jar wherever you checked out the
|
database, click Browse and find udf.jar wherever you checked out the
|
||||||
sahara project <sahara root>/etc/edp-examples/pig-job
|
sahara project <sahara root>/etc/edp-examples/edp-pig/trim-spaces
|
||||||
|
|
||||||
- Create a Job
|
- Create a Job
|
||||||
|
|
||||||
|
68
etc/edp-examples/edp-pig/top-todoers/README.rst
Normal file
68
etc/edp-examples/edp-pig/top-todoers/README.rst
Normal file
@ -0,0 +1,68 @@
|
|||||||
|
Top TODOers Pig job
|
||||||
|
===================
|
||||||
|
|
||||||
|
This script calculates top TODOers in input sources.
|
||||||
|
|
||||||
|
Example of usage
|
||||||
|
----------------
|
||||||
|
|
||||||
|
This pig script can process as many input files (sources) as you want.
|
||||||
|
Just put all input files in a directory in HDFS or container in Swift and
|
||||||
|
give the path of the HDFS directory (Swift object) as input DataSource for EDP.
|
||||||
|
|
||||||
|
Here are steps how to prepare input data:
|
||||||
|
|
||||||
|
1. Create dir 'input'
|
||||||
|
|
||||||
|
.. sourcecode:: console
|
||||||
|
|
||||||
|
$ mkdir input
|
||||||
|
|
||||||
|
2. Get some sources from GitHub and put it to 'input' directory:
|
||||||
|
|
||||||
|
.. sourcecode:: console
|
||||||
|
|
||||||
|
$ cd input
|
||||||
|
$ git clone "https://github.com/openstack/swift.git"
|
||||||
|
$ git clone "https://github.com/openstack/nova.git"
|
||||||
|
$ git clone "https://github.com/openstack/glance.git"
|
||||||
|
$ git clone "https://github.com/openstack/image-api.git"
|
||||||
|
$ git clone "https://github.com/openstack/neutron.git"
|
||||||
|
$ git clone "https://github.com/openstack/horizon.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-novaclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-keystoneclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/oslo-incubator.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-neutronclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-glanceclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-swiftclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-cinderclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/ceilometer.git"
|
||||||
|
$ git clone "https://github.com/openstack/cinder.git"
|
||||||
|
$ git clone "https://github.com/openstack/heat.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-heatclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-ceilometerclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/oslo.config.git"
|
||||||
|
$ git clone "https://github.com/openstack/ironic.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-ironicclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/operations-guide.git"
|
||||||
|
$ git clone "https://github.com/openstack/keystone.git"
|
||||||
|
$ git clone "https://github.com/openstack/oslo.messaging.git"
|
||||||
|
$ git clone "https://github.com/openstack/oslo.sphinx.git"
|
||||||
|
$ git clone "https://github.com/openstack/oslo.version.git"
|
||||||
|
$ git clone "https://github.com/openstack/sahara.git"
|
||||||
|
$ git clone "https://github.com/openstack/python-saharaclient.git"
|
||||||
|
$ git clone "https://github.com/openstack/openstack.git"
|
||||||
|
$ cd ..
|
||||||
|
|
||||||
|
3. Create single file containing all sources:
|
||||||
|
|
||||||
|
.. sourcecode:: console
|
||||||
|
|
||||||
|
tar -cf input.tar input/*
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Pig can operate with raw files as well as with compressed data, so in this
|
||||||
|
step you might want to create *.gz file with sources and it should work.
|
||||||
|
|
||||||
|
4. Upload input.tar to Swift or HDFS as input data source for EDP processing
|
@ -0,0 +1,3 @@
|
|||||||
|
2 https://launchpad.net/~slukjanov
|
||||||
|
1 https://launchpad.net/~aignatov
|
||||||
|
1 https://launchpad.net/~mimccune
|
18
etc/edp-examples/edp-pig/top-todoers/data/input
Normal file
18
etc/edp-examples/edp-pig/top-todoers/data/input
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
# There is some source file with TODO labels inside
|
||||||
|
|
||||||
|
|
||||||
|
def sum(a, b):
|
||||||
|
# TODO(slukjanov): implement how to add numbers
|
||||||
|
return None
|
||||||
|
|
||||||
|
def sum(a, b):
|
||||||
|
# TODO(slukjanov): implement how to subtract numbers
|
||||||
|
return None
|
||||||
|
|
||||||
|
def divide(a, b):
|
||||||
|
# TODO(aignatov): implement how to divide numbers
|
||||||
|
return None
|
||||||
|
|
||||||
|
def mul(a, b):
|
||||||
|
# TODO(mimccune): implement how to multiply numbers
|
||||||
|
return None
|
17
etc/edp-examples/edp-pig/top-todoers/example.pig
Normal file
17
etc/edp-examples/edp-pig/top-todoers/example.pig
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
input_lines = LOAD '$INPUT' AS (line:chararray);
|
||||||
|
|
||||||
|
-- filter out any lines that are not with TODO
|
||||||
|
todo_lines = FILTER input_lines BY line MATCHES '.*TODO\\s*\\(\\w+\\)+.*';
|
||||||
|
ids = FOREACH todo_lines GENERATE FLATTEN(REGEX_EXTRACT($0, '(.*)\\((.*)\\)(.*)', 2));
|
||||||
|
|
||||||
|
-- create a group for each word
|
||||||
|
id_groups = GROUP ids BY $0;
|
||||||
|
|
||||||
|
-- count the entries in each group
|
||||||
|
atc_count = FOREACH id_groups GENERATE COUNT(ids) AS count, group AS atc;
|
||||||
|
|
||||||
|
-- order the records by count
|
||||||
|
result = ORDER atc_count BY count DESC;
|
||||||
|
result = FOREACH result GENERATE count, CONCAT('https://launchpad.net/~', atc);
|
||||||
|
|
||||||
|
STORE result INTO '$OUTPUT' USING PigStorage();
|
@ -27,7 +27,7 @@ from sahara.utils import edp
|
|||||||
|
|
||||||
|
|
||||||
class EDPJobInfo(object):
|
class EDPJobInfo(object):
|
||||||
PIG_PATH = 'etc/edp-examples/pig-job/'
|
PIG_PATH = 'etc/edp-examples/edp-pig/trim-spaces/'
|
||||||
JAVA_PATH = 'etc/edp-examples/edp-java/'
|
JAVA_PATH = 'etc/edp-examples/edp-java/'
|
||||||
MAPREDUCE_PATH = 'etc/edp-examples/edp-mapreduce/'
|
MAPREDUCE_PATH = 'etc/edp-examples/edp-mapreduce/'
|
||||||
SPARK_PATH = 'etc/edp-examples/edp-spark/'
|
SPARK_PATH = 'etc/edp-examples/edp-spark/'
|
||||||
|
Loading…
Reference in New Issue
Block a user