From bcc350d87eb976ab9b7144aa2c55fe1f1d62a4b1 Mon Sep 17 00:00:00 2001 From: Trevor McKay Date: Tue, 14 Jan 2014 14:31:30 -0500 Subject: [PATCH] Add a swift-enabled version of WordCount to edp-examples The edp-wordcount example includes source code and instructions on how to build the jar and run it from the Oozie command line or the Savanna UI. WordCount will work with hdfs paths, or with swift paths as long as the proper configs are set. Change-Id: I9ac728de505d874fd50a6baf75b062d5b622f3d0 --- edp-examples/README.rst | 1 + edp-examples/edp-wordcount/README.rst | 75 ++++++++++++++ edp-examples/edp-wordcount/src/NOTICE.txt | 2 + edp-examples/edp-wordcount/src/WordCount.java | 95 ++++++++++++++++++ .../edp-wordcount/wordcount/job.properties | 23 +++++ .../wordcount/lib/edp-wordcount.jar | Bin 0 -> 3959 bytes .../edp-wordcount/wordcount/workflow.xml | 49 +++++++++ 7 files changed, 245 insertions(+) create mode 100644 edp-examples/edp-wordcount/README.rst create mode 100644 edp-examples/edp-wordcount/src/NOTICE.txt create mode 100644 edp-examples/edp-wordcount/src/WordCount.java create mode 100644 edp-examples/edp-wordcount/wordcount/job.properties create mode 100644 edp-examples/edp-wordcount/wordcount/lib/edp-wordcount.jar create mode 100644 edp-examples/edp-wordcount/wordcount/workflow.xml diff --git a/edp-examples/README.rst b/edp-examples/README.rst index 549baf4..0924890 100644 --- a/edp-examples/README.rst +++ b/edp-examples/README.rst @@ -2,3 +2,4 @@ EDP Examples ============ * Pig job example - trim spaces in input file +* Edp wordcount - a version of WordCount that works with swift input and output diff --git a/edp-examples/edp-wordcount/README.rst b/edp-examples/edp-wordcount/README.rst new file mode 100644 index 0000000..3ff3843 --- /dev/null +++ b/edp-examples/edp-wordcount/README.rst @@ -0,0 +1,75 @@ +===================== +EDP WordCount Example +===================== +Overview +======== + +``WordCount.java`` is a modified version of the WordCount example bundled with +version 1.2.1 of Apache Hadoop. It has been extended for use from a java action +in an Oozie workflow. The modification below allows any configuration values +from the ```` tag in an Oozie workflow to be set in the Configuration +object:: + + // This will add properties from the tag specified + // in the Oozie workflow. For java actions, Oozie writes the + // configuration values to a file pointed to by ooze.action.conf.xml + conf.addResource(new Path("file:///", + System.getProperty("oozie.action.conf.xml"))); + +In the example workflow, we use the ```` tag to specify user and +password configuration values for accessing swift objects. + +Compiling +========= + +To build the jar, add ``hadoop-core`` and ``commons-cli`` to the classpath. + +On a node running Ubuntu 13.04 with hadoop 1.2.1 the following commands +will compile ``WordCount.java`` from within the ``src`` directory:: + +$ mkdir wordcount_classes +$ javac -classpath /usr/share/hadoop/hadoop-core-1.2.1.jar:/usr/share/hadoop/lib/commons-cli-1.2.jar -d wordcount_classes WordCount.java +$ jar -cvf edp-wordcount.jar -C wordcount_classes/ . + +(A compiled ``edp-wordcount.jar`` is included in ``wordcount/lib``. Replace it if you rebuild) + +Running from the command line with Oozie +======================================== + +The ``wordcount`` subdirectory contains a ``job.properties`` file, a ``workflow.xml`` file, +and a ``lib`` directory with an ``edp-wordcount.jar`` compiled as above. + +To run this example from Oozie, you will need to modify the ``job.properties`` file +to specify the correct ``jobTracker`` and ``nameNode`` addresses for your cluster. + +You will also need to modify the ``workflow.xml`` file to contain the correct input +and output paths. These paths may be Savanna swift urls or hdfs paths. If swift +urls are used, set the ``fs.swift.service.savanna.username`` and ``fs.swift.service.savanna.password`` +properties in the ```` section. + +1) Upload the ``wordcount`` directory to hdfs + + ``$ hadoop fs -put wordcount wordcount`` + +2) Launch the job, specifying the correct oozie server and port + + ``$ oozie job -oozie http://oozie_server:port/oozie -config wordcount/job.properties -run`` + +3) Don't forget to create your swift input path! A Savanna swift url looks like *swift://container.savanna/object* + +Running from the Savanna UI +=========================== + +Running the WordCount example from the Savanna UI is very similar to running a Pig, Hive, +or MapReduce job. + +1) Create a job binary that points to the ``edp-wordcount.jar`` file +2) Create a ``Java`` job type and add the job binary to the ``libs`` value +3) Launch the job: + + + a) Add the input and output paths to ``args`` + b) If swift input or output paths are used, set the ``fs.swift.service.savanna.username`` and ``fs.swift.service.savanna.password`` + configuration values + + diff --git a/edp-examples/edp-wordcount/src/NOTICE.txt b/edp-examples/edp-wordcount/src/NOTICE.txt new file mode 100644 index 0000000..62fc581 --- /dev/null +++ b/edp-examples/edp-wordcount/src/NOTICE.txt @@ -0,0 +1,2 @@ +This product includes software developed by The Apache Software +Foundation (http://www.apache.org/). diff --git a/edp-examples/edp-wordcount/src/WordCount.java b/edp-examples/edp-wordcount/src/WordCount.java new file mode 100644 index 0000000..db03798 --- /dev/null +++ b/edp-examples/edp-wordcount/src/WordCount.java @@ -0,0 +1,95 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.examples; + +import java.io.IOException; +import java.util.StringTokenizer; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.IntWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.util.GenericOptionsParser; + +public class WordCount { + + public static class TokenizerMapper + extends Mapper{ + + private final static IntWritable one = new IntWritable(1); + private Text word = new Text(); + + public void map(Object key, Text value, Context context + ) throws IOException, InterruptedException { + StringTokenizer itr = new StringTokenizer(value.toString()); + while (itr.hasMoreTokens()) { + word.set(itr.nextToken()); + context.write(word, one); + } + } + } + + public static class IntSumReducer + extends Reducer { + private IntWritable result = new IntWritable(); + + public void reduce(Text key, Iterable values, + Context context + ) throws IOException, InterruptedException { + int sum = 0; + for (IntWritable val : values) { + sum += val.get(); + } + result.set(sum); + context.write(key, result); + } + } + + public static void main(String[] args) throws Exception { + Configuration conf = new Configuration(); + String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); + if (otherArgs.length != 2) { + System.err.println("Usage: wordcount "); + System.exit(2); + } + + // ---- Begin modifications for EDP ---- + // This will add properties from the tag specified + // in the Oozie workflow. For java actions, Oozie writes the + // configuration values to a file pointed to by ooze.action.conf.xml + conf.addResource(new Path("file:///", + System.getProperty("oozie.action.conf.xml"))); + // ---- End modifications for EDP ---- + + Job job = new Job(conf, "word count"); + job.setJarByClass(WordCount.class); + job.setMapperClass(TokenizerMapper.class); + job.setCombinerClass(IntSumReducer.class); + job.setReducerClass(IntSumReducer.class); + job.setOutputKeyClass(Text.class); + job.setOutputValueClass(IntWritable.class); + FileInputFormat.addInputPath(job, new Path(otherArgs[0])); + FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); + System.exit(job.waitForCompletion(true) ? 0 : 1); + } +} diff --git a/edp-examples/edp-wordcount/wordcount/job.properties b/edp-examples/edp-wordcount/wordcount/job.properties new file mode 100644 index 0000000..5e9f4fb --- /dev/null +++ b/edp-examples/edp-wordcount/wordcount/job.properties @@ -0,0 +1,23 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +nameNode=hdfs://1.2.3.4:8020 +jobTracker=1.2.3.4:8021 +queueName=default + +oozie.wf.application.path=${nameNode}/user/${user.name}/wordcount diff --git a/edp-examples/edp-wordcount/wordcount/lib/edp-wordcount.jar b/edp-examples/edp-wordcount/wordcount/lib/edp-wordcount.jar new file mode 100644 index 0000000000000000000000000000000000000000..21fdef7618d1783659fff9b4a73e0098c93d6d8d GIT binary patch literal 3959 zcma)9c|6nqAD`SzE$6fxNotrYI$Wh0jmv$0XpZ8m;;N-kZXQ;|roY3?eoGc+M*JY7^-Pck2n$Okw7~+py$xjy zQ+g&~43qC{@4AOk>D@c>)=S95c;!n#upZdc1pFTY z1%D1CU~xpE#~7+~VkMRFR9|oi;+f0Pk}_3=myVFU}=~ zs?9o^lw>#gz9wlu+PDv`ak@&Oa`vnXA`{0TUiZz+ULp+P>&D_RD*2EaEqL_lou*Ia z+$LIrJ_Nfnn)6FE+?Dx>i~h#>Psd%QFWK6&9l{MUam&)TPHjptthYEo4U34ZLw{+ zcYuS~On>>2hw1x_$y!R%!aif_@Ghw?v+k=(7)H499S6&McS;~v@ZRx|6@GVxst{ZH z#GPRES!jyT>+@wgS2lAl%^j|{T~uAXp|3#2$!gcpPDquxVCGLwJu4pntD-2b_ zOn;6AW*_Vj$;`!^Fn1&!ZKmgXr+DmiOf<9o@yrzPhdt_Yz(xE1RpEV@~aq-V)eFJ|3bNioZMo z=n(SfDVA&Hv6e=E@&Aesb)W})m7tQUfag)J!&%k9x1X++)=EY)24@DAWP-i*eKS+O zWaUVS9o>hnhPPcx)wZZr$-6=%> zTx2|m>@sHv<(q^06Kp@(bgzhZJhJnPH~BOIHlQ9fsmP+Pgx6(EiQ~V>#qq3huudn{ zUDMaUF^{KJuI~rKuSIs?---|yuJTD;oFC9P?rvaAmoofP_H^3HTeDrB3UOB3 zd&bwP?x1&F=S||rLz}?O4*_YavSVHVN#I}4Lg~^pFt1E-AxL#@hmyX`C6hum19o8& zqN$4F0@RJPNzdu%zM)zYd^_>}pr%d#eX4Vk)sW?L^24+5Dl4tWkrA0M1coxP46>m$ zI4>b}5(pC?8i_P(bW)(#xG0slF$ND^pxCT^ z0f)c2T>S3X!MZGd!TqIa*1C92at3J0#)E4Id3cNuO3*8!Ksk{pp-JJv21>A9ij_*( zYd%a>?kP!rpuOSeqAYfMwOKWa1ciS6_^lJ;BBV$4dbBRh{8yl>%a1a+O;j$Br=Xb$ zJI3Q^WuHbRp|0^p71buHG5OyPZG)Ri1nqfUo?hOuKw`@d;W$m`U25|s8>(^Br|QQ^ zCFN6994_YMLPqm^p6SZCF&9 zDyotD&A3>)&3NBjI1ZBGI4e$wa=nbueZ1hoKkOGXTy=0pwsC@sOBx%ltx7E zmN!Wi!ni^NfEEJw0Wc4FUPhITZ z`R|!}V&ndkp5*_fp1xmFQC0)i5JS*IHv!2VWVB)#+%Yl~j7>tSfFUYexW|rLXdt?x zJ2DC!$V&EJkw92F^ZJd&qmj?myD@mSOqRam}02kD&bkdtLQSyfDL zuVm3tqjZ}xX;1UUxCZ?B;=*Hz%lQ!+SE1rqo#5Rs!%qpFMA{#36op^x@v(65M2}O8 z?}}ICIrC|mk3{d>PHrN(n5Hzv-c3GtOIF}cG^BuTq7pi)0|6S*T6xRu)KO*AB712L zlc}4W6J++I$e)*=H`=eJf8cbV_>|xRaSB9fzPNcJb11j4PUk+N(c*UW4mIx>kzM_F zQHG8EvNsjGIwNf125v7arcco3QGSOF`nsbIz?1QlB!|%fiR_PzbbR7sG2Z_8e)!NP;xA0i6EDOs` zmc3hq-NH$}E|x2DzUA|qR9)iM`{Col3rJ+M<0;byHsvHf+Or~b?9xZN4Vuby@8dL_N}*|U*W!cf+Zd3EjPx(IfG zu$TQFp;F+q;r2I1PBu7ua?+)2*%^pEiA%g`EHYWVw9K#lyoE!R%Mm4Z=YnZ%n3WxM zx5}EGzC~EH!W4`giq`74E-R?9Y3bd?T_0&ho9qLgMH+Eh$nj0&g|mm$WHrQNZto-) zRU%4zb#PT8<3bIUT8~u@*h_T1eGxiVs`?}(@C?=Jnt2Cb+kBdE z@p?h&LdUkJ#cIuGe9?JFk2O-T)>cu`0i@NZna*j*d!yC6y6i!vLKT)I2zBJ@-WkSP ziEI7SDKb*BhT0c5@BE@XGJW`}!NYlx#M)7;nR4gM>P!EcgY&vFNnJx{FwilL#FevH z5v7hVgE2vvWJ0Y%TXw^P3zgkhtyd~r-iWB2-}QU|j@i9qXL0pK;O96=u={i9BV3kp z%n_6OQwXURT&H?(%;x2{!Ztn4Jh03o$fYs`0m2&H(7fU=JQYHDmo_d0mz^xl~*2O65wpFS@Ta@Jfz~!g+dii zZF4y)jqcgoE^X={H^JC+ZEk1WviQQ(%{?%lZ&B_d?DC6`PTgJ(h@!1*^_dKpUZ?X~Ez-c^3q@$@!vRLGFo=Hr;%znL-HCSkzOUgm#Bh2>>_#t(DC zayI{@ZnPIHC$kP6=I{q~gRf!Dzz6uf4OyRo#vFb(CBJ8|Y{@#tnL`%ihe`Q8k!4fX zA;9(%@rP;oJ)dP;)}ixD{st4XzThkyv$6ML*_d_s{&}MRV`?^j9af%=cM_|UTgPkW zSKhEvf4}gLg4gr>qObdBmW^ABb-%4cfw|fKO_uMsotZHwvlj&bc$vRs=6-AivY!3} D2>>Ew literal 0 HcmV?d00001 diff --git a/edp-examples/edp-wordcount/wordcount/workflow.xml b/edp-examples/edp-wordcount/wordcount/workflow.xml new file mode 100644 index 0000000..2c0195a --- /dev/null +++ b/edp-examples/edp-wordcount/wordcount/workflow.xml @@ -0,0 +1,49 @@ + + + + + + ${jobTracker} + ${nameNode} + + + mapred.job.queue.name + ${queueName} + + + fs.swift.service.savanna.username + swiftuser + + + fs.swift.service.savanna.password + swiftpassword + + + org.apache.hadoop.examples.WordCount + swift://user.savanna/input + swift://user.savanna/output + + + + + + Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}] + + +