Tuesday, April 19, 2016

Spark Application in Eclipse

I have prepared a step by step guide to build spark project using scala in eclipse. We will run this application locally through eclipse and also run this in hdfs. You might have already guessed. Yes, you are correct we will work on WordCount as an example. After all we do not want to be an outliers by working on something else.



So, let's get started. I am assuming you already have spark set up and running. If not, you can download cloudera quick start VM from http://www.cloudera.com/downloads.html. I would recommend this if you do not have anything set up and running, cause this is easy and quick. 

1. The first thing you want to do is to install Scala IDE in eclipse. So open eclipse and go to Help -> Eclipse Market Place


2. Search by 'scala', and then install 'Scala IDE'. 




3. Once the installation is complete, click on Window -> Open Perspective -> Other. You should see scala as an option; choose scala and click OK.



4. Now we build a Maven project. So right click on package explorer and go to new -> project, choose Maven project and click Next. Choose shown groupID and artifact ID and click next. Give your groupID and artifactID and click next.



Your package structure looks something like this.





5. We will not write the application in Java, instead we will write in scala. So for that, right click on your project, and go to configure, click on 'Add Scala Nature'.



6. Now your package structure should look like this.


7. But we are still not ready for writing scala objects. Right click on src/main/java -> Build Path -> Configure Build Path.


8. Choose /src/main/java and click 'Add Folder'.


Choose main and click 'Create New Folder'.


Type scala as folder name and click Next.



We will add pattern, so on Inclusion Patterns click on 'Add'.
Type **/*.scala as pattern and click next.

Then click Finish.
Now your package structure should look like this.




9. Now we are almost ready. Before we create an object we will add scala and spark dependency in pom file.
So open pom.xml and add these dependencies 

    <dependency> <!-- Scala dependency -->
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>2.11.8</version>
    </dependency>
    
   <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>1.5.0</version>
    </dependency>


10. Now we are ready. Right click on 'src/main/scala' -> Package. Give the package a name and click Finish.


11.  Now let's create a Scala Object. Right click on your package -> New -> Scala Object.


This is what we have now.




12. Modify WordCount as shown in the picture.


Notice that it requires an argument, which is the file it will read to count word. If there is no argument it will fail and print string 'Missing Arguments'.

13. Let's run this just to see what it says without setting up the argument. For that right click -> Run As -> Scala Application.


Notice the error, 'Missing Arguments'.


14. Now to set up an argument. I have created a file 'word' inside /home/cloudera/input. This will be my input for WordCount application.

Then right click on eclipse -> Run Configurations 
Under Spark Application choose WordCount and click on Arguments. Give the path to your file as an argument. Then click Apply -> Run.




15. Here is the output.


16. Next step, to save output as a file inside /home/cloudera/output. 
So our application will take two parameters, first: input file, second: path to save output as a file.


And modify your parameter, then click Apply -> Run.




17. Check inside /output, you will see a folder 'wordcount'.


18. Now we will read hdfs file and write output as hdfs file. For that export the project as a jar file. 
Right click on your project -> Export and jar file.


Give the path to save your jar and click Finish.


19. We can run the application with this command 

$ spark-submit --class com.test.WordCount --master local /home/cloudera/spark-test/WordCount.jar input/word output/wordcount

The first argument: input/word, is the path to hdfs file 'word', the second argument: output/wordcount is the path where output folder will be created with an output file inside wordcount.




No comments:

Post a Comment