Apache Spark is becoming ubiquitous by day and has been dubbed the next big thing in the Big Data world. Spark has been replacing MapReduce with its speed and scalability. In this series of articles on Spark we will try to solve various problems using Spark and Java.
Word count program is the big data equivalent of the classic Hello world program. The aim of this program is to scan a text file and display the number of times a word has occurred in that particular file. And for this word count application we will be using Apache spark 1.6 with Java 8.
For this program, we will be running spark in a stand alone mode. So you don't need to setup a cluster. Even Hadoop is not required for this exercise. Assuming you have Spark, Java and Maven installed properly, let's proceed.
Creating pom.xml
To compile Java programs with Maven, you will need a pom.xml file with the required dependencies. Use this pom.xml file if you don't have one available with you.
<?xml version="1.0" encoding="UTF-8"?>
<project>
<groupId>com.freblogg.sparklearning</groupId>
<artifactId>freblogg-spark-tutorial</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>example</name>
<packaging>jar</packaging>
<version>0.0.1</version>
<dependencies>
<dependency>
<!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
<properties>
<java.version>1.8</java.version>
<encoding>UTF-8</encoding>
<spark.version>1.6.1</spark.version>
</properties>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-plugin-plugin</artifactId>
<version>3.3</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
Now, save this file as pom.xml and put it in the same folder as your src directory.
Input File
After creating the POM file, you will need an input file on which we will run our Wordcount program, to count the number of occurrences of each word. This is the file I will be using.
It is close to midnight and something evil is lurking in the dark Under the moonlight you see a sight that almost stops your heart You try to scream but terror takes the sound before you make it You start to freeze as horror looks you right between the eyes You are paralyzed
Java Program
Once we have the pom file ready, we can start with the code.
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import scala.Tuple2;
import java.util.Arrays;
public class WordCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("wordCount");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load our input data.
String inputFile = "file:///home/dsp/Desktop/sparkExamples/sample_testing/resources/inputFile";
JavaRDD < String > input = sc.textFile(inputFile);
// Split in to list of words
JavaRDD < String > words = input.flatMap(l -> Arrays.asList(l.split(" ")));
// Transform into pairs and count.
JavaPairRDD < String, Integer > pairs = words.mapToPair(w -> new Tuple2(w, 1));
JavaPairRDD < String, Integer > counts = pairs.reduceByKey((x, y) -> x + y);
System.out.println(counts.collect());
}
}
Execution
Once we have everything ready, its time to execute our program and see the output.
To compile it, first execute this in the directory with the pom file.
mvn clean && mvn compile && mvn package
This will take sometime to run the first time because maven will have to download and install the dependencies. After successful compilation, It creates the target folder and a jar file named freblogg-spark-tutorial-0.0.1.jar.
Then to execute the program you need to run the spark-submit script in your SPARK_HOME folder.
$SPARK_HOME/bin/spark-submit --class "WordCount" target/freblogg-spark-tutorial-0.0.1.jar
Once this command is executed your screen will be completely filled with spark logs. If you scroll a bit to the top, you will see the following output, which is the output we are interested in.
{.prettyprint} [(freeze,1), (are,1), (Under,1), (it,1), (is,2), (you,3), (takes,1), (lurking,1), (right,1), (that,1), (a,1), (You,3), (terror,1), (start,1), (dark,1), (between,1), (scream,1), (before,1), (to,3), (as,1), (in,1), (moonlight,1), (sound,1), (midnight,1), (see,1), (stops,1), (sight,1), (try,1), (something,1), (paralyzed,1), (evil,1), (It,1), (eyes,1), (make,1), (almost,1), (but,1), (and,1), (close,1), (heart,1), (looks,1), (your,1), (horror,1), (the,4)]
That is the counts of each word in the file. So, there you go. You have successfully written your first Spark application. Congratulations. You're officially a Spark programmer now!
Understanding the code
Now that we have our application set up, let's see what the program is doing, step by step.
First we have the spark variables sc and conf. Don't worry too much about them right now. All you need to know is that every Spark program needs those two lines.
SparkConf conf = new SparkConf().setMaster("local") .setAppName("wordCount");
JavaSparkContext sc = new JavaSparkContext(conf);
So, just copy paste the lines in every application you are going to work on.
Next we are reading the input file using RDD's. RDD's are essentially blob's of text that you read from various sources and you can transform them in to whatever you want using various operations. Here we are reading the input file from our local file system. If you want to read from HDFS, then replace the file:/// with hdfs:///
String inputFile = "file:///home/dsp/Desktop/sparkExamples/sample_testing/resources/inputFile";
JavaRDD<String> input = sc.textFile(inputFile);
Then we have our first transformation operation on the input RDD we have created in the above step.
Flat Map is an inbuilt function that takes one input and can provide any number of outputs depending on the operations used inside it.
JavaRDD <String> words = input.flatMap(l -> Arrays.asList(l.split(" ")));
Here we are splitting the sentence on white space characters. So, the flatmap function here returns a list of all the words in the input document and that will be stored in the RDD named words. For more about Flatmap, read this : Spark FlatMap and Map
Next, we have another transformation mapToPair that returns a Tuple of word and the number 1.
And, a Tuple is very similar to ordered pairs in Cartesian coordinate system. Tuple2 looks like (x,y), where x is the Key. Similarly Tuple3 will be (x,y,z) and so on.
JavaPairRDD<String, Integer> pairs = words.mapToPair(w -> new Tuple2(w, 1));
As an example, the word you in the input will be mapped to (you,1) by mapToPair
function. And, since the result is a pair, we have to store it in a JavaPairRDD
which supports pairs.
And, then we are doing the final transformation on the pairs that will add up individual counts of each word.
JavaPairRDD <String, Integer> counts = pairs.reduceByKey((x, y) -> x + y);
ReduceByKey
method groups all the Tuple pairs with the same key. We have the word 'you' repeated thrice and so we have (you,1) three times. Now, (you,1) , (you,1), (you,1) will become (you,3) because of the sum we are doing inside the function. And similarly for the other words.
Then finally we are performing an action on the RDD which is where the actual computation of all the above steps takes place. collect() will return all the elements in the RDD and we are printing that using println
, giving us the output we want.
So there you go, Your first Spark application completed. To learn more go through the documentation and examples given on the Spark's webpage and subscribe to Freblogg for more tutorials.
Happy Sparking!
Image : http://www.datanami.com/wp-content/uploads/2014/12/spark-and-java-8.png
Self Promotion:
If you have liked this article and would like to see more, subscribe to our Facebook and G+ pages.
Facebook page @ Facebook.com/freblogg
Google Plus Page @ Google.com/freblogg