MapReduce is a programming model used to perform a different type of analysis on a large amount of data. Today we will see how we can find a distinct list of words in any random document. Consider the following scenario: Martin Luther King Jr. delivers a speech. We will be able to comprehend the speech’s theme if we can identify a certain set of terms from it.
MapReduce diagram to find a distinct set of words in any document
The architecture of the MapReduce program is given down below
Explanation of the diagram
Let’s understand the diagram now. Here, first of all, we will read an input file. The file can be of huge size here. So, we need to divide data into smaller chunks of data. In Hadoop terms, we call this chunk a block. So, we divide data into smaller blocks. Each block will be executed on different mapper functions. Here we take unique words and pass these words as a key and null as value to a reducer. Now, at the reducer, we will simply write input from the reducer to the output file.
Steps to implement
There are three types of classes present in the map-reduce applications. Driver class, reducer class, and the mapper class. First, the mapper class takes the input in the form of Key and value pairs. After that, the output of the mapper passes into the reducer as input. Reducer performs operations to find out the distinct set of elements in the given text.
After that Let’s understand the driver’s class. It helps us to run the Map-reduce program and answer some of the following questions like
- What formats will be used for the input and output?
- What will the output types for the mapper key and value and the reducer key and value be?
Input data
Let’s see, what is the input for this exercise:
India is a country and US is also country
Mapper class
The implementation of the mapper class is given down below
- First of all, all the important libraries are imported
- Then a mapper class is created which takes LongWritable text as Key and Value as Null.
- Next, we are reading a line from a value and splitting it into an array of words in the field Variable. Later we are iterating through a list of words in an array and write output to the context with key as a word and value as a null. Here we are transferring unique words to a reducer.
Now, we will see reducer side implementation.
public static class Map extends Mapper<LongWritable, Text,
Text, NullWritable> {
// created object of type Text
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
// read a line
String line = value.toString();
// split a line with a space and retrieve list of
// words
String[] fields = line.split(" ");
for(int i=0; i<fields.lenght; i++){
String word_data = fields[i];
word.set(word_data);
context.write(word, NullWritable.get());
}
}
Reducer class
Here, we are reading mapper output to a reducer as an input. As key will be unique, so all unique words will become one at a time, and writing it to the context output and value will be null. So, the final output will be a unique list of words.
public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable> {
@Override
public void reduce(Text key, Iterable<NullWritable>
values, Context context) throws IOException,InterruptedException
{
context.write(key, NullWritable.get());
}
}
Driver class
In the driver class first importing all the necessary libraries to implement driver class.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
Here, we have created a class with the main method and from the main method, we are calling Driver which specifies different configurations and creates a job that we will run through the Hadoop command.
Now, we will look into the mapper side. Here we will see how the mapper is performing to find unique words from the content.
public class DistinctValuesExample extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
//input and output paths passed from cli
Path hdfsInputPath = new Path(args[0]);
Path hdfsOutputPath = new Path(args[1]);
//create job
Job job = new Job(conf, "Distinct Values Example 1");
job.setJarByClass(DistinctValuesExample.class);
//set mapper and reducer
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//set output key and value classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//set input format and path
FileInputFormat.addInputPath(job, hdfsInputPath);
job.setInputFormatClass(TextInputFormat.class);
//set output format and path
FileOutputFormat.setOutputPath(job, hdfsOutputPath);
job.setOutputFormatClass(TextOutputFormat.class);
//run job return status
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new DistinctValuesExample1(), args);
System.exit(res);
}
}
Output and Explanation
The output of the above program is given down below
- India
- is
- a
- country
- and
- US
- also
As we can see here there is no repetition of words. Each word comes a single time.
Conclusion
We have seen here how we can use Hadoop map-reduce program to find a distinct set of elements using MapReduce. Hope this will help you understand how it works internally and how other problem statements also we can handle using the same map-reduce approach.
When I originally commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on whenever a comment is added I recieve four emails with the exact same comment. Perhaps there is a way you can remove me from that service? Many thanks!
Im very pleased to uncover this site. I need to to thank you for ones time just for this wonderful read!! I definitely enjoyed every part of it and i also have you book marked to look at new things on your website.