In today’s world, we are storing minute-to-minute data as we know data is the currency of the next world. If we have a massive amount of data available in a raw format, will it be helpful?
Data that doesn’t have any insights have no value. If we want to get some insights out of our data, we need to analyze those data. As we know, data is vast; to analyze it, we need more time and more processing. So, how will we do that?
Here comes MapReduce in the picture.
We must have questioned why MapReduce?
MapReduce is a programming model used to process a large amount of complex data in a parallel manner using a cluster of multiple nodes of commodity hardware machines. MapReduce is used to analyze a large amount of complex data without data loss.
History to MapReduce
Traditionally we used to store a large amount of data on a single huge server that could handle a heavy load of requests coming continuously from lots of users. In this type of scenario, as the number of requests increases, the server’s load increases, and the response speed used decreases eventually. There can be the possibility of requests reaching maximum capacity and the server unable to handle it and going down for some amount of time. You must have faced such a situation when some exam result comes out, and the very moment the result gets declared site goes down. To resolve this type of situation MapReduce paradigm comes to the rescue. Doug cutting from Google has introduced the Hadoop MapReduce approach to resolve this issue.
In MapReduce, Instead of a single centralized system, it stores data in multiple commodity hardware with replicas or copies of data on multiple machines to handle a massive amount of data without a heavy load on a single system and without data loss with the help of the concept of replication factor.
How MapReduce in Hadoop works
If we want to understand the working of MapReduce, we need to understand the phases it covers during its process. On a high level, it contains two major phases: Map and Reduce.
MapReduce takes input in key-value format and generates output in key-value format.
- Map task takes a set of data and converts it into another set of data.
- Reduce task takes the output from the Map phase as input and aggregates values based on keys, and reduces to a lesser set of data.
Reduce phase always executes after the Map phase, and there can be a chain of multiple map-reduce phases.
Let’s see the complete map-reduce flow in detail.
Apart from Map and Reduce phases, there are some more phrases that we will see in the sequence it comes.
Steps of MapReduce Phase
- Record Reader: The Record reader reads records from the input file and parses them into a key-value format accepted on the mapper side.
- Map: Map is a phase in which it reads data in a key-value format, processes it, and translates again into the new key-value format.
- Intermediate Stage: The output of the map phase is considered an intermediate state. It has stored in memory. If data is more than the capacity of memory, then it stores data temporarily on commodity hardware.
- Combiner: This is an optional phase. It takes the output from the map phase and aggregates values based on keys specific to particular commodity hardware. It works like a reducer but on the machine where the map phase has been executed.
- Sort & shuffle: It shuffles data available among machines according to common keys available.
After the shuffle task, it sorts data shuffled based on values with common keys comes along. So, it can be easily used in the reducer phase.
- Reduce: Reduce phase takes input in a key-value format. It executes reduce method for each and every entry in reducer input and performs aggregation on data. It can generate zero or more data as output in key-value format.
- Record Writer: It takes the output from the reducer and writes output to the output file using the record writer.
.
Benefits of Hadoop MapReduce
We have shown the flow of MapReduce. So, what is the benefit of using Hadoop MapReduce?
- Simple— Anyone having basic programming knowledge can learn and work with the MapReduce programming model. This programming model is very simple to understand and implement. It can be supported in multiple languages like Java, Python, R, etc.
- Scalable— This programming model is just a logical model which can work with scalability. As we increase no. of commodity hardware in the Hadoop cluster, this programming model will work on increased no. of hardware, and the same applies if we reduce in case of maintenance.
- Available— As the MapReduce programming model works on the cluster of multiple machines, it sends data to more than one machine. So, if one of the nodes goes down on gets failed for some reason, it can be retrieved from another node, and data will not be lost.
- Parallel— Here, data is stored parallelly on multiple machines. So, as we are running the MapReduce programming model on various machines, we can run it in parallel. As the complete job is divided into multiple sub-tasks, it runs faster and completes early.
- Data Locality— Data locality means data won’t come where our programming model is there. This is a very important feature of the MapReduce programming model. It sends a programming model where data is available. So, it reduces the transfer of data, and eventually, it processes faster.
- Cost-Effective– As it says, we can process a huge amount of data using commodity hardware; we don’t need to pay for high-end servers. We can use commodity hardware to make a multi-node Hadoop cluster.
Applications of MapReduce
There is a very large variety of problem statements where the MapReduce model can be helpful.
- A very basic application of MapReduce is word count.
- Log Analysis
- Social Network Analysis
- Fraud Detection
- Data Searching
We can use map-reduce in different machine learning problem statements like
- Document Clustering
- Language Detection
- Entity Extraction
- Named Entity Recognition
- Image Identification
- Website Categorization
Creating the first program of MapReduce (word count)
A hello world program for Hadoop world is a word count program. The application is to count the occurrence of each word available in the document.
Before starting working on the problem statement, we should know what will be input and what output we needed. So, let’s discuss it first.
Input
HDFS is used to store data in hadoop. MapReduce is used to process data in hadoop
Output
- is 2
- used 2
- to 2
- data 2
- in 2
- hadoop 2
- HDFS 1
- store 1
- MapReduce 1
- process 1
Now, let’s understand how to implement a word count program. As we saw phases of the MapReduce programming model above it has 2 phases
Mapper class
In the map phase there will be a program as below:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
As we understood, the map method of the mapper will execute for each and every record of Input. So, for our understanding let’s say it will read one line at a time which is as below:
HDFS is used to store data in Hadoop
- It will tokenize words, iterates data, and store data in the below format as output.
HDFS, <1>
Is, <1> etc.,
- In case there is two entry of the same word in the line. It will be as below:
HDFS, <1,1>
Reducer class
In reduce phase there will be programs as below:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
Over reduce phase, input will be as below:
- Key: is values: <1,1>
- Key: used values: <1,1>
- Key: to values: <1,1>
- Key: data values: <1,1>
- Key: in values: <1,1>
- Key: hadoop values: <1,1>
- Key: HDFS values: <1>
- Key: store values: <1>
- Key: MapReduce values: <1>
- Key: process values: <1>
Here, reduce method will call for each key and iterate through the list of values and will add up in the sum variable, and finally will be stored as output in the output file as mentioned in Output above.
Driver class
This class is used to provide configuration detail to execute the map-reduce program for the word count application.
- We can mention multiple things over here like:
- Which class is considered a mapper class?
- Which class is considered a reducer?
- What is the type of key and value in mapper output?
- What is the type of key and value in reducer output?
- What is the input and output format?
So, basically, the driver class is responsible for executing the map-reduce program. The implementation of the driver class is given down below.
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
The output of the word count program
Wrap up
On a closing note, hopefully, this article will help you to understand MapReduce architecture and give you a starting point to start with MapReduce programming.
If you like this blog, you can share it with your friends or colleague. You can connect with me on social media profiles like Linkedin, Twitter, and Instagram.
Linkedin – https://www.linkedin.com/in/abhishek-kumar-singh-8a6326148
Twitter- https://twitter.com/Abhi007si
Instagram- www.instagram.com/dataspoof