Friday, October 13, 2017

MapReduce Tutorial

In this post i am going to introduce to you MapReduce, which is one of the framework for writing applications to process huge amount of data(It might be structured or unstructured data) in HDFS. It is specially designed for commodity hardware. It allows us to perform parallel and distributed processing  on huge data sets. These MapReduce programs can be written in various languages such as JAVA,Python,C++.

MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first job is Map job, which takes a set of data and converts it into another set of data where individual elements are broken down into touple(key/value pairs). Then reduce job is takes the output from a map as input and combines those data tuples into a smaller set of tuples. We should remember one point is,reduce job is always performed after the map job.

Advantages of MapReduce:

1. Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us to process the data using different machines. As the data is processed by multiple machine instead of a single machine in parallel, the time taken to process the data gets reduced by a tremendous amount.

2. Data Locality: 

Instead of moving data to the processing unit, we are moving processing unit to the data in the MapReduce Framework.  In the traditional system, we used to bring data to the processing unit and process it. But, as the data grew and became very huge, bringing this huge amount of data to the processing unit posed following issues: 

Moving huge data to processing is costly and deteriorates the network performance. 
Processing takes time as the data is processed by a single unit which becomes the bottleneck.
Master node can get over-burdened and may fail.  
Now, MapReduce allows us to overcome above issues by bringing the processing unit to the data. So, as you can see in the above image that the data is distributed among multiple nodes where each node processes the part of the data residing on it. This allows us to have the following advantages:

It is very cost effective to move processing unit to the data.
The processing time is reduced as all the nodes are working with their part of the data in parallel.
Every node gets a part of the data to process and therefore, there is no chance of a node getting overburdened. 

MapReduce Example:

Let's assume we have three tables and each table contains two columns that represent a company name and corresponding profit. In this example, company name is the key and profit is the value.



In the above data, we have to find the maximum profit for each company across all of the tables(note each table might have the same company represented multiple times). Using the MapReduce framework,we can break this down into three map tasks,where each mapper works on one of the three tables and mapper task goes through the data and returns the maximum profit for each company.

For example,the results produced from one mapper task for table A would like this:

[(abc,34),(pqr,15),(xyz,32)]

Same,the results produced from one mapper task(working on the other two tables)would like this:

[(abc,23),(pqr,34),(xyz,18)]
[(abc,34),(pqr,34),(xyz,23)]

All three of these output streams would be fed into the reduce tasks,which combine the input results and output a single value for each company,producing a final result set as follows:

[(abc,34),(pqr,34),(xyz,32)]

MapReduce PROGRAM:

In this i have taken word count example where i have to find out number of concurrences of each word.

mport java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;

public class WordCount
{
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
public void map(LongWritable key, Text value,Context context)throws IOException,InterruptedException{

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
}
}
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
 public static void main(String[] args) throws Exception {
 Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Run the MapReduce code:
The command for running a MapReduce code is:
hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output




No comments:

Post a Comment

High Paying Jobs after Learning Python

Everyone knows Python is one of the most demand Programming Language. It is a computer programming language to build web applications and sc...