TechAE Blogs - Explore now for new leading-edge technologies

TechAE Blogs - a global platform designed to promote the latest technologies like artificial intelligence, big data analytics, and blockchain.

Full width home advertisement

Post Page Advertisement [Top]

Run MapReduce Job On Hadoop


Want to run the MapReduce job on Hadoop Cluster? Here is the simplest tutorial that is sure to help.

What is MapReduce?

Hadoop MapReduce is a framework for quickly developing applications that process massive volumes of data (multi-terabytes of data) in parallel on huge clusters (thousands of nodes) in a reliable, failure-tolerant method.

A MapReduce job typically divides the input data set into distinct pieces that are handled in parallel by the map jobs. The framework sorts the map outputs, which are subsequently fed into the reduction jobs. Typically, both the job's input and output are saved in a file system. The framework manages task scheduling, task monitoring, and task re-execution upon failure.

PREREQUISITES:

Apache Hadoop must be configured and running. For:

Table of Contents

  • WordCount Program
  • Updating Hadoop-env.sh
  • Compilation of WordCount Program
  • Creating Input Files
  • Executing Application
  • Results

Step 1: WordCount Program

WordCount is a simple application that counts the number of occurrences of each word in a given input data. Write the below code in WordCount.java and save it in your $HADOOP_HOME directory.


import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Step 2: Updating Hadoop-env.sh


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Make sure you have updated with the correct java path else you will get the following error:

Error: Could not find or load main class com.sun.tools.javac.Main

Step 3: Compilation of WordCount Program

These commands will compile WordCount.java and create a jar file:


$ cd $HADOOP_HOME
$ bin/hadoop com.sun.tools.javac.Main WordCount.java 
$ jar cf wc.jar WordCount*.class

Step 4: Creating Input Files

These commands will allow you to create input_dir in HDFS and upload file01 and file02 into it. You have to be in the $HADOOP_HOME directory to run these commands.


$ $HADOOP_HOME/bin/hadoop fs -mkdir -p input_dir
$ nano file01
Hello world, Good Morning Neophytes.
$ nano file02
Hello Hadoop, Good morning to Hadoop.
$ $HADOOP_HOME/bin/hadoop fs -put -p /home/hdoop/hadoop-3.3.1/file01 input_dir
$ $HADOOP_HOME/bin/hadoop fs -put -p /home/hdoop/hadoop-3.3.1/file02 input_dir

Step 5: Executing Application


$ bin/hadoop jar wc.jar WordCount input_dir output_dir

Wait for some time, and you will get results something like this.


2022-07-09 11:48:50,199 INFO mapreduce.Job: Running job: job_1657347551206_0001
2022-07-09 11:49:22,878 INFO mapreduce.Job: Job job_1657347551206_0001 running in uber mode : false
2022-07-09 11:49:22,880 INFO mapreduce.Job:  map 0% reduce 0%
2022-07-09 11:49:58,616 INFO mapreduce.Job:  map 100% reduce 0%
2022-07-09 11:50:13,897 INFO mapreduce.Job:  map 100% reduce 100%
2022-07-09 11:50:14,933 INFO mapreduce.Job: Job job_1657347551206_0001 completed successfully
2022-07-09 11:50:15,270 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=147
		FILE: Number of bytes written=821951
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=303
		HDFS: Number of bytes written=82
		HDFS: Number of read operations=11
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=62984
		Total time spent by all reduces in occupied slots (ms)=12503
		Total time spent by all map tasks (ms)=62984
		Total time spent by all reduce tasks (ms)=12503
		Total vcore-milliseconds taken by all map tasks=62984
		Total vcore-milliseconds taken by all reduce tasks=12503
		Total megabyte-milliseconds taken by all map tasks=64495616
		Total megabyte-milliseconds taken by all reduce tasks=12803072
	Map-Reduce Framework
		Map input records=2
		Map output records=11
		Map output bytes=119
		Map output materialized bytes=153
		Input split bytes=228
		Combine input records=11
		Combine output records=11
		Reduce input groups=9
		Reduce shuffle bytes=153
		Reduce input records=11
		Reduce output records=9
		Spilled Records=22
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=1087
		CPU time spent (ms)=4160
		Physical memory (bytes) snapshot=644632576
		Virtual memory (bytes) snapshot=7456948224
		Total committed heap usage (bytes)=581115904
		Peak Map Physical memory (bytes)=251269120
		Peak Map Virtual memory (bytes)=2483400704
		Peak Reduce Physical memory (bytes)=144678912
		Peak Reduce Virtual memory (bytes)=2490146816
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=75
	File Output Format Counters 
		Bytes Written=82

Step 6: Results

You can check the result by running this command:


$ bin/hadoop fs -cat output_dir/part-r-00000
Good	2
Hadoop,	1
Hello	2
Morning	1
Neophytes!	1
hadoop.	1
morning	1
to	1
world,	1

Let’s Put It All Together:

We started out by defining MapReduce, explaining that it provides scalable Hadoop infrastructure.

Then, we proceeded to cover the best possible steps to run a MapReduce job on your Hadoop cluster. At last, we’ve got the results and you now know how to run your own custom MapReduce job. Feel free to ask questions if you need any help.

Good Luck!

1 comment:

Bottom Ad [Post Page]