Monday, July 21, 2014

Big Data: Map Reduce

In the last 2 blogs, we discussed The Big data Technology stack and  HDFS.  If you want to revisit these topics then please go to the link

Big Data Technology Stack

In this blog we will discuss Map Reduce.

Map reduce is the most difficult concept in big data. Why? Because it is low level programming. In this blog, we will discuss high level map reduce. This blog will not contain any coding.

Map Reduce Architecture

The main elements in the map reduce architecture will be the:
1)      Job Client: Submit Jobs
2)      Job Tracker: Coordinate Jobs
3)      Task Tracker: Executes Jobs
Job tracker is the master in this architecture to the Task trackers. It is similar to Name Node being master to data node.
So the rules of the game are,
   1)      Job Clients submits jobs to Job Trackers and also copies its information like binaries to HDFS
   2)      Job Tracker talks to name node
   3)      Job Tracker creates an execution plan
   4)      Job Trackers submits to task trackers. Task trackers do the heavy lifting. They will execute the map and reduce functions.
   5)      Task trackers report and progress via heartbeat: When its executing map and reduce functions, it sends progress updates and status updates to job tracker
   6)      Job tracker manages phases
   7)      Job tracker Update status

so this covers high level architecture  of map reduce.

Now lets zoom in for MAP REDUCE INTERNALS

So the first phase is here the Split Phase. Split phase uses the input format to bring data of the disk, off the HDFS format and split it up so that it can be sent to mappers. The default input format is the text input format. It breaks up data line by line. So each line be sent to mapper. So if you have a large file having many lines, you could have thousands of thousands of mapper running simultaneously. 

Let’s go to input format. There are variety of input formats: there is binary input format, database input format, record based input formats etc.
Mappers transforms the input splits into key/value pairs based on user defined code. Then it will go to intermediate phase called Shuffle and sort. Shuffle and sort moves map outputs to the reducers and sorts them by key
Reducers aggregates key/value pairs based on user-defined code. And then put it into output format. Output format determines how the results are written to the output directory. Output format puts the data into HDFS

So lets send some data through

So here we have a input, say a file with bunch of information. By default it will split the different lines. and each line will be sent to the mapper. mapper will do key value and so on.

so what we have seen above is functional programming paradigm. its a programming paradigm that treats computation as the evolution of mathematical functions.

So this is how map reduce works

No comments:

Post a Comment

Featured Post

Amazon Route 53

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.Route 53  perform three main functions in any...