Skip to main content

Big Data: Map Reduce

In the last 2 blogs, we discussed The Big data Technology stack and  HDFS.  If you want to revisit these topics then please go to the link

Big Data Technology Stack

In this blog we will discuss Map Reduce.

Map reduce is the most difficult concept in big data. Why? Because it is low level programming. In this blog, we will discuss high level map reduce. This blog will not contain any coding.

Map Reduce Architecture

The main elements in the map reduce architecture will be the:
1)      Job Client: Submit Jobs
2)      Job Tracker: Coordinate Jobs
3)      Task Tracker: Executes Jobs
Job tracker is the master in this architecture to the Task trackers. It is similar to Name Node being master to data node.
So the rules of the game are,
   1)      Job Clients submits jobs to Job Trackers and also copies its information like binaries to HDFS
   2)      Job Tracker talks to name node
   3)      Job Tracker creates an execution plan
   4)      Job Trackers submits to task trackers. Task trackers do the heavy lifting. They will execute the map and reduce functions.
   5)      Task trackers report and progress via heartbeat: When its executing map and reduce functions, it sends progress updates and status updates to job tracker
   6)      Job tracker manages phases
   7)      Job tracker Update status

so this covers high level architecture  of map reduce.

Now lets zoom in for MAP REDUCE INTERNALS

So the first phase is here the Split Phase. Split phase uses the input format to bring data of the disk, off the HDFS format and split it up so that it can be sent to mappers. The default input format is the text input format. It breaks up data line by line. So each line be sent to mapper. So if you have a large file having many lines, you could have thousands of thousands of mapper running simultaneously. 

Let’s go to input format. There are variety of input formats: there is binary input format, database input format, record based input formats etc.
Mappers transforms the input splits into key/value pairs based on user defined code. Then it will go to intermediate phase called Shuffle and sort. Shuffle and sort moves map outputs to the reducers and sorts them by key
Reducers aggregates key/value pairs based on user-defined code. And then put it into output format. Output format determines how the results are written to the output directory. Output format puts the data into HDFS

So lets send some data through

So here we have a input, say a file with bunch of information. By default it will split the different lines. and each line will be sent to the mapper. mapper will do key value and so on.

so what we have seen above is functional programming paradigm. its a programming paradigm that treats computation as the evolution of mathematical functions.

So this is how map reduce works


Popular posts from this blog

Data Center Migration

Note: This blog is written with the help of my friend Rajanikanth
Data Center Migrations / Data Center Consolidations
Data Center Consolidations, Migrations are complex projects which impact entire orgnization they support. They usually dont happen daily but once in a decade or two. It is imperative to plan carefully, leverage technology improvements, virtualization, optimizations.
The single most important factor for any migration project is to have high caliber, high performing, experienced technical team in place. You are migrating business applications from one data center to another and there is no scope for failure or broken application during migration. So testing startegy should be in place for enterprise business applications to be migrated.
Typical DCC and Migrations business objectives
Business Drivers
·Improve utilization of IT assets ·DC space & power peaked out - business growth impacted ·Improve service levels and responsiveness to new applications ·Reduce support complexi…

HP CSA Implementation

I know the above picture is little confusing but don’t worry I break it down and explain in detail. By the time I am done explaining you all will be happy. HARDWARE AND SOFTWARE REQUIREMENTS 1.VMware vSphere infrastructure / Microsoft Hyper V: For the sake of Simplicity we will use VMware vSphere. We Need vSphere 4.0 /5/5.5 and above and vCenter 4.0 and above ready and installed. This is the first step. 2.We need Software medias for HP Cloud Service Automation, 2.00, HP Server Automation, 9.02, HP Operations Orchestration (OO)9.00.04, HP Universal CMDB 9.00.02, HP Software Site Scope, 11.01,HP Insight Software6.2 Update 1 3.DNS, DHCP and NTP systems are already installed and configured. NTP information should be part of VM templates 4.SQL Server 2005 or Microsoft® SQL Server 2008 or Microsoft® SQL Server 2012 , Oracle 11g, both 32-bit and 64-bit versions may be used for CSA database.
5.We will install  HP Cloud Service Automation, 2.00, HP Server Automation, 9.02, HP Operations Orchestra…

Openstack- Its importance in Cloud. The HP Helion Boost

Every enterprise expects few things from cloud computing, mainly:

· Auto scaling: The workload should increase and decrease as needed by the IT environment.

· Automatic repair: If there is any fault or crash of the application or the server, it automatically fix it

· Fault tolerant: The application or underlying technology is intelligent enough to make itself fault torrent

· Integrated lifecycle: It should have integrated lifecycle

· Unified management: Its easy to manage all different aspects of technology

· Less cost

· Speed

Its year 2014. till now only 5% to 7% enterprises are using cloud computing. Such a small number. Its a huge opportunity and a vast majority for anyone who is interested in providing cloud computing services.
Current IT environment is very complex. You just cant solve all your problems with cloud computing.
There are legacy systems, databases, data processors, different hardware and software. You name it , there are so many technology available in just o…