Sunday, July 20, 2014

Big Data: Hadoop Distributed Filesystem (HDFS)

In this Blog we will see details of HDFS. HDFS is not a typical file system. We will start with high level view of HDFS. Then we will go deeper.

HDFS Architecture

On the left hand side we have a single rack cluster and on the right hand side multi rack cluster.
First thing to note about HDFS architecture that it is a master slave architecture. We have the server roles:
-Name node
-Secondary Name Node
-Job Tracker
-Data Nodes
-Task trackers

 Lets look at the left side diagram.

Name Node

Name node, think it is a HDFS daemon or controller for all the data nodes. Any request that comes to the file system like to create a directory. To create a file or to read or write  a file, is gone go through the name node. Name node essentially manages the file system namespace. It holds in the memory a snapshot of what the file system really looks like.

 It also handles block mappings. Whenever you put a file in the HDFS, its gonna break the file into blocks and spread it across all the data nodes. So the name nodes know where all that of blocks are within the clusters.

Data node

Data nodes are the work horse of the system. They are the one who is actually doing all the block operations. They would be receiving instructions from the name node of where to put the blocks and how to put the blocks.

Data nodes are also responsible for replication. They would be receiving instructions from the name node but data nodes are the one who physically do the replication

Secondary Name Node
Secondary name node is not really secondary name node as the name implies. Lot of people in the community are renaming it as Checkpoint node. It’s just there to take snapshot of the name node now and then. Its like backup, system restore for the name node

Now lets take our attention to Right side multi-rack cluster



You can see we have broken the roles into each of there separate servers and in their own rack. So you can have a single rack cluster or multi-rack cluster. With the multi-rack cluster, new challenges appear.

The 1st challenge is Data loss challenge: Remember in our last blog we discussed, When we setup HDFS, we set up replication factor, default is 3. What it means that when we put this file in Hadoop, it will make 3 copies of every block spread across all the nodes in the cluster. So if we lose a node, then we would still have all the data. It will re replicate the data the on that node to the other nodes. It’s the fault tolerance in Hadoop.
so what happens if we lose an entire rack?  DATA LOSS. This is where Rack Awareness  feature helps us. It does this by understanding our network topology. its a manual thing. We need to describe our network topology the name node. Once its understands it and rack aware, anytime the data comes, name node makes sure that multiple copies are available on multiple data nodes and rack.

The 2nd Challenge is Network Performance: Bandwidth, its a scarce resource. We are going to make an assumption that in a rack there is much more bandwidth and low latency in comparison to traffic between 2 racks. Wouldn't it be great if rack awareness keeps our bulky flow within the rack? Actually it can.

So Rack awareness, reliable storage and high throughput are the main features of Multi-rack 
clusters.

Now lets go to HDFS internals

Name Node: it is the single most server in the cluster. it is a single point of failure. The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline.

So in short
- contains file system metadata
-Clients communicate/controller
- Monitors health

so lets go to file system metadata stuff

The name node holds in memory snapshot of entire file system. it looks something like above diagram, tracks all the files, their replication value and it also holds mapping of block ids to data nodes. name node also has an edit log or journal. it keeps track of all the transactions. actually its the clients who write the edit log or journal. so how does name node writes the edit log into his memory? For that name node needs a reboot. a checkpoint process. What happens is when name node is rebooted, it writes the data in its memory to the disk  into the file and the file is called fsimage and then its going to merge the journal to the disk.

But in production environment, we don't reboot the name node so often, so the edit logs will keep on growing and growing. If we lose the name node in a disaster then we will also lose all the changes.

That's why it is nice to have secondary name node.
Secondary name node takes all the responsibility of merging the edit log with the fsimage file. It periodically will go to name node and say hey give me your edit logs. The secondary name node will have the most recent fsimage. its gone merge to together and reupload to the name node. So when the name node reboots, it will be updated.

Moving in to the DATA NODES
A Data Node stores data in the HadoopFileSystem. A functional file system has more than one Data Node, with data replicated across them.On startup, a DataNode connects to the Name Node; spinning until that service comes up. It then responds to requests from the Name Node for file system operations.Client applications can talk directly to a Data Node, once the Name Node has provided the location of the data.

Data node send heartbeat to Name node every 3 sec. If the Data node doesnt send heartbeat for 10 minutes, name node will consider it as a dead node. Thats when its going to take all the blocks on it and re-replicate across the clusters.

Speaking of Blocks, Blocks are 64mb ( default) size. Block Placement is very important for bandwidth issues.
Replication management will tell us that we are either under replicated or over replicated. if its over replicated, a block will marked to be removed. if its under replicated, then it will create a priority queue. things with least amount of blocks in the cluster will be on top of the queue.

In the end, lets see some tools that we can see to interact with HDFS

From the user prospective, we will use FSshell




No comments:

Post a Comment

Featured Post

Amazon Route 53

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.Route 53  perform three main functions in any...