Wednesday, July 30, 2014

Big Data: Managing HDFS

So far we have covered the following topics in the big data. You can click on the hyperlink and go to a specific topic.

Technology Stack
hadoop distributed file system (HDFS)
Map Reduce
Installing Hadoop ( Single Node)
Apache Hadoop installing Multi Node
Big Data: Troubleshooting, Administering and optimizing Hadoop

In this blog, we cover the topic, Managing HDFS.
 Lets start with DATA.
Below are the URLs for getting  data on the internet of varying shapes and sizes.

When people get into the hadoop, first thing they want to do is to see the whole process. Thats why small data is really good to start with   Books (Small):  has thousand of thousand of free book , which you can download as text files. put these text files into hadoop and start mining this information.

Other data sets examples

S3 Data (Varying) :,com/datasets

Public datasets (varying) :

So, we download this data, on to our computer in the Cygwin directory,

so we have small data under the books folder and semi large data under the weather folder. Now we have get this data to HDFS.
 now we connect our Client to hadoop cluster and make a test directory

# ~S hadoop fs -mkdir test
when we make directory without specifying the path, the directory would be created in the users home directory.
We need to put our data to  place where everyone can access it in hadoop
# ~S hadoop fs -mkdir hdfs://hnname:10001/data/small
The above command is putting the data for our book data that we have
# ~S hadoop fs -mkdir hdfs://hnname:10001/data/big

also if you want to remove a directory, then we can do this

# ~S hadoop fs -rnr test

now we will move the data in to our small and big directory

# ~S hadoop fs -moveFromLocal /home/abuser/data/war_and_peace.txt hdfs://hnname:10001/data/small/war_and_peace.txt

since we have loaded the data, now we can do some admin work

# ~S hadoop fs -report
it will give cluster summary node info etc

we can also put it in safe mode
# ~S hadoop dfsadmin -safemode enter

to get out of safe mode

# ~S hadoop dfsadmin -safemode leave

We can also run the file system checker since we have the data now. we cant run this command from the client machine. you would need to ssh to the name node

# ~S hadoop fsck -blocks

we can also check file system of a specific directory

# ~S hadoop fsck /data/big


1) Shutdown the cluster: shutdown down the map reduce first, then HDFS.
2) Install new version of hadoop
3) Start Hadoop with upgrade option: -upgrade
4) Check status with dfsadmin
5) When status is complete:
   - Put in safemode
   - use fsck to check health
   - Read some files
6) Rollback if -rollback
7) Finalize if successfull: hadoop dfsadmin -finalizeUpgrade

 Next thing we need to discuss is RACK Awareness

-Name node executes script. Passes IP address as argument
-Script returns rack id
-accomplished via direct code, file lookup or dyanmic

the above process is manual and hard coded. we need to find a better way to do this. the better way is to make a file

No comments:

Post a Comment

Featured Post

Amazon Route 53

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.Route 53  perform three main functions in any...