Skip to main content

Big Data: Managing HDFS

So far we have covered the following topics in the big data. You can click on the hyperlink and go to a specific topic.

Technology Stack
hadoop distributed file system (HDFS)
Map Reduce
Installing Hadoop ( Single Node)
Apache Hadoop installing Multi Node
Big Data: Troubleshooting, Administering and optimizing Hadoop

In this blog, we cover the topic, Managing HDFS.
 Lets start with DATA.
Below are the URLs for getting  data on the internet of varying shapes and sizes.

When people get into the hadoop, first thing they want to do is to see the whole process. Thats why small data is really good to start with   Books (Small): www.gutenberg.org  has thousand of thousand of free book , which you can download as text files. put these text files into hadoop and start mining this information.

Other data sets examples

S3 Data (Varying) : a3.amazon,com/datasets

Public datasets (varying) : www.infochimps.com/datasets

So, we download this data, on to our computer in the Cygwin directory,

so we have small data under the books folder and semi large data under the weather folder. Now we have get this data to HDFS.
 now we connect our Client to hadoop cluster and make a test directory

# ~S hadoop fs -mkdir test
when we make directory without specifying the path, the directory would be created in the users home directory.
We need to put our data to  place where everyone can access it in hadoop
# ~S hadoop fs -mkdir hdfs://hnname:10001/data/small
The above command is putting the data for our book data that we have
# ~S hadoop fs -mkdir hdfs://hnname:10001/data/big

also if you want to remove a directory, then we can do this


# ~S hadoop fs -rnr test

now we will move the data in to our small and big directory

# ~S hadoop fs -moveFromLocal /home/abuser/data/war_and_peace.txt hdfs://hnname:10001/data/small/war_and_peace.txt

since we have loaded the data, now we can do some admin work

# ~S hadoop fs -report
it will give cluster summary node info etc

we can also put it in safe mode
# ~S hadoop dfsadmin -safemode enter

to get out of safe mode

# ~S hadoop dfsadmin -safemode leave

We can also run the file system checker since we have the data now. we cant run this command from the client machine. you would need to ssh to the name node

# ~S hadoop fsck -blocks

we can also check file system of a specific directory

# ~S hadoop fsck /data/big

Now lets check the UPGRADE PROCESS OF HADOOP

1) Shutdown the cluster: shutdown down the map reduce first, then HDFS.
2) Install new version of hadoop
3) Start Hadoop with upgrade option: start-dfs.sh -upgrade
4) Check status with dfsadmin
5) When status is complete:
   - Put in safemode
   - use fsck to check health
   - Read some files
6) Rollback if issues:start-dfs.sh -rollback
7) Finalize if successfull: hadoop dfsadmin -finalizeUpgrade

 Next thing we need to discuss is RACK Awareness

-Name node executes script. Passes IP address as argument
-Script returns rack id
-accomplished via direct code, file lookup or dyanmic



the above process is manual and hard coded. we need to find a better way to do this. the better way is to make a file






Popular posts from this blog

Data Center Migration

Note: This blog is written with the help of my friend Rajanikanth
Data Center Migrations / Data Center Consolidations
Data Center Consolidations, Migrations are complex projects which impact entire orgnization they support. They usually dont happen daily but once in a decade or two. It is imperative to plan carefully, leverage technology improvements, virtualization, optimizations.
The single most important factor for any migration project is to have high caliber, high performing, experienced technical team in place. You are migrating business applications from one data center to another and there is no scope for failure or broken application during migration. So testing startegy should be in place for enterprise business applications to be migrated.
Typical DCC and Migrations business objectives
Business Drivers
·Improve utilization of IT assets ·DC space & power peaked out - business growth impacted ·Improve service levels and responsiveness to new applications ·Reduce support complexi…

HP CSA Implementation

I know the above picture is little confusing but don’t worry I break it down and explain in detail. By the time I am done explaining you all will be happy. HARDWARE AND SOFTWARE REQUIREMENTS 1.VMware vSphere infrastructure / Microsoft Hyper V: For the sake of Simplicity we will use VMware vSphere. We Need vSphere 4.0 /5/5.5 and above and vCenter 4.0 and above ready and installed. This is the first step. 2.We need Software medias for HP Cloud Service Automation, 2.00, HP Server Automation, 9.02, HP Operations Orchestration (OO)9.00.04, HP Universal CMDB 9.00.02, HP Software Site Scope, 11.01,HP Insight Software6.2 Update 1 3.DNS, DHCP and NTP systems are already installed and configured. NTP information should be part of VM templates 4.SQL Server 2005 or Microsoft® SQL Server 2008 or Microsoft® SQL Server 2012 , Oracle 11g, both 32-bit and 64-bit versions may be used for CSA database.
5.We will install  HP Cloud Service Automation, 2.00, HP Server Automation, 9.02, HP Operations Orchestra…

Openstack- Its importance in Cloud. The HP Helion Boost

Every enterprise expects few things from cloud computing, mainly:

· Auto scaling: The workload should increase and decrease as needed by the IT environment.

· Automatic repair: If there is any fault or crash of the application or the server, it automatically fix it

· Fault tolerant: The application or underlying technology is intelligent enough to make itself fault torrent

· Integrated lifecycle: It should have integrated lifecycle

· Unified management: Its easy to manage all different aspects of technology

· Less cost

· Speed


Its year 2014. till now only 5% to 7% enterprises are using cloud computing. Such a small number. Its a huge opportunity and a vast majority for anyone who is interested in providing cloud computing services.
Current IT environment is very complex. You just cant solve all your problems with cloud computing.
There are legacy systems, databases, data processors, different hardware and software. You name it , there are so many technology available in just o…