Saturday, July 19, 2014

Big data: Technology Stack

There is so much untapped information stored in unstructured data. There is a huge rise in unstructured data.
Hadoop is central to big data. Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. Let’s first start with benefits of Hadoop:
  1)      Distributed storage: this features enables large processing and massive clusters
  2)      Scalable
  3)      Flexible
  4)      Fault tolerant

  5)      Intelligent

So lets go little high level. I will start with some basic high level Big data technologies and then try to go little deep.

Lets describe Hadoop technology stack

Core to Hadoop is HDFS, Map Reduce and YARN
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication.
HDFS is a self-healing high bandwidth clustered storage. If we put petabyte file in the Hadoop, HDFS will break up the file into blocks and distribute between all the nodes in the cluster.
When we setup HDFS, we set up replication factor, default is 3. What it means that when we put this file in Hadoop, it will make 3 copies of every block spread across all the nodes in the cluster. So if we lose a node, then we would still have all the data. It will re replicate the data the on that node to the other nodes. It’s the fault tolerance in Hadoop
How it does it this ?  As I told earlier it has a name node and a data node.  There is one name node and rest are data nodes.

Name node is the metadata server, it holds in memory every block of every node .How it does this? Here comes the map reduce: it’s a 2 step process- a mapper and a reducer

Map reduce is the retrieval and processing. Programmers will write the mapper function, which will go out and tell us which data point to retrieve. Reducer will take all that data and aggregate it

Yarn- yet another resource negotiator: it is nothing but MapReduce 2.0. The fundamental idea is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM)

Hadoop essential:
·         Pig
·         Hive
·         Hbase
·         Avro
·         Zookeeper
·         -oozie
·         sqoop
So lets describe them in detail
Pig is high level data flow scripting language. It is at the data access layer of the Hadoop architecture. Its all about getting the data, loading it up, transforming the data and filtering and retuning the results
There are 2 core components to pig
  1)      Pig Latin- It is the programming language

  2)      Ping Runtime- it compiles pig latin and converts it into map reduce job

Hive is another data access project. Hive is a way to project structure onto the data inside the cluster. So it is really a database and data warehouse built on top of Hadoop. It contains a query language called Hive QL. It is extremely similar to SQL.

Under DATA storage layer

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Under Interaction visualization execution development layer

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.

Lucene is a free/open source information retrieval software library

Under Data serialization layer
Avro is a remote procedure call and serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. In other words, Avro is a data serialization system. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

It is similar to Thrift, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Under Data intelligence layer  


Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets

Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

 Under Data integration layer   
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Under Management, Monitoring, orchestration layer

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. 

In next few blogs we will go into more details. so stay tuned.

1 comment:

Featured Post

Amazon Route 53

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.Route 53  perform three main functions in any...