Skip to main content

Big data: Technology Stack


There is so much untapped information stored in unstructured data. There is a huge rise in unstructured data.
Hadoop is central to big data. Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. Let’s first start with benefits of Hadoop:
  1)      Distributed storage: this features enables large processing and massive clusters
  2)      Scalable
  3)      Flexible
  4)      Fault tolerant

  5)      Intelligent



So lets go little high level. I will start with some basic high level Big data technologies and then try to go little deep.

Lets describe Hadoop technology stack


Core to Hadoop is HDFS, Map Reduce and YARN
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication.
HDFS is a self-healing high bandwidth clustered storage. If we put petabyte file in the Hadoop, HDFS will break up the file into blocks and distribute between all the nodes in the cluster.
When we setup HDFS, we set up replication factor, default is 3. What it means that when we put this file in Hadoop, it will make 3 copies of every block spread across all the nodes in the cluster. So if we lose a node, then we would still have all the data. It will re replicate the data the on that node to the other nodes. It’s the fault tolerance in Hadoop
How it does it this ?  As I told earlier it has a name node and a data node.  There is one name node and rest are data nodes.

Name node is the metadata server, it holds in memory every block of every node .How it does this? Here comes the map reduce: it’s a 2 step process- a mapper and a reducer

Map reduce is the retrieval and processing. Programmers will write the mapper function, which will go out and tell us which data point to retrieve. Reducer will take all that data and aggregate it

Yarn- yet another resource negotiator: it is nothing but MapReduce 2.0. The fundamental idea is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM)

Hadoop essential:
·         Pig
·         Hive
·         Hbase
·         Avro
·         Zookeeper
·         -oozie
·         sqoop
So lets describe them in detail
Pig is high level data flow scripting language. It is at the data access layer of the Hadoop architecture. Its all about getting the data, loading it up, transforming the data and filtering and retuning the results
There are 2 core components to pig
  1)      Pig Latin- It is the programming language

  2)      Ping Runtime- it compiles pig latin and converts it into map reduce job

Hive is another data access project. Hive is a way to project structure onto the data inside the cluster. So it is really a database and data warehouse built on top of Hadoop. It contains a query language called Hive QL. It is extremely similar to SQL.

Under DATA storage layer

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.


Under Interaction visualization execution development layer

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.


Lucene is a free/open source information retrieval software library


Under Data serialization layer
Avro is a remote procedure call and serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. In other words, Avro is a data serialization system. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

It is similar to Thrift, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Under Data intelligence layer  

                 

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets


Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

 Under Data integration layer   
                 
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Under Management, Monitoring, orchestration layer


ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. 




In next few blogs we will go into more details. so stay tuned.




















Comments

Popular posts from this blog

Data Center Migration

Note: This blog is written with the help of my friend Rajanikanth
Data Center Migrations / Data Center Consolidations
Data Center Consolidations, Migrations are complex projects which impact entire orgnization they support. They usually dont happen daily but once in a decade or two. It is imperative to plan carefully, leverage technology improvements, virtualization, optimizations.
The single most important factor for any migration project is to have high caliber, high performing, experienced technical team in place. You are migrating business applications from one data center to another and there is no scope for failure or broken application during migration. So testing startegy should be in place for enterprise business applications to be migrated.
Typical DCC and Migrations business objectives
Business Drivers
·Improve utilization of IT assets ·DC space & power peaked out - business growth impacted ·Improve service levels and responsiveness to new applications ·Reduce support complexi…

HP CSA Implementation

I know the above picture is little confusing but don’t worry I break it down and explain in detail. By the time I am done explaining you all will be happy. HARDWARE AND SOFTWARE REQUIREMENTS 1.VMware vSphere infrastructure / Microsoft Hyper V: For the sake of Simplicity we will use VMware vSphere. We Need vSphere 4.0 /5/5.5 and above and vCenter 4.0 and above ready and installed. This is the first step. 2.We need Software medias for HP Cloud Service Automation, 2.00, HP Server Automation, 9.02, HP Operations Orchestration (OO)9.00.04, HP Universal CMDB 9.00.02, HP Software Site Scope, 11.01,HP Insight Software6.2 Update 1 3.DNS, DHCP and NTP systems are already installed and configured. NTP information should be part of VM templates 4.SQL Server 2005 or Microsoft® SQL Server 2008 or Microsoft® SQL Server 2012 , Oracle 11g, both 32-bit and 64-bit versions may be used for CSA database.
5.We will install  HP Cloud Service Automation, 2.00, HP Server Automation, 9.02, HP Operations Orchestra…

Openstack- Its importance in Cloud. The HP Helion Boost

Every enterprise expects few things from cloud computing, mainly:

· Auto scaling: The workload should increase and decrease as needed by the IT environment.

· Automatic repair: If there is any fault or crash of the application or the server, it automatically fix it

· Fault tolerant: The application or underlying technology is intelligent enough to make itself fault torrent

· Integrated lifecycle: It should have integrated lifecycle

· Unified management: Its easy to manage all different aspects of technology

· Less cost

· Speed


Its year 2014. till now only 5% to 7% enterprises are using cloud computing. Such a small number. Its a huge opportunity and a vast majority for anyone who is interested in providing cloud computing services.
Current IT environment is very complex. You just cant solve all your problems with cloud computing.
There are legacy systems, databases, data processors, different hardware and software. You name it , there are so many technology available in just o…