Hadoop 101!

Hadoop 101!

The world is one big data problem! - Andrew McAfee

This article is a part of Big-Data series in which I'll be posting stuff related to Big data tech stack. Most of the articles will be short and to the point.

Hadoop

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. (Wikipedia)

In simple words:

Like any other platform It's a platform for Data storage and processing but with immense powers (same as Thor). Since it's inception it has been a game changer in how we process and store data.

What is Hadoop?

  • It's an open source project from Apache Software foundation.
  • It consist of a software framework for distributing and running applications on cluster of servers (I said you same power as Thor ;)).
  • Hadoop is written in Java.
  • It is inspired by Google's GFS(Google File system).
  • Hadoop is capable of processing large volumes of data (Big Data) on a cluster of in-expensive hardware also called commodity hardware.
  • Hadoop is highly fault tolerant ie. if any failure occurs it is automatically taken care of.

So like Thor what power does Hadoop has?

Fault Tolerance, Reliability, High Availability, Scalability - (Vertical & Horizontal), Highly Economic, Data Locality (Moving computation close to data rather than moving data close to computation)

As human body has 2 most important part : The heart and Brain. Similarly Hadoop has 2 most important parts.

The Heart and Brain of Hadoop:

heart-and-brain.jpg

1. HDFS: Hadoop Distributed File System

  • HDFS aka heart of Hadoop is responsible for Data Storage and Data Protection.

  • HDFS is the Storage layer in Hadoop ecosystem.

2. Map-Reduce :

  • Map Reduce aka the brain of Hadoop is responsible for Data Processing in parallel.

  • Map Reduce is the computation layer in the Hadoop ecosystem.

We'll decipher these 2 organs in detail in next blog post ;)