The world is one big data problem! - Andrew McAfee
This article is a part of Big-Data series in which I'll be posting stuff related to Big data tech stack. Most of the articles will be short and to the point.
Hadoop
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. (Wikipedia)
In simple words:
Like any other platform It's a platform for Data storage and processing but with immense powers (same as Thor). Since it's inception it has been a game changer in how we process and store data.
What is Hadoop?
- It's an open source project from Apache Software foundation.
- It consist of a software framework for distributing and running applications on cluster of servers (I said you same power as Thor ;)).
- Hadoop is written in Java.
- It is inspired by Google's GFS(Google File system).
- Hadoop is capable of processing large volumes of data (Big Data) on a cluster of in-expensive hardware also called commodity hardware.
- Hadoop is highly fault tolerant ie. if any failure occurs it is automatically taken care of.
So like Thor what power does Hadoop has?
Fault Tolerance, Reliability, High Availability, Scalability - (Vertical & Horizontal), Highly Economic, Data Locality (Moving computation close to data rather than moving data close to computation)
As human body has 2 most important part : The heart and Brain. Similarly Hadoop has 2 most important parts.
The Heart and Brain of Hadoop:
1. HDFS: Hadoop Distributed File System
HDFS aka heart of Hadoop is responsible for Data Storage and Data Protection.
HDFS is the Storage layer in Hadoop ecosystem.
2. Map-Reduce :
Map Reduce aka the brain of Hadoop is responsible for Data Processing in parallel.
Map Reduce is the computation layer in the Hadoop ecosystem.
We'll decipher these 2 organs in detail in next blog post ;)