Apache Hadoop Fundamentals – HDFS and MapReduce Explained with a Diagram

by Ramesh Natarajan on January 4, 2012

Hadoop is an open source software used for distributed computing that can be used to query a large set of data and get the results faster using reliable and scalable architecture.

This is the first article in our new ongoing Hadoop series.

In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to retrieve the data. The non distributed model has few fundamental issues. In this model, you’ll mostly scale vertically by adding more CPU, adding more storage, etc. This architecture is also not reliable, as if the main server fails, you have to go back to the backup to restore the data. From performance point of view, this architecture will not provide the results faster when you are running a query against a huge data set.

In a hadoop distributed architecture, both data and processing are distributed across multiple servers. The following are some of the key points to remember about the hadoop:

  • Each and every server offers local computation and storage. i.e When you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set. Finally, the resultset from all this local servers are consolidated.
  • In simple terms, instead of running a query on a single server, the query is split across multiple servers, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.
  • You don’t need a powerful server. Just use several less expensive commodity servers as hadoop individual nodes.
  • High fault-tolerance. If any of the nodes fails in the hadoop environment, it will still return the dataset properly, as hadoop takes care of replicating and distributing the data efficiently across the multiple nodes.
  • A simple hadoop implementation can use just two servers. But you can scale up to several thousands of servers without any additional effort.
  • Hadoop is written in Java. So, it can run on any platform.
  • Please keep in mind that hadoop is not a replacement for your RDBMS. You’ll typically use hadoop for unstructured data
  • Originally Google started using the distributed computing model based on GFS (Google Filesystem) and MapReduce. Later Nutch (open source web search software) was rewritten using MapReduce. Hadoop was branced out of Nutch as a separate project. Now Hadoop is a top-level Apache project that has gained tremendous momentum and popularity in recent years.

HDFS

HDFS stands for Hadoop Distributed File System, which is the storage system used by Hadoop. The following is a high-level architecture that explains how HDFS works.

The following are some of the key points to remember about the HDFS:

  • In the above diagram, there is one NameNode, and multiple DataNodes (servers). b1, b2, indicates data blocks.
  • When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in the hadoop cluster. HDFS creates several replication of the data blocks and distributes them accordingly in the cluster in way that will be reliable and can be retrieved faster. A typical HDFS block size is 128MB. Each and every data block is replicated to multiple nodes across the cluster.
  • Hadoop will internally make sure that any node failure will never results in a data loss.
  • There will be one NameNode that manages the file system metadata
  • There will be multiple DataNodes (These are the real cheap commodity servers) that will store the data blocks
  • When you execute a query from a client, it will reach out to the NameNode to get the file metadata information, and then it will reach out to the DataNodes to get the real data blocks
  • Hadoop provides a command line interface for administrators to work on HDFS
  • The NameNode comes with an in-built web server from where you can browse the HDFS filesystem and view some basic cluster statistics

MapReduce


The following are some of the key points to remember about the HDFS:

  • MapReduce is a parallel programming model that is used to retrieve the data from the Hadoop cluster
  • In this model, the library handles lot of messy details that programmers doesn’t need to worry about. For example, the library takes care of parallelization, fault tolerance, data distribution, load balancing, etc.
  • This splits the tasks and executes on the various nodes parallely, thus speeding up the computation and retriving required data from a huge dataset in a fast manner.
  • This provides a clear abstraction for programmers. They have to just implement (or use) two functions: map and reduce
  • The data are fed into the map function as key value pairs to produce intermediate key/value pairs
  • Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output
  • JobTracker keeps track of all the MapReduces jobs that are running on various nodes. This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes. If any one of those jobs fails, it reallocates the job to another node, etc. In simple terms, JobTracker is responsible for making sure that the query on a huge dataset runs successfully and the data is returned to the client in a reliable manner.
  • TaskTracker performs the map and reduce tasks that are assigned by the JobTracker. TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not.

We’ve only scratched the surface of the Hadoop. This is just the first article in our ongoing series on Hadoop. In the future articles of this series, we’ll explain how to install and configure Hadoop environment, and how to write MapReduce programs to retrieve the data from the cluster, and how to effectively maintain a Hadoop infrastructure.


Linux Sysadmin Course Linux provides several powerful administrative tools and utilities which will help you to manage your systems effectively. If you don’t know what these tools are and how to use them, you could be spending lot of time trying to perform even the basic administrative tasks. The focus of this course is to help you understand system administration tools, which will help you to become an effective Linux system administrator.
Get the Linux Sysadmin Course Now!

If you enjoyed this article, you might also like..

  1. 50 Linux Sysadmin Tutorials
  2. 50 Most Frequently Used Linux Commands (With Examples)
  3. Top 25 Best Linux Performance Monitoring and Debugging Tools
  4. Mommy, I found it! – 15 Practical Linux Find Command Examples
  5. Linux 101 Hacks 2nd Edition eBook Linux 101 Hacks Book

Bash 101 Hacks Book Sed and Awk 101 Hacks Book Nagios Core 3 Book Vim 101 Hacks Book

{ 23 comments… read them below or add one }

1 Bilal January 4, 2012 at 5:47 am

This is very nice articale to overview about hadoop. I am very happy about this on going article. I have really expected hadoop articale from you. I am very curious for next hadoop article. Thaks a lot. I am plan to implement hadoop infrastructure. Can you share your good haddop reference(book or links)

2 Yaidel January 4, 2012 at 10:52 am

Excelent article. I`m a begin a project with Hadoop and need some references and books.Thanks you.

3 Dave Hardy January 4, 2012 at 1:04 pm

Yes, thanks very much Ramesh. I have just started researching Hadoop and may be working on a project at some point with it. Thanks also for all the good work you do explaining so many things to more people than you might imagine. And best wishes in this new year, sir.

4 Naveen January 4, 2012 at 2:34 pm

Ramesh, Thanks much for creating this article on your site. Happy to see the next part.

Its a great work!!!

5 Arun Saha January 9, 2012 at 8:23 pm

Very good intro, Ramesh. Waiting for the part 2 of the series.

6 Mustapha Oldache January 13, 2012 at 7:02 am

Hi !
I’ve constructed a cluster mail system with postfix in 2005. the cluster was built with a frontal sever which redirects requests to several small servers over LVS ( Linux virtual server) and NFS.
I’ve never heard before about hadoop. I would like to try hadoop in my application !

Thank you Ramesh, I still find new interesting things in your blog.

7 vallisha March 16, 2012 at 8:31 am

Hello Sir,

Excellent article, concepts explained are very easy to follow.
In the HDFS architecture, if the name node fails, is there a possibility to retrieve the data ?

8 Anju Garg July 22, 2012 at 11:03 pm

Understandable explaination of Hadoop…..

9 Narasimha September 11, 2012 at 2:04 am

This is a very good notes for Hadoop beginers , please post the next steps like how to install and configure Hadoop environment, and how to write MapReduce programs to retrieve the data from the cluster, and how to effectively maintain a Hadoop infrastructure.
Hope to see the next notes soon…
Thanks in Advance :-)

10 masoome November 15, 2012 at 1:06 am

I studied about mapreduce and now, I’m working on integrity issues of mapreduce, but my information about ” how to install and configure Hadoop environment, and how to write MapReduce programs ” is not enough. would you help me?

11 Deepak January 29, 2013 at 3:56 am

Thank you for putting the Hadoop introduction in simple words, Expecting the next part of Hadoop , installation , environment setup , programing asap.

12 RK February 4, 2013 at 2:06 pm

One of the best explanation of HDFS and Hadoop. Thank you!

13 masoome February 5, 2013 at 8:29 am

This is a very good, thanks a lot,
I waite for your new article.

14 lavanya March 7, 2013 at 4:56 am

One of the best articles on Hadoop..Very helpful for beginners…Thankyou..

15 Armaan March 27, 2013 at 7:11 am

This is the best into about hadoop….i have just started learning hadoop . As i have some experience in ETL work , will it help me in learning hadoop.

16 P.T.Naidu April 26, 2013 at 2:03 am

Hi Sir

Thanks a lot for the article, it is easy to understand.

17 Ramesh Sencha April 28, 2013 at 9:53 pm

Nice article Sir G, Looking for Part 2
Keep it up ..

18 Yasodha June 12, 2013 at 4:37 am

Nice article to know he overview of the Hadoop HDFS and MapReduce.
Thanks for such simple and clear overview.

19 kartheek August 30, 2013 at 1:22 am

Thx ramesh its very worth for who are all know about Hadoop

20 saravanan August 30, 2013 at 8:46 am

Nice article..

21 Goutam Roy September 10, 2013 at 5:01 am

Hi Ramesh,
It is very very useful for the beginner like me. I appreciate about your knowledge on Hadoop. Thank you.

With Regards,
Goutam Roy.

22 RK October 2, 2013 at 7:07 am

Really helpful & nice articles for beginners ..Thanks :-)

23 Ashish November 29, 2013 at 2:34 am

Thanks Ramesh, very useful article for introduction.

Leave a Comment

Previous post:

Next post: