what is hadoop

Let’s start with a little quiz:

Hadoop is

a)     Twitter shortcut for “I HAD it, but OOPS I lost it”?

b)     The latest dance song craze (Macarena, Gangnam, Hadoop)?

c)      A stuffed toy elephant?

d)     A software solution for distributed computing of large datasets?

The correct answers are actually c) and d).  You see, Hadoop is a software solution developed as part of the Apache project sponsored by the Apache Software Foundation, and it was named after a stuffed elephant owned by the son of the framework’s founder, Doug Cutting.

But what exactly is Hadoop and how does it work?

Per the Apache website, “Apache Hadoop is a framework for running applications on large cluster built of commodity hardware.” This open source software framework enables the developer/user to manage large amounts of data (Big Data) using a distributed file system.  The power of Hadoop lies in its ability to leverage distributed clusters of computing hardware.  It does this by leveraging two key technologies.

The first is the Hadoop Distributed File System (HDFS); a distributed, scalable, and portable file system.  It is written in Java specifically for the Hadoop framework.  A key component of HDFS is the name-node.  This is a single server that tracks all the other nodes in the distributed client/server cluster.  In other words, the name-node is the directory of who all the distributed clients are and which files each contains.  As clients and files are added to the cluster, commands update the links to these new nodes in the name-node.

Hadoop-HDFS

The second key technology leveraged within a Hadoop implementation is that of MapReduce.  MapReduce is a programming model for processing large datasets.  It works by enabling a master node (the node assigned the processing request) to break apart the work request into smaller sub-tasks, and send the sub-tasks out to worker nodes.  This is the “Map” aspect as the master node is mapping out the workload to the worker nodes.  As each worker node completes the assigned sub-task, it ships the results back to the master node.  The master node then takes all the worker node results and combines them into one result set; thereby completing the assigned request.  This is the “Reduce” aspect.

mapreduce-flow

It is important to note that for very large or complex requests, worker nodes can also MapReduce their assigned tasks into smaller sub-tasks for their worker nodes.  You could refer to this as Big Data outsourcing.  As each node determines another node is better equipped to handle a portion of an assigned request, it relegates the work to a more efficient worker node, while retaining responsibility for getting the completed assignment back to the master node.

Sources:

Webopedia http://www.webopedia.com/TERM/H/hadoop.html

Wikipedia http://en.wikipedia.org/wiki/MapReduce

Hadoop Wiki http://wiki.apache.org/hadoop/

To learn more about Hadoop solutions contact us today.


1 Comment

Pete · June 1, 2018 at 9:36 am

great post

Leave a Reply

Your email address will not be published. Required fields are marked *