Hadoop software library is a framework that was developed for reliable, distributed and scalable processing. This is specifically for large data sets that are divided as clusters of computers using a few simple programming models. It was developed by Apache Software Foundation on April 1, 2006. This software is implemented in the Java programming language. It has a distributed file system in it for the management purposes. It uses MapReduce programming models to distribute the storage for its data. It uses the concept of clustering to split the files into blocks and assign to different file locations. This approach uses the concept of data locality where manipulation of data occurs for the ones that have access permissions. There are four base modules that Hadoop framework is composed of. They are:
Hadoop Common- container of libraries and utilities
Hadoop distributed file system- stores data in commodity machines with accurate bandwidth that is required
Hadoop YARN- grouping the resources into clusters and assigning it to a respective user's applications
Hadoop MapReduce- used for large scale data processing
A system architecture consists of all the above-mentioned modules and along with Java Archive (JAR), files need to start Hadoop. For the Hadoop to work on your system, it will require you to have a Java Run-time Environment (JRE) and Standard Secure Shell (SSH). Clusters are used to group the data into blocks in the form of nodes and load it into the HDFS. HDFS consists of five main services in the form of nodes: Name node, Secondary Name node, data node, job tracker and task tracker. Here are each in brief:
Name node-acts as a main central node which manages the file system, track files and has metadata and whole data in it
Data blocks-stores data in the form of blocks
Secondary name node-tracks the check points in the metadata in the name node
Job tracker-used to track and process data
Task tracker-receives the task and code from job tracker and the code is applied to the files residing on HDFS
Since it deals with large volumes of data, storage is what the priority should be given to and Hadoop follows this principle. Since we are dealing with large data sets here, job tracker and task tracker is used for scheduling them into work queues and processing them. We use many schedulers for these complex operations such as fair scheduler and capacity scheduler. There are a few versions of Hadoop such as Hadoop 1, Hadoop 2 and Hadoop 3. Hadoop 3 enables GPU hardware within a cluster. Many prominent applications make use of Hadoop. Some among them are Yahoo, which launched the world's largest Hadoop production in the year 2008 and Facebook, in the year 2010, had the largest Hadoop cluster in the world as well.
The main purpose for the introduction of a software framework was to reduce the hardware failures and allow a highly available service.
Resource box-
As we can see with the massive scope for the Hadoop framework, we get to conclude that a major framework like Hadoop is required for the maintenance, working and further expansion of data sets. It is an emerged and enormous area and would be a great option for students to take Hadoop training and make it as their career option.
