IS2021
Last modified:
4 min read

Part 4 - Big Data Tools and Techniques

Table of Contents

Hadoop

Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant. It was originally developed to support distribution for the Nutch search engine project.

  • Data Management: Store and process vast quantities of data in a storage layer that scales linearly
    • Hadoop Distributed File System (HDFS) is the core technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. It is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
    • Yet Another Resource Negotiator (YARN) is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
  • Data Access: Interact with data in a wide variety of ways - from batch to real-time.
    • Apache Hive is the most widely adopted data access technology, though there are many specialized engines.
  • Data Governance & Integration: Quickly and easily load data, and manage according to policy.
    • Apache Falcon provides policy-based workflows for data governance, while Apache Flum and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
  • Security: Address requirements of Authentication, Authorization, Accounting and Data Protection.
    • Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Acess components on up through the entire perimeter of the cluster via Apache Knox.
  • Operations: Provision, manage, monitor and operate Hadoop clusters at scale.
    • Apache Ambari offers the necessary interface and APIs to provision, manage and monitor Hadoop clusters and integrate with other management console software.

HDFS (Hadoop Distributed File System)

The goal of Hadoop is to use commonly available servers in a very large cluster, where each server has a set of inexpensive internal disk drives.

Data in a Hadoop cluster is broken down into smaller pieces and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of larger data sets, and this provides the scalability that is needed for big data processing.

Key tasks for failure management:

  • Monitoring
  • Rebalancing
    • A process of automatically migrating blocks of data from one data node to another when there is free space.
  • Managing integrity
    • Uses checksums associated with the actual data stored in a file.
  • Metadata replication:
    • Metadata files are also subject to failure.
  • Snapshots
    • Incremental copying of data to establish a point in time to which the system can be rolled back.

MapReduce

The heart of Hadoop - two separate and distinct tasks that Hadoop programs perform.

The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

MapReduce execution environment employs a master/Slave execution model, in which on master node (JobTracker) manages a pool of slave computing resources (TaskTrackers) that are called upon to do the actual work.

The role of JobTracker is to manage the resources with some specific responsibilities, including managing TaskTrackers, continually monitoring their accessibility, availability and different aspects of job management.

Role of TaskTracker is much simpler - wait for a task assignment, initiate and execute the requested task, and provide status back to the JobTracker periodically.