Tools And Technologies

Big Data Tools and Technologies

Apache Hadoop

Hadoop is a very significant player in the Big Data landscape.

It's an open-sourced framework for distributed storage and processing of very large data sets.

Originally built in 2005 by a Yahoo engineer.

It was inspired by Google's MapReduce and the Google File System papers.

It was written in Java to implement the MapReduce programming model for scalable, reliable and distributed computing.

The framework is composed of:

Hadoop Common: Contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop MapReduce: A programming model for large-scale data processing.
Hadoop YARN: A resource management platform responsible for managing compute resources in clusters and using them for the scheduling of users' applications.

HDFS

Structured like a regular Unix file system with data storage distributed across several machines in the cluster.
Data service that sits atop regular file systems, allowing a fault tolerant, resilient clustered approach to storing and processing data.
Fault-tolerant: Detection of faults and quick automatic recovery is a core architectural goal.
Tuned to support large files. Typically a file is GB or TB and can support tens of millions of files by scaling to hundreds of nodes in a cluster.
Follows write once, read multiple approach, simplifying data coherency issues and enabling high throughput data access. Example is a web crawler application.
Optimized for throughput rather than latency. This makes it suited for long batch operations on large scale data rather than interactive analysis on streaming data.
Moving computation near the data reduces network congestion and increses throughput. HDFS provides interfaces or applications to move closer to data storage.

Architecture

Leader-follower architecture, where Namenode is the leader and Datanodes are slaves.
Files split into blocks, and blocks are stored on datanodes (generally one per node within cluster).
Datanodes manage storage attached to nodes that they run on.
Namenode controls all metadata, including what blocks make up a file and which datanode the blocks are stored on.
Namenode executres file system operations like opening, closing and renaming files and directories.
Datanodes serve read and write requests from the clients.
Datanodes perform block creation, deletion, replication upon instruction from the Namenode.
Namenode and Datanode are Java software designed to run on commodity hardware that supports Java.
Usually a cluster contains a single Namenode and multiple datanodes, one each for each node in the cluster.

HDFS Architecture

The Namenode makes all decisions around replication of blocks for data durability. Periodically receives heartbeat and BlockReport from datanodes in the cluster. Receipt of heartbeat is the health check.

Hadoop MapReduce

A framework that makes it easy to write applications which can consume huge amouts of data.

It allows processing in parallel on large clusters consisting of thousands of nodes in a manner that is reliable and fault tolerant.

The MapReduce layer consists of:

MapReduce Java API to write workflows
Services to manage these workflows and provide the scheduling, distribution and parallelizing.

MapReduce jobs

Splits the data sets into independent chunks.
Data sets are processed by map tasks in parallel.
MapReduce sorts the output of map jobs and feeds them to reduce tasks.
Both input and output of map and reduce are stored on the file system.
Framework takes care of scheduling tasks, monitoring them and re-executing failed tasks.
MapReduce framework and HDFS are running on the same set of nodes. Tasks are scheduled on nodes where data is already present, hence yielding high bandwidth across the cluster.

Inputs and Outputs of a MapReduce Job

Exclusively operates on key-value pairs.
Input is large scale data set which benefits from parallel processing and does not fit on a single machine.
Input split into independent data sets and map function produces key-value pair for each record in the data set.
Output of mappers is shuffled, sorted, grouped and passed to the reducers.
Reducer function applied to sets of key-value pairs that share the same key. The reducer function often agregates the value for the pairs with the same key.

It is important to know that:

Almost all data can be mapped to a key-value pair using a map function.
Keys and values can be of any type. If using a custom type, the type must be implement a writable interface.
MapReduce cannot be used if a computation of a value depends on a previously computed value. Recursive funcs like Fibonnaci cannot be implemented using MapReduce.

This is an example of a word count MapReduce job.

Example of MapReduce job

The order of a job goes as the following:

Input
Splitting
Mapping
Shuffle
Reduce
Final Result

Example Applications of MapReduce

Counting votes by processing data from each polling booth.
Aggregating electricy consumption from data points collected across a large geographical area.
Used by Google Maps to calculate nearest neighbour.
Performing statistical aggregate type functions on large data sets.
Counting number of href links in web log files for clickstream analysis.

Writing and Running Hadoop MapReduce Jobs

Typicall jobs are written in Java, but can also be written using:

Hadoop Streaming: A utility which allows users to create an run MapReduce jobs with any executables.
Hadoop Pipes: C++ API to implement MapReduce applications

Hadoop Job configurations

Consists of:

Input and output locations on HDFS.
Map and reduce functions via implementations of interfaces or abstract classes.
Other job parameters.

A Hadoop job client submits the job (jar/executable) and configuration to the ResourceManager in YARN which distributes them to the workers and performs functions like scheduling, monitoring and providing status and diagnostic information.

Yet Another Resource Negotiator (YARN)

Introduced in Hadoop 2.0, YARN provides a general processing platform not constrained to MapReduce.

Global ResourceManager is the authority that delegates resources among the applications in the system.

It has a NodeManager on each node that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting this data to the ResourceManager.

The ResourceManager has two components:

Scheduler - responsible for allocating resources the various running applications.
ApplicationsManager - responsible for accepting job-submissions, negotiating the first container for executing the application-specific ApplicationMaster and provides the service for restarting the ApplicationMaster on failure.

The ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring progress.

Container

Note that for YARN, a container represents a collection of physical resources. Also could mean CPU cores, disk along with RAM.

Hadoop Ecosystem

Ecosystem

5 functions of Hadoop ecosystem:

Data management using HDFS, HBase and YARN
Data access with MapReduce, Hive and Pig
Data ingestion and integration using Flume, Sqoop, Kafka and Storm
Data monitoring using Ambari, Zookeeper and Oozie
Data governance and security using Falcon, Ranger and Knox