Big Data Tools and Technologies
Hadoop is a very significant player in the Big Data landscape.
It's an open-sourced framework for distributed storage and processing of very large data sets.
Originally built in 2005 by a Yahoo engineer.
It was inspired by Google's
MapReduce and the
Google File System papers.
It was written in Java to implement the
MapReduce programming model for scalable, reliable and distributed computing.
The framework is composed of:
- Hadoop Common: Contains libraries and utilities needed by other Hadoop modules.
- Hadoop Distributed File System (HDFS): A distributed file system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
- Hadoop MapReduce: A programming model for large-scale data processing.
- Hadoop YARN: A resource management platform responsible for managing compute resources in clusters and using them for the scheduling of users' applications.
- Structured like a regular Unix file system with data storage distributed across several machines in the cluster.
- Data service that sits atop regular file systems, allowing a fault tolerant, resilient clustered approach to storing and processing data.
- Fault-tolerant: Detection of faults and quick automatic recovery is a core architectural goal.
- Tuned to support large files. Typically a file is GB or TB and can support tens of millions of files by scaling to hundreds of nodes in a cluster.
write once, read multiple approach, simplifying data coherency issues and enabling high throughput data access. Example is a web crawler application.
- Optimized for throughput rather than latency. This makes it suited for long batch operations on large scale data rather than interactive analysis on streaming data.
- Moving computation near the data reduces network congestion and increses throughput. HDFS provides interfaces or applications to move closer to data storage.
- Leader-follower architecture, where
Namenode is the leader and
Datanodes are slaves.
- Files split into blocks, and blocks are stored on datanodes (generally one per node within cluster).
- Datanodes manage storage attached to nodes that they run on.
- Namenode controls all metadata, including what blocks make up a file and which datanode the blocks are stored on.
- Namenode executres file system operations like opening, closing and renaming files and directories.
- Datanodes serve read and write requests from the clients.
- Datanodes perform block creation, deletion, replication upon instruction from the Namenode.
- Namenode and Datanode are Java software designed to run on commodity hardware that supports Java.
- Usually a cluster contains a single Namenode and multiple datanodes, one each for each node in the cluster.
Namenode makes all decisions around replication of blocks for data durability. Periodically receives heartbeat and
BlockReport from datanodes in the cluster. Receipt of heartbeat is the health check.
A framework that makes it easy to write applications which can consume huge amouts of data.
It allows processing in parallel on large clusters consisting of thousands of nodes in a manner that is reliable and fault tolerant.
The MapReduce layer consists of:
- MapReduce Java API to write workflows
- Services to manage these workflows and provide the scheduling, distribution and parallelizing.
- Splits the data sets into independent chunks.
- Data sets are processed by map tasks in parallel.
- MapReduce sorts the output of map jobs and feeds them to reduce tasks.
- Both input and output of map and reduce are stored on the file system.
- Framework takes care of scheduling tasks, monitoring them and re-executing failed tasks.
- MapReduce framework and HDFS are running on the same set of nodes. Tasks are scheduled on nodes where data is already present, hence yielding high bandwidth across the cluster.
Inputs and Outputs of a MapReduce Job
- Exclusively operates on key-value pairs.
- Input is large scale data set which benefits from parallel processing and does not fit on a single machine.
- Input split into independent data sets and map function produces key-value pair for each record in the data set.
- Output of mappers is shuffled, sorted, grouped and passed to the reducers.
- Reducer function applied to sets of key-value pairs that share the same key. The reducer function often agregates the value for the pairs with the same key.
It is important to know that:
- Almost all data can be mapped to a key-value pair using a map function.
- Keys and values can be of any type. If using a custom type, the type must be implement a writable interface.
MapReduce cannot be used if a computation of a value depends on a previously computed value. Recursive funcs like Fibonnaci cannot be implemented using
This is an example of a word count
The order of a job goes as the following:
- Final Result
Example Applications of MapReduce
- Counting votes by processing data from each polling booth.
- Aggregating electricy consumption from data points collected across a large geographical area.
- Used by Google Maps to calculate nearest neighbour.
- Performing statistical aggregate type functions on large data sets.
- Counting number of href links in web log files for clickstream analysis.
Writing and Running Hadoop MapReduce Jobs
Typicall jobs are written in Java, but can also be written using:
- Hadoop Streaming: A utility which allows users to create an run MapReduce jobs with any executables.
- Hadoop Pipes: C++ API to implement MapReduce applications
Hadoop Job configurations
- Input and output locations on HDFS.
- Map and reduce functions via implementations of interfaces or abstract classes.
- Other job parameters.
A Hadoop job client submits the job (jar/executable) and configuration to the
YARN which distributes them to the workers and performs functions like scheduling, monitoring and providing status and diagnostic information.
Yet Another Resource Negotiator (YARN)
Introduced in Hadoop 2.0, YARN provides a general processing platform not constrained to
Global ResourceManager is the authority that delegates resources among the applications in the system.
It has a
NodeManager on each node that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting this data to the
ResourceManager has two components:
- Scheduler - responsible for allocating resources the various running applications.
- ApplicationsManager - responsible for accepting job-submissions, negotiating the first container for executing the application-specific
ApplicationMaster and provides the service for restarting the
ApplicationMaster on failure.
ApplicationMaster has the responsibility of negotiating appropriate resource containers from the
Scheduler, tracking their status and monitoring progress.
Note that for YARN, a
container represents a collection of physical
resources. Also could mean CPU cores, disk along with RAM.
5 functions of Hadoop ecosystem:
- Data management using HDFS, HBase and YARN
- Data access with MapReduce, Hive and Pig
- Data ingestion and integration using Flume, Sqoop, Kafka and Storm
- Data monitoring using Ambari, Zookeeper and Oozie
- Data governance and security using Falcon, Ranger and Knox