Introduction to Big Data and Hadoop

Certification Training
3112 Learners
View Course Now!

Hive HBase and Hadoop Ecosystem Components Tutorial

Welcome to the second lesson of the ‘Introduction to Big Data and Hadoop’ course tutorial (part of the Introduction to Big data and Hadoop course). In this lesson, we will focus on Hive, HBase, and components of the Hadoop ecosystem.

Objectives

By the end of this lesson, you will be able to:

  • Describe the basics of Hive

  • Explain HBase and Cloudera

  • Discuss the commercial distributions of Hadoop

  • Explain the components of the Hadoop ecosystem

In the next section of Introduction to Hadoop lesson, we will focus on an introduction to the concept of Hive.

Hive – Introduction

Hive is defined as a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop.  

Following are the facts related to Hive:

  • It provides a SQL-like language called HiveQL(HQL). Due to its SQL-like interface, Hive is a popular choice for Hadoop analytics.

  • It provides massive scale-out and faults tolerance capabilities for data storage and processing of commodity hardware.

  • Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution.

In the next section of Introduction to Hadoop lesson, we will discuss the key characteristics of Hive.

Hive – Characteristics

Hive is a system for managing and querying unstructured data into a structured format.

It uses the concept of MapReduce for the execution of its scripts and the Hadoop Distributed File System or HDFS for storage and retrieval of data.

Following are the key principles underlying Hive.

Hive commands are similar to that of SQL. SQL is a data warehousing tool that is similar to Hive.

Hence, learning Hive will not be a big challenge for those who are familiar with SQL.

Hive contains extensive, pluggable MapReduce scripts in the language of your choice. These scripts include rich, user-defined data types and user-defined functions.

Hive has an extensible framework to support different files and data formats.

Performance is better in Hive since Hive engine uses the best-inbuilt script to reduce the execution time, thus enabling high output in less time.

In the next section of Introduction to Hadoop lesson, we will discuss the system architecture and the components of Hive.

System Architecture and Components of Hive

The image below shows the architecture of the Hive system. It also illustrates the role of Hive and Hadoop in the development process.

In the next section of Introduction to Hadoop lesson, we will discuss the basics of Hive Query Language.

Basics of Hive Query Language

Hive Query Language or HQL is the query language for Hive engine.

Hive supports basic SQL queries such as:

  • From clause subquery

  • ANSI JOIN

  • Multi-table insert

  • Multi group-by

  • Sampling

  • Objects traversal

HQL provides support to pluggable MapReduce scripts using the TRANSFORM command.

In the next section of Introduction to Hadoop lesson, we will focus on tables in Hive.

Data Model – Tables

Hive tables are analogous to tables in relational databases.

A Hive table logically comprises the data that is stored and the associated metadata. Each table has a corresponding directory in HDFS.

There are two types of tables in Hive. They are managed through tables and external tables.

In the next section of Introduction to Hadoop lesson, we will focus on data types in Hive.

Trying to make a career in Big data? Click here to know more!

Data Types in Hive

There are three data types in Hive. They are primitive, complex, and user-defined types. In the next section of Introduction to Hadoop lesson, we will discuss serialization and deserialization.

Serialization and Deserialization

Serialization takes a Java object that Hive has been working with, and turns it into something that Hive can write to HDFS or another supported system.

Serialization is used when writing data, for example, through an INSERT-SELECT statement.

De-serialization is used during query time to execute SELECT statements.

Other facts related to serialization and deserialization are:

  • The interface used for performing serialization and de-serialization is SerDe.

  • In some situations, the interface used for de-serialization is LazySerDe.

  • Unstructured data gets converted into structured data due to the flexibility of LazySerDe interface.

  • While using the LazySerDe interface, data is read based on the separation by different delimiter characters.

  • The SerDe interface is located in ‘hive_contrib.jar’.

In the next section of Introduction to Hadoop lesson, we will focus on User-Defined Functions and MapReduce scripts.

UDF/UDAF vs. MapReduce Scripts

The table below shows the comparison of User-Defined and User-Defined Aggregate Functions with MapReduce scripts.

Attribute

UDF/UDAF

MapReduce scripts

Language

UDF is written in Java

Any language

1/1 input/output

Supported via UDF

Supported

n/1 input/output

Supported via UDAF

Supported

1/n input/output

Supported via User-Defined Table Generating Function (UDTF)

Supported

Speed

UDF is faster

(in the same process)

Slower

(spawns new process)

User-Defined Functions are written in Java while MapReduce scripts can be written in any language.

Both User-Defined Functions and MapReduce scripts support 1 to 1, n to 1, and 1 to n input to output. However, User-Defined Functions are faster than MapReduce scripts since the latter spawns new processes for different operations.

In the next section of Introduction to Hadoop lesson, we will focus on an introduction to the concept of HBase.

HBase – Introduction

Apache HBase is a distributed, column-oriented database built on top of HDFS.

Apache HBase can scale horizontally to thousands of commodity servers and petabytes of data by indexing the storage.

Apache HBase is an open-source, distributed, and versioned non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data.

Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. HBase supports random, real-time CRUD (read as “C-R-U-D”) operations.

CRUD stands for Create, Read, Update, and Delete.

The goal of HBase is to host very large tables with billions of rows and millions of columns, atop clusters of commodity hardware.

In the next section of Introduction to Hadoop lesson, we will focus on the key characteristics of HBase.

Characteristics of HBase

HBase is a type of NoSQL database and is classified as a key-value store.

In HBase,

  • Value is identified with a key.

  • Both key and value are byte-array, which means binary formats can be stored easily.

  • Values are stored in key-orders and can be accessed quickly by their keys.

HBase is a database in which tables have no schema. Column families and not columns are defined at the time of table creation.

In the next section of Introduction to Hadoop lesson, we will focus on the HBase architecture.

HBase Architecture

HBase has two types of Nodes which are Master and RegionServer.

Master

There is only one Master node running at a time whereas there can be one or more RegionServers. The high availability of the Master node is maintained with ZooKeeper.

The Master node manages cluster operations like an assignment, load balancing, and splitting. It is not a part of read or write path.

RegionServer

The RegionServer hosts tables, performs reads, and buffers writes. Clients communicate with RegionServer to read and write.

A region in HBase is the subset of a table’s rows. The Master node detects the status of RegionServers and assigns regions to RegionServers.

In the next section of Introduction to Hadoop lesson, we will compare HBase with Relational Database Management System or RDBMS.

HBase vs. RDBMS

HBase provides certain advantages compared to Relational Database Management System.

HBase

RDBMS

Automatic partitioning

Usually manual,admin-driven partitions

Scales linearly and automatically with new nodes

Usually scale vertically by adding more hardware resources

Uses commodity hardware

Relies on expensive servers

Has fault tolerance

Fault tolerance may or may not be present

Leverages batch processing with MapReduce distributed processing

Relies on multiple threads or processes rather than MapReduce distributed processing

In the next section of Introduction to Hadoop lesson, we will focus on an introduction to Cloudera.

Cloudera – Introduction

Cloudera is a commercial tool used to deploy Hadoop in an enterprise setup.

Following are the salient features of Cloudera.

Cloudera uses 100% open-source distribution of Apache Hadoop and related projects such as Apache Pig, Apache Hive, Apache HBase and Apache Sqoop.  

Cloudera offers the user-friendly Cloudera Manager for system management, Cloudera Navigator for data management, dedicated technical support, and so on.

In the next section, we will explore Cloudera distribution.

Cloudera Distribution

Just like Linux distribution is open source and distributed by commercial vendors, Cloudera and many other vendors offer Hadoop as a commercial distribution.

Cloudera’s distribution is known as CDH or Cloudera Distribution including Apache Hadoop, and it delivers the core elements of Hadoop.

These elements include scalable storage, distributed computing and additional components such as a user interface, and necessary enterprise capabilities such as security. CDH includes the core elements of Apache Hadoop and several key open source projects.

These projects, when coupled with customer support, management, and governance through a Cloudera Enterprise subscription, can deliver an enterprise data hub.

In the next section of Introduction to Hadoop lesson, we will focus on Cloudera Manager.

Cloudera Manager

Cloudera Manager is used to administering Apache Hadoop. It is used to configure the following, among others:

  • HDFS

  • Hive engine

  • Hue

  • MapReduce

  • Oozie

  • ZooKeeper

  • Flume

  • HBase

  • Cloudera Impala

  • Cloudera Search

  • YARN

In the next section of Introduction to Hadoop lesson, we will discuss the Hortonworks Data Platform.

Hortonworks Data Platform

Hortonworks Data Platform or HDP enables Enterprise Hadoop with a suite of essential capabilities that serve as the functional definition of any data platform technology.

It has a comprehensive set of capabilities aligned to functional areas such as data management, data access, data governance and integration, security, and operations.

HDP can be downloaded from the URL mentioned below.

http://hortonworks.com/hdp/downloads/

In the next section of Introduction to Hadoop lesson, we will look at the MapR data platform.

MapR Data Platform

The MapR data platform supports more than 20 open source projects. It also supports multiple versions of the individual projects, thereby allowing users to migrate to the latest versions at their own pace.

The image below shows all the projects actively supported in the current General Availability or GA version of MapR Distribution for Hadoop—M7.

MapR can be downloaded from the URL mentioned below.

https://www.mapr.com/products/hadoop-download

In the next section of Introduction to Hadoop lesson, we will focus on Pivotal HD, another commercial distribution of Hadoop.

Pivotal HD

Pivotal HD is a commercially supported, enterprise-capable distribution of Hadoop. It consists of GemFire XD® along with toolsets such as HAWQ, MADlib, OpenMPI, GraphLab, and Spring XD.

Pivotal HD can be downloaded from the URL mentioned below.

https://network.pivotal.io/products/big-data

Pivotal HD aims to accelerate data analytics projects and significantly expands Hadoop’s capabilities. Pivotal GemFire brings real-time analytics to Hadoop, enabling businesses to process and make critical decisions immediately.

In the next section of Introduction to Hadoop lesson, we will focus on an introduction to the concept of ZooKeeper.

Introduction to ZooKeeper

ZooKeeper is an open-source and high-performance coordination service for distributed applications.

It offers services such as:

  • Naming

  • Locks and synchronization

  • Configuration management

  • Group services

In the next section of Introduction to Hadoop lesson, we will discuss the features of ZooKeeper.

Features of ZooKeeper

Some of the salient features of ZooKeeper are as follows:

ZooKeeper provides a simple and high-performance kernel for building more complex coordination primitives at the client. It also provides distributed co-ordination services for distributed applications.

ZooKeeper follows FIFO, that is, the First-In-First-Out Approach when it comes to job execution. It allows synchronization, serialization, and coordination of nodes in a Hadoop cluster. It comes with pipeline architecture to achieve a wait-free approach.

ZooKeeper takes care of problems by using inbuilt algorithms for deadlock detection and prevention. It applies a multi-processing approach to reduce wait-time for process execution.

ZooKeeper also allows for distributed processing. Thus, it is compatible with services related to MapReduce.

In the next section of Introduction to Hadoop lesson, we will focus on the goals of ZooKeeper.

Goals of ZooKeeper

The goals of ZooKeeper are as follows:

  • Serialization ensures avoidance of delay in read or write operations.

  • Reliability persists when an update is applied by a user in the cluster.

  • Atomicity does not allow partial results. Any user update can either succeed or fail.

  • Simple Application Programming Interface or API provides an interface for development and implementation.

In the next section of Introduction to Hadoop lesson, we will discuss the typical uses of ZooKeeper.

Uses of ZooKeeper

The uses of ZooKeeper are as follows:

  • Configuration refers to ensuring that the nodes in the cluster are in sync with each other and also with the NameNode server.

  • The message queue is the communication with nodes present in the cluster.

  • Notification refers to the process of notifying the NameNode of any failure that occurs in the cluster so that the specific task can be restarted from another node.

  • Synchronization refers to ensuring that all the nodes in the cluster are in sync with each other, and the services are up and running.

In the next section of Introduction to Hadoop lesson, we will focus on what Sqoop is and the reasons why it is used.

Sqoop – Reasons to Use It

Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases such as MySQL (read as My S-Q-L), MSSQL (read as M-S S-Q-L) and Oracle to HDFS.

Listed on the section are the reasons to use Scoop.

  • SQL servers are deployed worldwide and are the primary ways to accept the data from a user.

  • Nightly processing is done on SQL server for years.

  • It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS as Hadoop makes its way into enterprises.

  • Transferring the data using automated scripts is inefficient and time-consuming.

  • Traditional DB has reporting, data visualization, and other enterprise applications built in but to handle large data, we need an ecosystem.

  • It helps to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by Sqoop.

In the next section of Introduction to Hadoop lesson, we will continue to discuss why Sqoop is needed.

Sqoop – Reasons to Use It (contd.)

Sqoop is required when the database is imported from Relational Database or RDB to Hadoop or vice versa.

A Relational Database refers to any data in a structured format. Databases in MySQL or Oracle are the examples of RDB.

Exporting database from RDB to Hadoop

While exporting databases from a Relational Database to Hadoop, users must consider the consistency of data, consumption of production system resources, and preparation of data for provisioning downstream pipeline.

Importing database from Hadoop to RDB

While importing the data from Hadoop to a Relational Database, users must keep in mind that directly accessing data residing on external systems within

MapReduce framework complicates applications and exposes the production system to excessive loads originating from cluster nodes. Hence, Sqoop is needed.

In the next section of Introduction to Hadoop lesson, we will discuss the benefits of using Sqoop.

Benefits of Sqoop

The benefits of using Sqoop are as follows:

  • It is a tool designed to transfer data from Hadoop to an RDB and vice versa.

  • It transforms data in Hadoop with the help of MapReduce or Hive without extra coding.

  • It is used to import data from a relational database such as SQL, or MySQL, or Oracle into the Hadoop Distributed File System.

  • Sqoop exports data back to the RDB.

In the next section of Introduction to Hadoop lesson, we will focus on the Apache Hadoop ecosystem.

Apache Hadoop Ecosystem

The image shown below displays the various Hadoop ecosystem components as part of Apache Software Foundation projects.

Please note there are many other commercial and open source offerings apart from the Apache projects mentioned on this section.

The Hadoop ecosystem components have been categorized as follows:

  • File system

  • Data store

  • Serialization

  • Job execution

  • Work management

  • Development

  • Operations

  • Security

  • Data transfer

  • Data interactions

  • Analytics and intelligence

  • Search processing

  • Graph processing

In the next few sections, we will discuss some of the Hadoop ecosystem components. We will start with Apache Oozie in the following section.

Want to know more about Big data? Check out our course preview here!

Apache Oozie

Apache Oozie is a workflow scheduler system used to manage Hadoop MapReduce jobs. The workflow scheduler provides an option to the users to prioritize jobs based on their requirement.

Functions of Oozie

Following are the functions of Oozie.

  • Apache Oozie executes and monitors workflows in Hadoop.

  • It also performs periodic scheduling of workflows.

  • Oozie has the capability to trigger the execution of workflows based on data availability.

  • It also provides a web and CLI.

In the next section, we will focus on an introduction to Mahout.

Introduction to Mahout

Mahout is an ecosystem component that is dedicated to machine learning. The Machine learning process can be done in three modes, namely, supervised, unsupervised and semi-supervised modes.

In the next section, we will focus on the usage of Mahout.

Usage of Mahout

Mahout helps in clustering, which is one of the most popular techniques of machine learning.

Clustering allows the system to group numerous entities into separate clusters or groups based on certain characteristics or features of the entities. One of the best examples of clustering is seen in the Google News section.

In the next section, we will focus on an introduction to Apache Cassandra.

Apache Cassandra

Apache Cassandra is a freely distributed, high-performance, extremely scalable, and fault-tolerant post-relational database. It has the following features:

  • It is designed keeping in mind that system or hardware failures can occur.

  • Cassandra follows read or write-anywhere design, which makes it different from other ecosystem components.

The benefits of Cassandra are that:

  • It performs Online Transaction Processing or OLTP operations and Online Analytical Processing or OLAP operations; and

  • It helps to modify real-time data and perform data analytics.

In the next section, we will discuss Apache Spark.

Apache Spark

Apache Spark is a fast and general MapReduce-like engine for large-scale data processing.

Following are the key advantages of Spark:

Speed

  • Spark claims to run programs up to 100 times faster than Hadoop MapReduce in memory, or ten times faster on disk.

  • Spark has an advanced DAG (read as D-A-G) execution engine that supports cyclic data flow and in-memory computing.

Ease of use

  • It offers support to write applications quickly in Java, Scala, or Python.

  • It offers interactive Scala and Python shells.

Generality

  • It can combine SQL, streaming, and complex analytics.

  • It powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming.

Integration with Hadoop

  • Spark can run on YARN (read as one word ‘yarn’) cluster manager of Hadoop 2,

  • Spark can read any existing Hadoop data.

In the next section, we will discuss Apache Ambari.

Apache Ambari

Apache Ambari is a completely open operational framework for provisioning, managing, and monitoring Apache Hadoop clusters.

Ambari enables system administrators to:

  • Provision a Hadoop cluster

  • Manage a Hadoop cluster

  • Monitor a Hadoop cluster

  • Integrate Hadoop with the Enterprise operational tools.

In the next section, we will list the key features of Apache Ambari.

Key Features of Apache Ambari

Some of the key features of Apache Ambari are as follows:

  • It has a wizard-driven installation for Hadoop across any number of hosts.

  • Ambari provides API driven installation of Hadoop via Ambari Blueprints for automated provisioning.

  • It has a granular control of Hadoop service and component life cycles.

  • It helps in the management of Hadoop service configurations and advanced job diagnostic and visualization tools.

  • Ambari has robust RESTful APIs for customization and integration with enterprise systems.

In the next section, we will focus on Kerberos that ensures Hadoop security.

Hadoop Security – Kerberos

Hadoop relies on Kerberos for secure authentication.

Kerberos is a third-party authentication mechanism in which users and services that users wish to access, rely on the Kerberos server to authenticate each to the other.

The Kerberos server, also known as Key Distribution Center or KDC has three parts:

Principals

Principals is a database of the users and services their respective Kerberos passwords.

Authentication Server

Authentication Server or AS is meant for initial authentication and issuing a Ticket Granting Ticket or TGT.

Ticket Granting Server

Ticket Granting Server or TGS is meant for issuing subsequent service tickets based on the initial TGT.

Summary

Let us summarize the topics covered in this lesson:

  • Hive is a data warehouse system facilitating the analysis of large datasets in Hadoop.

  • HBase is a distributed column-oriented database built on top of HDFS.

  • Cloudera offers the user-friendly Cloudera Manager for system management.

  • Hortonworks Data Platform, MapR data platform, and Pivotal HD are some of the commercial distributions of Hadoop.

  • Some of the components of the Hadoop ecosystem are Oozie, Cassandra, and Spark.

Conclusion

With this, we have come to the end of the introduction to big data and Hadoop tutorial.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

We use cookies on this site for functional and analytical purposes. By using the site, you agree to be cookied and to our Terms of Use. Find out more

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)

By proceeding, you agree to our Terms of Use and Privacy Policy

We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*

By proceeding, you agree to our Terms of Use and Privacy Policy