About Syllabus Certification Career Path Jobs Demo Sessions FAQ

Job Meter = High

Hadoop Administration Training

30 Hours

Online Instructor-led Training

USD 2800

Hadoop Administration Training course and certification

5 (16 reviews)

44 Learners

About this Course

Big Data analysis is emerging as a key advantage in business intelligence for many organizations. Our Big Data and Hadoop Administrator training course lets you deep-dive into the concepts of Big Data, equipping you with the skills required for Hadoop administration roles.

Hadoop Administration training by Uplatz will help you master Hadoop Admin activities like planning, installation, monitoring, configuration and performance tuning of large and complex Hadoop clusters. In this Hadoop Admin online course, you will learn to implement security using Kerberos and Hadoop YARN features using real-life use cases.

This course helps you become a Big Data Administrator by learning concepts of Hadoop and implementing advanced operations on Hadoop clusters. This Hadoop Administration course will provide you with all the skills needed to successfully work as a Hadoop Administrator. This Hadoop Administration certification course includes fundamentals of Hadoop, Hadoop clusters, HDFS, MapReduce and HBase. The training will make you proficient in working with Hadoop clusters and deploying that knowledge on real-world projects.

------------------------------------------------------------------------------------------------------------

Hadoop Administration Course Objectives

Hadoop architecture and its main components
Hadoop installation and configuration
Hadoop Distributed File System (HDFS)
MapReduce abstraction and its working
Troubleshooting cluster issues and recovering from node failures
Concepts of Hive, Pig, Oozie, Sqoop and Flume
Optimizing Hadoop cluster for high performance
Preparing for the Cloudera Certified Administrator for Apache Hadoop exam

------------------------------------------------------------------------------------------------------------

Hadoop Administration Training

Course Details & Curriculum

HADOOP ADMINISTATION TRAINING CURRICULUM

1 INTRODUCTION

1.1 Big Data Introduction

1.1.1 What is Big Data?
1.1.2 Big Data - Why
1.1.3 Big Data - Journey
1.1.4 Big Data Statistics
1.1.5 Big Data Analytics
1.1.6 Big Data Challenges
1.1.7 Technologies Supported By Big Data

1.2 Hadoop Introduction
1.2.1 What Is Hadoop?
1.2.2 History Of Hadoop
1.2.3 Breakthroughs Of Hadoop
1.2.4 Future of Hadoop
1.2.5 Who Is Using?

1.3 Basic Concepts
1.3.1 The Hadoop Distributed File System - At a Glance
1.3.2 Hadoop Daemon Processes
1.3.3 Anatomy Of A Hadoop Cluster
1.3.4 Hadoop Distributions

2 HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

2.1 What is HDFS?
2.1.1 Distributed File System (DFS)
2.1.2 Hadoop Distributed File System (HDFS)

2.2 HDFS Cluster Architecture and Block Placement
2.2.1 NameNode
2.2.2 DataNode
2.2.3 JobTracker
2.2.4 TaskTracker
2.2.5 Secondary NameNode

2.3 HDFS Concepts
2.3.1 Typical Workflow
2.3.2 Data Replication
2.3.3 Replica Placement
2.3.4 Replication Policy
2.3.5 Hadoop Rack Awareness
2.3.6 Anatomy of a File Read
2.3.7 Anatomy of a File Write

3. MAPREDUCE

3.1 STAGES OF MAPREDUCE

3.2 DAEMONS
3.2.1 Job Tracker
3.2.2 Task Tracker

3.3 TASK FAILURES
3.3.1 Child
3.3.2 Task Tracker Failures
3.3.3 Job Tracker Failures
3.3.4 HDFS Failures

3.4 YARN

4. HOW TO PLAN A CLUSTER

4.1 VERSIONS AND FEATURES

4.2 HARDWARE SELECTION
4.2.1 Master Hardware
4.2.2 Slave Hardware
4.2.3 Cluster sizing

4.3 OPERATING SYSTEM SELECTION
4.3.1 Deployment Layout
4.3.2 Software Packages
4.3.3 Hostname, DNS
4.3.4 Users, Groups, Privileges

4.4 DISK CONFIGURATION
4.4.1 Choose a FileSystem
4.4.2 Mount options

4.5 NETWORK DESIGN
4.5.1 Network usage in Hadoop
4.5.2 Typical network Topologies

5. INSTALLATION AND CONFIGURATION

5.1 APACHE HADOOP
5.1.1 Tarball Installation
5.1.2 Package Installation

5.2 CONFIGURATION
5.2.1 XML Configuration
5.2.2 Environment Variables
5.2.3 Logging Configuration

5.3 HDFS
5.3.1 Optimization and Tuning

5.4 MAPREDUCE
5.4.1 Optimization and Tuning

6. AUTHENTICATION

6.1 KERBEROS AND HADOOP
6.1.1 Kerberos
6.1.2 Configuring Hadoop Security

7. RESOURCE MANAGEMENT

7.1 WHAT IS RESOURCE MANAGEMENT?

7.2 MAPREDUCE SCHEDULER
7.2.1 Capacity Scheduler
7.2.2 Fair Scheduler

8. CLUSTER MAINTENANCE

8.1 MANAGING HADOOP PROCESS
8.1.1 Starting and stopping processes with Init scripts
8.1.2 Starting and stopping processes manually

8.2 HDFS MAINTENANCE
8.2.1 Adding and Decommissioning DataNode
8.2.2 Balancing HDFS Block Data
8.2.3 Dealing with a Failed disk

8.3 MAPREDUCE MAINTENANCE
8.3.1 Adding and Decommissioning TaskTracker
8.3.2 Kill MapReduce Job and Task
8.3.3 Dealing Blacklisted Tasktracker

9. TROUBLESHOOTING

9.1 COMMON FAILUERS AND PROBLEMS

9.2 HDFS AND MAPREDUCE CHECKS

10. BACKUP AND RECOVERY

10.1 DATA BACKUP

10.1.1 Distributed copy
10.1.2 Parallel data ingestion

10.2 NAMENODE METADATA

COURSE DELIVERABLES
 Workshop style coaching
 Interactive approach
 Course material
 Hands on practice exercises
 Quiz at the end of each major topic
 Tips and techniques on Cloudera Certification Examination
 Mock interviews for each individual will be conducted on need basis
 Resume preparation and guidance

Job Prospects

Global Hadoop market to reach $84.6 billion in two years – Allied Market Research
The number of jobs for all the US Data Professionals will increase to 2.7 million per year – IBM
A Hadoop Administrator in the US can get a salary of $123,000 – Indeed

Hadoop is the most important framework for working with Big Data in a distributed environment. Due to the rapid deluge of Big Data and the need for real-time insights from huge volumes of data, the job of a Hadoop administrator is critical to large organizations. Hence, there is huge demand for professionals with the right skills and certification. Intellipaat is offering the industry-designed Hadoop administration training to help you master this domain.

------------------------------------------------------------------------------------------------------------

Hadoop Administration Interview Questions

------------------------------------------------------------------------------------------------------------

1) What is Hadoop?

Hadoop is what evolved as the solution to the “Big Data” problem. Hadoop is described as the framework that offers a number of tools and services in order to store and process Big Data. It also plays an important role in the analysis of big data and to make efficient business decisions when it is difficult to make the decision using the traditional method.

2) Name the Main Components of a Hadoop Application.

Hadoop offers a vast toolset that makes it possible to store and process data very easily. Here are all the main components of the Hadoop:

· Hadoop Common

· HDFS

· Hadoop MapReduce

· YARN

· PIG and HIVE – The Data Access Components.

· HBase – For Data Storage

· Apache Flume, Sqoop, Chukwa – The Data Integration Components

· Ambari, Oozie and ZooKeeper – Data Management and Monitoring Component

· Thrift and Avro – Data Serialization components

· Apache Mahout and Drill – Data Intelligence Components

3) How many Input Formats are there in Hadoop? Explain.

There are following three input formats in Hadoop –

1. Text Input Format: The text input is the default input format in Hadoop.

2. Sequence File Input Format: This input format is used to read files in sequence.

3. Key Value Input Format: This input format is used for plain text files.

4) What do you know about YARN?

YARN stands for Yet Another Resource Negotiator, it is the Hadoop processing framework. YARN is responsible to manage the resources and establish an execution environment for the processes.

5) Why do the nodes are removed and added frequently in a Hadoop cluster?

The following features of Hadoop framework makes a Hadoop administrator to add (commission) and remove (decommission) Data Nodes in a Hadoop clusters –

1. The Hadoop framework utilizes commodity hardware, and it is one of the important features of Hadoop framework. It results in a frequent DataNode crash in a Hadoop cluster.

2. The ease of scale is yet another important feature of the Hadoop framework that is performed according to the rapid growth of data volume.

6) What do you understand by “Rack Awareness”?

In Hadoop, Rack Awareness is defined as the algorithm through which NameNode determines how the blocks and their replicas are stored in the Hadoop cluster. This is done via rack definitions that minimize the traffic between DataNodes within the same rack. Let’s take an example – we know that the default value of replication factor is 3. According to the “Replica Placement Policy” two copies of replicas for every block of data will be stored in a single rack whereas the third copy is stored in the different rack.

7) What daemons are needed to run a Hadoop cluster?

DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster.

8) Which OS are supported by Hadoop deployment?

The main OS use for Hadoop is Linux. However, by using some additional software, it can be deployed on Windows platform.

9) What are the common Input Formats in Hadoop?

Three widely used input formats are:

1. Text Input: It is default input format in Hadoop.

2. Key Value: It is used for plain text files

3. Sequence: Use for reading files in sequence

10) What modes can Hadoop code be run in?

Hadoop can be deployed in

1. Standalone mode

2. Pseudo-distributed mode

3. Fully distributed mode.

11) What is the main difference between RDBMS and Hadoop?

RDBMS is used for transactional systems to store and process the data whereas Hadoop can be used to store the huge amount of data.

12) What are the important hardware requirements for a Hadoop cluster?

There are no specific requirements for data nodes.

However, the namenodes need a specific amount of RAM to store filesystem image in memory. This depends on the particular design of the primary and secondary namenode.

13) How would you deploy different components of Hadoop in production?

You need to deploy jobtracker and namenode on the master node then deploy datanodes on multiple slave nodes.

14) What do you need to do as Hadoop admin after adding new datanodes?

You need to start the balancer for redistributing data equally between all nodes so that Hadoop cluster will find new datanodes automatically. To optimize the cluster performance, you should start rebalancer to redistribute the data between datanodes.

15) What are the Hadoop shell commands can use for copy operation?

The copy operation command are:

fs –copyToLocal

fs –put

fs –copyFromLocal.

16) What is the Importance of the namenode?

The role of namenonde is very crucial in Hadoop. It is the brain of the Hadoop. It is largely responsible for managing the distribution blocks on the system. It also supplies the specific addresses for the data based when the client made a request.

17) Explain how you will restart a NameNode?

The easiest way of doing is to run the command to stop running sell script.

Just click on stop.all.sh. then restarts the NameNode by clocking on start-all-sh.

18) What happens when the NameNode is down?

If the NameNode is down, the file system goes offline.

19) Is it possible to copy files between different clusters? If yes, How can you achieve this?

Yes, we can copy files between multiple Hadoop clusters. This can be done using distributed copy.

20) Is there any standard method to deploy Hadoop?

No, there are now standard procedure to deploy data using Hadoop. There are few general requirements for all Hadoop distributions. However, the specific methods will always different for each Hadoop admin.

21) List few Hadoop shell commands that are used to perform a copy operation.

fs –put
fs –copyToLocal
fs –copyFromLocal

22) What is jps command used for?

jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.

23) What are the important hardware considerations when deploying Hadoop in production environment?

Memory-System’s memory requirements will vary between the worker services and management services based on the application.
Operating System - a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.
Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.
Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when compared to Small Form Factor disks.
Network - Two TOR switches per rack provide better redundancy.
Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.

24) How many NameNodes can you run on a single Hadoop cluster?

Only one.

25) What happens when the NameNode on the Hadoop cluster goes down?

The file system goes offline whenever the NameNode is down.

26) What is the conf/hadoop-env.sh file and which variable in the file should be set for Hadoop to work?

This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.

27) Apart from using the jps command is there any other way that you can check whether the NameNode is working or not.

Use the command -/etc/init.d/hadoop-0.20-namenode status.

28) In a MapReduce system, if the HDFS block size is 64 MB and there are 3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this scenario, how many input splits are likely to be made by the Hadoop framework.

2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.

29) Which command is used to verify if the HDFS is corrupt or not?

Hadoop FSCK (File System Check) command is used to check missing blocks.

30) What are the essential features of Hadoop?

Hadoop framework has the competence of solving many questions for Big Data analysis. It’s designed on Google MapReduce which is based on Google’s Big Data file systems.

31) What is the main difference between an “Input Split” and “HDFS Block”?

“Input Split” is the logical division of the data while The “HDFS Block” is the physical division of the data.

32) State some of the important features of Hadoop.

The important features of Hadoop are –

Hadoop framework is designed on Google MapReduce that is based on Google’s Big Data File Systems.
Hadoop framework can solve many questions efficiently for Big Data analysis.

33) Do you know some companies that are using Hadoop?

Yahoo – using Hadoop

Facebook – developed Hive for analysis

Amazon, Adobe, Spotify, Netflix, eBay, and Twitter are some other well-known and established companies that are using Hadoop.

34) How can you differentiate RDBMS and Hadoop?

The key points that differentiate RDBMS and Hadoop are –

1. RDBMS is made to store structured data, whereas Hadoop can store any kind of data i.e. unstructured, structured, or semi-structured.

2. RDBMS follows “Schema on write” policy while Hadoop is based on “Schema on read” policy.

3. The schema of data is already known in RDBMS that makes Reads fast, whereas in HDFS, writes no schema validation happens during HDFS write, so the Writes are fast.

4. RDBMS is licensed software, so one needs to pay for it, whereas Hadoop is open source software, so it is free of cost.

5. RDBMS is used for Online Transactional Processing (OLTP) system whereas Hadoop is used for data analytics, data discovery, and OLAP system as well.

35) What are the differences between Hadoop 1 and Hadoop 2?

The following two points explain the difference between Hadoop 1 and Hadoop 2:

In Hadoop 1.X, there is a single NameNode which is thus the single point of failure whereas, in Hadoop 2.x, there are Active and Passive NameNodes. In case, the active NameNode fails, the passive NameNode replaces the active NameNode and takes the charge. As a result, high availability is there in Hadoop 2.x.

In Hadoop 2.x, the YARN provides a central resource manager that share a common resource to run multiple applications in Hadoop whereas data processing is a problem in Hadoop 1.x.

36) What do you know about active and passive NameNodes?

In high-availability Hadoop architecture, two NameNodes are present.

Active NameNode – The NameNode that runs in Hadoop cluster, is the Active NameNode.

Passive NameNode – The standby NameNode that stores the same data as that of the Active NameNode is the Passive NameNode.

On the failure of active NameNode, the passive NameNode replaces it and takes the charge. In this way, there is always a running NameNode in the cluster and thus it never fails.

37) What are the Components of Apache HBase?

Apache HBase Consists of the following main components:

• Region Server: A Table can be divided into several regions. A group of these regions gets served to the clients by a Region Server.

• HMaster: This coordinates and manages the Region server.

• ZooKeeper: This acts as a coordinator inside HBase distributed environment. It functions by maintaining server state inside of the cluster by communication in sessions.

------------------------------------------------------------------------------------------------------------

FAQ

Will this course help me clear the certification exam? +

How is the DataNode failure handled by NameNode? +

Explain the NameNode recovery process. +

What are the different schedulers available in Hadoop? +

Which is better - Self-paced training or Instructor-led training? +

Who are the trainers? +

What if I miss a class? +

How will I execute the practical? +

Is the course material accessible after completion of the course? +

Is there any offer/discount that I can avail? +

Will I get a refund if I cancel my enrollment? +

What if I have queries after completion of the course? +

Message to Tutor Inquiry to Uplatz PAY NOW

Hadoop Administration Training

Hadoop Administration Training

Uplatz

IT Training

IT Training

General