Hadoop Administration Course Objectives
Hadoop Administration Training
HADOOP ADMINISTATION TRAINING CURRICULUM
1.1 Big Data Introduction
1.1.1 What is Big Data?
1.1.2 Big Data - Why
1.1.3 Big Data - Journey
1.1.4 Big Data Statistics
1.1.5 Big Data Analytics
1.1.6 Big Data Challenges
1.1.7 Technologies Supported By Big Data
1.2 Hadoop Introduction
1.2.1 What Is Hadoop?
1.2.2 History Of Hadoop
1.2.3 Breakthroughs Of Hadoop
1.2.4 Future of Hadoop
1.2.5 Who Is Using?
1.3 Basic Concepts
1.3.1 The Hadoop Distributed File System - At a Glance
1.3.2 Hadoop Daemon Processes
1.3.3 Anatomy Of A Hadoop Cluster
1.3.4 Hadoop Distributions
2 HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
2.1 What is HDFS?
2.1.1 Distributed File System (DFS)
2.1.2 Hadoop Distributed File System (HDFS)
2.2 HDFS Cluster Architecture and Block Placement
2.2.5 Secondary NameNode
2.3 HDFS Concepts
2.3.1 Typical Workflow
2.3.2 Data Replication
2.3.3 Replica Placement
2.3.4 Replication Policy
2.3.5 Hadoop Rack Awareness
2.3.6 Anatomy of a File Read
2.3.7 Anatomy of a File Write
3.1 STAGES OF MAPREDUCE
3.2.1 Job Tracker
3.2.2 Task Tracker
3.3 TASK FAILURES
3.3.2 Task Tracker Failures
3.3.3 Job Tracker Failures
3.3.4 HDFS Failures
4. HOW TO PLAN A CLUSTER
4.1 VERSIONS AND FEATURES
4.2 HARDWARE SELECTION
4.2.1 Master Hardware
4.2.2 Slave Hardware
4.2.3 Cluster sizing
4.3 OPERATING SYSTEM SELECTION
4.3.1 Deployment Layout
4.3.2 Software Packages
4.3.3 Hostname, DNS
4.3.4 Users, Groups, Privileges
4.4 DISK CONFIGURATION
4.4.1 Choose a FileSystem
4.4.2 Mount options
4.5 NETWORK DESIGN
4.5.1 Network usage in Hadoop
4.5.2 Typical network Topologies
5. INSTALLATION AND CONFIGURATION
5.1 APACHE HADOOP
5.1.1 Tarball Installation
5.1.2 Package Installation
5.2.1 XML Configuration
5.2.2 Environment Variables
5.2.3 Logging Configuration
5.3.1 Optimization and Tuning
5.4.1 Optimization and Tuning
6.1 KERBEROS AND HADOOP
6.1.2 Configuring Hadoop Security
7. RESOURCE MANAGEMENT
7.1 WHAT IS RESOURCE MANAGEMENT?
7.2 MAPREDUCE SCHEDULER
7.2.1 Capacity Scheduler
7.2.2 Fair Scheduler
8. CLUSTER MAINTENANCE
8.1 MANAGING HADOOP PROCESS
8.1.1 Starting and stopping processes with Init scripts
8.1.2 Starting and stopping processes manually
8.2 HDFS MAINTENANCE
8.2.1 Adding and Decommissioning DataNode
8.2.2 Balancing HDFS Block Data
8.2.3 Dealing with a Failed disk
8.3 MAPREDUCE MAINTENANCE
8.3.1 Adding and Decommissioning TaskTracker
8.3.2 Kill MapReduce Job and Task
8.3.3 Dealing Blacklisted Tasktracker
9.1 COMMON FAILUERS AND PROBLEMS
9.2 HDFS AND MAPREDUCE CHECKS
10. BACKUP AND RECOVERY
10.1 DATA BACKUP
10.1.1 Distributed copy
10.1.2 Parallel data ingestion
10.2 NAMENODE METADATA
Workshop style coaching
Hands on practice exercises
Quiz at the end of each major topic
Tips and techniques on Cloudera Certification Examination
Mock interviews for each individual will be conducted on need basis
Resume preparation and guidance
Hadoop is the most important framework for working with Big Data in a distributed environment. Due to the rapid deluge of Big Data and the need for real-time insights from huge volumes of data, the job of a Hadoop administrator is critical to large organizations. Hence, there is huge demand for professionals with the right skills and certification. Intellipaat is offering the industry-designed Hadoop administration training to help you master this domain.
Hadoop Administration Interview Questions
1) What is Hadoop?
Hadoop is what evolved as the solution to the “Big Data” problem. Hadoop is described as the framework that offers a number of tools and services in order to store and process Big Data. It also plays an important role in the analysis of big data and to make efficient business decisions when it is difficult to make the decision using the traditional method.
2) Name the Main Components of a Hadoop Application.
Hadoop offers a vast toolset that makes it possible to store and process data very easily. Here are all the main components of the Hadoop:
· Hadoop Common
· Hadoop MapReduce
· PIG and HIVE – The Data Access Components.
· HBase – For Data Storage
· Apache Flume, Sqoop, Chukwa – The Data Integration Components
· Ambari, Oozie and ZooKeeper – Data Management and Monitoring Component
· Thrift and Avro – Data Serialization components
· Apache Mahout and Drill – Data Intelligence Components
3) How many Input Formats are there in Hadoop? Explain.
There are following three input formats in Hadoop –
1. Text Input Format: The text input is the default input format in Hadoop.
2. Sequence File Input Format: This input format is used to read files in sequence.
3. Key Value Input Format: This input format is used for plain text files.
4) What do you know about YARN?
YARN stands for Yet Another Resource Negotiator, it is the Hadoop processing framework. YARN is responsible to manage the resources and establish an execution environment for the processes.
5) Why do the nodes are removed and added frequently in a Hadoop cluster?
The following features of Hadoop framework makes a Hadoop administrator to add (commission) and remove (decommission) Data Nodes in a Hadoop clusters –
1. The Hadoop framework utilizes commodity hardware, and it is one of the important features of Hadoop framework. It results in a frequent DataNode crash in a Hadoop cluster.
2. The ease of scale is yet another important feature of the Hadoop framework that is performed according to the rapid growth of data volume.
6) What do you understand by “Rack Awareness”?
In Hadoop, Rack Awareness is defined as the algorithm through which NameNode determines how the blocks and their replicas are stored in the Hadoop cluster. This is done via rack definitions that minimize the traffic between DataNodes within the same rack. Let’s take an example – we know that the default value of replication factor is 3. According to the “Replica Placement Policy” two copies of replicas for every block of data will be stored in a single rack whereas the third copy is stored in the different rack.
7) What daemons are needed to run a Hadoop cluster?
DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster.
8) Which OS are supported by Hadoop deployment?
The main OS use for Hadoop is Linux. However, by using some additional software, it can be deployed on Windows platform.
9) What are the common Input Formats in Hadoop?
Three widely used input formats are:
1. Text Input: It is default input format in Hadoop.
2. Key Value: It is used for plain text files
3. Sequence: Use for reading files in sequence
10) What modes can Hadoop code be run in?
Hadoop can be deployed in
1. Standalone mode
2. Pseudo-distributed mode
3. Fully distributed mode.
11) What is the main difference between RDBMS and Hadoop?
RDBMS is used for transactional systems to store and process the data whereas Hadoop can be used to store the huge amount of data.
12) What are the important hardware requirements for a Hadoop cluster?
There are no specific requirements for data nodes.
However, the namenodes need a specific amount of RAM to store filesystem image in memory. This depends on the particular design of the primary and secondary namenode.
13) How would you deploy different components of Hadoop in production?
You need to deploy jobtracker and namenode on the master node then deploy datanodes on multiple slave nodes.
14) What do you need to do as Hadoop admin after adding new datanodes?
You need to start the balancer for redistributing data equally between all nodes so that Hadoop cluster will find new datanodes automatically. To optimize the cluster performance, you should start rebalancer to redistribute the data between datanodes.
15) What are the Hadoop shell commands can use for copy operation?
The copy operation command are:
16) What is the Importance of the namenode?
The role of namenonde is very crucial in Hadoop. It is the brain of the Hadoop. It is largely responsible for managing the distribution blocks on the system. It also supplies the specific addresses for the data based when the client made a request.
17) Explain how you will restart a NameNode?
The easiest way of doing is to run the command to stop running sell script.
Just click on stop.all.sh. then restarts the NameNode by clocking on start-all-sh.
18) What happens when the NameNode is down?
If the NameNode is down, the file system goes offline.
19) Is it possible to copy files between different clusters? If yes, How can you achieve this?
Yes, we can copy files between multiple Hadoop clusters. This can be done using distributed copy.
20) Is there any standard method to deploy Hadoop?
No, there are now standard procedure to deploy data using Hadoop. There are few general requirements for all Hadoop distributions. However, the specific methods will always different for each Hadoop admin.
21) List few Hadoop shell commands that are used to perform a copy operation.
- fs –put
- fs –copyToLocal
- fs –copyFromLocal
22) What is jps command used for?
jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.
23) What are the important hardware considerations when deploying Hadoop in production environment?
- Memory-System’s memory requirements will vary between the worker services and management services based on the application.
- Operating System - a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.
- Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.
- Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when compared to Small Form Factor disks.
- Network - Two TOR switches per rack provide better redundancy.
- Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.
24) How many NameNodes can you run on a single Hadoop cluster?
25) What happens when the NameNode on the Hadoop cluster goes down?
The file system goes offline whenever the NameNode is down.
26) What is the conf/hadoop-env.sh file and which variable in the file should be set for Hadoop to work?
This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.
27) Apart from using the jps command is there any other way that you can check whether the NameNode is working or not.
Use the command -/etc/init.d/hadoop-0.20-namenode status.
28) In a MapReduce system, if the HDFS block size is 64 MB and there are 3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this scenario, how many input splits are likely to be made by the Hadoop framework.
2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.
29) Which command is used to verify if the HDFS is corrupt or not?
Hadoop FSCK (File System Check) command is used to check missing blocks.
30) What are the essential features of Hadoop?
Hadoop framework has the competence of solving many questions for Big Data analysis. It’s designed on Google MapReduce which is based on Google’s Big Data file systems.
31) What is the main difference between an “Input Split” and “HDFS Block”?
“Input Split” is the logical division of the data while The “HDFS Block” is the physical division of the data.
32) State some of the important features of Hadoop.
The important features of Hadoop are –
- Hadoop framework is designed on Google MapReduce that is based on Google’s Big Data File Systems.
- Hadoop framework can solve many questions efficiently for Big Data analysis.
33) Do you know some companies that are using Hadoop?
Yahoo – using Hadoop
Facebook – developed Hive for analysis
Amazon, Adobe, Spotify, Netflix, eBay, and Twitter are some other well-known and established companies that are using Hadoop.
34) How can you differentiate RDBMS and Hadoop?
The key points that differentiate RDBMS and Hadoop are –
1. RDBMS is made to store structured data, whereas Hadoop can store any kind of data i.e. unstructured, structured, or semi-structured.
2. RDBMS follows “Schema on write” policy while Hadoop is based on “Schema on read” policy.
3. The schema of data is already known in RDBMS that makes Reads fast, whereas in HDFS, writes no schema validation happens during HDFS write, so the Writes are fast.
4. RDBMS is licensed software, so one needs to pay for it, whereas Hadoop is open source software, so it is free of cost.
5. RDBMS is used for Online Transactional Processing (OLTP) system whereas Hadoop is used for data analytics, data discovery, and OLAP system as well.
35) What are the differences between Hadoop 1 and Hadoop 2?
The following two points explain the difference between Hadoop 1 and Hadoop 2:
In Hadoop 1.X, there is a single NameNode which is thus the single point of failure whereas, in Hadoop 2.x, there are Active and Passive NameNodes. In case, the active NameNode fails, the passive NameNode replaces the active NameNode and takes the charge. As a result, high availability is there in Hadoop 2.x.
In Hadoop 2.x, the YARN provides a central resource manager that share a common resource to run multiple applications in Hadoop whereas data processing is a problem in Hadoop 1.x.
36) What do you know about active and passive NameNodes?
In high-availability Hadoop architecture, two NameNodes are present.
Active NameNode – The NameNode that runs in Hadoop cluster, is the Active NameNode.
Passive NameNode – The standby NameNode that stores the same data as that of the Active NameNode is the Passive NameNode.
On the failure of active NameNode, the passive NameNode replaces it and takes the charge. In this way, there is always a running NameNode in the cluster and thus it never fails.
37) What are the Components of Apache HBase?
Apache HBase Consists of the following main components:
• Region Server: A Table can be divided into several regions. A group of these regions gets served to the clients by a Region Server.
• HMaster: This coordinates and manages the Region server.
• ZooKeeper: This acts as a coordinator inside HBase distributed environment. It functions by maintaining server state inside of the cluster by communication in sessions.
This course is specifically designed to help you clear the certification exam successfully. The comprehensive content of the course along with demonstration of practical scenarios & examples will make you understand each and every topic in great depth. Since the course structure has a special focus on certification, hence you will go through a lot of real-time case studies and study material during the training that will help you crack the certification exam.
NameNode continuously receives a signal from all the DataNodes present in Hadoop cluster that specifies the proper function of the DataNode. The list of all the blocks present on a DataNode is stored in a block report. If a DataNode is failed in sending the signal to the NameNode, it is marked dead after a specific time period. Then the NameNode replicates/copies the blocks of the dead node to another DataNode with the earlier created replicas.
The process of NameNode recovery helps to keep the Hadoop cluster running, and can be explained by the following steps –
Step 1: To start a new NameNode, utilize the file system metadata replica (FsImage).
Step 2: Configure the clients and DataNodes to acknowledge the new NameNode.
Step 3: Once the new Name completes the loading of last checkpoint FsImage and receives block reports from the DataNodes, the new NameNode start serving the client.
The different available schedulers in Hadoop are –
COSHH – It schedules decisions by considering cluster, workload, and using heterogeneity.
Both self-paced training and online instructor-led training have their own advantages and disadvantages. 1) Suitability - If you have no idea about the course content and have no experience on it, then online instructor-led training will help you understand the course content better and deeply. 2) Flexibility - Self-paced training is generally more flexible than the tutor-led training since you can learn at your own pace through the videos as and when you have time. 3) Doubt-clearing, assignments etc. - You can attempt assignments and get feedback from the tutor in instructor-led training and also can get your doubts cleared during the class. Cost (more for instructor-led training) and other factors are also important.
We have a highly qualified and experienced team of professionals who are experts in their fields. Our trainers are highly supportive and render a friendly learning environment to the students focusing on their career growth.
In normal circumstances the tutor should be able to reschedule the class as per your convenient time but if you accidently miss any particular class in a multi-student batch then you can catch-up from the corresponding session recording that will be shared with all students or you can request the tutor to hold a separate class for you later on the topics covered in that class.
We provide server access for you to practice on and our trainers will ensure that you get practical real-time experience and training with all the utilities required for in-depth understanding of the course.
Yes. We provide lifetime access to Uplatz Learning Hub where you can view or download the course material anytime.
Yes. We have special offer/discount on this course. Please send email to firstname.lastname@example.org asking for the discount and our team will be glad to help you.
Yes. Please check our refunds and cancellation policy (link given in the footer of our website) for more details.
You can send an email to email@example.com and Uplatz course team will respond back on your queries.