Hadoop Administration Interview Questions and Answers

Written by Venkatesan M | May 23, 2017 10:32:25 AM

Q1. How will you decide whether you need to use the Capacity Scheduler or the Fair Scheduler?

Ans: Fair Scheduling is the process in which resources are assigned to jobs such that all jobs get to share equal number of resources over time. Fair Scheduler can be used under the following circumstances:

If you wants the jobs to make equal progress instead of following the FIFO order then you must use Fair Scheduling.
If you have slow connectivity and data locality plays a vital role and makes a significant difference to the job runtime then you must use Fair Scheduling.
Use fair scheduling if there is lot of variability in the utilization between pools.

Capacity Scheduler allows runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximize the utilization of the hadoop cluster and throughput. Capacity Scheduler can be used under the following circumstances:

If the jobs require scheduler detrminism then Capacity Scheduler can be useful.
CS's memory based scheduling method is useful if the jobs have varying memory requirements.
If you want to enforce resource allocation because you know very well about the cluster utilization and workload then use Capacity Scheduler.

Q2. What are the daemons required to run a Hadoop cluster?

Ans: NameNode, DataNode, TaskTracker and JobTracker.

Q3. How will you restart a NameNode?

Ans: The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.

Q4. Explain about the different schedulers available in Hadoop.

Ans:

FIFO Scheduler: This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
COSHH: This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
Fair Sharing: This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.

Q5. List few Hadoop shell commands that are used to perform a copy operation.

Ans:

fs –put
fs –copyToLocal
fs –copyFromLocal

Q6. What is jps command used for?

Ans: jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.

Q7. What are the important hardware considerations when deploying Hadoop in production environment?

Ans:

Memory-System’s memory requirements will vary between the worker services and management services based on the application.
Operating System - a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.
Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.
Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when compared to Small Form Factor disks.
Network - Two TOR switches per rack provide better redundancy.
Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.

Q8. How many NameNodes can you run on a single Hadoop cluster?

Ans: Only one.

Q9. What happens when the NameNode on the Hadoop cluster goes down?

Ans: The file system goes offline whenever the NameNode is down.

Q10. What is the conf/hadoop-env.sh file and which variable in the file should be set for Hadoop to work?

Ans: This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.

Q11. Apart from using the jps command is there any other way that you can check whether the NameNode is working or not.

Ans: Use the command -/etc/init.d/hadoop-0.20-namenode status.

Q12. In a MapReduce system, if the HDFS block size is 64 MB and there are 3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this scenario, how many input splits are likely to be made by the Hadoop framework.

Ans: 2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.

Q13. Which command is used to verify if the HDFS is corrupt or not?

Ans: Hadoop FSCK (File System Check) command is used to check missing blocks.

Q14. List some use cases of the Hadoop Ecosystem

Ans: Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.

Q15. How can you kill a Hadoop job?

Ans: Hadoop job –kill jobID.

Q16. I want to see all the jobs running in a Hadoop cluster. How can you do this?

Ans: Using the command – Hadoop job –list, gives the list of jobs running in a Hadoop cluster.

Q17. Is it possible to copy files across multiple clusters? If yes, how can you accomplish this?

Ans: Yes, it is possible to copy files across multiple Hadoop clusters and this can be achieved using distributed copy. DistCP command is used for intra or inter cluster copying.

Q18. Which is the best operating system to run Hadoop?

Ans: Ubuntu or Linux is the most preferred operating system to run Hadoop. Though Windows OS can also be used to run Hadoop but it will lead to several problems and is not recommended.

Q19. What are the network requirements to run Hadoop?

Ans:

SSH is required to run - to launch server processes on the slave nodes.
A password less SSH connection is required between the master, secondary machines and all the slaves.

Q20. The mapred.output.compress property is set to true, to make sure that all output files are compressed for efficient space usage on the Hadoop cluster. In case under a particular condition if a cluster user does not require compressed data for a job. What would you suggest that he do?

Ans: If the user does not want to compress the data for a particular job then he should create his own configuration file and set the mapred.output.compress property to false. This configuration file then should be loaded as a resource into the job.

Q21. What is the best practice to deploy a secondary NameNode?

Ans: It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not interfere with the operations of the primary node.

Q22. How often should the NameNode be reformatted?

Ans: The NameNode should never be reformatted. Doing so will result in complete data loss. NameNode is formatted only once at the beginning after which it creates the directory structure for file system metadata and namespace ID for the entire file system.

Q23. If Hadoop spawns 100 tasks for a job and one of the job fails. What does Hadoop do?

Ans: The task will be started again on a new TaskTracker and if it fails more than 4 times which is the default setting (the default value can be changed), the job will be killed.

Q24. How can you add and remove nodes from the Hadoop cluster?

Ans:

To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.

Q25. You increase the replication level but notice that the data is under replicated. What could have gone wrong?

Ans: Nothing could have actually wrong, if there is huge volume of data because data replication usually takes times based on data size as the cluster has to copy the data and it might take a few hours.

Q26. Explain about the different configuration files and where are they located.

Ans: The configuration files are located in “conf” sub directory. Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml and mapred-site.xml.

Q27. Which operating system(s) are supported for production Hadoop deployment?

Ans: Which operating system(s) are supported for production Hadoop deployment? | Hadoop admin questions.

Q28. What is the role of the namenode?

Ans: The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests.

Q29. What happen on the namenode when a client tries to read a data file? | Hadoop admin questions

Ans: The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot

Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval.

Q30. What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)?

Ans: There are no requirements for datanodes. However, the namenodes require a specified amount of RAM to store filesystem image in memory Based on the design of the primary namenode and secondary namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to have enough memory to contain the entire filesystem image.

Q31. What mode(s) can Hadoop code be run in? | Hadoop admin questions

Ans: Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode.

Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes.

Q32. How would an Hadoop administrator deploy various components of Hadoop in production?

Ans: Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on multiple slave nodes.

There is a need for only one namenode and jobtracker on the system. The number of datanodes depends on the available hardware.

Q33. What is the best practice to deploy the secondary namenode.

Ans: Deploy secondary namenode on a separate standalone machine.The secondary namenode needs to be deployed on a separate machine. It will not interfere with primary namenode operations in this way. The secondary namenode must have the same memory requirements as the main namenode.

Q34. Is there a standard procedure to deploy Hadoop?

Ans: No, there are some differences between various distributions. However, they all require that Hadoop jars be installed on the machine

There are some common requirements for all Hadoop distributions but the specific procedures will be different for different vendors since they all have some degree of proprietary software.

Q35. What is the role of the secondary namenode?

Ans: Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots.

The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back-up.

Q36. What are the side effects of not running a secondary name node?

Ans: The cluster performance will degrade over time since edit log will grow bigger and bigger

If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.

Q37. What happen if a datanode loses network connection for a few minutes?

Ans: The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will be;

The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not avaialble it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated data will be deleted. Note: the data might be deleted from the original datanode.

Q38. What happen if one of the datanodes has much slower CPU?

Ans: The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact

Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed.

Q39. What is speculative execution?

Ans: If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.

The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.

Q40. After increasing the replication level, I still see that data is under replicated. What could be wrong?

Ans: Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication

Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.

Q41. How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?

Ans: In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured.

Hadoop has a built-in rack awareness mechanism that allows data distribution between different racks based on the configuration.

Q42. Are there any special requirements for namenode?

Ans: Yes, the namenode holds information about all files in the system and needs to be extra reliable.

The namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in multiple places. Note that the community is working on solving the single point of failure issue with the namenode.

Q43. If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)?

Ans: Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6.

What is distributed copy (distcp)? | Hadoop admin questions

Ans: Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a large amount of data.

One of the major challenges in the Hadoop enviroment is copying data across multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.

Q44. What is distributed copy (distcp)?

Ans: Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a large amount of data.

One of the major challenges in the Hadoop enviroment is copying data across multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.

Q45. What is replication factor?

Ans: Replication factor controls how many times each individual block can be replicated .

Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.

Q46. What daemons run on Master nodes?

Ans: NameNode, Secondary NameNode and JobTracker.

Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes.

Q47. What is rack awareness?

Ans: Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions.

Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness.

Q48. What is the role of the jobtracker in an Hadoop cluster?

Ans: The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks.

The job tracker is the main component of the map-reduce execution. It control the division of the job into smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results back to calling code.

How does the Hadoop cluster tolerate datanode failures?

Ans: Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data.

The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.]

Q49. What is the procedure for namenode recovery?

A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting secondary namenode to primary namenode

The namenode recovery procedure is very important to ensure the reliability of the data.It can be accomplished by starting a new namenode using backup data or by promoting the secondary namenode to primary.

Q50. Web-UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?

Ans: This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decomissioning finished.

Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data from datanodes by moving replicas to remaining datanodes.

Q51. What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?

Ans: Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes.

Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster performance it is recommended to start rebalancer to redistribute the data between datanodes evenly.

Q52. If the Hadoop administrator needs to make a change, which configuration file does he need to change?

Ans: It depends on the nature of the change. Each node has it`s own set of configuration files and they are not always the same on each node.

Correct Answer is A - Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node.

Q53. Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong?

Ans: The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before restarting the jobs again.

This is a very common mistake by Hadoop administrators when there is no secondary namenode on the cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and combine the edit log and current file system timestamp.