In this page i am sharing Hadoop Interview Questions which are frequently asked in Technical Round for Hadoop Developers.Earlier i was written for Hadoop developers who want start learning Hadoop prerequisite for learning Hadoop
1) What is Hadoop?
Ans: Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model.
2) Why the name Hadoop?
Ans: Hadoop doesn't have any expanding version like oops. The charming yellow elephant you see is basically named after Dougs sons toy elephant.
3) Why do we need Hadoop?
Ans: Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations,that too data present in different machines at different locations. In this situation a necessary for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.
4) Give some examples of some companies that are using Hadoop Structure?
Ans: A lot companies are using the Hadoop Structure such as Facebook,eBay,Amazon,
Twitter, Google and so on...
5) What is the basic Difference between RDBMS and Hadoop?
Ans: A Traditional RDBMS is used for transnational systems to report and archive the data,whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to seek one record from Big data,whereas Hadoop will be useful when you want Big data is one shot and perform analysis on that later.
6) What is Structured and unstructured data?
Ans: Structured data is the data that is easily identifiable form as it organized in a structure. The most common form of structured data is a database where specific information is stored in tables,that is rows and columns. Unstructured data refers to any data that can not be identified easily. It could be in the form of images,videos,documents,email,logs and random text. It is not in the form of rows and columns.
7) What are the core componenets of Hadoop?
Ans: Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and MapReduce is used to process such large data sets.
8) What is HDFS?
Ans: HDFS is a file system designed for storing very large files with streaming data access patterns,running clusters on commodity hardware.
9) What is Fault Tolerance?
Ans: Suppose you have a file stored in a system,and due to some technical problem that file gets destroyed. Then there is no chance of getting data back present in that file. To avoid such situations ,Hadoop has introduced the feature of Fault Tolerance in HDFS. In Hadoop, where we store a file,it automatically gets replicated at two other locations also. So even if one or two of the systems collapse,the file still on the third system.
10) What is meaning Replication factor?
Ans: Replication Factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3 times the amount of storage needed to store the data. Each file is split into data blocks and spread across cluster.
11) What is MapReduce?
Ans: MapReduce is a set of programs used to acess and manipulate large data sets over a Hadoop cluster.
12) What is the InputSplit in map reduce software?
Ans: An InputSplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the database.
13) How does master slave architecture in the Hadoop?
Ans: Totally 5 daemons run in Hadoop Master-slave architecture. On Master Node: Name node and job Tracker and Secondary name Node on Slave: Data Node and Task Tracker But it is recommended to run secondary name node in a separate machine which have Master node capacity.
14) What is the Reducer used for?
Ans: Reducer is used to combine the multiple outputs of mapper to one.
15) What are the primary phases of the Reducer?
Ans: A Reducer has 3 primary phsases they are 1) Shuffle 2) sort 3) reduce
16) What is the typical block size of an HDFS block?
Ans: Default block size is 64mb but 128mb is typical .
17) How input and output data format of the Hadoop Framework?
Ans: There are different data format such as Fileinputformat,textinputformat,keyvaluetextinputformat,wholefileformat are file formats in hadoop framework.
18) Can I set the number of reducers to zero?
Ans: Yes. It can be set as zero. So, the mapper output is an finalized output and stores in HDFS.
19) What are the Map files and why are they important in Hadoop?
Ans: Map files are sorted sequence files that also have an index. The index allows fast data look up.
20) How can you use binary data in MapReduce in Hadoop?
Ans: Binary data can be used directly by a map-reduce job.Often binary data is added to a sequence file.
1) What is Hadoop?
Ans: Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model.
2) Why the name Hadoop?
Ans: Hadoop doesn't have any expanding version like oops. The charming yellow elephant you see is basically named after Dougs sons toy elephant.
3) Why do we need Hadoop?
Ans: Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations,that too data present in different machines at different locations. In this situation a necessary for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.
4) Give some examples of some companies that are using Hadoop Structure?
Ans: A lot companies are using the Hadoop Structure such as Facebook,eBay,Amazon,
Twitter, Google and so on...
5) What is the basic Difference between RDBMS and Hadoop?
Ans: A Traditional RDBMS is used for transnational systems to report and archive the data,whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to seek one record from Big data,whereas Hadoop will be useful when you want Big data is one shot and perform analysis on that later.
6) What is Structured and unstructured data?
Ans: Structured data is the data that is easily identifiable form as it organized in a structure. The most common form of structured data is a database where specific information is stored in tables,that is rows and columns. Unstructured data refers to any data that can not be identified easily. It could be in the form of images,videos,documents,email,logs and random text. It is not in the form of rows and columns.
7) What are the core componenets of Hadoop?
Ans: Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and MapReduce is used to process such large data sets.
8) What is HDFS?
Ans: HDFS is a file system designed for storing very large files with streaming data access patterns,running clusters on commodity hardware.
9) What is Fault Tolerance?
Ans: Suppose you have a file stored in a system,and due to some technical problem that file gets destroyed. Then there is no chance of getting data back present in that file. To avoid such situations ,Hadoop has introduced the feature of Fault Tolerance in HDFS. In Hadoop, where we store a file,it automatically gets replicated at two other locations also. So even if one or two of the systems collapse,the file still on the third system.
10) What is meaning Replication factor?
Ans: Replication Factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3 times the amount of storage needed to store the data. Each file is split into data blocks and spread across cluster.
11) What is MapReduce?
Ans: MapReduce is a set of programs used to acess and manipulate large data sets over a Hadoop cluster.
12) What is the InputSplit in map reduce software?
Ans: An InputSplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the database.
13) How does master slave architecture in the Hadoop?
Ans: Totally 5 daemons run in Hadoop Master-slave architecture. On Master Node: Name node and job Tracker and Secondary name Node on Slave: Data Node and Task Tracker But it is recommended to run secondary name node in a separate machine which have Master node capacity.
14) What is the Reducer used for?
Ans: Reducer is used to combine the multiple outputs of mapper to one.
15) What are the primary phases of the Reducer?
Ans: A Reducer has 3 primary phsases they are 1) Shuffle 2) sort 3) reduce
16) What is the typical block size of an HDFS block?
Ans: Default block size is 64mb but 128mb is typical .
17) How input and output data format of the Hadoop Framework?
Ans: There are different data format such as Fileinputformat,textinputformat,keyvaluetextinputformat,wholefileformat are file formats in hadoop framework.
18) Can I set the number of reducers to zero?
Ans: Yes. It can be set as zero. So, the mapper output is an finalized output and stores in HDFS.
19) What are the Map files and why are they important in Hadoop?
Ans: Map files are sorted sequence files that also have an index. The index allows fast data look up.
20) How can you use binary data in MapReduce in Hadoop?
Ans: Binary data can be used directly by a map-reduce job.Often binary data is added to a sequence file.
Excellent Blog very imperative good content, this article is useful to beginners and real time
ReplyDeleteEmployees. Hadoop Admin and Developer Online Training
Hi,
ReplyDeleteThanks for sharing the great information about Hadoop…Its useful and helpful information…Keep Sharing.
Thank You
Hari