Ans: Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.
Ans: Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect to the metastore.
Ans: Delimited text and Sequence Files.
Ans: Both distCP (Distributed Copy in Hadoop) and Sqoop transfer data in parallel but the only difference is that distCP command can transfer any kind of data from one Hadoop cluster to another whereas Sqoop transfers data between RDBMS and other components in the Hadoop ecosystem like HBase, Hive, HDFS, etc.
Q5. Compare Sqoop and Flume
Ans: Sqoop vs Flume
Sqoop |
Flume |
Used for importing data from structured data sources like RDBMS. | Used for moving bulk streaming data into HDFS. |
It has a connector based architecture. | It has a agent based architecture. |
Data import in sqoop is not evetn driven. | Data load in flume is event driven |
HDFS is the destination for importing data. | Data flows into HDFS through one or more channels. |
Ans: Sqoop can import data form a relational database using any SQL query rather than only using table and column name parameters.
Ans: Yes, MySQL is the default database
Ans: This can be accomplished using the –m 1 option in the Sqoop import command. It will create only one MapReduce task which will then import rows serially.
Ans: This can be accomplished using the import-all-tables import command in Sqoop and by specifying the exclude-tables option with it as follows-
sqoop import-all-tables
--connect –username –password --exclude-tables Table298, Table 123, Table 299
Ans: Apache Sqoop import command does not support direct import of BLOB and CLOB large objects. To import large objects, I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.
Ans: Unlike sqoop-list-tables and sqoop-list-databases, there is no direct command like sqoop-list-columns to list all the columns. The indirect way of achieving this is to retrieve the columns of the desired tables and redirect them to a file which can be viewed manually containing the column names of a particular table.
Sqoop import --m 1 --connect 'jdbc: sqlserver: //nameofmyserver; database=nameofmydatabase; username=DeZyre; password=mypassword' --query "SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name='mytableofinterest' AND \$CONDITIONS" --target-dir 'mytableofinterest_column_name'
Ans: Using the –input-null-string parameter, a default value can be specified so that the row gets inserted with the default value for the column that it has a NULL value in HDFS.
Ans: --Split-by clause is used to specify the columns of the table that are used to generate splits for data imports. This clause specifies the columns that will be used for splitting when importing the data into the Hadoop cluster. —split-by clause helps achieve improved performance through greater parallelism. Apache Sqoop will create splits based on the values present in the columns specified in the –split-by clause of the import command. If the –split-by clause is not specified, then the primary key of the table is used to create the splits while data import. At times the primary key of the table might not have evenly distributed values between the minimum and maximum range. Under such circumstances –split-by clause can be used to specify some other column that has even distribution of data to create splits so that data import is efficient.
Ans: Sqoop allows data to be imported using two file formats
This is the default file format to import data using Sqoop. This file format can be explicitly specified using the –as-textfile argument to the import command in Sqoop. Passing this as an argument to the command will produce the string based representation of all the records to the output files with the delimited characters between rows and columns.
It is a binary file format where records are stored in custom record-specific data types which are shown as Java classes. Sqoop automatically creates these data types and manifests them as java classes.
Ans: The basic commands of HadoopSqoop are
For each sqoop copying into HDFS how many MapReduce jobs and tasks will be submitted?
There are 4 jobs that will be submitted to each Sqoop copying into HDFS and no reduce tasks are scheduled.
Ans: If the imported records have rows that contain null values for all the columns, then probably those records might have been dropped off during import because HBase does not allow null values in all the columns of a record.
Ans: split-by is a clause, it is used to specify the columns of the table which are helping to generate splits for data imports during importing the data into the Hadoop cluster. This clause specifies the columns and helps to improve the performance via greater parallelism. And also it helps to specify the column that has an even distribution of data to create splits,that data is imported.
Ans: Data can be synchronised using incremental parameter with data import –
--Incremental parameter can be used with one of the two options-
Ans: In the Java code Sqoop jar is included in the classpath. The required parameters are created to Sqoop programmatically like for CLI (command line interface). Sqoop.runTool() method also invoked in Java code.
–connect jdbc: mysql: //localhost/test_db
Is the above command the best way to specify the connect string in case I want to use Apache Sqoop with a distributed hadoop cluster?
Ans: When using Sqoop with a distributed Hadoop cluster the URL should not be specified with localhost in the connect string because the connect string will be applied on all the DataNodes with the Hadoop cluster. So, if the literal name localhost is mentioned instead of the IP address or the complete hostname then each node will connect to a different database on their localhosts. It is always suggested to specify the hostname that can be seen by all remote nodes.
Ans: This can be proficient using the import-all-tables, import command in Sqoop and by specifying the exclude-tables option with it as follows-
sqoop import-all-tables
–connect –username –password –exclude-tables Table498, Table 323, Table 199
Ans: Using the –boundary-query clause. Generally, sqoop uses the SQL query select min (), max () from to find out the boundary values for creating splits. However, if this query is not optimal then using the –boundary-query argument any random query can be written to generate two numeric columns.
Ans: Hadoop MapReduce cluster is configured to run a maximum of 4 parallel MapReduce tasks and the sqoop import can be configured with number of parallel tasks less than or equal to 4 but not more than 4.
Ans: sqoop job –list
Ans: sqoop --tables --connect jdbc:mysql://host/DB1
Ans: The metastore database can be hosted anywhere within or outside of the Hadoop cluster..