Sqoop 1.4.5

Sqoop version 1.4.5 is delivered with OraOop — Oracle Hadoop connector. Previously this connector was a separate product, so probably not many users knew about it, but now it’s much easier to take an advantage of Oracle specific parameters and settings.

Installing from sources is straight forward. After building it I didn’t have any problems with running it on CDH 5.1.3. To use Oracle driver you have to add

--direct

parameter on the command line. If you see the following output in logs:

**************************************************
*** Using Data Connector for Oracle and Hadoop ***
**************************************************

it means that Oracle connector was used. To make it works you need certain privileges in Oracle:

grant select on v_$instance to user;
grant select on dba_tables to user;
grant select on dba_tab_columns to user;
grant select on dba_objects to user;
grant select on dba_extents to user;
grant select on dba_tab_partitions to user;
grant select on dba_tab_subpartitions to user;

It was a bit confusing for me, because “select on dba_tab_partitions” and “select on dba_tab_subpartitions” were not mentioned in the manual. Anyway you can find all the queries in the sources so in case of exception you can browse Java files to figure out which grant you are still missing.

After setting things up I did a little test. I tired to load sample data bypassing buffer cache (using /* +append_values */ Oracle hint). Although It doesn’t have to reduce the time of loading as only one session can write to table in such mode, in my scenario it was indeed faster. Besides it’s a good way to reduce redo log generation.

Standard Sqoop loading gave the following performance:

14/12/01 22:54:08 INFO mapreduce.ExportJobBase: Transferred 407.5772 MB in 217.8112 seconds (1.8712 MB/sec)
14/12/01 22:54:08 INFO mapreduce.ExportJobBase: Exported 2285493 records.

Whereas with Oralce connector (with append hint set):

-Doraoop.oracle.append.values.hint.usage=ON

it was much better:

14/12/01 23:06:52 INFO mapreduce.ExportJobBase: Transferred 407.5772 MB in 50.1248 seconds (8.1313 MB/sec)
14/12/01 23:06:52 INFO mapreduce.ExportJobBase: Exported 2285493 records.

Of course this performance gain comes at its price — no recovery in case of crash.

See this link for more information on Oracle connector.

Leave a Reply