Hive table properties

It was a surprise for me to find those two properties – very useful when dealing with large amount of files generated in other systems. I couldn’t find it on Hive docs, but you can come across these settings on forums.

skip.header.line.count tells how many lines from the file should be skipped. Useful when you read CSV files and first file contains header with column names. It works with text file (ROW FORMAT DELIMITED FIELDS TERMINATED BY…) and with CSV Serde. There is also complementary settings that allows to skip footer lines: skip.footer.line.count. The problem is however that Spark does’t recognize those properties so be careful when you plan to read the table later via Spark HiveContex.

Speaking about Hive table properties the following setting may also be very useful.

serialization.null.format is another table property which define how NULL values are encoded in the text file. Here is an example of using “null” string as NULL marker:

CREATE TABLE table_null(
     s1 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('serialization.null.format'='null');

So whenever field will contain string “null” it will be interpreted as NULL value.

Field separator escape

One more useful property that can be used when dealing with text files is delim.escape. This property allows setting custom character that will be used to escpe separator in column values:

CREATE TABLE table_escape(
    s1 STRING,
    s2 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('escape.delim'='\\');

(We use backslash for escape delimiter, but it has to be escaped as in Java)

In such case the following data file:

aaaa,bbbb
aaa\,bbb,cc

will be interpreated as:

0: jdbc:hive2://localhost:10000> select * from table_escape
0: jdbc:hive2://localhost:10000> ;
+------------------+------------------+--+
| table_escape.s1  | table_escape.s2  |
+------------------+------------------+--+
| aaaa             | bbbb             |
| aaa,bbb          | cc               |
+------------------+------------------+--+

Custom line breaks

There is also syntax that allows to split records with some character other than new line:

CREATE TABLE table_lines(
    s1 STRING,
    s2 STRING
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '|';

However it is currently not supported (I tried on Hive 1.2.1):

Error: Error while compiling statement: FAILED: SemanticException 5:20 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''|'' (state=42000,code=40000)

Binary formats

Generally speaking, text files should be rather avoided on Hadoop, because binary columnar format usually give better performance. Nevertheless CSV or other plain text files can be found quite often as an input from external systems. In such cases it good to have different formatting options and easily start using Hive in existing ecosystem without too much hassle.

 

 

Apache Drill

Last week I spent some time playing with Apache Drill and I really liked it. It seems that tool had some significant development last year and the coverage of SQL functionality is wide enough to migrate from Hive (at least in many typical cases).

I had a chance to do a short introduction presentation which is available here:

Introduction to Apache Drill

It seems there are still some issues that may block introducing Drill into existing production environments.

First of all it doesn’t support YARN, so it may be a challenge to deploy it alongside existing Hadoop nodes. Missing Kerberos support may also be an issue for some mature production clusters. There are however tickets already created for both of the features, so maybe in the next versions we will enjoy those improvements.