Hive Virtual Columns

Like many other databases, Hive provides virtual columns – columns that are not stored in the datafiles but can be accessed in queries. Usually they provide some metadata that can be very handy.

In this post I will describe what virtual columns are available in Hive 1.0 and show how it works on several examples.

Sample data set

Let’s generate some data first. In our table we will store only one column that has one character and there will be 200,000,000 records. I wanted to have the records as shortest as possible. Our data file will occupy 400,000,000 bytes so that are 3 HDFS blocks.

The data was generated in the following way:

$ cat /dev/urandom | tr -cd '0-9a-zA-Z' | fold -b -w 1 | head -n 200000000 > one_char.txt

The file content looks like that:

$ head one_char.txt
c
U
9
Z
k
C
W
0
m
i

Now, let’s create few Hive tables with different formats. Basic table is defined in the following way:

CREATE TABLE jp_one_char(c string) 
ROW FORMAT DELIMITED;

After loading the file generated previously let’s populate the parquet table:

CREATE TABLE jp_one_char_parquet(c string) 
STOREAD AS PARQUET;

INSERT INTO TABLE jp_one_char_parquet 
SELECT * FROM jp_one_char;

ORC table:

CREATE TABLE jp_one_char_orc(c string) 
STORED AS ORC;

INSERT INTO TABLE jp_one_char_orc
SELECT * FROM jp_one_cha

and sequence file table:

CREATE TABLE jp_one_char_seq(c string) 
STORED AS SEQUENCEFILE;

INSERT INTO TABLE jp_one_char_seq
SELECT * FROM jp_one_char

Virtual columns

Please not that the virtual columns contains double underscore (“_ _”) as word separator in their names.

INPUT__FILE__NAME – this is pretty straight-forward. When displaying this column you will see the file on HDFS that contains the data. This can be useful, when debugging – you can find the file that contains some strange data, for example. But it may also be helpful in finding file location for external table or partition (otherwise you would have to examine table/partition properties: DESCRIBE EXTENDED … PARTITION(…).

BLOCK__OFFSET__INSIDE__FILE – according to documentation it should get the offset of the data block. It seems however that this behavior depends on the file format.

For example in text file (compressed or not) this columns display the offset of the record in the file.

> SELECT BLOCK__OFFSET__INSIDE__FILE, *
> FROM jp_one_char
> LIMIT 20;
+------------------------------+----------------+--+
| block__offset__inside__file  | jp_one_char.c  |
+------------------------------+----------------+--+
| 0                            | c              |
| 2                            | U              |
| 4                            | 9              |
| 6                            | Z              |
| 8                            | k              |
| 10                           | C              |
| 12                           | W              |
| 14                           | 0              |
| 16                           | m              |
| 18                           | i              |
| 20                           | 5              |
| 22                           | z              |
| 24                           | T              |
| 26                           | Z              |
| 28                           | P              |
| 30                           | 3              |
| 32                           | 2              |
| 34                           | s              |
| 36                           | 4              |
| 38                           | e              |
+------------------------------+----------------+--+

Each row occupies 2 bytes, which is correct (one char + new line). So, as you can see this column holds the offset of the row. Let’s make sure that there are no duplicates within the file:

> SELECT offset, count(*)
> from (
>   SELECT BLOCK__OFFSET__INSIDE__FILE offset, *
>   FROM jp_one_char
> ) t
> GROUP BY offset
> HAVING COUNT(*) >1;

...

+---------+------+--+
| offset  | _c1  |
+---------+------+--+
+---------+------+--+
No rows selected (693.016 seconds)

No duplicates, so this is unique within a file, as expected That’s nice, because we can use it to create row IDs.

But you have to be careful, as it depends on file format. In parquet, ORC and sequence file BLOCK__OFFSET__INSIDE__FILE is not unique within file and it’s hard to determine what it actually means.

ROW__OFFSET__INSIDE__BLOCK – To access this column you have to enable it first by setting:

set hive.exec.rowoffset=true;

but it always produced 0, anyway.

GROUPING__ID – It can be helpful when dealing with GROUP BY … WITH CUBE/ROLLUP queries. This column provides binary value of grouping set (0 if given grouping column is aggregated together, 1 if grouping column has value).

Let’s create another sample table:

> select * from jp_grouping_sample;
+------------------------+------------------------+------------------------+---------------------------+--+
| jp_grouping_sample.c1  | jp_grouping_sample.v1  | jp_grouping_sample.v2  | jp_grouping_sample.value  |
+------------------------+------------------------+------------------------+---------------------------+--+
| a                      | 1                      | 1                      | 1234                      |
| a                      | 2                      | 1                      | 404                       |
| a                      | 2                      | 1                      | 40                        |
| b                      | 1                      | 1                      | 0                         |
| b                      | 1                      | 2                      | 10                        |
+------------------------+------------------------+------------------------+---------------------------+--+
5 rows selected (0.768 seconds)

When we GROUP BY WITH CUBE, then we can use GROUPING__ID to have binary value of columns that were used:

> SELECT GROUPING__ID, c1, v1, v2,  count(*)
> FROM jp_grouping_sample
> GROUP BY c1, v1, v2 WITH CUBE;

+---------------+-------+-------+-------+------+--+
| grouping__id  |  c1   |  v1   |  v2   | _c4  |
+---------------+-------+-------+-------+------+--+
| 0             | NULL  | NULL  | NULL  | 5    |
| 4             | NULL  | NULL  | 1     | 4    |
| 4             | NULL  | NULL  | 2     | 1    |
| 2             | NULL  | 1     | NULL  | 3    |
| 6             | NULL  | 1     | 1     | 2    |
| 6             | NULL  | 1     | 2     | 1    |
| 2             | NULL  | 2     | NULL  | 2    |
| 6             | NULL  | 2     | 1     | 2    |
| 1             | a     | NULL  | NULL  | 3    |
| 5             | a     | NULL  | 1     | 3    |
| 3             | a     | 1     | NULL  | 1    |
| 7             | a     | 1     | 1     | 1    |
| 3             | a     | 2     | NULL  | 2    |
| 7             | a     | 2     | 1     | 2    |
| 1             | b     | NULL  | NULL  | 2    |
| 5             | b     | NULL  | 1     | 1    |
| 5             | b     | NULL  | 2     | 1    |
| 3             | b     | 1     | NULL  | 2    |
| 7             | b     | 1     | 1     | 1    |
| 7             | b     | 1     | 2     | 1    |
+---------------+-------+-------+-------+------+--+
20 rows selected (41.555 seconds)

For example, last row has 7 in GROUPING__ID, because all 3 grouping columns have some values (b,1,2) so 2^0+2^1+2^2 = 7.

When browsing Hive sources I noticed two other virtual columns, but it was hard to find any information on the web regarding them:

ROW__ID – This one is interesting. It gives NULL in all tables except for table in ORC format:

> SELECT ROW__ID, * FROM jp_one_char_orc LIMIT 5;
+---------------------------------------------+--------------------+--+
|                   row__id                   | jp_one_char_orc.c  |
+---------------------------------------------+--------------------+--+
| {"transactionid":0,"bucketid":0,"rowid":0}  | w                  |
| {"transactionid":0,"bucketid":0,"rowid":1}  | k                  |
| {"transactionid":0,"bucketid":0,"rowid":2}  | Y                  |
| {"transactionid":0,"bucketid":0,"rowid":3}  | 2                  |
| {"transactionid":0,"bucketid":0,"rowid":4}  | V                  |
+---------------------------------------------+--------------------+--+
5 rows selected (0.691 seconds)

In ORC table ROW__ID returns structure with three values: transactionid, bucketid and rowid. I couldn’t find much infomation on those values, but it seems that bucketid is number of HDFS block, and rowid is unique ID, but within a HDFS block.

RAW__DATA__SIZE – This one actually was not visible by Hive (1.0). Maybe still not implemented or some additional property has to be set?

References:

  • Hive Documentation on Virtual Columns;
  • Programming Hive, Edward Capriolo, Dean Vampler, Jason Rutherglen;
  • Apache Hive