Hive partition filtering

Some time ago I found one pitfall related to partition filtering and user defined functions.

Let’s assume we have partitioned table that looks like this:

CREATE TABLE pageviews(
... 
) PARTITIONED BY (v_date STRING)
... /other properties/
;

Now let’s assume we usually access yesterday data, so for convenience we decided to create view that contains only page views from yesterday partition:

CREATE VIEW yesterday_pageviews AS
SELECT * FROM pageviews
WHERE v_date = date_add(from_utc_timestamp(unix_timestamp()*1000,'UTC'),-1)

You can check that the expression used in where clause works as intended. It always returns the current date minus 1 day.

hive> SELECT date_add(from_utc_timestamp(unix_timestamp()*1000,'UTC'),-1);
OK
2015-06-16
Time taken: 0.422 seconds, Fetched: 1 row(s)

This view (yesterday_pageviews) returns correct records, but the problem is that it’s very slow. As you can see in the explain plan the amount of data suggest that the whole table is read instead of single partition.

> EXPLAIN select count(*) from yesterday_pageviews;
...
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test_table
            Statistics: Num rows: 505000 Data size: 50500000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (v_date = date_add(Converting field (unix_timestamp() * 1000) from UTC to timezone: 'UTC', -1)) (type: boolean)
              Statistics: Num rows: 252500 Data size: 25250000 Basic stats: COMPLETE Column stats: NONE
              Select Operator

As you can see, where clause where moved down to filter operator, and ~50MB is the total table size:

$ hdfs dfs -du -s /apps/hive/warehouse/test_table/
50500000  /apps/hive/warehouse/test_table

Deterministic functions

The root cause for this problem is unix_timestamp() function which is declared as not deterministic. See the source on github:

@UDFType(deterministic = false)
@Description(name = "unix_timestamp",
 value = "_FUNC_([date[, pattern]]) - Returns the UNIX timestamp",
 extended = "Converts the current or specified time to number of seconds "
 + "since 1970-01-01.")
public class GenericUDFUnixTimeStamp extends GenericUDFToUnixTimeStamp {
...

Deterministic=true means that each call (within one query execution) should return the same value. Because unix_timestamp is not deterministic Hive cannot execute it in advance to get the partition key and it will try to filter row-by-row. This of course causes a full table scan.

Solution

The real problem is that prior to Hive 1.2 you don’t have many options. unix_timestamp() is the only function that gives the current date/time. You can try creating your own UDF defined as deterministic and it should work as expected.

In Hive 1.2 there are two new functions:

  • current_date
  • current_timestmap

and both are declared as deterministic, so they can be used in this scenario. In the explain section below you can see that only one partition is read:

hive> CREATE OR REPLACE VIEW yesterday_pageviews AS 
    > SELECT * FROM pageviews WHERE v_date=date_add(current_date(),-1);

...

hive> EXPLAIN SELECT count(*) FROM yesterday_pageviews;

...

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20150617215157_ebaf02bb-0d27-4429-89fa-7f4292774c4b:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: test_table
                  filterExpr: (v_date = '2015-06-03') (type: boolean)
                  Statistics: Num rows: 1 Data size: 10100000 Basic stats: PARTIAL Column stats: NONE
                  Select Operator
                    Statistics: Num rows: 1 Data size: 10100000 Basic stats: PARTIAL Column stats: NONE
                    Group By Operator

It’s worth mentioning that unix_timestamp behaviour was described in HIVE-10728 issue and this function will become redefined as deterministic (and will be marked as deprecated) in the future releases.