I guess it can be a surprise for quite a few people that the schema of partitioned tables in Hive is defined also on partition level. Let’s have a look at this example.
Table schema
In Hive you can change the schema of an existing table. Let’s say you have a table:
create table sample_table ( ts string, value int ) row format delimited fields terminated by '\t';
We will focus on the second column named value. We load few records to this table. The file looks like this:
2015-01-01 10:01:23 140.23 2015-01-01 10:01:23 70.5 2015-01-01 10:01:23 123
The second column has some decimal values, but we have defined this column as integer, so the we won’t see the decimal part:
hive> select * from sample_table; OK 2015-01-01 10:01:23 140 2015-01-01 10:01:23 70 2015-01-01 10:01:23 123 Time taken: 0.103 seconds, Fetched: 3 row(s)
We have noticed our mistake, so let’s change the schema of the table by replacing int to decimal(10,2):
hive> alter table sample_table change value value decimal(10,2); OK Time taken: 0.29 seconds
Now we can see the data, as expected:
hive> select * from sample_table; OK 2015-01-01 10:01:23 140.23 2015-01-01 10:01:23 70.5 2015-01-01 10:01:23 123 Time taken: 0.099 seconds, Fetched: 3 row(s)
So everything works fine, but it’s get more complicated with partitioned tables.
Partitioned tables
Let’s say our table has partitions:
create table partitioned_table( ts string, value int ) partitioned by ( v_part string) row format delimited fields terminated by '\t';
Let’s load the same data as previously to partition part1:
hive -e "load data local inpath 'part1.txt' into table partitioned_table partition (v_part = 'part1');"
Selecting from the table shows int values — as previously:
hive> select * from partitioned_table; OK 2015-01-01 10:01:23 140 part1 2015-01-01 10:01:23 70 part1 2015-01-01 10:01:23 123 part1 Time taken: 0.145 seconds, Fetched: 3 row(s)
We have noticed our mistake and we change the schema of that table:
hive> alter table partitioned_table change value value decimal(10,2); OK Time taken: 0.288 seconds
So far, so good:
hive> desc partitioned_table; OK ts string value decimal(10,2) v_part string # Partition Information # col_name data_type comment v_part string Time taken: 0.224 seconds, Fetched: 8 row(s)
However when we select that data, we get int values again (as if the schema didn’t change):
hive> select * from partitioned_table; OK 2015-01-01 10:01:23 140 part1 2015-01-01 10:01:23 70 part1 2015-01-01 10:01:23 123 part1 Time taken: 0.135 seconds, Fetched: 3 row(s)
Partition schema
This can be a bit counter-intuitive. The schema for partitioned table is defined also on partition level. So altering the schema will affect only newly created partitions. For existing partition old schema will be used (that one which was in use during creation of that partition).
Loading the same file to another partition will show the difference:
hive -e "load data local inpath 'part1.txt' into table partitioned_table partition (v_part = 'part2');"
And now we can see that second column (named value) in part1 partition is converted into int (part1) whereas in partition part2 is displayed with precision decimal(10,2):
hive> select * from partitioned_table; OK 2015-01-01 10:01:23 140 part1 2015-01-01 10:01:23 70 part1 2015-01-01 10:01:23 123 part1 2015-01-01 10:01:23 140.23 part2 2015-01-01 10:01:23 70.5 part2 2015-01-01 10:01:23 123 part2 Time taken: 0.131 seconds, Fetched: 6 row(s)
It doesn’t mean that the partition schemas are totally independent. For example having string column in partition and int on table-level will cause we get NULL for all those string values that couldn’t be parsed as integers. At the beginning the values are read and interpreted according to partition schema and then they are converted according to the table schema.
How to deal with it?
In Hive 0.14 there is a way to change the partition schema:
alter table ... partition(...) change colname colname type;
I didn’t have a chance to test it, but according to documentation it should work.
I haven’t found any way to change the partition schema in releases prior to 0.14 (in particular 0.13), so it seems the only way to fix it is to recreate the table.