Hive gotchas – Order By

There is this one feature in Hive that I really hate: ORDER BY col_index.

Historically order by clause accepted only column aliases, like in the simple example below:

select id, name from people order by name;

+------------+--------------+--+
| people.id  | people.name  |
+------------+--------------+--+
| 5          | Jimmy        |
| 2          | John         |
| 1          | Kate         |
| 4          | Mike         |
| 3          | Sam          |
+------------+--------------+--+

In other relational databases it is possible to give not only column alias but also column index, It much simpler to say “column 3” rather than typing whole name or alias. This option was not supported in Hive at the beginning, but community noticed that and a ticket was created.

Since Hive 0.11.0 it is possible to order the result by column index as well, however there is a gotcha here. There is a property that enables this new option: hive.groupby.orderby.position.alias must be set to ‘true’. The problem is that by default it is set to ‘false’ and in that case you can still use numbers in order by clause, but they are interpreted literally (as numbers) not as column index, which is rather strange.

So for example in the any modern Hive version where you do something like that:

select id, name from people order by 2;

+------------+--------------+--+
| people.id  | people.name  |
+------------+--------------+--+
| 1          | Kate         |
| 2          | John         |
| 3          | Sam          |
| 4          | Mike         |
| 5          | Jimmy        |
+------------+--------------+--+

As you can see by default it was interpreted as “value 2”, not the “column number 2”. After enabling the option you can change how the order by works:

set hive.groupby.orderby.position.alias=true;
select id, name from people order by 2;

+-----+--------+--+
| id  |  name  |
+-----+--------+--+
| 5   | Jimmy  |
| 2   | John   |
| 1   | Kate   |
| 4   | Mike   |
| 3   | Sam    |
+-----+--------+--+

So this time after enabling option we can use column number to sort by name. The problem is that whenever you work in Hive you have to think if the hive.groupby.orderby.position.alias was enabled in current session or not. This makes rather impractical and limits the usage of this syntactical sugar. Moreover I cannot really see any use case for using order by <value>

References

Hive Order By – https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy