Timestamps in Parquet on Hadoop

One of main advantages of open source Hadoop environment is that we are free to choose different tools that will make up our Big Data platform. No matter what kind of software distribution you decide to use, most of the times you can freely customise it by adding extra frameworks or upgrading versions on your own. The free choice of products gives great flexibility but also can cause a lot of difficulties when orchestrating different parts together. In this post I’d like to share some of the problems with handling timestamp on Parquet files.

Timestamp is commonly used and supported data type. You can find it in most of the frameworks but it turns out that tools can store and interpret it quite differently which will end up in wrong results or even hours spent on debugging your data workflow.

Timestamp in Hive

Hive supports Timestamp since version 0.8. They are interpreted as timestamps in local time zone (so the actual value is stored in parquet file as UTC) [4]. When timestamps are read from the file server’s time zone is applied on the value to give local timestamp. Of course, such behaviour depends on the file format. Text file format don’t imply any conversions to UTC.

Timestamp in Impala

In Impala timestamps are saved in local time zones, which is different than in Hive. Because historically Impala-Hive operability was very important, there are some workarounds to make coexistence of these two frameworks possible. The following impalad start-up parameter will add proper handling for timestamps in Hive-generated parquet file:

convert_legacy_hive_parquet_utc_timestamps=true (default false) [2]

It is worth mentioning that parquet file metadata is used to determine if the file was created in Hive or not. parquet-tools meta <file> command is helpful to see the creator of the file.

There is also Hive option to allow reading Impala’s files. Parameter hive.parquet.timestamp.skip.conversion is by default set to true and it means that parquet files created in Impala won’t have time zone applied, because the timestamps are already saved in local time zone.

Timestamp in Spark

Spark-Hive interoperability is fine. Every time we read timestamp column we have correct timestamp. The problem begins when we read in Spark tables created in Impala. In such case Spark apply server timezone to file which already have local timestamps and as a result we get different timestamps.

The main problem is that Spark (up to the newest version 2.2.0) doesn’t provide any special handling for Impala parquet files. So every time we have any scripts in Impala that process data later used in Spark we need to stay aware of the problems. Keep in mind that there are various Impala and Hive parameters that can influence the timestamp adjustments.


Even more problems if we add Sqoop to the workflow. Sqoop stores timestamp in Parquet as INT64 which makes the imported parquet file incompatible with Hive and Impala. These two tools will return errors when reading sqoop’s parquet files with timestamps. The funny thing is that Spark will read such file correctly without problems.

Timestamp in Parquet

Parquet is one of the most popular columnar format the is supported by most of the processing engines available on Hadoop. Its data types include only BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE and BYTE_ARRAY[1]. Timestamps is defined as a logical type (TIMESTAMP_MILLIS, TIMESTAMP_MICROS) [5], but since Impala stores the timestamp up to nanosecond precision, it was decided to use INT96. Other frameworks followed Impala to use INT96, but time zone interpretation compatibility was somehow missed.

In SQL database

So, how is it done in SQL database?

In Oracle for example we have TIMESTAMP for storing timestamp without timezone information, but with defined precision, TIMESTAMP WITH TIME ZONE and TIMESTAMP WITH LOCAL TIME ZONE (where timestamp are stored in DB time zone and converted to session time zone when returning to the client) [7].

In Postgres we have two options too: TIMESTAMP (without time zone information) and TIMESTAMP WITH TIME ZONE (which is stored as UTC and converted to local time zone after reading) [8].

It could be helpful to have such choice on Hadoop.


Because Hadoop is open ecosystem with multiple independently-developed components sometimes it’s possible to find areas where there are some incompatibilities between components. Handling timestamp is a good example of such problem. Although there are some workaround to this known issue, some cases are still quite hard to detect and overcome. It is also a good example of typical difficulties with complex and open environment when compared to product designed and developed by single vendor.



Working day vs. weekend page views

Two months ago Stack Overflow published interesting blog post on programming languages and weather people are more likely to ask question during week or on weekend. It gives some
overview of how widely languages are spread in business (week) and hobby (weekend) projects.

From their analysis we can see that for example T-SQL, PowerShell and Oracle are used
during week whereas Huskell, assembly and C during weekend.

On Wikipedia…

I was interested in checking the same using Wikipedia page views data. Of course with Wikipedia it will be a bit differently. When someone learns programming language
they don’t usually read about it on Wikipedia, but rather find tutorial or look for answers on Stack Overflow. In some cases however Wikipedia can be main source of knowledge, especially when someone looks for theoretical aspects of programming or technology.

I checked several articles from different categories: databases, programming and data science. I checked page views of English Wikipedia since September 2016. For each article I computed weekend to week ratio (average page views during weekend / average page views during working days).


Database category shows something interesting. There is a difference between theoretical concepts, for example Slowly changing dimension article is more work-related than normalisation and normal form definitions. On the other end of the scale there is Blockchain that is the most ‘weekend’ page in this section.

Data science

In data science section, there is interesting observation. Deep learning itself and
various modern frameworks usually related to deep learning/neural networks are
much more weekend articles than older machine learning algorithms.


As mentioned above, reading about programming language on Wikipedia is not really
sign that the language is used in projects. More likely people will check some detail about
it when they hear that name for the first time. Nevertheless there are some interesting facts.
As in Stack Overflow report, Huskell seems to attract more people during weekends.
On the other hand, it’s has similar ratio as Java so probably this is not the best
indicator about how popular in business is given language.

Design patterns are more work-related than some theoretical articles related to
functional programming or internals (garbage collection or stack buffer overflow).

Surprisingly, Scala was seems to be more often read during working days than other
languages that I checked.

Hive – Selecting columns with regular expression

In Hive there is rather an unique feature that allows to select columns by
regular expression instead of using column by names.

It’s very useful when we need to select all columns except one. In most of the SQL databases we would have to specify all columns, but in Hive there is this feature that can save us typing.

Let’s say there is a people table with column name, age, city, country and created_at. To select all columns except created_at we can write:

set hive.support.quoted.identifiers=none;
from people
limit 10;

This is equivalent to:

    name, age, city, county
from people
limit 10;

Please note that in Hive 0.13 or later you have to set hive.support.quoted.identifier to none.
I have never seen such functionality in others SQL databases.



Spark SQL

This is one of the Hive-specific features that are not available in Spark SQL.

Hadoop user name

Some time ago I was looking for this option:

Environmental variable HADOOP_USER_NAME lets you specify the username that will be used when connecting to Hadoop, for example to create new HDFS files or accessing existing data.

Let’s have a look at the short example:

[root@sandbox ~]# echo "New file" | hdfs dfs -put - /tmp/file_as_root
[root@sandbox ~]# export HADOOP_USER_NAME=hdfs
[root@sandbox ~]# echo "New file" | hdfs dfs -put - /tmp/file_as_hdfs
[root@sandbox ~]# hdfs dfs -ls /tmp/file_*
-rw-r--r--   3 hdfs hdfs        154 2016-05-21 08:20 /tmp/file_as_hdfs
-rw-r--r--   3 root hdfs        154 2016-05-21 08:19 /tmp/file_as_root

So the second (file_as_hdfs) is owned by hdfs user because that was the value of HADOOP_USER_NAME variable.

Of course it works only on Hadoop cluster without Kerberos, but still it’s very useful on test environment or on VM. You can act as many users without executing sudo commands all the time.

Hive gotchas – Order By

There is this one feature in Hive that I really hate: ORDER BY col_index.

Historically order by clause accepted only column aliases, like in the simple example below:

select id, name from people order by name;

| people.id  | people.name  |
| 5          | Jimmy        |
| 2          | John         |
| 1          | Kate         |
| 4          | Mike         |
| 3          | Sam          |

In other relational databases it is possible to give not only column alias but also column index, It much simpler to say “column 3” rather than typing whole name or alias. This option was not supported in Hive at the beginning, but community noticed that and a ticket was created.

Since Hive 0.11.0 it is possible to order the result by column index as well, however there is a gotcha here. There is a property that enables this new option: hive.groupby.orderby.position.alias must be set to ‘true’. The problem is that by default it is set to ‘false’ and in that case you can still use numbers in order by clause, but they are interpreted literally (as numbers) not as column index, which is rather strange.

So for example in the any modern Hive version where you do something like that:

select id, name from people order by 2;

| people.id  | people.name  |
| 1          | Kate         |
| 2          | John         |
| 3          | Sam          |
| 4          | Mike         |
| 5          | Jimmy        |

As you can see by default it was interpreted as “value 2”, not the “column number 2”. After enabling the option you can change how the order by works:

set hive.groupby.orderby.position.alias=true;
select id, name from people order by 2;

| id  |  name  |
| 5   | Jimmy  |
| 2   | John   |
| 1   | Kate   |
| 4   | Mike   |
| 3   | Sam    |

So this time after enabling option we can use column number to sort by name. The problem is that whenever you work in Hive you have to think if the hive.groupby.orderby.position.alias was enabled in current session or not. This makes rather impractical and limits the usage of this syntactical sugar. Moreover I cannot really see any use case for using order by <value>


Hive Order By – https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy