I have found very useful feature in Pig, that I didn’t know about. Imagine you have the following input data:
input_data/pagecounts-2014-01-01.bz2 input_data/pagecounts-2014-01-02.bz2 input_data/pagecounts-2014-01-03.bz2
Each file has some logs from one day, but the date is hard coded only in file name, it doesn’t exists in the content of the file. When you need to process all files at once and keep the date there is one handy option in the default PigStorage loader:
rows = LOAD 'input_data' USING PigStorage(' ','-tagFile');
The -tagFile option means that an extra column ($0) will be added with the name of the source file for given tuple.
DUMP rows; ... (pagecounts-2014-01-01.bz2,abc,123) (pagecounts-2014-01-01.bz2,def,234) (pagecounts-2014-01-01.bz2,c,45) (pagecounts-2014-01-02.bz2,abc,123) (pagecounts-2014-01-02.bz2,def,234) (pagecounts-2014-01-02.bz2,c,45) (pagecounts-2014-01-03.bz2,abc,123) (pagecounts-2014-01-03.bz2,def,234) (pagecounts-2014-01-03.bz2,c,45)
When you specify the schema in the LOAD statement make sure you add an extra column for file name. Otherwise file name will replace the value of the first column from definition list.
In previous versions of Pig (prior to 0.12) this option was named -tagsource.
Besides -tagFile there is also very useful -tagPath which adds the whole path and file name. This make a job much easier when you are processing for example partitioned Hive table with partition keys embedded in the directory.
The full documentation is here.