Custom HDFS block size

HDFS stores files split into block. By default blocks are 64MB, however often in production system you can much larger block (i.e. 128MB). This setting is configured by dfs.block.size (or dfs.blocksize) property, usually defined in hdfs-site.xml configuration file.

Custom HDFS block size

It may be surprising that the block size setting can be overridden when executing Hadoop application. For example, when creating new file you can specify different block size than the system-wide defaults.

$ hdfs dfs -D dfs.blocksize=10m -put file.txt /user/kuba/
$ hadoop fsck /user/kuba/file.txt
 Total blocks (validated): 19 (avg. block size 10313284 B)

Of course this applies not only to console HDFS tools. It’s perfectly OK to create table in Hive that will be loaded with data split into some custom-sized HDFS blocks:

hive> set dfs.blocksize=300m;

hive> create table test_table_small_block_size(<schema...>)

hive> select ... from other_tables;

Some limitations

dfs.blocksize must be a multiplication of dfs.bytes-per-checksum, which by default is set to 512 bytes.

There is a system wide minimal block size defined by dfs.namenode.fs-limits.min-block-size (by default 1048576) and all custom dfs blocks settings must be greater than this value.