I spend most of my time at work developing and maintaining processing on Hadoop, and I I’ve become interested in learning how different configurations impact performance. Some tests, like those related to compression and data format you can easily perform on production systems, but there are many other cases in which you would need full control over the cluster.
As a summer hobby project I decided to set up my own little Hadoop cluster. I hope it will be useful in testing various configurations and of course a lot of fun.
After a quick market research I found some old, but still good workstations:
They have two, dual core Intel Xeon processors:
$ cat /proc/cpuinfo ... model name : Intel(R) Xeon(R) CPU 5140 @ 2.33GHz cache size : 4096 KB
Now it has only 4GB of RAM installed, but it can be upgraded to 16GB. Unfortunately all four slots are occupied now (4x1GB).
$ free -h total used free shared buffers cached Mem: 3.9G 1.5G 2.3G 648K 18M 322M -/+ buffers/cache: 1.2G 2.7G Swap: 14G 0B 14G
I chose 2TB hard drives and initially I will put one hard drive per node, but four SATA ports gives opportunity to extend the storage later.
$ sudo lshw -class disk ... product: ST2000DM001-1CH1 vendor: Seagate ... size: 1863GiB (2TB)
Of course the number of nodes won’t be impressing either. For now, I will start with just 4 nodes.