GPU computing

Nowadays GPU computing is a standard approach in deep learning. Even low end graphics cards can give a significant performance boost in arithmetic operations. In this post I will check how much faster you can train neural network with GPU computing.

GPU performance

First, I used Caffe framework for deep learning. It comes with several known benchmarks which allows to test some of the functionality and checks how your hardware performs.

On classical MNIST LeNet example I had almost 5x better times when using GPU:

CPU vs. GPU on Caffe

Next, I tried cxxnet.There were some problems during compilation with CUDA support turned on. The error message during build.sh was:

nvcc -c -o updater_gpu.o --use_fast_math -g -O3 -ccbin g++  -Xcompiler "-DMSHADOW_FORCE_STREAM -Wall -g -O3 -I./mshadow/  -fPIC -msse3 -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_DIST_PS=0 -DCXXNET_USE_OPENCV=1 -DCXXNET_USE_OPENCV_DECODER=1 " src/updater/updater_impl.cu
src/updater/./sgd_updater-inl.hpp(18): error: expected an identifier

1 error detected in the compilation of "/tmp/tmpxft_000021a9_00000000-6_updater_impl.cpp1.ii".
make: *** [updater_gpu.o] Error 2

I had to remove std:: prefix in line 18 of src/updater/sgd_updater-inl.hpp file:

if (std::isnan(a)) return 0.0f;

After fixing it the build was successful. I tried once again an example based on MNIST dataset (training convolution neural net). That’s cxxnet/example/MNIST/MNIST_CONV.conf configuration file.

This time the performance on GPU was over x20 better when compared to standard CPU (with BLAS).

CPU vs. GPU cxxnet

Libraries

It’s worth mentioning some other factors that can influence overall performance. When building various deep learning frameworks you can often choose which libraries it will used. Usually there are ATLAS, BLAS/OpenBLAS or MKL options. It can matter a lot when you work without GPU computing.

In my experiments it turned out that BLAS/OpenBLAS had much better performance than standard ATLAS option. In fact OpenBLAS Wikipedia page says that it has similar performance to Intel proprietary MKL, which I have not tested.

  • In Caffe scenario BLAS compilation was about 1.5 times faster than ATLAS (1765 s vs. 1161 s)
  • In cxxnet the difference was smaller: 806 s vs 907 s (again BLAS faster)

Hardware

The tests were executed on Intel(R) Xeon(R) CPU 5140 @ 2.33GHz system (2x dual core) with 10GB RAM. The GPU used was GeForce GTX 460 card (bought used for about 50-60 USD).

References and useful links

I mentioned earlier Caffe and cxxnet deep learning frameworks. The documentation and tutorials are helpful and I didn’t have much problems when configuring it and running. There were many missing dependencies during compilation, but everything could be installed from standard Ubuntu repository.

Installing Nvidia CUDA library on Ubuntu was straight forward.

There is a great blog written by Tim Dettmers about deep learning especially from hardware perspective (this post for example). Seems to be very helpful if you consider investing in your hardware.