The R programming language uses Basic Linear Algebra Subprograms (BLAS) for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. R includes Netlib BLAS by default. Significant performance gains can be achieved by replacing that with a different BLAS library such as OpenBLAS or ATLAS.
Further gains are possible by intercepting certain calls to BLAS with NVIDIA’s NVBLAS. Operations that can benefit from running on a GPU will be automatically redirected to cuBLAS without any modification to your R code.
Begin by installing the GPU driver for your system from NVIDIA’s website, then install the CUDA Toolkit. Run nvidia-smi and you should see output like this:
+------------------------------------------------------ | NVIDIA-SMI 384.59 Driver Version: 384.59 |-------------------------------+---------------------- | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |===============================+====================== | 0 GeForce GTX TITAN Off | 00000000:01:00.0 On | N/A | 31% 42C P8 26W / 250W | 33MiB / 6081MiB | 0% Default +-------------------------------+----------------------
If you only have one GPU, or multiple GPUs of exactly the same type, this basic configuration file is all you need. I have chosen OpenBLAS for operations that cannot benefit from the GPU.
echo "NVBLAS_LOGFILE nvblas.log NVBLAS_CPU_BLAS_LIB /usr/lib64/libopenblas.so NVBLAS_GPU_LIST ALL" > /etc/nvblas.conf
You can now run R on the command line with GPU-accelerated BLAS like this:
LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf R
For convenience you can make a wrapper to automatically enable NVBLAS whenever you run R:
mv /usr/local/bin/R /usr/local/bin/R_ echo '#!/bin/sh LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/local/bin/R_ "$@"' > /usr/local/bin/R chmod +x /usr/local/bin/R
The location of R on your system might be different to mine. Find your R with which and adjust these commands accordingly:
which R
Do the same for Rscript:
mv /usr/local/bin/Rscript /usr/local/bin/Rscript_ echo '#!/bin/sh LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/local/bin/Rscript_ "$@"' > /usr/local/bin/Rscript chmod +x /usr/local/bin/Rscript
R Studio:
mv /usr/lib/rstudio/bin/rsession /usr/lib/rstudio/bin/rsession_ echo '#!/bin/sh LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/lib/rstudio/bin/rsession_ "$@"' > /usr/lib/rstudio/bin/rsession chmod +x /usr/lib/rstudio/bin/rsession
R Studio Server:
mv /usr/lib/rstudio-server/bin/rsession /usr/lib/rstudio-server/bin/rsession_ echo '#!/bin/sh LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/lib/rstudio-server/bin/rsession_ "$@"' > /usr/lib/rstudio-server/bin/rsession chmod +x /usr/lib/rstudio-server/bin/rsession
You can compare the performance in R with this matrix multiplication example:
system.time(matrix(rnorm(4096*4096), nrow=4096, ncol=4096) %*% matrix(rnorm(4096*4096), nrow=4096, ncol=4096))
On a 4-core 3.2GHz CPU with one NVIDIA GTX Titan GPU, Netlib BLAS completed the example in 121.455s. Recompiling R with OpenBLAS reduced that to 16.486s. NVBLAS cut it down to 5.244s.