Accelerated R with CUDA on Linux

The R programming language uses Basic Linear Algebra Subprograms (BLAS) for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. R includes Netlib BLAS by default. Significant performance gains can be achieved by replacing that with a different BLAS library such as OpenBLAS or ATLAS.

Further gains are possible by intercepting certain calls to BLAS with NVIDIA’s NVBLAS. Operations that can benefit from running on a GPU will be automatically redirected to cuBLAS without any modification to your R code.

Begin by installing the GPU driver for your system from NVIDIA’s website, then install the CUDA Toolkit. Run nvidia-smi and you should see output like this:

+------------------------------------------------------
| NVIDIA-SMI 384.59 Driver Version: 384.59
|-------------------------------+----------------------
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================
| 0 GeForce GTX TITAN Off | 00000000:01:00.0 On | N/A
| 31% 42C P8 26W / 250W | 33MiB / 6081MiB | 0% Default
+-------------------------------+----------------------

If you only have one GPU, or multiple GPUs of exactly the same type, this basic configuration file is all you need. I have chosen OpenBLAS for operations that cannot benefit from the GPU.

echo "NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib64/libopenblas.so
NVBLAS_GPU_LIST ALL" > /etc/nvblas.conf

You can now run R on the command line with GPU-accelerated BLAS like this:

LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf R

For convenience you can make a wrapper to automatically enable NVBLAS whenever you run R:

mv /usr/local/bin/R /usr/local/bin/R_

echo '#!/bin/sh
LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/local/bin/R_ "$@"' > /usr/local/bin/R

chmod +x /usr/local/bin/R

The location of R on your system might be different to mine. Find your R with which and adjust these commands accordingly:

which R

Do the same for Rscript:

mv /usr/local/bin/Rscript /usr/local/bin/Rscript_

echo '#!/bin/sh
LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/local/bin/Rscript_ "$@"' > /usr/local/bin/Rscript

chmod +x /usr/local/bin/Rscript

R Studio:

mv /usr/lib/rstudio/bin/rsession /usr/lib/rstudio/bin/rsession_

echo '#!/bin/sh
LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/lib/rstudio/bin/rsession_ "$@"' > /usr/lib/rstudio/bin/rsession

chmod +x /usr/lib/rstudio/bin/rsession

R Studio Server:

mv /usr/lib/rstudio-server/bin/rsession /usr/lib/rstudio-server/bin/rsession_

echo '#!/bin/sh
LD_PRELOAD=/usr/local/cuda-9.0/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /usr/lib/rstudio-server/bin/rsession_ "$@"' > /usr/lib/rstudio-server/bin/rsession

chmod +x /usr/lib/rstudio-server/bin/rsession

You can compare the performance in R with this matrix multiplication example:

system.time(matrix(rnorm(4096*4096), nrow=4096, ncol=4096) %*% matrix(rnorm(4096*4096), nrow=4096, ncol=4096))

On a 4-core 3.2GHz CPU with one NVIDIA GTX Titan GPU, Netlib BLAS completed the example in 121.455s. Recompiling R with OpenBLAS reduced that to 16.486s. NVBLAS cut it down to 5.244s.