We recently acquired a Nvidia GTX Titan graphical processing unit (GPU) for statistical computing at work, specifically double-precision floating point operations on the CUDA API. Before I lock it away in the server room I would like to see how it compares to my primary GPU at home – a Nvidia GTX 680, and my older GPUs – a pair of AMD Radeon HD 5770. This act of comparison is called benchmarking – running a number of standard tests and trials in order to assess the relative performance of a piece of hardware or software.

The reason why a GPU (also called as graphics card or video card) is preferred over a central processing unit (CPU) for certain mathematical or scientific applications is their ability to do highly-parallel stream processing. This feature evolved in GPUs as a way to perform thousands of graphics rendering tasks – such as shading polygons – very quickly by performing single instructions on multiple data simultaneously. Exploiting that power for other computing tasks is called general-purpose computing on graphics processing units (GPGPU).

Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) model created by Nvidia in 2007. It can only be run on GPUs made by Nvidia.

The following table lists all of the Nvidia GPUs that are capable of more than one thousand billion double-precision floating point operations per second (1 Terraflop, or 1000 GFLOPS) according to the list of Nvidia graphics processing units on Wikipedia. Prices are in US dollars from Amazon in September 2015.

Name RAM GFLOPS Price
Nvidia GTX Titan 6 GB 1500 $600
Nvidia GTX Titan Black 6 GB 1707 $700
Nvidia GTX Titan Z 12 GB 2707 $1593
Nvidia Tesla K20 5 GB 1173 $1975
Nvidia Tesla K20X 6 GB 1312 $1975
Nvidia Tesla K40 12 GB 1430 $3000
Nvidia Quadro K6000 12 GB 1732 $3650
Nvidia Tesla K80 24 GB 2330 $4246
Double-Precision GFLOPS Per Dollar


CUDA can only be used on Nvidia GPUs, but it’s not the only GPGPGU API available. If you’re looking for double-precision floating point performance on OpenCL or DirectCompute, you will find similar performance and value with a GPU from AMD.

Name RAM GFLOPS Price
AMD Radeon HD 6990 4 GB 1277 ?
AMD Radeon HD 7990 6 GB 1894 $693
AMD Radeon HD 8990 6 GB 1946 ?
AMD Radeon R9 295X2 8 GB 1433 $989
AMD Radeon Sky 900 6 GB 1478 $1650
AMD FirePro W8100 8 GB 2109 $995
AMD FirePro W9100 16 GB 2619 $3100
AMD FirePro S9150 16 GB 2530 $1999
AMD FirePro S9170 32 GB 2620 ?
AMD FirePro S10000 6 GB 1478 $3290

The Hardware

Motherboard: MSI 970A SLI Krait Edition
CPU: AMD FX-8350
RAM: Corsair Vengeance Pro 4x8GB
Disk: Samsung 850 EVO 500GB
Power: Cooler Master V1000
Case: Cooler Master Storm Scout 2
Operating System: Microsoft Windows 10

Motherboard with RAM and CPU
Motherboard with RAM and CPU

All tests will be conducted in this standard consumer-grade PC with an 8-core CPU at 4.0GHz, 32GB RAM, solid state hard drive, and enough power and space to run two high-end GPUs.

The GPUs

Nvidia GeForce GTX Titan

Year: 2013
14 Cores, 1.2GHz, 6GB RAM, 2688 Stream Processors

This is currently the best value processor for double-precision floating point operations in CUDA.

Nvidia GTX Titan 6GB
Nvidia GTX Titan 6GB

The Nvidia Titan, Titan Black, and Titan Z can be configured to run double-precision floating point operations at a higher rate than normal, with a trade-off of slower single-precision performance and game graphics. In some of these benchmarks I will be testing both modes, identified as Titan (normal performance) and Titan DP (boosted double-precision).

Note: the newer Titan X does not have double-precision mode!

Activating 'Double precision' mode
Activating ‘Double precision’ mode

Nvidia GeForce GTX 680

Year: 2012
8 Cores, 1.2GHz, 4GB RAM, 1536 Stream Processors

This is my current graphics card at home. It plays all the games.

Nvidia GTX 680 4GB
Nvidia GTX 680 4GB

While it is a fine GPU on its own, the GTX 680 can also be used as a co-processor to assist a faster GPU (such as the Titan) during heavy computing workloads. In GPGPU computing tasks both cards will share the load. In games and graphics applications the primary card (Titan) will do graphics while the secondary card (680) handles computing tasks such as Nvidia PhysX (physics simulations on CUDA) or AMD TressFX (realistic hair simulation). This configuration will be identified as Titan+680.

AMD TressFX
TressFX simulated hair, AMD Blog
Titan and 680 installed
Titan and 680 installed together

AMD Radeon 5770

Year: 2009
10 Cores, 850MHz, 1GB RAM, 800 Stream Processors

I have two of these cards. I will be testing one individually, and then both working together in CrossFireX mode identified as 5770+5770.

With only 1GB RAM its memory space is too small to run some tests. It is still limited to 1GB RAM when both cards are combined – the two cards share the computing load, but they operate in a single memory space.

The Radeon 5770 does not have native support for double-precision floating point operations, although it is able to emulate double-floats in some applications.

2x AMD Radeon 5770 1GB
2x AMD Radeon 5770 1GB
2x 5770 installed (with 12cm side fan)
2x 5770 installed (with 12cm side fan)

AMD FX-8350 CPU

Year: 2012
8 Cores, 4.0GHz, 8MB Cache, 64 Stream Processors

This is not a GPU, it’s a CPU. It can’t run CUDA. But it can run OpenCL and native code, so it can compete with the GPUs in some tests.

The Tests

AIDA64

AIDA64 Extreme is a hardware information and benchmark suite for Windows.

Memory Copy MB/s


Single-Precision GFLOPS


Double-Precision GFLOPS


If you’re reading this to find out which of these GPUs is the best at double-precision floating point calculations, you can stop reading now. We have a winner.

24-bit Integer GIOPS


64-bit Integer GIOPS


AES-256 MB/s


The FX-8350 beat the GPUs in this test because it has dedicated encryption acceleration features.

SiSoft Sandra 2015

Sandra 2015 is a benchmark suite for Windows.

Compute API Comparison

This test compares four different GPU computing APIs: CUDA, OpenCL, OpenGL, and DirectCompute.

Single-float Shaders Performance (Mpix/s) – Titan Normal Mode


Double-float Shaders Performance (Mpix/s) – Titan DP Mode


Thankfully CUDA achieved the highest score for double-floats on the Titan, because that’s exactly what I bought it for! In the remaining Sandra tests I will use CUDA on the Nvidia cards and OpenCL on the others.

GPGPU Processing

Single-float Shaders Performance (Mpix/s)


Double-float Shaders Performance (Mpix/s)


A surprising show of strength from the 5770+5770, beating the 680 at single- and double-floats, and matching the Titan at single-floats. Somehow the Titan+680 achieved a higher single-float score than their individual scores added together. There’s something weird about this test.

GPGPU Cryptography

Settings: AES256+SHA2-512

Encryption/Decryption Bandwidth (GB/s)


Hashing Bandwidth (GB/s)


The FX-8350 was unable to use its encryption acceleration features here, possibly because the test is run in OpenCL instead of native code.

GPGPU Financial Analysis Single-Precision

Binomial Euro Option Pricing (kOPT/s)


Monte Carlo Euro Option Pricing (kOPT/s)


GPGPU Financial Analysis Double-Precision

Binomial Euro Option Pricing (kOPT/s)


Monte Carlo Euro Option Pricing (kOPT/s)


GPGPU Scientific Analysis Single-Precision

General Matrix Multiply (GEMM) (GFLOPS)


Fast Fourier Transform (FFT) (GFLOPS)


GPGPU Scientific Analysis Double-Precision

General Matrix Multiply (GEMM) (GFLOPS)


Fast Fourier Transform (FFT) (GFLOPS)


GPGPU Bandwidth

Internal Memory Bandwidth (GB/s)


Interface Transfer Bandwidth (GB/s)


The maximum interface transfer bandwidth for the GPUs is limited by the PCI Express 2.0 interface on the motherboard.

LuxMark 3.1

LuxMark is an OpenCL benchmark tool for Windows, Linux, and Mac. It renders images by simulating the flow of light according to physical equations.

LuxBall HDR Score


Neumann TLM-102 SE Score



Note: Radeon 5770 could not run this test because it requires more than 1GB RAM.

Hotel Lobby Score


LuxMark Hotel Lobby, 30 seconds on the Titan
Hotel Lobby after 30 seconds rendering on the Titan
LuxMark Hotel Lobby, 16 hours on the Titan
Hotel Lobby after 16 hours rendering on the Titan

Graphics Benchmarks

3DMark 11 Basic Edition

3DMark 11 is a video card benchmark tool for Windows. It makes extensive use of DirectX 11 features including tessellation, compute shaders and multi-threading.

Performance Preset Score


3DMark Advanced Edition

3DMark is a video card benchmark tool for Windows, Android, and iOS. It is the successor to 3DMark 11.

Ice Storm 1.2 Score


Ice Storm Extreme 1.2 Score


Cloud Gate 1.1 Score


Sky Diver 1.0 Score


Fire Strike 1.1 Score


Fire Strike Extreme 1.1 Score


Fire Strike Ultra 1.1 Score


3DMark API Overhead feature test 1.2

Draw calls per second DX11 Multi-threaded


Draw calls per second DX11 Single-threaded



Note: Radeon 5770 cannot run this test.

Draw calls per second DX12


Unigine Heaven 4.0 Basic Edition

Heaven is a GPU stress-testing and benchmark tool for Windows, Linux and Mac. It makes comprehensive use of tessellation.

Extreme Preset Score


Unigine Valley 1.0 Basic Edition

Valley is a GPU stress-testing and benchmark tool for Windows, Linux and Mac.

Extreme HD Preset Score


Game Benchmarks

Doom 3

Year: 2004
Engine: id Tech 4

Settings: 1280×1024 Ultra, timedemo demo1

Average Frames Per Second


Half-Life 2: Lost Coast

Year: 2005
Engine: Source, Havok

Settings: 1920×1080 Maximum detail

Average Frames Per Second


Doom 3 and Half-Life 2 are unable to benefit from the 5770+5770 because their engines were made before it was possible to combine multiple GPUs.

Batman: Arkham Origins

Year: 2013
Engine: Unreal Engine 3, PhysX

Settings: 1920×1080 Maximum detail

Average Frames Per Second


The Titan+680 achieved clear gains here; the Titan was able to work on graphics exclusively while the 680 handled PhysX processing.

Tomb Raider

Year: 2013
Engine: Foundation, TressFX

Settings: 2560×1080 ‘Ultimate’ preset.
Note: Game is limited to 60 frames per second.

Average Frames Per Second


The Titan hit the maximum frame rate on its own. I’m not sure why the Titan+680 was slightly slower.

Bioshock Infinite

Year: 2013
Engine: Unreal Engine 3

Settings: 1920×1080 ‘UltraDX11 DDOF’ preset.

Average Frames Per Second


EVE Online

Year: 2003-2015
Engine: Trinity

Note: EVE Probe is the benchmark tool for EVE Online.

Combined Score


Batman: Arkham Knight

Year: 2015
Engine: Unreal Engine 3, Nvidia GameWorks

Settings: 1920×1080 Maximum detail.
Note: Radeon 5770 could not run this test because it requires more than 1GB RAM.

Average Frames Per Second


Conclusion

Titan’s double-precision mode performed exactly as expected, easily dominating the double-precision tests. The older Radeon 5770 surprised with strong results in a few tests, highlighting the architectural differences in GPU designs from Nvidia and AMD. I would be very interested to see how a newer AMD GPU compares in those tests.

The FX-8350 CPU took last place in nearly every test due to significantly less cores and streams than the GPUs, despite having more speed per core. It did win the AIDA64 AES test thanks to its encryption acceleration features.

Upgrading the CPU cooler

To increase the longevity of the FX-8350 CPU, I replaced the standard cooler with a bigger one: the Arctic Cooling Freezer 13. As you can see in the final picture below, the cooler overlaps the RAM slots. It required some cutting (the cooler, not the RAM) to make it fit.

Left: Standard cooler. Right: Cooler cooler.
Left: Standard cooler. Right: Cooler cooler.
Lowest Temperature C
Highest Temperature C


Final configuration: just the Titan
The final build, now running Linux. For statistics.