[Trilinos-Users] SpMV performance on CUDA GPUs

Sun Dec 20 17:59:53 EST 2015

Hello Kokkos/Tpetra developers,

I’m looking at Tpetra SPMV performance on GPUs, in particular comparing it to NVIDIA’s CUSPARSE library. I’ve observed a fairly severe loss in performance when switching from CUSPARSE to Trilinos, but I don’t have an explanation yet. I wanted to share my findings, and see if anyone had further suggestions or guesses that might help to diagnose.

To set the stage, I’m running performance tests with a square block diagonal matrix of dimension 12,434,616 x 12,434,616. On average, each row has 10 non-zeroes. At most, they have 16 non-zeroes (hence, load imbalance between rows is not a huge issue). I’m comparing the CUSPARSE implementation that comes with CUDA 7.5 to Trilinos from the github trilinos-release-12-4-branch branch. These experiments were run on a single GK210 in a K80, with 12GB of global memory (the whole matrix fits in global memory). I’ve attached the complete micro benchmarks I use for SPMV (trilinos_spmv.cpp and cusparse_spmv.cpp). In general, I run for 30 repeats and take the median. If anyone is interested in trying to reproduce the results, message me directly and we can find some way to transfer the matrices (they consume ~1.6GB on disk in CSR format).

The high level results are as follows. When considering all data movement (cudaMemcpy) and kernel execution, CUSPARSE takes 193,551,177.5 ns (~193 ms) to perform a single SPMV. When considering only the CUSPARSE kernel, that drops to 24,804,730.5 ns (~ 24 ms). On the other hand, Trilinos takes 909,918,300.5 ns (~910 ms).

Using nvprof, I’ve identified a single kernel launch in the Trilinos implementation as the bottleneck: a call to cuda_parallel_launch_local_memory that takes ~840 ms. Looking at the code I don’t see anything obviously wrong, so I took to testing out with different nvprof metrics and flags to find what differences there might be between the main CUSPARSE and Trilinos kernels. I’ve listed only the ones that seem the most interesting below, the full results are attached in all-metrics.txt:

--unified-memory-profiling doesn’t show any significant Unified Memory traffic between host and device, only a few KB.
Trilinos generally shows lower global load efficiency, 34.92% compared to 59.20% for CUSPARSE
Trilinos does not appear to use CUDA shared memory, which CUSPARSE does
Trilinos shows much lower issue slot utilization (4.70% vs 46.84% for CUSPARSE). I’m not entirely clear on what this implies, but it seems odd given that Trilinos generally shows plenty of eligible and active warps.
Trilinos and CUSPARSE use slightly different grid and block configurations, but it’s unclear if this makes any difference.

It’s hard to know which of these things is the culprit, or if it’s something else entirely. The biggest red flags would seem to be the lack of shared memory, and the low numbers for issue slot utilization. But it’s hard to believe that this would account for such a massive difference in performance, which is why I’m also wondering if it’s programmer error. Does anyone have suggestions or explanations for this difference in performance? Or other possible avenues of investigation?

Thanks!

Max

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151220/f1ae204c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cusparse_spmv.c
Type: application/octet-stream
Size: 20055 bytes
Desc: not available
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151220/f1ae204c/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151220/f1ae204c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trilinos_spmv.cpp
Type: application/octet-stream
Size: 6979 bytes
Desc: not available
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151220/f1ae204c/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151220/f1ae204c/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: all-metrics.txt
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151220/f1ae204c/attachment.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151220/f1ae204c/attachment-0003.html>