[Trilinos-Users] CUDA (not trilinos related)

Matt G mgoodman at email.arizona.edu
Wed Jul 29 14:38:11 MDT 2009

This is true as long as the multiplications performed don't need to exit the
scalar processors.  For most realistic linear algebra tasks you bump into
the memory bandwidth limit way before reaching that.  You only ever get the
theroetical performace when doing stuff like
fractals<http://scipyed.wordpress.com/2009/01/03/pycuda-fun/>or wpa
Benchmarking the cublas implementation vs. cpu blas for DP typically gives a
win to the CPU.  I have yet to have mkl loose to a gpu (295's included),
admittledy I am rolling with a pricey i7.

All the magic in cpu blas comes from consecutive cache hits and never
needing to wait on memory.


On Wed, Jul 29, 2009 at 12:09 PM, <rrossi at cimne.upc.edu> wrote:

> A comment on CUDA ... which indeed has very little to do with trilinos.
> concerning the on-going discussion on CUDA performance, i would like to say
> that Matrix-Vector Multiplication using nvidia GTX280 and DOUBLE PRECISION
> is at 5 to 10 times faster than a recent quad core system (old quad core ...
> not sure of the I7).
> If the matrix is sufficiently big this figure could be even better.
> I should say that the figures i report are obtained using a serial code for
> the matrix multiplication based on the ublas library. I am pretty sure in
> any case that even trilinos is not faster than that for a single CPU.
> greetings
> Riccardo
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov
> http://software.sandia.gov/mailman/listinfo/trilinos-users

Matthew Goodman

Find me on LinkedIn: http://tinyurl.com/d6wlch
Follow me on twitter: http://twitter.com/meawoppl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://software.sandia.gov/pipermail/trilinos-users/attachments/20090729/04158951/attachment.html 

More information about the Trilinos-Users mailing list