[Trilinos-Users] Tpetra::CrsMatrix::apply performance

Mohammad Siahatgar siahatgar at rrzn.uni-hannover.de
Mon Jul 28 07:08:21 MDT 2014


Hi there,

It seems to me that there is a big difference in performance of 
Tpetra::CrsMatrix::apply with Teuchos::NO_TRANS and Teuchos::TRANS in 
the parallel implementations. The former is about 12x faster using 
KokkosClassic::OpenMPNode with 32 cores and 15x using TBBNode, while the 
SerialNode is only 1.2x faster. The numbers are results of SpMV 
operations of the order 5000 on a SandyBridge using the developement 
version of Tpetra. My code is based on the EpetraBenchmarkTest but 
modified to use Tpetra.

On the other hand, if I use ThrustNode instead I got impossibly good 
performance for NO_TRANS, while TRANS timing is still comparable to CPU 
parallelization. This has tied my hands for benchmarking Tpetra's real 
performance on GPUs.

Is this expected or anything I am missing?

Thanks in advance for the ideas!

Bests
Mohammad

-- 
Dr. Mohammad Siahatgar

Leibniz Universität IT Services
Schloßwender Straße 5, 30159 Hannover, Germany
Tel. +49 511 762 794666 | Fax +49 511 762 3003
siahatgar at rrzn.uni-hannover.de | www.rrzn.uni-hannover.de



More information about the Trilinos-Users mailing list