[Trilinos-Users] Tpetra::CrsMatrix::apply performance
Mohammad Siahatgar
siahatgar at rrzn.uni-hannover.de
Mon Jul 28 07:08:21 MDT 2014
Hi there,
It seems to me that there is a big difference in performance of
Tpetra::CrsMatrix::apply with Teuchos::NO_TRANS and Teuchos::TRANS in
the parallel implementations. The former is about 12x faster using
KokkosClassic::OpenMPNode with 32 cores and 15x using TBBNode, while the
SerialNode is only 1.2x faster. The numbers are results of SpMV
operations of the order 5000 on a SandyBridge using the developement
version of Tpetra. My code is based on the EpetraBenchmarkTest but
modified to use Tpetra.
On the other hand, if I use ThrustNode instead I got impossibly good
performance for NO_TRANS, while TRANS timing is still comparable to CPU
parallelization. This has tied my hands for benchmarking Tpetra's real
performance on GPUs.
Is this expected or anything I am missing?
Thanks in advance for the ideas!
Bests
Mohammad
--
Dr. Mohammad Siahatgar
Leibniz Universität IT Services
Schloßwender Straße 5, 30159 Hannover, Germany
Tel. +49 511 762 794666 | Fax +49 511 762 3003
siahatgar at rrzn.uni-hannover.de | www.rrzn.uni-hannover.de
More information about the Trilinos-Users
mailing list