[Trilinos-Users] Tpetra MultiGPU

Hoemmen, Mark mhoemme at sandia.gov
Sun Sep 18 20:56:55 MDT 2011

We are always working on improving the performance of Tpetra, so we appreciate your observations.  It would be great if you would be willing to share your benchmarks that reveal the performance issues you observed.

Regarding your observations about Import/Export, are you interested mainly in the sparse matrix-vector multiply kernel?  

Message: 1
Date: Sun, 18 Sep 2011 06:50:52 +0400
From: ???? ?????? <oleg.ryabkov.87 at gmail.com>
Subject: [Trilinos-Users] Tpetra MultiGPU
To: trilinos-users at software.sandia.gov
        <CAFa1kQpAO4rj0jVAMCbro_RC9G+VJKgSRmpr3+-bhL=EewOoGQ at mail.gmail.com>
Content-Type: text/plain; charset=iso-8859-1

  Hello, everyone!
I was testing gpu and multi gpu capabilities of Tpetra (with Belos
solvers) and noticed, that MultiGPU variant is always much slower then
alone GPU variant.
I investigated the code and discovered that Import/Export classes
simply create "views" using viewBufferNonConst method of kernel (which
is ThrustGPUNode in this case)
which means that all vectors data is copied between GPU and CPU
(however, it is obvious that in many cases of sparse matrices just
small "extra" parts should be transmitted).
Do you see any solution without changing interface of "Node"? Maybe,
launch some additional kernels (parallel_for<>??) to copy elements
really needed to temporal buffer (GPU<=>GPU) and then create its
(it is not just curiosity; i really liked trilinos design and wonder
if i can use it in my future projects).


More information about the Trilinos-Users mailing list