[Trilinos-Users] [EXTERNAL] Tpetra::CrsMatrix caching related to C= A^T B and D = A*C

Denis Davydov davydden at gmail.com
Wed Sep 12 03:48:44 EDT 2018

Hi Chris,

I have a few follow-up questions.

1. for MatrixMatrix::Multiply on the first run C has to have the same row distribution as A.
If both A and B are MPI distributed row-wise, what’s actually the distribution of C?
How does TPetra decide on this? Is it row-wise? column-wise? both? or simply serial?

2. keeping in mind row-wise distribution of tall and skinny A and B, for operations like D = A  (A^T B) = A C
each MPI process needs to know all rows in C for which columns in locally owned part of A are non-zero.
Say I know the required rows outside of Trilinos, do I have to create a C2 with such (not one-to-one) partition of rows and 
do Import from C before doing MatrixMatrix::Multiply?
Here it is still not clear what will be the distribution of D after this. I would hope it to be exactly the same as A,
namely row-wise with the same map.

3. I did some timings for C=A^T B on my mac in Release mode (just to get a rough idea before I occupy a cluster).

I setup an empty C:

 Teuchos::RCP<T_matrix> C =
      Teuchos::rcp(new T_matrix(dof_map, 0));

then I check the time for the first run (A and B are fill-complete):

                                   /*fill complete*/ true);

and then check the time for a few more runs where I would expect TPetra not to re-calculate the sparsity of the product (A^T B).

      Tpetra::MatrixMatrix::Multiply(*A, true,
                                     *B, false,
                                     /*fill complete*/true);

But so far it does not seem to be the case. 
Walltime is 0.136s in the first Multiply and then on average 0.501s on consequent calls.
So it actually takes much more time. Is resumeFill() so expensive? Or shall I, for the sake of timing, do
fill_complete=false in both steps and then would probably not need C->resumeFill()?

p.s. sizes of A and B are 22000 x 960, maximum number of non-zeroes per row are 70 and 54, respectively.


> On 10 Sep 2018, at 16:59, Siefert, Christopher <csiefer at sandia.gov> wrote:
> Denis,
> Tpetra caches nothing w.r.t. matrix-matrix multiplication:.  You do that yourself (but that is actually pretty simple)
> Imagine you do this at timestep 0:
> C = Teuchos::rcp(new Tpetra::CrsMatrix(blah));
> Tpetra::MatrixMatrix::Multiply(A,B,C)
> This forms your very first C matrix.  Now rather than delete it, you keep it.  Then at timestep 1, you do:
> C->putScalar(0)
> Tpetra::MatrixMatrix::Multiply(A,B,C)
> Since C already has the right sparsity & communication patterns, the multiply will re-use that.  Tpetra tests in the MatrixMatrix directory test this "reuse" case.
> -Chris
> From: Denis Davydov <davydden at gmail.com>
> Sent: Monday, September 10, 2018 8:50 AM
> To: Siefert, Christopher
> Cc: trilinos-users
> Subject: Re: [EXTERNAL] [Trilinos-Users] Tpetra::CrsMatrix caching related to C= A^T B and D = A*C
> Dear Chris,
>> On 10 Sep 2018, at 16:36, Siefert, Christopher <csiefer at sandia.gov <mailto:csiefer at sandia.gov>> wrote:
>> Denis,
>> If you pass in a matrix with the correct structure, for "C" it will keep the communication and sparsity patterns (and be a lot
> By “correct structure” you mean sparsity pattern, right? Because the way this matrix is distributed in MPI row-wise has some degree of freedom.
> So the communication pattern, essentially, depends on MPI distribution of C, in addition to sparsities and distribution of A and B. 
> If I understand you correctly, this will still be cached.
> Are there any examples / tutorials / unit-tests in TPetra where I can check how to get the sparsity of C needed for the product without actually calling Tpetra::MatrixMatrix::Multiply()?
> Regards,
> Denis.
>> faster).
>> -Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20180912/d1c8c236/attachment.html>

More information about the Trilinos-Users mailing list