[Trilinos-Users] [EXTERNAL] Matrix-matrix product with large CrsMatrices

Andrey Prokopenko aprokop at sandia.gov
Mon Oct 12 16:24:37 EDT 2015

Hi Christopher,

1. It works for sparse matrices. Say, for instance, we are dealing with 
a 1000 x 1000  matrix, and processor 0  owns rows [0,100). Lets say that 
row 0 is a sparse row containing {0, 4, 6, 50, 105, 200}. Then the 
processor only needs to fetch columns of the matrix B with a nonzero 
entry in one of this position (otherwise, the entry in the result C=A*B 
is going to be zero). To do that, one could fetch rows with nonzeros in 
one of the required columns. As matrix is sparse, there are very few of 
such rows, and thus no need to fetch the whole matrix B.

2. The max number of nonzeros per process is restricted to those 
addressable with LocalOrdinal. By default, it is int and therefore 2^31 
nonzeros. The amount of memory taken by those nonzeros is more than 2GB 
dues to having to store row offset and values in addition to column 
indices (plus GlobalOrdinal), which results >= 25GB. Note, that the 
global matrix may have much larger size due to using GlobalOrdinal, 
which one can set to be a 64-bit ordinal. In MPI routines we typically 
use only LocalOrdinals.

On 10/12/2015 10:56 AM, Christopher Thiele wrote:
> Hello,
> I have a code that involves a matrix-matrix multiplication (A*B) using 
> the Tpetra::MatrixMatrix::Multiply() function. As far as I understand, 
> this function first transfers remote parts of one of the matrices via 
> MPI and then computes the actual matrix-matrix product. However, for 
> large matrices this does not seem to work for two reasons:e
> 1. If only the rows of the matrices are distributed across MPI 
> processes, i.e. if a process always owns all or none of the entries of 
> a given row, then basically the full matrix B must be present on each 
> process to compute the matrix-matrix product, which somehow makes it 
> pointless to distribute it in the first place. Especially the global 
> number of non-zeros is restricted by the memory available to one 
> process, despite distributing the matrix.
> 2. In fact, the non-zeros per process seem to be restricted to 2 GB 
> anyway, because they are communicated with MPI functions that take a 
> 32 bit integer as byte count. One cannot circumvent this using e.g. a 
> Teuchos::MpiComm<int64_t> communicator, because the Tpetra::Map class 
> expects a Teuchos::Comm<int> communicator. At least this is what I 
> observed when I tried larger matrices.
> Do you see a way to get around these issues without re-writing the 
> Multiply() function, or am I maybe missing something here? I 
> appreciate any suggestions.
> Regards,
> Christopher
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at trilinos.org
> https://trilinos.org/mailman/listinfo/trilinos-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20151012/0df01f4e/attachment.html>

More information about the Trilinos-Users mailing list