[Trilinos-Users] MPI error with Large Sparse Matrix
Mike Atambo
mikeat4999 at gmail.com
Thu May 21 10:26:22 EDT 2015
Im working with the sparse matrices in trilinos and was able to generate
and diagonalize matrices with up to hundreds of millions of non-zeros
(all double complex), it seems that once the matrix is generated, the
solver takes only a short while to converge (which indicates we are still
at manageable sizes for hundreds of millions of non zeros), however I
tried a 96M x 96M sparse matrix with about just under 3 billion non
zeros, and im faced with the following error:
[cn08-15:28114] *** An error occurred in MPI_Isend
[cn08-15:28114] *** reported by process [1773404161,22]
[cn08-15:28114] *** on communicator MPI_COMM_WORLD
[cn08-15:28114] *** MPI_ERR_COUNT: invalid count argument
[cn08-15:28114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[cn08-15:28114] *** and potentially your MPI job)
when creating this particular size of the matrix , i gave the value of
exactly 3 billion nonzeros to rcp (new matrix_map_type .. )
function, but only 2614493376 were actually non-zeros,
Here is the series of function calls in their exact order,
----------------- snip-----------------
const Tpetra::global_size_t numMatGlobalEntries = 3000000000 ;
RCP<const matrix_map_type> EqualMatDistribution =
rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
Tpetra::GloballyDistributed));
RCP<crs_matrix_type> mat = rcp (new crs_matrix_type
(EqualMatDistribution,
0, DynamicProfile
Tpetra::DynamicProfile
));
const size_t mat_global_start =
EqualMatDistribution->getMinGlobalIndex ();
const size_t mat_global_end = EqualMatDistribution->getMaxGlobalIndex
();
for( global_ordinal_type sInd = map_global_start ; sInd <= map_global_end ;
sInd++)
{
.... ...
fInd = position_fucntion() ;
elements+=2 ;
const global_ordinal_type fView =
static_cast<global_ordinal_type> (fInd) ;
complex_scalar st = phase() ;
mat->insertGlobalValues (
sInd,tuple<global_ordinal_type>( fView) , tuple<complex_scalar>(st) ) ;
mat->insertGlobalValues(fView,
tuple<global_ordinal_type>( sInd), tuple<complex_scalar>(std::conj(st)) );
}
std::cout<< "elements:" << elements << std::endl;
mat->fillComplete ();
std::cout<< "After fillComp" << std::endl;
------------------------------------------------ SNIP
>From a simple debug (print statement), the error occurs at the sparse
matrix ->fillComplete () method call.
Id like to emphasize that the same code works just fine up to hundreds of
millions of elements, and stops at the size we are attempting of a few
billion non zeros.
Has anyone any experience with this particular type of error or its cause?
Is there some problem with our local mpi software environment?
Our software environment is as follows:
- gcc version 4.9.2 (GCC)
- openmpi 1.8.3 (Compiled with above)
- trilinos 12.0 (Compiled and linked with above two)
The matrix occupies alot of memory and to eliminate memory problems, we
run on 10 fat memory (160GB) nodes, using a single process per node.
Any help would be highly appreciated.
--
M. O. Atambo
mikeat4999 at gmail.com
matambo at ictp.it
Ext .139
Room 209.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20150521/28748570/attachment.html>
More information about the Trilinos-Users
mailing list