[Trilinos-Users] MPI error with Large Sparse Matrix

Mike Atambo mikeat4999 at gmail.com
Mon May 25 05:18:12 EDT 2015


After additional debugging,  (recompiled    trilinos and openmpi with gcc
4.8.2  ),
the error seems to have changed,  im not sure whether its a different
issue, or
im exposing the same error  with different  symptoms so to speak,  please
see the errors:

=====================================SNIP====================================================================
/mike_debugging/installdir/trilinos-12.0-debug/include/Tpetra_KokkosRefactor_CrsMatrix_def.hpp:1805:

Throw number = 2

Throw test that evaluated to true: true

Tpetra::CrsMatrix::insertGlobalValues:
allocateValues(GlobalIndices,GraphNotYetAllocated) threw an exception:
std::bad_alloc
[cn01-12:11042] *** Process received signal ***
[cn01-12:11042] Signal: Aborted (6)
[cn01-12:11042] Signal code:  (-6)
[cn01-12:11042] [ 0] /lib64/libpthread.so.0[0x314600f710]
[cn01-12:11042] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3145c32925]
[cn01-12:11042] [ 2] /lib64/libc.so.6(abort+0x175)[0x3145c34105]
[cn01-12:11042] [ 3]
/u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f0ac0ce68e5]
[cn01-12:11042] [ 4]
/u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea56)[0x7f0ac0ce4a56]
[cn01-12:11042] [ 5]
/u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea83)[0x7f0ac0ce4a83]
[cn01-12:11042] [ 6]
/u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ecae)[0x7f0ac0ce4cae]
[cn01-12:11042] [ 7] ./kryanasazi.x[0x5015f8]
[cn01-12:11042] [ 8] ./kryanasazi.x[0x6f3836]
[cn01-12:11042] [ 9] ./kryanasazi.x[0x6e6491]
[cn01-12:11042] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3145c1ed1d]
[cn01-12:11042] [11] ./kryanasazi.x[0x430395]
[cn01-12:11042] *** End of error message ***
........................................................................................................................................................................................................................................................

In the stdout, there are some MPI related u issues that are reported,  if
anyone has met this
or has suggestion please let me know:

=================================SNIP=======================================
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              cn01-12
  Registerable memory:     16384 MiB
  Total memory:            40931 MiB

Your MPI job will continue, but may be behave poorly and/or hang

........................................................................................................................................................

On Thu, May 21, 2015 at 4:26 PM, Mike Atambo <mikeat4999 at gmail.com> wrote:

> Im working with the sparse matrices in trilinos and was able to generate
> and diagonalize  matrices with  up to   hundreds of millions of non-zeros
> (all double complex), it seems that  once the matrix is generated,  the
> solver takes only a short while to converge (which indicates  we are  still
>  at manageable sizes for hundreds of millions of non zeros),  however I
> tried a  96M x 96M sparse  matrix with  about just under 3 billion non
> zeros,  and im faced   with  the following error:
>
>
> [cn08-15:28114] *** An error occurred in MPI_Isend
> [cn08-15:28114] *** reported by process [1773404161,22]
> [cn08-15:28114] *** on communicator MPI_COMM_WORLD
> [cn08-15:28114] *** MPI_ERR_COUNT: invalid count argument
> [cn08-15:28114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [cn08-15:28114] ***    and potentially your MPI job)
>
> when creating this particular size of  the matrix , i gave the value of
>  exactly  3 billion nonzeros to  rcp (new matrix_map_type ..   )
> function,  but only 2614493376 were actually non-zeros,
>
> Here is the series of function calls in their exact order,
>
> ----------------- snip-----------------
>   const Tpetra::global_size_t numMatGlobalEntries =  3000000000 ;
>   RCP<const matrix_map_type> EqualMatDistribution =
>            rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
>                            Tpetra::GloballyDistributed));
>
>  RCP<crs_matrix_type>  mat = rcp (new crs_matrix_type
> (EqualMatDistribution,
>     0,  DynamicProfile
>
>  Tpetra::DynamicProfile
>                                                        ));
>
>   const size_t  mat_global_start =
> EqualMatDistribution->getMinGlobalIndex ();
>   const size_t  mat_global_end =
>  EqualMatDistribution->getMaxGlobalIndex ();
>
> for( global_ordinal_type sInd = map_global_start ; sInd <= map_global_end
> ; sInd++)
>     {
>        .... ...
>
>                           fInd =  position_fucntion() ;
>                           elements+=2 ;
>                           const global_ordinal_type  fView =
> static_cast<global_ordinal_type> (fInd) ;
>                           complex_scalar st  = phase() ;
>                           mat->insertGlobalValues (
> sInd,tuple<global_ordinal_type>( fView) , tuple<complex_scalar>(st)  ) ;
>                           mat->insertGlobalValues(fView,
> tuple<global_ordinal_type>( sInd), tuple<complex_scalar>(std::conj(st)) );
>     }
>   std::cout<< "elements:" << elements << std::endl;
>   mat->fillComplete ();
>   std::cout<< "After fillComp" << std::endl;
>
> ------------------------------------------------ SNIP
>
> From a simple debug (print statement),   the error occurs at the   sparse
> matrix  ->fillComplete ()   method  call.
>
> Id like to emphasize that the same code works just fine up to  hundreds of
> millions of elements, and stops at the size  we are attempting of a few
> billion non zeros.
> Has anyone any experience with this particular type of error or its cause?
> Is there some problem with  our local mpi software environment?
>
> Our  software environment is as follows:
> - gcc version 4.9.2 (GCC)
> - openmpi 1.8.3   (Compiled with above)
> - trilinos 12.0    (Compiled and linked with above two)
>
> The matrix occupies alot of memory and to  eliminate memory problems, we
> run on 10 fat memory (160GB) nodes,  using a single process per node.
>
>
> Any help would be highly appreciated.
>
>
>
>
>
> --
> M. O. Atambo
> mikeat4999 at gmail.com
> matambo at ictp.it
> Ext .139
> Room 209.
>
>


-- 
M. O. Atambo
mikeat4999 at gmail.com
matambo at ictp.it
Ext .139
Room 209.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20150525/c515a576/attachment.html>


More information about the Trilinos-Users mailing list