[Trilinos-Users] [EXTERNAL] Re: MPI error with Large Sparse Matrix

Mike Atambo mikeat4999 at gmail.com
Wed May 27 12:04:03 EDT 2015


Thanks Andrew,
i wasnt able to find where im using  the 32bit intergers, so i reduced it
to a  simple case, without, no filling the matrix,
and i can reproduce the behaviour,   here is the  MWE

#include <Tpetra_DefaultPlatform.hpp>
#include <Tpetra_Vector.hpp>
#include <Tpetra_Version.hpp>
#include <Teuchos_GlobalMPISession.hpp>
#include <Teuchos_oblackholestream.hpp>
#include <Tpetra_CrsMatrix.hpp>
#include "Ifpack2_BorderedOperator.hpp"
#include "Ifpack2_Preconditioner.hpp"
#include <Ifpack2_Factory.hpp>
#include "AnasaziBlockKrylovSchurSolMgr.hpp"
#include "AnasaziBasicEigenproblem.hpp"
#include "AnasaziTpetraAdapter.hpp"
#include <Tpetra_Map.hpp>
#include <Tpetra_MultiVector.hpp>
#include <Teuchos_Array.hpp>
#include <Teuchos_ScalarTraits.hpp>
#include <Teuchos_RCP.hpp>
#include <vector>
#include <sys/time.h>

typedef std::complex<double>  complex_scalar ;
typedef size_t  global_ordinal_type;



int main (int argc, char *argv[])
{
  using std::endl;
  using Teuchos::RCP;
  using Teuchos::rcp;
  typedef Tpetra::DefaultPlatform::DefaultPlatformType::NodeType  Node;
  typedef Tpetra::Map<global_ordinal_type , global_ordinal_type, Node>
matrix_map_type;
  Teuchos::oblackholestream blackHole;
  Teuchos::GlobalMPISession mpiSession (&argc, &argv, &blackHole);
  RCP<const Teuchos::Comm<int> > comm =
 Tpetra::DefaultPlatform::getDefaultPlatform ().getComm ();
  const int myRank = comm->getRank ();
  std::ostream& out = (myRank == 0) ? std::cout : blackHole;
  typedef Tpetra::CrsMatrix< complex_scalar, global_ordinal_type,
global_ordinal_type> crs_matrix_type ;
  const global_ordinal_type indexBase = 0;
  const Tpetra::global_size_t numMatGlobalEntries = 3000000000;
  RCP<const matrix_map_type> EqualMatDistribution =
           rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
                           Tpetra::GloballyDistributed));

  RCP<crs_matrix_type>  mat = rcp (new crs_matrix_type
(EqualMatDistribution, // rowMap, distribution of rows of the matrix
                                                        0,
   // is a hint, max number of  entries in row, can be 0 with DynamicProfile

 Tpetra::DynamicProfile // Whether to allocate storage dynamically  or
statically
                                                       ));
  mat->fillComplete ();
return 0;
}


If you can see something (and there may be something obviously wrong),  id
appreciate the help.

Mike


On Tue, May 26, 2015 at 6:32 PM, Bradley, Andrew Michael <ambradl at sandia.gov
> wrote:

>  Hi Mike,
>
>
>  I'm going to make a wild guess as to what is happening.
>
>
>  This guess is based on the line
>
>     Tpetra::CrsMatrix::insertGlobalValues:
> allocateValues(GlobalIndices,GraphNotYetAllocated) threw an exception:
> std::bad_alloc
>
>
>  I think that a size_t is being cast to an int at some point before the
> memory allocation that causes the std::bad_alloc happens. 3B is larger than
> 2^31 - 1, the largest integer an int can hold. Therefore, the int that
> results from a cast from the size_t will be a negative number. new will
> throw a std::bad_alloc if a negative size is requested.
>
>
>  Here's a demonstration of that.
>
>
>  #include <iostream>
> int main () {
>   // Set i to 1 larger than the maximum int.
>   size_t i = 1L << 31;
>   // Cast size_t to int.
>   int j = i;
>   std::cout << i << "\n";
>   std::cout << j << "\n";
>   // Will throw a std::bad_alloc if int is 32 bits.
>   int* k = new int[j];
>   return 0;
> }
>
>
>  One possible reason for the cast is that the global ordinal type (GO) is
> being used to hold the number of nonzeros somewhere in your code instead of
> global_size_t. If that is true and GO is an int, that would produce the
> error I describe above. The best fix is to find those lines in the code and
> fix the type. A quick first check of whether my guess is correct is to
> configure GO to be long long int instead of int and see if the program
> proceeds past the point at which it is currently failing.
>
>
>  Cheers,
>
> Andrew
>
>
>  ------------------------------
> *From:* Trilinos-Users <trilinos-users-bounces at trilinos.org> on behalf of
> Mike Atambo <mikeat4999 at gmail.com>
> *Sent:* Monday, May 25, 2015 3:18 AM
> *To:* trilinos-users at trilinos.org
> *Subject:* [EXTERNAL] Re: [Trilinos-Users] MPI error with Large Sparse
> Matrix
>
>  After additional debugging,  (recompiled    trilinos and openmpi with
> gcc 4.8.2  ),
> the error seems to have changed,  im not sure whether its a different
> issue, or
> im exposing the same error  with different  symptoms so to speak,  please
> see the errors:
>
>
> =====================================SNIP====================================================================
>
> /mike_debugging/installdir/trilinos-12.0-debug/include/Tpetra_KokkosRefactor_CrsMatrix_def.hpp:1805:
>
>  Throw number = 2
>
>  Throw test that evaluated to true: true
>
>  Tpetra::CrsMatrix::insertGlobalValues:
> allocateValues(GlobalIndices,GraphNotYetAllocated) threw an exception:
> std::bad_alloc
> [cn01-12:11042] *** Process received signal ***
> [cn01-12:11042] Signal: Aborted (6)
> [cn01-12:11042] Signal code:  (-6)
> [cn01-12:11042] [ 0] /lib64/libpthread.so.0[0x314600f710]
> [cn01-12:11042] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3145c32925]
> [cn01-12:11042] [ 2] /lib64/libc.so.6(abort+0x175)[0x3145c34105]
> [cn01-12:11042] [ 3]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f0ac0ce68e5]
> [cn01-12:11042] [ 4]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea56)[0x7f0ac0ce4a56]
> [cn01-12:11042] [ 5]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea83)[0x7f0ac0ce4a83]
> [cn01-12:11042] [ 6]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ecae)[0x7f0ac0ce4cae]
> [cn01-12:11042] [ 7] ./kryanasazi.x[0x5015f8]
> [cn01-12:11042] [ 8] ./kryanasazi.x[0x6f3836]
> [cn01-12:11042] [ 9] ./kryanasazi.x[0x6e6491]
> [cn01-12:11042] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3145c1ed1d]
> [cn01-12:11042] [11] ./kryanasazi.x[0x430395]
> [cn01-12:11042] *** End of error message ***
>
> ........................................................................................................................................................................................................................................................
>
>  In the stdout, there are some MPI related u issues that are reported,
>  if anyone has met this
> or has suggestion please let me know:
>
>
> =================================SNIP=======================================
>  WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory.  This can cause MPI jobs to
> run with erratic performance, hang, and/or crash.
>
>  This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered.  You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
>  See this Open MPI FAQ item for more information on these Linux kernel
> module
> parameters:
>
>      http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
>    Local host:              cn01-12
>   Registerable memory:     16384 MiB
>   Total memory:            40931 MiB
>
>  Your MPI job will continue, but may be behave poorly and/or hang
>
>
> ........................................................................................................................................................
>
> On Thu, May 21, 2015 at 4:26 PM, Mike Atambo <mikeat4999 at gmail.com> wrote:
>
>>  Im working with the sparse matrices in trilinos and was able to
>> generate and diagonalize  matrices with  up to   hundreds of millions of
>> non-zeros (all double complex), it seems that  once the matrix is
>> generated,  the solver takes only a short while to converge (which
>> indicates  we are  still  at manageable sizes for hundreds of millions of
>> non zeros),  however I tried a  96M x 96M sparse  matrix with  about just
>> under 3 billion non zeros,  and im faced   with  the following error:
>>
>>
>> [cn08-15:28114] *** An error occurred in MPI_Isend
>> [cn08-15:28114] *** reported by process [1773404161,22]
>> [cn08-15:28114] *** on communicator MPI_COMM_WORLD
>> [cn08-15:28114] *** MPI_ERR_COUNT: invalid count argument
>> [cn08-15:28114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>> [cn08-15:28114] ***    and potentially your MPI job)
>>
>> when creating this particular size of  the matrix , i gave the value of
>>  exactly  3 billion nonzeros to  rcp (new matrix_map_type ..   )
>> function,  but only 2614493376 were actually non-zeros,
>>
>> Here is the series of function calls in their exact order,
>>
>> ----------------- snip-----------------
>>   const Tpetra::global_size_t numMatGlobalEntries =  3000000000 ;
>>   RCP<const matrix_map_type> EqualMatDistribution =
>>            rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
>>                            Tpetra::GloballyDistributed));
>>
>>  RCP<crs_matrix_type>  mat = rcp (new crs_matrix_type
>> (EqualMatDistribution,
>>     0,  DynamicProfile
>>
>>  Tpetra::DynamicProfile
>>                                                        ));
>>
>>   const size_t  mat_global_start =
>> EqualMatDistribution->getMinGlobalIndex ();
>>   const size_t  mat_global_end =
>>  EqualMatDistribution->getMaxGlobalIndex ();
>>
>> for( global_ordinal_type sInd = map_global_start ; sInd <= map_global_end
>> ; sInd++)
>>     {
>>        .... ...
>>
>>                           fInd =  position_fucntion() ;
>>                           elements+=2 ;
>>                           const global_ordinal_type  fView =
>> static_cast<global_ordinal_type> (fInd) ;
>>                           complex_scalar st  = phase() ;
>>                           mat->insertGlobalValues (
>> sInd,tuple<global_ordinal_type>( fView) , tuple<complex_scalar>(st)  ) ;
>>                           mat->insertGlobalValues(fView,
>> tuple<global_ordinal_type>( sInd), tuple<complex_scalar>(std::conj(st)) );
>>     }
>>   std::cout<< "elements:" << elements << std::endl;
>>   mat->fillComplete ();
>>   std::cout<< "After fillComp" << std::endl;
>>
>> ------------------------------------------------ SNIP
>>
>> From a simple debug (print statement),   the error occurs at the   sparse
>> matrix  ->fillComplete ()   method  call.
>>
>> Id like to emphasize that the same code works just fine up to  hundreds
>> of millions of elements, and stops at the size  we are attempting of a few
>> billion non zeros.
>> Has anyone any experience with this particular type of error or its cause?
>> Is there some problem with  our local mpi software environment?
>>
>> Our  software environment is as follows:
>> - gcc version 4.9.2 (GCC)
>> - openmpi 1.8.3   (Compiled with above)
>> - trilinos 12.0    (Compiled and linked with above two)
>>
>> The matrix occupies alot of memory and to  eliminate memory problems, we
>> run on 10 fat memory (160GB) nodes,  using a single process per node.
>>
>>
>> Any help would be highly appreciated.
>>
>>
>>
>>
>>
>>  --
>> M. O. Atambo
>> mikeat4999 at gmail.com
>> matambo at ictp.it
>> Ext .139
>> Room 209.
>>
>>
>
>
>  --
>  M. O. Atambo
> mikeat4999 at gmail.com
> matambo at ictp.it
> Ext .139
> Room 209.
>
>


-- 
M. O. Atambo
mikeat4999 at gmail.com
matambo at ictp.it
Ext .139
Room 209.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20150527/aedfe279/attachment.html>


More information about the Trilinos-Users mailing list