[Trilinos-Users] [EXTERNAL] Re: MPI error with Large Sparse Matrix

Wed May 27 13:14:31 EDT 2015

Andrew,
thanks for the quick reply,  i see what you mean in the  example,
 unfortunately im having some difficulty reconciling with what
is in the documentation,
http://trilinos.org/docs/dev/packages/tpetra/doc/html/classTpetra_1_1Map.html#aeb556362352e2ae62804601e36672a6c
how is the case of the map indicating the number of rows distinguished form
the case the map is created with  the number of global entries?
The docs seem to mention only the latter unless im not seeing something.

Mike

On Wed, May 27, 2015 at 6:23 PM, Bradley, Andrew Michael <ambradl at sandia.gov
> wrote:

>  Hi Mike,
>
>
>  Thanks very much for the min breaking example. That helped. Looks like
> it's an issue with how the map is set up. The map is set up like this:
>
> const Tpetra::global_size_t numMatGlobalEntries = 3000000000;
>   RCP<const matrix_map_type> EqualMatDistribution =
>            rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
>                            Tpetra::GloballyDistributed));
>
> This map is meant to be the row map that is used in the construction of
> the CrsMatrix. Hence the first argument is not the number of nonzeros in
> the matrix (numMatGlobalEntries), but rather the number of rows. For more
> on setting up maps for use in a CrsMatrix, see this example:
>
>
>      tpetra/core/example/Lesson03-Power-Method/lesson03_power_method.cpp
>
>
>  Cheers,
>
> Andrew
>
>
>  ------------------------------
> *From:* Mike Atambo <mikeat4999 at gmail.com>
> *Sent:* Wednesday, May 27, 2015 10:04 AM
> *To:* Bradley, Andrew Michael
> *Cc:* trilinos-users at trilinos.org
> *Subject:* Re: [EXTERNAL] Re: [Trilinos-Users] MPI error with Large
> Sparse Matrix
>
>  Thanks Andrew,
> i wasnt able to find where im using  the 32bit intergers, so i reduced it
> to a  simple case, without, no filling the matrix,
> and i can reproduce the behaviour,   here is the  MWE
>
>  #include <Tpetra_DefaultPlatform.hpp>
> #include <Tpetra_Vector.hpp>
> #include <Tpetra_Version.hpp>
> #include <Teuchos_GlobalMPISession.hpp>
> #include <Teuchos_oblackholestream.hpp>
> #include <Tpetra_CrsMatrix.hpp>
> #include "Ifpack2_BorderedOperator.hpp"
> #include "Ifpack2_Preconditioner.hpp"
> #include <Ifpack2_Factory.hpp>
> #include "AnasaziBlockKrylovSchurSolMgr.hpp"
> #include "AnasaziBasicEigenproblem.hpp"
> #include "AnasaziTpetraAdapter.hpp"
> #include <Tpetra_Map.hpp>
> #include <Tpetra_MultiVector.hpp>
> #include <Teuchos_Array.hpp>
> #include <Teuchos_ScalarTraits.hpp>
> #include <Teuchos_RCP.hpp>
> #include <vector>
> #include <sys/time.h>
>
>  typedef std::complex<double>  complex_scalar ;
> typedef size_t  global_ordinal_type;
>
>
>
>  int main (int argc, char *argv[])
> {
>   using std::endl;
>   using Teuchos::RCP;
>   using Teuchos::rcp;
>   typedef Tpetra::DefaultPlatform::DefaultPlatformType::NodeType  Node;
>   typedef Tpetra::Map<global_ordinal_type , global_ordinal_type, Node>
> matrix_map_type;
>   Teuchos::oblackholestream blackHole;
>   Teuchos::GlobalMPISession mpiSession (&argc, &argv, &blackHole);
>   RCP<const Teuchos::Comm<int> > comm =
>  Tpetra::DefaultPlatform::getDefaultPlatform ().getComm ();
>   const int myRank = comm->getRank ();
>   std::ostream& out = (myRank == 0) ? std::cout : blackHole;
>   typedef Tpetra::CrsMatrix< complex_scalar, global_ordinal_type,
> global_ordinal_type> crs_matrix_type ;
>   const global_ordinal_type indexBase = 0;
>   const Tpetra::global_size_t numMatGlobalEntries = 3000000000;
>   RCP<const matrix_map_type> EqualMatDistribution =
>            rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
>                            Tpetra::GloballyDistributed));
>
>    RCP<crs_matrix_type>  mat = rcp (new crs_matrix_type
> (EqualMatDistribution, // rowMap, distribution of rows of the matrix
>                                                         0,
>    // is a hint, max number of  entries in row, can be 0 with DynamicProfile
>
>  Tpetra::DynamicProfile // Whether to allocate storage dynamically  or
> statically
>                                                        ));
>   mat->fillComplete ();
> return 0;
> }
>
>
>  If you can see something (and there may be something obviously wrong),
>  id appreciate the help.
>
>  Mike
>
>
> On Tue, May 26, 2015 at 6:32 PM, Bradley, Andrew Michael <
> ambradl at sandia.gov> wrote:
>
>>  Hi Mike,
>>
>>
>>  I'm going to make a wild guess as to what is happening.
>>
>>
>>  This guess is based on the line
>>
>>     Tpetra::CrsMatrix::insertGlobalValues:
>> allocateValues(GlobalIndices,GraphNotYetAllocated) threw an exception:
>> std::bad_alloc
>>
>>
>>  I think that a size_t is being cast to an int at some point before the
>> memory allocation that causes the std::bad_alloc happens. 3B is larger than
>> 2^31 - 1, the largest integer an int can hold. Therefore, the int that
>> results from a cast from the size_t will be a negative number. new will
>> throw a std::bad_alloc if a negative size is requested.
>>
>>
>>  Here's a demonstration of that.
>>
>>
>>  #include <iostream>
>> int main () {
>>   // Set i to 1 larger than the maximum int.
>>   size_t i = 1L << 31;
>>   // Cast size_t to int.
>>   int j = i;
>>   std::cout << i << "\n";
>>   std::cout << j << "\n";
>>   // Will throw a std::bad_alloc if int is 32 bits.
>>   int* k = new int[j];
>>   return 0;
>> }
>>
>>
>>  One possible reason for the cast is that the global ordinal type (GO)
>> is being used to hold the number of nonzeros somewhere in your code instead
>> of global_size_t. If that is true and GO is an int, that would produce the
>> error I describe above. The best fix is to find those lines in the code and
>> fix the type. A quick first check of whether my guess is correct is to
>> configure GO to be long long int instead of int and see if the program
>> proceeds past the point at which it is currently failing.
>>
>>
>>  Cheers,
>>
>> Andrew
>>
>>
>>  ------------------------------
>> *From:* Trilinos-Users <trilinos-users-bounces at trilinos.org> on behalf
>> of Mike Atambo <mikeat4999 at gmail.com>
>> *Sent:* Monday, May 25, 2015 3:18 AM
>> *To:* trilinos-users at trilinos.org
>> *Subject:* [EXTERNAL] Re: [Trilinos-Users] MPI error with Large Sparse
>> Matrix
>>
>>   After additional debugging,  (recompiled    trilinos and openmpi with
>> gcc 4.8.2  ),
>> the error seems to have changed,  im not sure whether its a different
>> issue, or
>> im exposing the same error  with different  symptoms so to speak,  please
>> see the errors:
>>
>>
>> =====================================SNIP====================================================================
>>
>> /mike_debugging/installdir/trilinos-12.0-debug/include/Tpetra_KokkosRefactor_CrsMatrix_def.hpp:1805:
>>
>>  Throw number = 2
>>
>>  Throw test that evaluated to true: true
>>
>>  Tpetra::CrsMatrix::insertGlobalValues:
>> allocateValues(GlobalIndices,GraphNotYetAllocated) threw an exception:
>> std::bad_alloc
>> [cn01-12:11042] *** Process received signal ***
>> [cn01-12:11042] Signal: Aborted (6)
>> [cn01-12:11042] Signal code:  (-6)
>> [cn01-12:11042] [ 0] /lib64/libpthread.so.0[0x314600f710]
>> [cn01-12:11042] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3145c32925]
>> [cn01-12:11042] [ 2] /lib64/libc.so.6(abort+0x175)[0x3145c34105]
>> [cn01-12:11042] [ 3]
>> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f0ac0ce68e5]
>> [cn01-12:11042] [ 4]
>> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea56)[0x7f0ac0ce4a56]
>> [cn01-12:11042] [ 5]
>> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea83)[0x7f0ac0ce4a83]
>> [cn01-12:11042] [ 6]
>> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ecae)[0x7f0ac0ce4cae]
>> [cn01-12:11042] [ 7] ./kryanasazi.x[0x5015f8]
>> [cn01-12:11042] [ 8] ./kryanasazi.x[0x6f3836]
>> [cn01-12:11042] [ 9] ./kryanasazi.x[0x6e6491]
>> [cn01-12:11042] [10]
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3145c1ed1d]
>> [cn01-12:11042] [11] ./kryanasazi.x[0x430395]
>> [cn01-12:11042] *** End of error message ***
>>
>> ........................................................................................................................................................................................................................................................
>>
>>  In the stdout, there are some MPI related u issues that are reported,
>>  if anyone has met this
>> or has suggestion please let me know:
>>
>>
>> =================================SNIP=======================================
>>  WARNING: It appears that your OpenFabrics subsystem is configured to
>> only
>> allow registering part of your physical memory.  This can cause MPI jobs
>> to
>> run with erratic performance, hang, and/or crash.
>>
>>  This may be caused by your OpenFabrics vendor limiting the amount of
>> physical memory that can be registered.  You should investigate the
>> relevant Linux kernel module parameters that control how much physical
>> memory can be registered, and increase them to allow registering all
>> physical memory on your machine.
>>
>>  See this Open MPI FAQ item for more information on these Linux kernel
>> module
>> parameters:
>>
>>      http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>>    Local host:              cn01-12
>>   Registerable memory:     16384 MiB
>>   Total memory:            40931 MiB
>>
>>  Your MPI job will continue, but may be behave poorly and/or hang
>>
>>
>> ........................................................................................................................................................
>>
>> On Thu, May 21, 2015 at 4:26 PM, Mike Atambo <mikeat4999 at gmail.com>
>> wrote:
>>
>>>  Im working with the sparse matrices in trilinos and was able to
>>> generate and diagonalize  matrices with  up to   hundreds of millions of
>>> non-zeros (all double complex), it seems that  once the matrix is
>>> generated,  the solver takes only a short while to converge (which
>>> indicates  we are  still  at manageable sizes for hundreds of millions of
>>> non zeros),  however I tried a  96M x 96M sparse  matrix with  about just
>>> under 3 billion non zeros,  and im faced   with  the following error:
>>>
>>>
>>> [cn08-15:28114] *** An error occurred in MPI_Isend
>>> [cn08-15:28114] *** reported by process [1773404161,22]
>>> [cn08-15:28114] *** on communicator MPI_COMM_WORLD
>>> [cn08-15:28114] *** MPI_ERR_COUNT: invalid count argument
>>> [cn08-15:28114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>> will now abort,
>>> [cn08-15:28114] ***    and potentially your MPI job)
>>>
>>> when creating this particular size of  the matrix , i gave the value of
>>>  exactly  3 billion nonzeros to  rcp (new matrix_map_type ..   )
>>> function,  but only 2614493376 were actually non-zeros,
>>>
>>> Here is the series of function calls in their exact order,
>>>
>>> ----------------- snip-----------------
>>>   const Tpetra::global_size_t numMatGlobalEntries =  3000000000 ;
>>>   RCP<const matrix_map_type> EqualMatDistribution =
>>>            rcp (new matrix_map_type (numMatGlobalEntries, indexBase,
>>> comm,
>>>                            Tpetra::GloballyDistributed));
>>>
>>>  RCP<crs_matrix_type>  mat = rcp (new crs_matrix_type
>>> (EqualMatDistribution,
>>>     0,  DynamicProfile
>>>
>>>  Tpetra::DynamicProfile
>>>                                                        ));
>>>
>>>   const size_t  mat_global_start =
>>> EqualMatDistribution->getMinGlobalIndex ();
>>>   const size_t  mat_global_end =
>>>  EqualMatDistribution->getMaxGlobalIndex ();
>>>
>>> for( global_ordinal_type sInd = map_global_start ; sInd <=
>>> map_global_end ; sInd++)
>>>     {
>>>        .... ...
>>>
>>>                           fInd =  position_fucntion() ;
>>>                           elements+=2 ;
>>>                           const global_ordinal_type  fView =
>>> static_cast<global_ordinal_type> (fInd) ;
>>>                           complex_scalar st  = phase() ;
>>>                           mat->insertGlobalValues (
>>> sInd,tuple<global_ordinal_type>( fView) , tuple<complex_scalar>(st)  ) ;
>>>                           mat->insertGlobalValues(fView,
>>> tuple<global_ordinal_type>( sInd), tuple<complex_scalar>(std::conj(st)) );
>>>     }
>>>   std::cout<< "elements:" << elements << std::endl;
>>>   mat->fillComplete ();
>>>   std::cout<< "After fillComp" << std::endl;
>>>
>>> ------------------------------------------------ SNIP
>>>
>>> From a simple debug (print statement),   the error occurs at the
>>> sparse matrix  ->fillComplete ()   method  call.
>>>
>>> Id like to emphasize that the same code works just fine up to  hundreds
>>> of millions of elements, and stops at the size  we are attempting of a few
>>> billion non zeros.
>>> Has anyone any experience with this particular type of error or its
>>> cause?
>>> Is there some problem with  our local mpi software environment?
>>>
>>> Our  software environment is as follows:
>>> - gcc version 4.9.2 (GCC)
>>> - openmpi 1.8.3   (Compiled with above)
>>> - trilinos 12.0    (Compiled and linked with above two)
>>>
>>> The matrix occupies alot of memory and to  eliminate memory problems, we
>>> run on 10 fat memory (160GB) nodes,  using a single process per node.
>>>
>>>
>>> Any help would be highly appreciated.
>>>
>>>
>>>
>>>
>>>
>>>  --
>>> M. O. Atambo
>>> mikeat4999 at gmail.com
>>> matambo at ictp.it
>>> Ext .139
>>> Room 209.
>>>
>>>
>>
>>
>>  --
>>  M. O. Atambo
>> mikeat4999 at gmail.com
>> matambo at ictp.it
>> Ext .139
>> Room 209.
>>
>>
>
>
>  --
>  M. O. Atambo
> mikeat4999 at gmail.com
> matambo at ictp.it
> Ext .139
> Room 209.
>
>

-- 
M. O. Atambo
mikeat4999 at gmail.com
matambo at ictp.it
Ext .139
Room 209.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20150527/32d8ff78/attachment.html>