[Trilinos-Users] [EXTERNAL] Re: MPI error with Large Sparse Matrix
Mike Atambo
mikeat4999 at gmail.com
Wed May 27 12:04:03 EDT 2015
Thanks Andrew,
i wasnt able to find where im using the 32bit intergers, so i reduced it
to a simple case, without, no filling the matrix,
and i can reproduce the behaviour, here is the MWE
#include <Tpetra_DefaultPlatform.hpp>
#include <Tpetra_Vector.hpp>
#include <Tpetra_Version.hpp>
#include <Teuchos_GlobalMPISession.hpp>
#include <Teuchos_oblackholestream.hpp>
#include <Tpetra_CrsMatrix.hpp>
#include "Ifpack2_BorderedOperator.hpp"
#include "Ifpack2_Preconditioner.hpp"
#include <Ifpack2_Factory.hpp>
#include "AnasaziBlockKrylovSchurSolMgr.hpp"
#include "AnasaziBasicEigenproblem.hpp"
#include "AnasaziTpetraAdapter.hpp"
#include <Tpetra_Map.hpp>
#include <Tpetra_MultiVector.hpp>
#include <Teuchos_Array.hpp>
#include <Teuchos_ScalarTraits.hpp>
#include <Teuchos_RCP.hpp>
#include <vector>
#include <sys/time.h>
typedef std::complex<double> complex_scalar ;
typedef size_t global_ordinal_type;
int main (int argc, char *argv[])
{
using std::endl;
using Teuchos::RCP;
using Teuchos::rcp;
typedef Tpetra::DefaultPlatform::DefaultPlatformType::NodeType Node;
typedef Tpetra::Map<global_ordinal_type , global_ordinal_type, Node>
matrix_map_type;
Teuchos::oblackholestream blackHole;
Teuchos::GlobalMPISession mpiSession (&argc, &argv, &blackHole);
RCP<const Teuchos::Comm<int> > comm =
Tpetra::DefaultPlatform::getDefaultPlatform ().getComm ();
const int myRank = comm->getRank ();
std::ostream& out = (myRank == 0) ? std::cout : blackHole;
typedef Tpetra::CrsMatrix< complex_scalar, global_ordinal_type,
global_ordinal_type> crs_matrix_type ;
const global_ordinal_type indexBase = 0;
const Tpetra::global_size_t numMatGlobalEntries = 3000000000;
RCP<const matrix_map_type> EqualMatDistribution =
rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
Tpetra::GloballyDistributed));
RCP<crs_matrix_type> mat = rcp (new crs_matrix_type
(EqualMatDistribution, // rowMap, distribution of rows of the matrix
0,
// is a hint, max number of entries in row, can be 0 with DynamicProfile
Tpetra::DynamicProfile // Whether to allocate storage dynamically or
statically
));
mat->fillComplete ();
return 0;
}
If you can see something (and there may be something obviously wrong), id
appreciate the help.
Mike
On Tue, May 26, 2015 at 6:32 PM, Bradley, Andrew Michael <ambradl at sandia.gov
> wrote:
> Hi Mike,
>
>
> I'm going to make a wild guess as to what is happening.
>
>
> This guess is based on the line
>
> Tpetra::CrsMatrix::insertGlobalValues:
> allocateValues(GlobalIndices,GraphNotYetAllocated) threw an exception:
> std::bad_alloc
>
>
> I think that a size_t is being cast to an int at some point before the
> memory allocation that causes the std::bad_alloc happens. 3B is larger than
> 2^31 - 1, the largest integer an int can hold. Therefore, the int that
> results from a cast from the size_t will be a negative number. new will
> throw a std::bad_alloc if a negative size is requested.
>
>
> Here's a demonstration of that.
>
>
> #include <iostream>
> int main () {
> // Set i to 1 larger than the maximum int.
> size_t i = 1L << 31;
> // Cast size_t to int.
> int j = i;
> std::cout << i << "\n";
> std::cout << j << "\n";
> // Will throw a std::bad_alloc if int is 32 bits.
> int* k = new int[j];
> return 0;
> }
>
>
> One possible reason for the cast is that the global ordinal type (GO) is
> being used to hold the number of nonzeros somewhere in your code instead of
> global_size_t. If that is true and GO is an int, that would produce the
> error I describe above. The best fix is to find those lines in the code and
> fix the type. A quick first check of whether my guess is correct is to
> configure GO to be long long int instead of int and see if the program
> proceeds past the point at which it is currently failing.
>
>
> Cheers,
>
> Andrew
>
>
> ------------------------------
> *From:* Trilinos-Users <trilinos-users-bounces at trilinos.org> on behalf of
> Mike Atambo <mikeat4999 at gmail.com>
> *Sent:* Monday, May 25, 2015 3:18 AM
> *To:* trilinos-users at trilinos.org
> *Subject:* [EXTERNAL] Re: [Trilinos-Users] MPI error with Large Sparse
> Matrix
>
> After additional debugging, (recompiled trilinos and openmpi with
> gcc 4.8.2 ),
> the error seems to have changed, im not sure whether its a different
> issue, or
> im exposing the same error with different symptoms so to speak, please
> see the errors:
>
>
> =====================================SNIP====================================================================
>
> /mike_debugging/installdir/trilinos-12.0-debug/include/Tpetra_KokkosRefactor_CrsMatrix_def.hpp:1805:
>
> Throw number = 2
>
> Throw test that evaluated to true: true
>
> Tpetra::CrsMatrix::insertGlobalValues:
> allocateValues(GlobalIndices,GraphNotYetAllocated) threw an exception:
> std::bad_alloc
> [cn01-12:11042] *** Process received signal ***
> [cn01-12:11042] Signal: Aborted (6)
> [cn01-12:11042] Signal code: (-6)
> [cn01-12:11042] [ 0] /lib64/libpthread.so.0[0x314600f710]
> [cn01-12:11042] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3145c32925]
> [cn01-12:11042] [ 2] /lib64/libc.so.6(abort+0x175)[0x3145c34105]
> [cn01-12:11042] [ 3]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f0ac0ce68e5]
> [cn01-12:11042] [ 4]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea56)[0x7f0ac0ce4a56]
> [cn01-12:11042] [ 5]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ea83)[0x7f0ac0ce4a83]
> [cn01-12:11042] [ 6]
> /u/shared/programs/x86_64/gcc/4.8.2/lib64/libstdc++.so.6(+0x5ecae)[0x7f0ac0ce4cae]
> [cn01-12:11042] [ 7] ./kryanasazi.x[0x5015f8]
> [cn01-12:11042] [ 8] ./kryanasazi.x[0x6f3836]
> [cn01-12:11042] [ 9] ./kryanasazi.x[0x6e6491]
> [cn01-12:11042] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3145c1ed1d]
> [cn01-12:11042] [11] ./kryanasazi.x[0x430395]
> [cn01-12:11042] *** End of error message ***
>
> ........................................................................................................................................................................................................................................................
>
> In the stdout, there are some MPI related u issues that are reported,
> if anyone has met this
> or has suggestion please let me know:
>
>
> =================================SNIP=======================================
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory. This can cause MPI jobs to
> run with erratic performance, hang, and/or crash.
>
> This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered. You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
> See this Open MPI FAQ item for more information on these Linux kernel
> module
> parameters:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
> Local host: cn01-12
> Registerable memory: 16384 MiB
> Total memory: 40931 MiB
>
> Your MPI job will continue, but may be behave poorly and/or hang
>
>
> ........................................................................................................................................................
>
> On Thu, May 21, 2015 at 4:26 PM, Mike Atambo <mikeat4999 at gmail.com> wrote:
>
>> Im working with the sparse matrices in trilinos and was able to
>> generate and diagonalize matrices with up to hundreds of millions of
>> non-zeros (all double complex), it seems that once the matrix is
>> generated, the solver takes only a short while to converge (which
>> indicates we are still at manageable sizes for hundreds of millions of
>> non zeros), however I tried a 96M x 96M sparse matrix with about just
>> under 3 billion non zeros, and im faced with the following error:
>>
>>
>> [cn08-15:28114] *** An error occurred in MPI_Isend
>> [cn08-15:28114] *** reported by process [1773404161,22]
>> [cn08-15:28114] *** on communicator MPI_COMM_WORLD
>> [cn08-15:28114] *** MPI_ERR_COUNT: invalid count argument
>> [cn08-15:28114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>> [cn08-15:28114] *** and potentially your MPI job)
>>
>> when creating this particular size of the matrix , i gave the value of
>> exactly 3 billion nonzeros to rcp (new matrix_map_type .. )
>> function, but only 2614493376 were actually non-zeros,
>>
>> Here is the series of function calls in their exact order,
>>
>> ----------------- snip-----------------
>> const Tpetra::global_size_t numMatGlobalEntries = 3000000000 ;
>> RCP<const matrix_map_type> EqualMatDistribution =
>> rcp (new matrix_map_type (numMatGlobalEntries, indexBase, comm,
>> Tpetra::GloballyDistributed));
>>
>> RCP<crs_matrix_type> mat = rcp (new crs_matrix_type
>> (EqualMatDistribution,
>> 0, DynamicProfile
>>
>> Tpetra::DynamicProfile
>> ));
>>
>> const size_t mat_global_start =
>> EqualMatDistribution->getMinGlobalIndex ();
>> const size_t mat_global_end =
>> EqualMatDistribution->getMaxGlobalIndex ();
>>
>> for( global_ordinal_type sInd = map_global_start ; sInd <= map_global_end
>> ; sInd++)
>> {
>> .... ...
>>
>> fInd = position_fucntion() ;
>> elements+=2 ;
>> const global_ordinal_type fView =
>> static_cast<global_ordinal_type> (fInd) ;
>> complex_scalar st = phase() ;
>> mat->insertGlobalValues (
>> sInd,tuple<global_ordinal_type>( fView) , tuple<complex_scalar>(st) ) ;
>> mat->insertGlobalValues(fView,
>> tuple<global_ordinal_type>( sInd), tuple<complex_scalar>(std::conj(st)) );
>> }
>> std::cout<< "elements:" << elements << std::endl;
>> mat->fillComplete ();
>> std::cout<< "After fillComp" << std::endl;
>>
>> ------------------------------------------------ SNIP
>>
>> From a simple debug (print statement), the error occurs at the sparse
>> matrix ->fillComplete () method call.
>>
>> Id like to emphasize that the same code works just fine up to hundreds
>> of millions of elements, and stops at the size we are attempting of a few
>> billion non zeros.
>> Has anyone any experience with this particular type of error or its cause?
>> Is there some problem with our local mpi software environment?
>>
>> Our software environment is as follows:
>> - gcc version 4.9.2 (GCC)
>> - openmpi 1.8.3 (Compiled with above)
>> - trilinos 12.0 (Compiled and linked with above two)
>>
>> The matrix occupies alot of memory and to eliminate memory problems, we
>> run on 10 fat memory (160GB) nodes, using a single process per node.
>>
>>
>> Any help would be highly appreciated.
>>
>>
>>
>>
>>
>> --
>> M. O. Atambo
>> mikeat4999 at gmail.com
>> matambo at ictp.it
>> Ext .139
>> Room 209.
>>
>>
>
>
> --
> M. O. Atambo
> mikeat4999 at gmail.com
> matambo at ictp.it
> Ext .139
> Room 209.
>
>
--
M. O. Atambo
mikeat4999 at gmail.com
matambo at ictp.it
Ext .139
Room 209.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20150527/aedfe279/attachment.html>
More information about the Trilinos-Users
mailing list