[Trilinos-Users] aztec00 problem

Matthias Heil matthias.heil at manchester.ac.uk
Tue Jan 21 08:58:38 MST 2014


Mike,

    we've made some progress. After some more digging we've
established that the offensive call to AZ_manage_memory is
made with the following arguments:

AZ_manage_memory (input_size=295202080,
                    action=0,
                    type=-914901,
                    name=0x7fffffffd720 "vblock in gmres0",
                    status=0x7fffffffd6c8)


at trilinos-11.4.3-Source/packages/aztecoo/src/az_util.c:944
when "viewed" from within AZ_manage_memory.

However, looking at the calling code, the first argument, input_size,
is derived from: kspace=5000; aligned_N_total=222084;
sizeof(double)=8, so it should be (5000+1)*222084*8=8885136672.

Andrew Hazel then wrote the small test code, below, which shows that
295202080 is the value given to an unsigned int that stores the result
of the calculation. The problem appears to be that the first argument
to AZ_manage_memory is an unsigned int, rather than an unsigned
long (or some other custom type).

      Matthias


#include <iostream>

int main()
  {
   unsigned kspace = 5000;
   unsigned aligned_N_total = 222084;

   unsigned temp = (kspace+1)*aligned_N_total*sizeof(double);
   unsigned long temp2 = (kspace+1)*aligned_N_total*sizeof(double);

   std::cout << temp << " " << temp2 << "\n";
  }



On 20/01/14 21:18, Heroux, Mike wrote:
> Matthias,
>
> Do you have a sense of whether or not the data sizes you are using would
> result in array indexing that exceed 2.1 billion?  The kinds of issues you
> are seeing would be consistent with trying to address an array using an
> integer value that is bigger than what signed int can handle.
>
> The hex values you are printing are very large (more than 140 trillion),
> which seems to indicate an incorrect address calculation somewhere.  I
> agree that the memory manager should detect the issue, no matter what.
>
> Mike
>
> On 1/20/14 8:24 AM, "Matthias Heil" <matthias.heil at manchester.ac.uk> wrote:
>
>> Hi,
>>
>>    we've come across a possible bug in trilinos aztecoo.
>> The code seg faults when trying to execute the line
>>
>>     *dst_ptr++ = s;
>>
>> in
>>
>> trilinos-11.4.3-Source/packages/epetra/src/Epetra_CrsMatrix.cpp:3327
>>
>> An attempt to de-reference that pointer (in ddd) shows:
>>
>> (gdb) print *dst_ptr
>> Cannot access memory at address 0x8001cb466110
>>
>> Moving back through the call stack shows that the
>> memory is initially allocated in AZ_manage_memory(...)
>> which is called from just under
>>
>>     trilinos-11.4.3-Source/packages/aztecoo/src/az_gmres.c:239
>>
>> The problem arises only for large values of kspace which is related
>> to the max number of iterations. We've set this to a rather large
>> value of 5000 (We don't usually need that many, BUT the code
>> should hopefully still be able to handle this or fail
>> gracefully. Things work ok for smaller values, e.g kspace=1000).
>>
>> Following the return from this call, the memory allocated in
>> AZ_manage_memory(...) gets distributed into two vectors, hh
>> and v, and it's v that contains the illegal memory address:
>> Placing a breakpoint in
>>
>>     trilinos-11.4.3-Source/packages/aztecoo/src/az_gmres.c:248
>>
>> (just after that loop) and interrogating various values of v yields:
>>
>> (gdb) print v[5000]
>> $1 = (double *) 0x8001cb466110
>>
>> and, predictably:
>>
>> (gdb) print *v[5000]
>> Cannot access memory at address 0x8001cb466110
>>
>> whereas
>>
>> (gdb) print *v[500]
>> $4 = 0
>>
>> is fine.
>>
>> Trial and error shows that things go wrong beyond entry 518:
>>
>> (gdb) print *v[519]
>> Cannot access memory at address 0x7ffff0bf14f0
>> (gdb) print *v[518]
>> $8 = 0
>>
>> Further information:
>>
>>    -- All code was completely built from source, using gcc
>>       without optimisation and with -g.
>>
>>    -- Based on a (small) sample of machines, the problem only
>>       arises on 64 bit machines (not 32)
>>
>>    -- The problem only arises for sufficiently big problem sizes
>>       (though they are still way short of the machines' total
>>       available memory). When running on a machine with very
>>       little memory, the call to AZ_manage_memory(...) fails
>>       gracefully with the "maybe you should try a smaller problem"
>>       message.
>>
>>    -- The problem arises with both serial and parallel installations
>>       (i.e. when the code is compiled with and without mpi support)
>>       and with different trilinos releases.
>>
>>   -- The problem is difficult to isolate further since we use
>>       trilinos from within our own big library (which provides the
>>       preconditioner). Note that our code works fine if we use our
>>       own (serial) GMRES solver (or a direct solver).
>>
>>    Does any of this ring a bell?
>>
>>       Happy to run further tests here or provide additional diagnostic
>> information.
>>
>>       Best wishes,
>>
>>               Matthias
>>
>> -- 
>> --------------------------------------------------------------------------
>> -
>> Professor Matthias Heil
>>
>> Alan Turing Building, Room 2.224
>> School of Mathematics           Tel. +44 (0)161 275 5808
>> University of Manchester        Fax. +44 (0)161 275 5819
>> Oxford Road                     email: M.Heil at maths.man.ac.uk
>> Manchester M13 9PL              WWW: http://www.maths.man.ac.uk/~mheil/
>> U.K.
>>
>> NEWS:   The beta release of oomph-lib, the object-oriented
>>          multi-physics finite-element library is now available
>>          as free open-source software at
>>
>>              http://www.oomph-lib.org
>>
>> --------------------------------------------------------------------------
>> -
>>
>> _______________________________________________
>> Trilinos-Users mailing list
>> Trilinos-Users at software.sandia.gov
>> http://software.sandia.gov/mailman/listinfo/trilinos-users

-- 
---------------------------------------------------------------------------
Professor Matthias Heil

Alan Turing Building, Room 2.224
School of Mathematics           Tel. +44 (0)161 275 5808
University of Manchester        Fax. +44 (0)161 275 5819
Oxford Road                     email: M.Heil at maths.man.ac.uk
Manchester M13 9PL              WWW: http://www.maths.man.ac.uk/~mheil/
U.K.

NEWS:   The beta release of oomph-lib, the object-oriented
         multi-physics finite-element library is now available
         as free open-source software at

             http://www.oomph-lib.org

---------------------------------------------------------------------------





More information about the Trilinos-Users mailing list