[Trilinos-Users] aztec00 problem

Heroux, Mike MHeroux at csbsju.edu
Tue Jan 21 10:05:20 MST 2014


Matthias,

AztecOO has been "upgraded" to handle larger problems.  We can now use it
for problems where the global integer data is "long long", but we decided
to avoid a complete transition to 64-bit ints.  Instead, the Belos
package, along with Tpetra is our long-term solution for this issue.

I am recording the result of this conversation with trilinos-bugs, so we
can get the fix on the queue.

Thanks.

Mike

On 1/21/14 12:00 PM, "Matthias Heil" <matthias.heil at manchester.ac.uk>
wrote:

>Mike,
>
>   thanks for your quick reply. I agree that if this
>
>"AztecOO is not designed to work with problems
>where the size of any local data objects is beyond the range of signed
>32-bit ints."
>
>is the policy (which is sensible -- if potentially inconvenient/confusing
>for the user who can't necessarily assess what happens
>internally) then it's not a bug. I'd noticed some recent changes
>to aztec's source code where ints had been upgraded to long ints,
>and inferred (wrongly!) that there was a general attempt to allow
>it to handle bigger problems. I had hoped that we might simply
>need a similar tweak here but do realise that there would almost
>certainly be additional problems lurking further "downstream".
>
>However, adding some internal sanity checking that issues warnings
>(or aborts) if this problem arises would be VERY helpful. Took us
>quite a while to get to the bottom of this...
>
>   Anyway, with your explanation I'm happy to regard this as
>resolved...
>
>    Thanks for the quick feedback (and the great code!).
>
>      Best wishes,
>
>        Matthias
>
>  
>
>
>On 21/01/14 16:26, Heroux, Mike wrote:
>> Matthias,
>>
>> Just to make sure I understand:  This is a real overflow of range, so
>>the
>> issue is not a bug in the correct execution of AztecOO, but in not
>> detecting the memory error.  AztecOO is not designed to work with
>>problems
>> where the size of any local data objects is beyond the range of signed
>> 32-bit ints.
>>
>> It seems that we could add a quick check to AZ_manage_memory that would
>> copy the input_size value into a signed int and then compare the result,
>> or something similar.
>>
>> Is this the kind of fix you could use?  Please let me know if I am
>>missing
>> the point.
>>
>> Thanks.
>>
>> Mike
>>
>> On 1/21/14 9:58 AM, "Matthias Heil" <matthias.heil at manchester.ac.uk>
>>wrote:
>>
>>> Mike,
>>>
>>>     we've made some progress. After some more digging we've
>>> established that the offensive call to AZ_manage_memory is
>>> made with the following arguments:
>>>
>>> AZ_manage_memory (input_size=295202080,
>>>                     action=0,
>>>                     type=-914901,
>>>                     name=0x7fffffffd720 "vblock in gmres0",
>>>                     status=0x7fffffffd6c8)
>>>
>>>
>>> at trilinos-11.4.3-Source/packages/aztecoo/src/az_util.c:944
>>> when "viewed" from within AZ_manage_memory.
>>>
>>> However, looking at the calling code, the first argument, input_size,
>>> is derived from: kspace=5000; aligned_N_total=222084;
>>> sizeof(double)=8, so it should be (5000+1)*222084*8=8885136672.
>>>
>>> Andrew Hazel then wrote the small test code, below, which shows that
>>> 295202080 is the value given to an unsigned int that stores the result
>>> of the calculation. The problem appears to be that the first argument
>>> to AZ_manage_memory is an unsigned int, rather than an unsigned
>>> long (or some other custom type).
>>>
>>>       Matthias
>>>
>>>
>>> #include <iostream>
>>>
>>> int main()
>>>   {
>>>    unsigned kspace = 5000;
>>>    unsigned aligned_N_total = 222084;
>>>
>>>    unsigned temp = (kspace+1)*aligned_N_total*sizeof(double);
>>>    unsigned long temp2 = (kspace+1)*aligned_N_total*sizeof(double);
>>>
>>>    std::cout << temp << " " << temp2 << "\n";
>>>   }
>>>
>>>
>>>
>>> On 20/01/14 21:18, Heroux, Mike wrote:
>>>> Matthias,
>>>>
>>>> Do you have a sense of whether or not the data sizes you are using
>>>>would
>>>> result in array indexing that exceed 2.1 billion?  The kinds of issues
>>>> you
>>>> are seeing would be consistent with trying to address an array using
>>>>an
>>>> integer value that is bigger than what signed int can handle.
>>>>
>>>> The hex values you are printing are very large (more than 140
>>>>trillion),
>>>> which seems to indicate an incorrect address calculation somewhere.  I
>>>> agree that the memory manager should detect the issue, no matter what.
>>>>
>>>> Mike
>>>>
>>>> On 1/20/14 8:24 AM, "Matthias Heil" <matthias.heil at manchester.ac.uk>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>     we've come across a possible bug in trilinos aztecoo.
>>>>> The code seg faults when trying to execute the line
>>>>>
>>>>>      *dst_ptr++ = s;
>>>>>
>>>>> in
>>>>>
>>>>> trilinos-11.4.3-Source/packages/epetra/src/Epetra_CrsMatrix.cpp:3327
>>>>>
>>>>> An attempt to de-reference that pointer (in ddd) shows:
>>>>>
>>>>> (gdb) print *dst_ptr
>>>>> Cannot access memory at address 0x8001cb466110
>>>>>
>>>>> Moving back through the call stack shows that the
>>>>> memory is initially allocated in AZ_manage_memory(...)
>>>>> which is called from just under
>>>>>
>>>>>      trilinos-11.4.3-Source/packages/aztecoo/src/az_gmres.c:239
>>>>>
>>>>> The problem arises only for large values of kspace which is related
>>>>> to the max number of iterations. We've set this to a rather large
>>>>> value of 5000 (We don't usually need that many, BUT the code
>>>>> should hopefully still be able to handle this or fail
>>>>> gracefully. Things work ok for smaller values, e.g kspace=1000).
>>>>>
>>>>> Following the return from this call, the memory allocated in
>>>>> AZ_manage_memory(...) gets distributed into two vectors, hh
>>>>> and v, and it's v that contains the illegal memory address:
>>>>> Placing a breakpoint in
>>>>>
>>>>>      trilinos-11.4.3-Source/packages/aztecoo/src/az_gmres.c:248
>>>>>
>>>>> (just after that loop) and interrogating various values of v yields:
>>>>>
>>>>> (gdb) print v[5000]
>>>>> $1 = (double *) 0x8001cb466110
>>>>>
>>>>> and, predictably:
>>>>>
>>>>> (gdb) print *v[5000]
>>>>> Cannot access memory at address 0x8001cb466110
>>>>>
>>>>> whereas
>>>>>
>>>>> (gdb) print *v[500]
>>>>> $4 = 0
>>>>>
>>>>> is fine.
>>>>>
>>>>> Trial and error shows that things go wrong beyond entry 518:
>>>>>
>>>>> (gdb) print *v[519]
>>>>> Cannot access memory at address 0x7ffff0bf14f0
>>>>> (gdb) print *v[518]
>>>>> $8 = 0
>>>>>
>>>>> Further information:
>>>>>
>>>>>     -- All code was completely built from source, using gcc
>>>>>        without optimisation and with -g.
>>>>>
>>>>>     -- Based on a (small) sample of machines, the problem only
>>>>>        arises on 64 bit machines (not 32)
>>>>>
>>>>>     -- The problem only arises for sufficiently big problem sizes
>>>>>        (though they are still way short of the machines' total
>>>>>        available memory). When running on a machine with very
>>>>>        little memory, the call to AZ_manage_memory(...) fails
>>>>>        gracefully with the "maybe you should try a smaller problem"
>>>>>        message.
>>>>>
>>>>>     -- The problem arises with both serial and parallel installations
>>>>>        (i.e. when the code is compiled with and without mpi support)
>>>>>        and with different trilinos releases.
>>>>>
>>>>>    -- The problem is difficult to isolate further since we use
>>>>>        trilinos from within our own big library (which provides the
>>>>>        preconditioner). Note that our code works fine if we use our
>>>>>        own (serial) GMRES solver (or a direct solver).
>>>>>
>>>>>     Does any of this ring a bell?
>>>>>
>>>>>        Happy to run further tests here or provide additional
>>>>>diagnostic
>>>>> information.
>>>>>
>>>>>        Best wishes,
>>>>>
>>>>>                Matthias
>>>>>
>>>>> -- 
>>>>>
>>>>> 
>>>>>----------------------------------------------------------------------
>>>>>--
>>>>> --
>>>>> -
>>>>> Professor Matthias Heil
>>>>>
>>>>> Alan Turing Building, Room 2.224
>>>>> School of Mathematics           Tel. +44 (0)161 275 5808
>>>>> University of Manchester        Fax. +44 (0)161 275 5819
>>>>> Oxford Road                     email: M.Heil at maths.man.ac.uk
>>>>> Manchester M13 9PL              WWW:
>>>>>http://www.maths.man.ac.uk/~mheil/
>>>>> U.K.
>>>>>
>>>>> NEWS:   The beta release of oomph-lib, the object-oriented
>>>>>           multi-physics finite-element library is now available
>>>>>           as free open-source software at
>>>>>
>>>>>               http://www.oomph-lib.org
>>>>>
>>>>>
>>>>> 
>>>>>----------------------------------------------------------------------
>>>>>--
>>>>> --
>>>>> -
>>>>>
>>>>> _______________________________________________
>>>>> Trilinos-Users mailing list
>>>>> Trilinos-Users at software.sandia.gov
>>>>> http://software.sandia.gov/mailman/listinfo/trilinos-users
>>> -- 
>>> 
>>>------------------------------------------------------------------------
>>>--
>>> -
>>> Professor Matthias Heil
>>>
>>> Alan Turing Building, Room 2.224
>>> School of Mathematics           Tel. +44 (0)161 275 5808
>>> University of Manchester        Fax. +44 (0)161 275 5819
>>> Oxford Road                     email: M.Heil at maths.man.ac.uk
>>> Manchester M13 9PL              WWW: http://www.maths.man.ac.uk/~mheil/
>>> U.K.
>>>
>>> NEWS:   The beta release of oomph-lib, the object-oriented
>>>          multi-physics finite-element library is now available
>>>          as free open-source software at
>>>
>>>              http://www.oomph-lib.org
>>>
>>> 
>>>------------------------------------------------------------------------
>>>--
>>> -
>>>
>>>
>>>
>
>-- 
>--------------------------------------------------------------------------
>-
>Professor Matthias Heil
>
>Alan Turing Building, Room 2.224
>School of Mathematics           Tel. +44 (0)161 275 5808
>University of Manchester        Fax. +44 (0)161 275 5819
>Oxford Road                     email: M.Heil at maths.man.ac.uk
>Manchester M13 9PL              WWW: http://www.maths.man.ac.uk/~mheil/
>U.K.
>
>NEWS:   The beta release of oomph-lib, the object-oriented
>         multi-physics finite-element library is now available
>         as free open-source software at
>
>             http://www.oomph-lib.org
>
>--------------------------------------------------------------------------
>-
>
>
>




More information about the Trilinos-Users mailing list