[Trilinos-Users] [EXTERNAL] test failures

Bartlett, Roscoe A. bartlettra at ornl.gov
Wed Jul 9 13:20:34 MDT 2014


Jeremy,

Assuming this is not a broken development version of Trilinos on your platform (which seems unlikely since you have 60% failing tests which we have not seen in many years) ...

I saw what executable was failing so I did not mean to imply it was your code.  Instead, I was asking if the problem might be due to your compilation or runtime environment or something external from the source code.  To know more, exactly why did these tests fail (i.e. according to ctest).  To see, you can look at the detailed output in:

   <YOUR-BUILD-DIR>/Testing/Temporary/LastTest.log

and it will tell you why CTest passed or failed each test.  My guess is that it is because it is seeing these dangling RCPNode objects.  The problem is that this exec is not terminating normally, it is being aborted prematurely for some reason as shown in your output:

Program received signal SIGABRT, Aborted.

Why?   The program aborting prematurely would concern me given that these tests pass on a bunch of platforms and compilers without aborting (again, assuming this is not a broken version of Trilinos).

If you want, you can just make these errors go away by configuring with:

   -DTeuchos_ENABLE_DEBUG_RCP_NODE_TRACING:BOOL=OFF

and that will turn off this type of checking (blow away your CMakeCache.txt file to be safe).  The test programs will likely pass even with this abort because ctest is looking for:

End Result: TEST PASSED

and *not* a zero return code in most cases (because return codes from mpiexec or mpirun are unreliable).

But then again, if the program is done its work at the end of main() then do you really care how it terminates?  The problem is that some sophisticated C++ programs may actually do necessary things after main() ends in the destruction of static objects (but most good C++ programs will not do that).  This check in Teuchos is just helping to catch and debug circular references.

-Ross



From: Templeton, Jeremy Alan [mailto:jatempl at sandia.gov]
Sent: Wednesday, July 09, 2014 1:12 PM
To: Bartlett, Roscoe A.
Cc: trilinos-users at software.sandia.gov
Subject: Re: [EXTERNAL] test failures

Hi Ross,

Thanks for the response.  Just to be clear, the program which is failing is a Trilinos test which I have not modified, specifically Tpetra_ExportToStaticGraphCrsMatrix.  This is an example as nearly 60% of the tests failed in similar fashion, again, without any changes from the distribution.

Jeremy

On Jul 9, 2014, at 9:21 AM, Bartlett, Roscoe A. <bartlettra at ornl.gov<mailto:bartlettra at ornl.gov>> wrote:


Jeremy,

See section:

   "5.11.2 Detection of circular references"

in:

    http://web.ornl.gov/~8vt/TeuchosMemoryManagementSAND.pdf

My guess is that your programs are terminating incorrectly for some reason.  This could be due to a number of different issues but it is not a good sign.  I would not trust software build on this system with these types of failures.

-Ross



From: trilinos-users-bounces at software.sandia.gov<mailto:trilinos-users-bounces at software.sandia.gov> [mailto:trilinos-users-bounces at software.sandia.gov] On Behalf OfTempleton, Jeremy Alan
Sent: Wednesday, July 09, 2014 12:01 PM
To: trilinos-users at software.sandia.gov<mailto:trilinos-users at software.sandia.gov>
Subject: [Trilinos-Users] test failures

Hi all,

I'm trying to get Trilinos 11.8 (tpetra, zoltan2) running on my mac (10.9, gcc4.8, openmpi) but when I build an MPI debug version nearly 60% of the tests fail on exit.  I ran one of them through GDB and the output is below.  When I build the MPI release version with no other configuration changes, all but one test passes.  So this looks like some extra error checking.  How relevant is it?  If it's not harmful, is there a way to turn it off?

Thanks,
Jeremy

Starting program: /Users/jatempl/Codes/Trilinos_build/packages/tpetra/test/ImportExport/Tpetra_ExportToStaticGraphCrsMatrix.exe
End Result: TEST PASSED

***
*** Warning! The following Teuchos::RCPNode objects were created but have
*** not been destroyed yet.  A memory checking tool may complain that these
*** objects are not destroyed correctly.
***
*** There can be many possible reasons that this might occur including:
***
***   a) The program called abort() or exit() before main() was finished.
***      All of the objects that would have been freed through destructors
***      are not freed but some compilers (e.g. GCC) will still call the
***      destructors on static objects (which is what causes this message
***      to be printed).
***
***   b) The program is using raw new/delete to manage some objects and
***      delete was not called correctly and the objects not deleted hold
***      other objects through reference-counted pointers.
***
***   c) This may be an indication that these objects may be involved in
***      a circular dependency of reference-counted managed objects.
***

  0: RCPNode (map_key_void_ptr=0x101353290)
       Information = {T=Teuchos::SerializationTraits<int, unsigned long>, ConcreteT=Teuchos::SerializationTraits<int, unsigned long>, p=0x101353290, has_ownership=1}
       RCPNode address = 0x1013533f0
       insertionNumber = 14
  1: RCPNode (map_key_void_ptr=0x101353440)
       Information = {T=Teuchos::SerializationTraits<int, int>, ConcreteT=Teuchos::SerializationTraits<int, int>, p=0x101353440, has_ownership=1}
       RCPNode address = 0x101354230
       insertionNumber = 15

NOTE: To debug issues, open a debugger, and set a break point in the function where
the RCPNode object is first created to determine the context where the object first
gets created.  Each RCPNode object is given a unique insertionNumber to allow setting
breakpoints in the code.  For example, in GDB one can perform:

1) Open the debugger (GDB) and run the program again to get updated object addresses

2) Set a breakpoint in the RCPNode insertion routine when the desired RCPNode is first
inserted.  In GDB, to break when the RCPNode with insertionNumber==3 is added, do:

  (gdb) b 'Teuchos::RCPNodeTracer::addNewRCPNode( [TAB] ' [ENTER]
  (gdb) cond 1 insertionNumber==3 [ENTER]

3) Run the program in the debugger.  In GDB, do:

  (gdb) run [ENTER]

4) Examine the call stack when the program breaks in the function addNewRCPNode(...)
libc++abi.dylib: terminating with uncaught exception of type std::logic_error: /Users/jatempl/Codes/trilinos-11.8.1-Source/packages/teuchos/core/src/Teuchos_RCPNode.cpp:497:

Throw number = 1

Throw test that evaluated to true: !(rcp_node_list())

Error!

Program received signal SIGABRT, Aborted.
0x00007fff8d734866 in ?? ()
(gdb) where
#0  0x00007fff8d734866 in ?? ()
#1  0x00007fff9294035c in ?? ()
#2  0x0000000000000000 in ?? ()

--------------------------------------------------------
Jeremy A. Templeton, Ph.D.
Thermal/Fluid Sciences & Engineering
jatempl at sandia.gov<mailto:jatempl at sandia.gov>
http://tiny.sandia.gov/jatempl
925-294-1429


--------------------------------------------------------
Jeremy A. Templeton, Ph.D.
Thermal/Fluid Sciences & Engineering
jatempl at sandia.gov<mailto:jatempl at sandia.gov>
http://tiny.sandia.gov/jatempl
925-294-1429




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://software.sandia.gov/pipermail/trilinos-users/attachments/20140709/f617977a/attachment.html>


More information about the Trilinos-Users mailing list