[Trilinos-Users] [EXTERNAL] Re: decomp tool Error

Sjaardema, Gregory D gdsjaar at sandia.gov
Tue Feb 23 08:59:59 EST 2016


Those lines enabled the netcdf-4 support for exodus and the netcdf library;  Glad they helped, but it should still work without the netcdf-4 support, so I need to figure out what was causing the previous issue.  However, hopefully you are at a state where you can use the tools.
..Greg

--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

From: Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>>
Date: Monday, February 22, 2016 at 6:27 PM
To: "Sjaardema, Gregory D" <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>>
Cc: "Bradley, Andrew Michael" <ambradl at sandia.gov<mailto:ambradl at sandia.gov>>, "trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>" <trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>>
Subject: Re: [Trilinos-Users] [EXTERNAL] Re: decomp tool Error

Wow. That worked! Adding these couple lines in my cmake script for Trilinos and rebuilding it was the charm I guess.

Thanks a lot for your patience with this issue, Greg. Thanks everyone who had inputs on this issue.

Sai
[https://mailfoogae.appspot.com/t?sender=adXBwYXRpc0B1dGV4YXMuZWR1&type=zerocontent&guid=998d68c8-1be2-42b5-a738-2eeb1f5368bc]ᐧ

On Mon, Feb 22, 2016 at 3:03 PM, Sjaardema, Gregory D <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:
OK. I’m not yet on El Capitan, so not sure if that could be part of the issue…

Could you try doing the netcdf-4 options enabled and then adding:

-D TPL_Netcdf_Enables_Netcdf4:BOOL=ON \
-D Trilinos_EXTRA_LINK_FLAGS:STRING="-L${TPL}/lib -lhdf5_hl -lhdf5 -lz” \

—Greg

--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

From: Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>>
Date: Monday, February 22, 2016 at 12:51 PM

To: "Sjaardema, Gregory D" <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>>
Cc: "Bradley, Andrew Michael" <ambradl at sandia.gov<mailto:ambradl at sandia.gov>>, "trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>" <trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>>
Subject: Re: [Trilinos-Users] [EXTERNAL] Re: decomp tool Error

Mac OS X 10.11.3 (El Capitan)

I've attached the cmake configure script for the Trilinos build. I'm using MPI compilers from open-mpi version 1.10.1_1, which I installed via Homebrew on my machine.

Sai
[https://mailfoogae.appspot.com/t?sender=adXBwYXRpc0B1dGV4YXMuZWR1&type=zerocontent&guid=e5c5415c-c81c-4542-bbd0-ae2aa7f7fdb9]ᐧ

On Mon, Feb 22, 2016 at 9:13 AM, Sjaardema, Gregory D <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:
What version of OS X are you running?

Can you send me your Trilinos configure script with the compilers that you are trying to use?

Not sure what is happening since my OS X version runs correctly… Will have to figure out what is different on your system…
..Greg

--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

From: Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>>
Date: Friday, February 19, 2016 at 11:58 AM
To: "Sjaardema, Gregory D" <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>>
Cc: "Bradley, Andrew Michael" <ambradl at sandia.gov<mailto:ambradl at sandia.gov>>, "trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>" <trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>>

Subject: Re: [Trilinos-Users] [EXTERNAL] Re: decomp tool Error

UPDATE 2:

NetCDF 4.4.0: Tried two variations, both with netcdf-4 and dap enabled:

I) With the following changes in netcdf.h:

# Modify the following #define statements in the netcdf.h file.  Change the values to match what is given below.
#define NC_MAX_DIMS 65536
#define NC_MAX_ATTRS 8192
#define NC_MAX_VARS 524288
#define NC_MAX_NAME 256
#define NC_MAX_VAR_DIMS 8

II) With default numbers in the netcdf.h

In case I: the following tests in the netcdf test suite failed:


155/169 Test #155: ncdap_tst_remote3 .....................***Failed   11.89 sec

        Start 156: ncdap_tst_formatx

156/169 Test #156: ncdap_tst_formatx .....................   Passed    0.47 sec

        Start 157: ncdap_test_partvar

157/169 Test #157: ncdap_test_partvar ....................   Passed    0.52 sec

        Start 158: ncdap_testurl

158/169 Test #158: ncdap_testurl .........................   Passed    0.77 sec

        Start 159: ncdap_test_nstride_cached

159/169 Test #159: ncdap_test_nstride_cached .............***Exception: SegFault  0.59 sec

        Start 160: ncdap_t_misc

160/169 Test #160: ncdap_t_misc ..........................   Passed    0.14 sec

        Start 161: ncdap_test_varm3

161/169 Test #161: ncdap_test_varm3 ......................***Exception: SegFault  0.58 sec

        Start 162: C_tests_simple_xy_wr

162/169 Test #162: C_tests_simple_xy_wr ..................   Passed    0.01 sec

        Start 163: C_tests_simple_xy_rd

163/169 Test #163: C_tests_simple_xy_rd ..................   Passed    0.04 sec

        Start 164: C_tests_sfc_pres_temp_wr

164/169 Test #164: C_tests_sfc_pres_temp_wr ..............   Passed    0.01 sec

        Start 165: C_tests_sfc_pres_temp_rd

165/169 Test #165: C_tests_sfc_pres_temp_rd ..............   Passed    0.01 sec

        Start 166: C_tests_pres_temp_4D_wr

166/169 Test #166: C_tests_pres_temp_4D_wr ...............   Passed    0.06 sec

        Start 167: C_tests_pres_temp_4D_rd

167/169 Test #167: C_tests_pres_temp_4D_rd ...............   Passed    0.03 sec

        Start 168: cdl_create_sample_files

168/169 Test #168: cdl_create_sample_files ...............   Passed    0.05 sec

        Start 169: cdl_do_comps

169/169 Test #169: cdl_do_comps ..........................   Passed    0.01 sec


98% tests passed, 3 tests failed out of 169


Total Test time (real) = 111.31 sec


The following tests FAILED:

155 - ncdap_tst_remote3 (Failed)

159 - ncdap_test_nstride_cached (SEGFAULT)

161 - ncdap_test_varm3 (SEGFAULT)

Errors while running CTest

In case II, all tests passed (100%).

But in either case, trilinos build against these variations still results in the decomp error. However, Peridigm builds fine with all tests passing in either case.

Sai




[https://mailfoogae.appspot.com/t?sender=adXBwYXRpc0B1dGV4YXMuZWR1&type=zerocontent&guid=b1ca81dd-ad32-457f-9eed-abc7b42d5336]ᐧ

On Fri, Feb 19, 2016 at 11:47 AM, Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>> wrote:
Hi Greg,

It's not a specific mesh that I'm having trouble with. Basically, any mesh I'm trying to decompose, I get either the netcdf error I've pasted in previous messages on this thread, or the following about 'segmentation fault':

Executing:
   /usr/local/trilinos/bin/nem_slice -e -S  -l inertial -c -o cube_split.g.nem -m mesh=4 cube_split.g
   ...see cube_split.g.decomp.out for nem_slice status

Beginning nem_slice execution.
Input Mesh File = 'cube_split.g'
Using 32-bit integer mode for decomposition...
[dhcp-128-83-76-100:51612] *** Process received signal ***
[dhcp-128-83-76-100:51612] Signal: Segmentation fault: 11 (11)
[dhcp-128-83-76-100:51612] Signal code: Address not mapped (1)
[dhcp-128-83-76-100:51612] Failing at address: 0x7fbdae400000
[dhcp-128-83-76-100:51612] [ 0] 0   libsystem_platform.dylib            0x00007fff91404eaa _sigtramp + 26
[dhcp-128-83-76-100:51612] [ 1] 0   ???                                 0x00007fff6df35390 0x0 + 140735038051216
[dhcp-128-83-76-100:51612] [ 2] 0   nem_slice                           0x000000010ab53522 _Z13write_nemesisIiEiRNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEEP19Machine_DescriptionP19Problem_DescriptionP16Mesh_DescriptionIT_EP14LB_DescriptionISD_EP11Sphere_Info + 16162
[dhcp-128-83-76-100:51612] [ 3] 0   nem_slice                           0x000000010ab4afce _Z13internal_mainIiEiiPPcT_ + 8814
[dhcp-128-83-76-100:51612] [ 4] 0   nem_slice                           0x000000010ab44400 main + 1680
[dhcp-128-83-76-100:51612] [ 5] 0   libdyld.dylib                       0x00007fff9788e5ad start + 1
[dhcp-128-83-76-100:51612] *** End of error message ***
/usr/local/trilinos/bin/decomp: line 125: 51612 Segmentation fault: 11  ( $NEM_SLICE -e $spheres $decomp_method $do_viz $nem_slice_flag -o $nemesis -m mesh=$processors $genesis >> $output )

ERROR:******************************************************************
ERROR:
ERROR     During nem_slice execution. Check error output above and rerun
ERROR:
ERROR:******************************************************************

You can see that this is not one of the meshes you see in the errors prior to this message. Any mesh I try to decompose is having this problem. All these errors though are 'During nem_slice execution'.

I'm using netcdf 4.4.0, the latest stable version. I've also tried netcdf version 4.3.3.1, which was working before the rebasing to the latest GitHub trilinos version. I've tried variations including not enabling/disabling netcdf-4 and dap. None of these variations were helping. The latest build I have is with netcdf-4 and dap enabled, and I've not made changes to the numbers in netcdf.h file as instructed in the peridigm webpage. All the tests in the netcdf test suite (~160 tests, I believe) passed.

I'm attaching the mesh file you requested, but it's not a mesh specific issue. I really appreciate your help looking into this issue. Meanwhile, I'll try leaving netcdf-4 and dap enabled, but changing the variables as shown on the peridigm page, and see if that variation works to fix this issue.

Sai
[https://mailfoogae.appspot.com/t?sender=adXBwYXRpc0B1dGV4YXMuZWR1&type=zerocontent&guid=3a78ce1f-b5a3-4b54-87a2-2a5451b9f61c]ᐧ

On Fri, Feb 19, 2016 at 7:52 AM, Sjaardema, Gregory D <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:
Most builds of nemslice use a netcdf with netcdf4 enabled.  It looks like there is a logic error somewhere with determining whether something is 32 or 64 bit (even in 32 bit mode we output some values to the nem file as 64 bit values if netcdf4 enabled).  If you could enable netcdf4 on your build then hopefully it will work.  I'm still trying to track down why yours is failing but can't replicate yet.

What netcdf version are you using?

.. Greg

Sent from my iPhone

On Feb 18, 2016, at 9:28 PM, Sjaardema, Gregory D <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:

Is it possible to send me the mesh.

Also, you don't need to and probably shouldn't disable netcdf4.  It will give you more options, but shouldn't cause the problem you are seeing.

If I could try the mesh I might be able to replicate the issue.

..Greg

Sent from my iPhone

On Feb 18, 2016, at 4:47 PM, Bradley, Andrew Michael <ambradl at sandia.gov<mailto:ambradl at sandia.gov>> wrote:


Hi Sai,


OK. Sorry, but I'll have to let the experts step in at this point. I had guessed that might work based on examining elb_main.C lines 139, 168-171, but there must be some deeper issue that I'm not seeing.


Andrew


________________________________
From: Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>>
Sent: Thursday, February 18, 2016 4:39 PM
To: Bradley, Andrew Michael
Cc: Sjaardema, Gregory D; trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>
Subject: Re: [Trilinos-Users] [EXTERNAL] Re: decomp tool Error

Andrew,

There is no change in the error it throws. Still uses a 32-bit integer mode for decomposition.

Sai
[https://mailfoogae.appspot.com/t?sender=adXBwYXRpc0B1dGV4YXMuZWR1&type=zerocontent&guid=62fc3831-5920-49b9-9b14-541f752f28bd]ᐧ

On Thu, Feb 18, 2016 at 5:28 PM, Bradley, Andrew Michael <ambradl at sandia.gov<mailto:ambradl at sandia.gov>> wrote:

Hi Sai,


Just a guess, but what happens if you add the command-line flag -64 ?


Andrew


________________________________
From: Trilinos-Users <trilinos-users-bounces at trilinos.org<mailto:trilinos-users-bounces at trilinos.org>> on behalf of Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>>
Sent: Thursday, February 18, 2016 4:23 PM
To: Sjaardema, Gregory D
Cc: trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>
Subject: Re: [Trilinos-Users] [EXTERNAL] Re: decomp tool Error

UPDATE:

I think all other tools may be working fine from my trilinos build. The tools I commonly use from trilinos include decomp, epu, exodiff and epu. All except decomp seem to working fine.

Even after rebuilding trilinos several times (varying some options each time) and Peridigm passing all the tests each time, decomp throws errors like these:

Executing:
   /usr/local/trilinos/bin/nem_slice -e -S  -l inertial -c -o HEGF-res-cylin.g.nem -m mesh=4 HEGF-res-cylin.g
   ...see HEGF-res-cylin.g.decomp.out for nem_slice status

Beginning nem_slice execution.
Input Mesh File = 'HEGF-res-cylin.g'
Using 32-bit integer mode for decomposition...
Exodus Library Warning/Error: [ex_put_cmap_params_cc]
Error: failed to add dimension for "ncnt_cmap" of size 6313656973<tel:6313656973> in file ID 65536
NetCDF: Invalid dimension size
================================messages================================
fatal: unable to output communication map parameters
fatal: could not output Nemesis file

ERROR:******************************************************************
ERROR:
ERROR     During nem_slice execution. Check error output above and rerun
ERROR:
ERROR:******************************************************************

Sai
[https://mailfoogae.appspot.com/t?sender=adXBwYXRpc0B1dGV4YXMuZWR1&type=zerocontent&guid=11e726bf-5482-49da-9c86-c27869c8737b]ᐧ

On Thu, Feb 18, 2016 at 11:03 AM, Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>> wrote:
Hi Greg,

An example mesh I'm trying to decompose contains 178320 elements, 189405 nodes and 1 block. I tried decomposing for 4, 6 and 8 processors. I haven't had problems with previous Trilinos versions I was using before. I think it was only since I rebased to the official version hosted on the GitHub page.

Anyways, getopt I was able to fix with John Foster's help. I just installed a gnu-getopt version from Homebrew and modified the PATH variable to look for it first before looking in /usr/bin.

Coming to Netcdf, I followed the instructions exactly as they stated in the following page: https://peridigm.sandia.gov/content/netcdf. So, I disabled netcdf-4 and dap, and installed it using the changed numbers in netcdf.h file as well. All the tests passed when I did 'make check'. So I didn't think there were any issues with the netcdf installation. Doing it this way, however, there was no referencing the HDF5 build I did in the previous step. Even the in summary of netcdf configuation, the HDF5 support seems to off. I left HDF5 installed though because I saw that it maybe needed for the SEACAS package in Trilinos.

But as I mentioned before, I didn't have issues like this with previous Trilinos versions (I also didn't follow the netcdf instructions given at the webpage before, I just installed whatever was default from unidata). Perhaps, the instructions on the page are not completely correct?

Sorry for the long email, but those are all the details.

Sai

On Thu, Feb 18, 2016 at 7:38 AM, Sjaardema, Gregory D <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:
What size mesh are you decomposing (#elem, #block, #node) and how many processors are you decompsing it for?

Did you also install hdf5 and reference it in the netcdf build for netcdf-4 support, or is it a netcdf build only?

The current getopt that you have will work, but will give reduced functionality in regards to long options which you can see by entering -H and -h and seeing the difference.  I’m not sure if installing the gnu-getopt in parallel with the system getopt would cause issues or not, but on my and many other macs we have both installed and have not noticed any issues (However, I use port instead of brew).

..Greg
--
"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

From: Trilinos-Users <trilinos-users-bounces at trilinos.org<mailto:trilinos-users-bounces at trilinos.org>> on behalf of "John T. Foster" <jfoster at austin.utexas.edu<mailto:jfoster at austin.utexas.edu>>
Date: Wednesday, February 17, 2016 at 6:00 PM
To: Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>>
Cc: "trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>" <trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>>
Subject: [EXTERNAL] Re: [Trilinos-Users] decomp tool Error

Sai,

I believe your using homebrew as a package manager so use:

brew install getopt

To install the getopt command line utility.

JTF

On Wednesday, February 17, 2016, Sai P Uppati <uppatis at utexas.edu<mailto:uppatis at utexas.edu>> wrote:
Hi,

I installed Trilinos and Peridigm (official versions hosted on GitHub) on my Mac OS X 10.11.3, including the dependencies boost, hdf5 and netcdf. I followed the instructions on Sandia's Peridigm installation guide to the dot.

The Peridigm unit tests all passed, which is good. However, when I try to use the decomp tool from Trilinos, I get the following errors:


########################################################################
The "getopt" executable that is available on this system is an older
version that is not compatible with the needs of the "decomp" tool.
If possible, you should update your getopt to a newer version and make
sure that the new getopt is in your path.

Below are some options for getting the current getopt version:
* If on a Mac: "sudo port install getopt"
* Search the internet for "getopt-1.1.5" or "getopt-1.1.4"; download and build

Enter "-h" for the modified options that this version supports.
Enter "-H" for the options that the standard version supports.
########################################################################



Executing:
   /usr/local/trilinos/bin/nem_slice -e -S  -l inertial -c -o prism-precrack.g.nem -m mesh=8 prism-precrack.g
   ...see prism-precrack.g.decomp.out for nem_slice status

Beginning nem_slice execution.
Input Mesh File = 'prism-precrack.g'
Using 32-bit integer mode for decomposition...
Exodus Library Warning/Error: [ex_put_cmap_params_cc]
Error: unable to output variable in file ID 65536
NetCDF: Index exceeds dimension bound
================================messages================================
fatal: unable to output communication map parameters
fatal: could not output Nemesis file

ERROR:******************************************************************
ERROR:
ERROR     During nem_slice execution. Check error output above and rerun
ERROR:
ERROR:******************************************************************


There are multiple errors here.

1) I don't know how to update the getopt executable. It seems Mac OS X already comes with a built in version (which I checked and found to be in /usr/bin), but this version in not compatible with decomp. I checked Homebrew, and there is a key only option to install gnu-getopt, but they have a warning that installing different versions in parallel can cause trouble. I'm not able to find any other working way to install get opt with out causing errors.

2) NetCDF error about exceeding dimensions. I installed the latest version of netcdf-c, 4.4.0. I changed the numbers in netcdf.h as instructed in the Peridigm installation guide. I have a feeling that this may have something to do with the error, but I'm not quite sure. All tests passed, however, when I installed netcdf from source.

There may be other errors I'm not seeing. Please, I would appreciate if I can get some guidance on how to address these errors.

Sai


--
Sent from iPhone

[https://mailfoogae.appspot.com/t?sender=adXBwYXRpc0B1dGV4YXMuZWR1&type=zerocontent&guid=32df3964-c134-447d-a50e-3021bc915ab5]ᐧ






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20160223/c1fd99dc/attachment-0001.html>


More information about the Trilinos-Users mailing list