[Trilinos-Users] [EXTERNAL] Re: decomp tool Error

Sai P Uppati uppatis at utexas.edu
Fri Feb 19 13:58:14 EST 2016


UPDATE 2:

NetCDF 4.4.0: Tried two variations, both with netcdf-4 and dap enabled:

I) With the following changes in netcdf.h:

# Modify the following #define statements in the netcdf.h file.
Change the values to match what is given below.
#define NC_MAX_DIMS 65536
#define NC_MAX_ATTRS 8192
#define NC_MAX_VARS 524288
#define NC_MAX_NAME 256
#define NC_MAX_VAR_DIMS 8


II) With default numbers in the netcdf.h

In case I: the following tests in the netcdf test suite failed:

155/169 Test #155: ncdap_tst_remote3 .....................***Failed   11.89
sec

        Start 156: ncdap_tst_formatx

156/169 Test #156: ncdap_tst_formatx .....................   Passed    0.47
sec

        Start 157: ncdap_test_partvar

157/169 Test #157: ncdap_test_partvar ....................   Passed    0.52
sec

        Start 158: ncdap_testurl

158/169 Test #158: ncdap_testurl .........................   Passed    0.77
sec

        Start 159: ncdap_test_nstride_cached

159/169 Test #159: ncdap_test_nstride_cached .............***Exception:
SegFault  0.59 sec

        Start 160: ncdap_t_misc

160/169 Test #160: ncdap_t_misc ..........................   Passed    0.14
sec

        Start 161: ncdap_test_varm3

161/169 Test #161: ncdap_test_varm3 ......................***Exception:
SegFault  0.58 sec

        Start 162: C_tests_simple_xy_wr

162/169 Test #162: C_tests_simple_xy_wr ..................   Passed    0.01
sec

        Start 163: C_tests_simple_xy_rd

163/169 Test #163: C_tests_simple_xy_rd ..................   Passed    0.04
sec

        Start 164: C_tests_sfc_pres_temp_wr

164/169 Test #164: C_tests_sfc_pres_temp_wr ..............   Passed    0.01
sec

        Start 165: C_tests_sfc_pres_temp_rd

165/169 Test #165: C_tests_sfc_pres_temp_rd ..............   Passed    0.01
sec

        Start 166: C_tests_pres_temp_4D_wr

166/169 Test #166: C_tests_pres_temp_4D_wr ...............   Passed    0.06
sec

        Start 167: C_tests_pres_temp_4D_rd

167/169 Test #167: C_tests_pres_temp_4D_rd ...............   Passed    0.03
sec

        Start 168: cdl_create_sample_files

168/169 Test #168: cdl_create_sample_files ...............   Passed    0.05
sec

        Start 169: cdl_do_comps

169/169 Test #169: cdl_do_comps ..........................   Passed    0.01
sec


98% tests passed, 3 tests failed out of 169


Total Test time (real) = 111.31 sec


The following tests FAILED:

155 - ncdap_tst_remote3 (Failed)

159 - ncdap_test_nstride_cached (SEGFAULT)

161 - ncdap_test_varm3 (SEGFAULT)

Errors while running CTest

In case II, all tests passed (100%).

But in either case, trilinos build against these variations still results
in the decomp error. However, Peridigm builds fine with all tests passing
in either case.

Sai




ᐧ

On Fri, Feb 19, 2016 at 11:47 AM, Sai P Uppati <uppatis at utexas.edu> wrote:

> Hi Greg,
>
> It's not a specific mesh that I'm having trouble with. Basically, any mesh
> I'm trying to decompose, I get either the netcdf error I've pasted in
> previous messages on this thread, or the following about 'segmentation
> fault':
>
> Executing:
>    /usr/local/trilinos/bin/nem_slice -e -S  -l inertial -c -o
> cube_split.g.nem -m mesh=4 cube_split.g
>    ...see cube_split.g.decomp.out for nem_slice status
>
> Beginning nem_slice execution.
> Input Mesh File = 'cube_split.g'
> Using 32-bit integer mode for decomposition...
> [dhcp-128-83-76-100:51612] *** Process received signal ***
> [dhcp-128-83-76-100:51612] Signal: Segmentation fault: 11 (11)
> [dhcp-128-83-76-100:51612] Signal code: Address not mapped (1)
> [dhcp-128-83-76-100:51612] Failing at address: 0x7fbdae400000
> [dhcp-128-83-76-100:51612] [ 0] 0   libsystem_platform.dylib
>  0x00007fff91404eaa _sigtramp + 26
> [dhcp-128-83-76-100:51612] [ 1] 0   ???
> 0x00007fff6df35390 0x0 + 140735038051216
> [dhcp-128-83-76-100:51612] [ 2] 0   nem_slice
> 0x000000010ab53522
> _Z13write_nemesisIiEiRNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEEP19Machine_DescriptionP19Problem_DescriptionP16Mesh_DescriptionIT_EP14LB_DescriptionISD_EP11Sphere_Info
> + 16162
> [dhcp-128-83-76-100:51612] [ 3] 0   nem_slice
> 0x000000010ab4afce _Z13internal_mainIiEiiPPcT_ + 8814
> [dhcp-128-83-76-100:51612] [ 4] 0   nem_slice
> 0x000000010ab44400 main + 1680
> [dhcp-128-83-76-100:51612] [ 5] 0   libdyld.dylib
> 0x00007fff9788e5ad start + 1
> [dhcp-128-83-76-100:51612] *** End of error message ***
> /usr/local/trilinos/bin/decomp: line 125: 51612 Segmentation fault: 11  (
> $NEM_SLICE -e $spheres $decomp_method $do_viz $nem_slice_flag -o $nemesis
> -m mesh=$processors $genesis >> $output )
>
> ERROR:******************************************************************
> ERROR:
> ERROR     During nem_slice execution. Check error output above and rerun
> ERROR:
> ERROR:******************************************************************
>
> You can see that this is not one of the meshes you see in the errors prior
> to this message. Any mesh I try to decompose is having this problem. All
> these errors though are 'During nem_slice execution'.
>
> I'm using netcdf 4.4.0, the latest stable version. I've also tried netcdf
> version 4.3.3.1, which was working before the rebasing to the latest GitHub
> trilinos version. I've tried variations including not enabling/disabling
> netcdf-4 and dap. None of these variations were helping. The latest build I
> have is with netcdf-4 and dap enabled, and I've not made changes to the
> numbers in netcdf.h file as instructed in the peridigm webpage. All the
> tests in the netcdf test suite (~160 tests, I believe) passed.
>
> I'm attaching the mesh file you requested, but it's not a mesh specific
> issue. I really appreciate your help looking into this issue. Meanwhile,
> I'll try leaving netcdf-4 and dap enabled, but changing the variables as
> shown on the peridigm page, and see if that variation works to fix this
> issue.
>
> Sai
>>
> On Fri, Feb 19, 2016 at 7:52 AM, Sjaardema, Gregory D <gdsjaar at sandia.gov>
> wrote:
>
>> Most builds of nemslice use a netcdf with netcdf4 enabled.  It looks like
>> there is a logic error somewhere with determining whether something is 32
>> or 64 bit (even in 32 bit mode we output some values to the nem file as 64
>> bit values if netcdf4 enabled).  If you could enable netcdf4 on your build
>> then hopefully it will work.  I'm still trying to track down why yours is
>> failing but can't replicate yet.
>>
>> What netcdf version are you using?
>>
>> .. Greg
>>
>> Sent from my iPhone
>>
>> On Feb 18, 2016, at 9:28 PM, Sjaardema, Gregory D <gdsjaar at sandia.gov>
>> wrote:
>>
>> Is it possible to send me the mesh.
>>
>> Also, you don't need to and probably shouldn't disable netcdf4.  It will
>> give you more options, but shouldn't cause the problem you are seeing.
>>
>> If I could try the mesh I might be able to replicate the issue.
>>
>> ..Greg
>>
>> Sent from my iPhone
>>
>> On Feb 18, 2016, at 4:47 PM, Bradley, Andrew Michael <ambradl at sandia.gov>
>> wrote:
>>
>> Hi Sai,
>>
>>
>> OK. Sorry, but I'll have to let the experts step in at this point. I had
>> guessed that might work based on examining elb_main.C lines 139, 168-171,
>> but there must be some deeper issue that I'm not seeing.
>>
>>
>> Andrew
>>
>>
>> ------------------------------
>> *From:* Sai P Uppati <uppatis at utexas.edu>
>> *Sent:* Thursday, February 18, 2016 4:39 PM
>> *To:* Bradley, Andrew Michael
>> *Cc:* Sjaardema, Gregory D; trilinos-users at trilinos.org
>> *Subject:* Re: [Trilinos-Users] [EXTERNAL] Re: decomp tool Error
>>
>> Andrew,
>>
>> There is no change in the error it throws. Still uses a 32-bit integer
>> mode for decomposition.
>>
>> Sai
>>>>
>> On Thu, Feb 18, 2016 at 5:28 PM, Bradley, Andrew Michael <
>> ambradl at sandia.gov> wrote:
>>
>>> Hi Sai,
>>>
>>>
>>> Just a guess, but what happens if you add the command-line flag -64 ?
>>>
>>>
>>> Andrew
>>>
>>>
>>> ------------------------------
>>> *From:* Trilinos-Users <trilinos-users-bounces at trilinos.org> on behalf
>>> of Sai P Uppati <uppatis at utexas.edu>
>>> *Sent:* Thursday, February 18, 2016 4:23 PM
>>> *To:* Sjaardema, Gregory D
>>> *Cc:* trilinos-users at trilinos.org
>>> *Subject:* Re: [Trilinos-Users] [EXTERNAL] Re: decomp tool Error
>>>
>>> UPDATE:
>>>
>>> I think all other tools may be working fine from my trilinos build. The
>>> tools I commonly use from trilinos include decomp, epu, exodiff and epu.
>>> All except decomp seem to working fine.
>>>
>>> Even after rebuilding trilinos several times (varying some options each
>>> time) and Peridigm passing all the tests each time, decomp throws errors
>>> like these:
>>>
>>> Executing:
>>>    /usr/local/trilinos/bin/nem_slice -e -S  -l inertial -c -o
>>> HEGF-res-cylin.g.nem -m mesh=4 HEGF-res-cylin.g
>>>    ...see HEGF-res-cylin.g.decomp.out for nem_slice status
>>>
>>> Beginning nem_slice execution.
>>> Input Mesh File = 'HEGF-res-cylin.g'
>>> Using 32-bit integer mode for decomposition...
>>> Exodus Library Warning/Error: [ex_put_cmap_params_cc]
>>> Error: failed to add dimension for "ncnt_cmap" of size 6313656973 in
>>> file ID 65536
>>> NetCDF: Invalid dimension size
>>> ================================messages================================
>>> fatal: unable to output communication map parameters
>>> fatal: could not output Nemesis file
>>>
>>> ERROR:******************************************************************
>>> ERROR:
>>> ERROR     During nem_slice execution. Check error output above and rerun
>>> ERROR:
>>> ERROR:******************************************************************
>>>
>>> Sai
>>>>>>
>>> On Thu, Feb 18, 2016 at 11:03 AM, Sai P Uppati <uppatis at utexas.edu>
>>> wrote:
>>>
>>>> Hi Greg,
>>>>
>>>> An example mesh I'm trying to decompose contains 178320 elements,
>>>> 189405 nodes and 1 block. I tried decomposing for 4, 6 and 8 processors. I
>>>> haven't had problems with previous Trilinos versions I was using before. I
>>>> think it was only since I rebased to the official version hosted on the
>>>> GitHub page.
>>>>
>>>> Anyways, getopt I was able to fix with John Foster's help. I just
>>>> installed a gnu-getopt version from Homebrew and modified the PATH variable
>>>> to look for it first before looking in /usr/bin.
>>>>
>>>> Coming to Netcdf, I followed the instructions exactly as they stated in
>>>> the following page: https://peridigm.sandia.gov/content/netcdf. So, I
>>>> disabled netcdf-4 and dap, and installed it using the changed numbers in
>>>> netcdf.h file as well. All the tests passed when I did 'make check'. So I
>>>> didn't think there were any issues with the netcdf installation. Doing it
>>>> this way, however, there was no referencing the HDF5 build I did in the
>>>> previous step. Even the in summary of netcdf configuation, the HDF5 support
>>>> seems to off. I left HDF5 installed though because I saw that it maybe
>>>> needed for the SEACAS package in Trilinos.
>>>>
>>>> But as I mentioned before, I didn't have issues like this with previous
>>>> Trilinos versions (I also didn't follow the netcdf instructions given at
>>>> the webpage before, I just installed whatever was default from unidata).
>>>> Perhaps, the instructions on the page are not completely correct?
>>>>
>>>> Sorry for the long email, but those are all the details.
>>>>
>>>> Sai
>>>>
>>>> On Thu, Feb 18, 2016 at 7:38 AM, Sjaardema, Gregory D <
>>>> gdsjaar at sandia.gov> wrote:
>>>>
>>>>> What size mesh are you decomposing (#elem, #block, #node) and how many
>>>>> processors are you decompsing it for?
>>>>>
>>>>> Did you also install hdf5 and reference it in the netcdf build for
>>>>> netcdf-4 support, or is it a netcdf build only?
>>>>>
>>>>> The current getopt that you have will work, but will give reduced
>>>>> functionality in regards to long options which you can see by entering -H
>>>>> and -h and seeing the difference.  I’m not sure if installing the
>>>>> gnu-getopt in parallel with the system getopt would cause issues or not,
>>>>> but on my and many other macs we have both installed and have not noticed
>>>>> any issues (However, I use port instead of brew).
>>>>>
>>>>> ..Greg
>>>>> --
>>>>> "A supercomputer is a device for turning compute-bound problems into
>>>>> I/O-bound problems”
>>>>>
>>>>> From: Trilinos-Users <trilinos-users-bounces at trilinos.org> on behalf
>>>>> of "John T. Foster" <jfoster at austin.utexas.edu>
>>>>> Date: Wednesday, February 17, 2016 at 6:00 PM
>>>>> To: Sai P Uppati <uppatis at utexas.edu>
>>>>> Cc: "trilinos-users at trilinos.org" <trilinos-users at trilinos.org>
>>>>> Subject: [EXTERNAL] Re: [Trilinos-Users] decomp tool Error
>>>>>
>>>>> Sai,
>>>>>
>>>>> I believe your using homebrew as a package manager so use:
>>>>>
>>>>> brew install getopt
>>>>>
>>>>> To install the getopt command line utility.
>>>>>
>>>>> JTF
>>>>>
>>>>> On Wednesday, February 17, 2016, Sai P Uppati <uppatis at utexas.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I installed Trilinos and Peridigm (official versions hosted on
>>>>>> GitHub) on my Mac OS X 10.11.3, including the dependencies boost, hdf5 and
>>>>>> netcdf. I followed the instructions on Sandia's Peridigm installation guide
>>>>>> to the dot.
>>>>>>
>>>>>> The Peridigm unit tests all passed, which is good. However, when I
>>>>>> try to use the decomp tool from Trilinos, I get the following errors:
>>>>>>
>>>>>>
>>>>>>
>>>>>> ########################################################################
>>>>>> The "getopt" executable that is available on this system is an older
>>>>>> version that is not compatible with the needs of the "decomp" tool.
>>>>>> If possible, you should update your getopt to a newer version and make
>>>>>> sure that the new getopt is in your path.
>>>>>>
>>>>>> Below are some options for getting the current getopt version:
>>>>>> * If on a Mac: "sudo port install getopt"
>>>>>> * Search the internet for "getopt-1.1.5" or "getopt-1.1.4"; download
>>>>>> and build
>>>>>>
>>>>>> Enter "-h" for the modified options that this version supports.
>>>>>> Enter "-H" for the options that the standard version supports.
>>>>>>
>>>>>> ########################################################################
>>>>>>
>>>>>>
>>>>>>
>>>>>> Executing:
>>>>>>    /usr/local/trilinos/bin/nem_slice -e -S  -l inertial -c -o
>>>>>> prism-precrack.g.nem -m mesh=8 prism-precrack.g
>>>>>>    ...see prism-precrack.g.decomp.out for nem_slice status
>>>>>>
>>>>>> Beginning nem_slice execution.
>>>>>> Input Mesh File = 'prism-precrack.g'
>>>>>> Using 32-bit integer mode for decomposition...
>>>>>> Exodus Library Warning/Error: [ex_put_cmap_params_cc]
>>>>>> Error: unable to output variable in file ID 65536
>>>>>> NetCDF: Index exceeds dimension bound
>>>>>>
>>>>>> ================================messages================================
>>>>>> fatal: unable to output communication map parameters
>>>>>> fatal: could not output Nemesis file
>>>>>>
>>>>>>
>>>>>> ERROR:******************************************************************
>>>>>> ERROR:
>>>>>> ERROR     During nem_slice execution. Check error output above and
>>>>>> rerun
>>>>>> ERROR:
>>>>>>
>>>>>> ERROR:******************************************************************
>>>>>>
>>>>>>
>>>>>> There are multiple errors here.
>>>>>>
>>>>>> 1) I don't know how to update the getopt executable. It seems Mac OS
>>>>>> X already comes with a built in version (which I checked and found to be in
>>>>>> /usr/bin), but this version in not compatible with decomp. I checked
>>>>>> Homebrew, and there is a key only option to install gnu-getopt, but they
>>>>>> have a warning that installing different versions in parallel can cause
>>>>>> trouble. I'm not able to find any other working way to install get opt with
>>>>>> out causing errors.
>>>>>>
>>>>>> 2) NetCDF error about exceeding dimensions. I installed the latest
>>>>>> version of netcdf-c, 4.4.0. I changed the numbers in netcdf.h as instructed
>>>>>> in the Peridigm installation guide. I have a feeling that this may have
>>>>>> something to do with the error, but I'm not quite sure. All tests passed,
>>>>>> however, when I installed netcdf from source.
>>>>>>
>>>>>> There may be other errors I'm not seeing. Please, I would appreciate
>>>>>> if I can get some guidance on how to address these errors.
>>>>>>
>>>>>> Sai
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from iPhone
>>>>>
>>>>
>>>>>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20160219/ff676e22/attachment.html>


More information about the Trilinos-Users mailing list