[Trilinos-Users] Odd Behavior in EpetraExt_HDF5

Tue Sep 16 14:11:37 MDT 2014

This is a STL vector. I am relying on the fact that vectors store data 
contiguously. There is no error when I remove the Trilinos code. The 
point of this example was to demonstrate that the EpetraExt_HDF5 code 
hangs when two processors try to write data of different size. The stall 
is not caused by extra time to write data. When both processors write 
the larger amount (2000 doubles) it finished immediately. The stall 
causes the program to never complete.

Truman
On 9/16/14, 2:52 PM, Lofstead, Gerald F II wrote:
> Which vector is that? It looks like an C++STL vector rather than a
> Trilinos vector (E/Tpetra). If so, I expect you can simplify this removing
> all of Trilinos and see the same behavior. Depending on the version of C++
> (c++11 is different), how you get the raw data differs. There is a
> function added in C++11 called data() that will give you a pointer to the
> elements that would work for the code as written. What you have in
> &testVec[0] is taking the address of a single element reference. Whether
> or not that is part of the actual data storage in the vector is not
> defined. For that matter, taking the address of a reference is not a good
> idea either. You need to use an iterator or upgrade to C++11 to get the
> code you have written to work properly and portably. The stall I am not
> sure why it is happening. If the sizes are what you expect, they may just
> be the amount of additional time to write the quantity of data. How long
> is the stall compared to writing out from the base case (1000 elements)?
>
> Jay
>
> On 9/16/14 12:08 PM, "Jonathan Hu" <jhu at sandia.gov> wrote:
>
>> trilinos-users-request at software.sandia.gov wrote on 09/16/2014 11:00 AM:
>>> Subject:
>>> Re: [Trilinos-Users] Odd Behavior in EpetraExt_HDF5
>>> From:
>>> Truman Ellis <truman at ices.utexas.edu>
>>> Date:
>>> 09/15/2014 03:56 PM
>>>
>>> To:
>>> <trilinos-users at software.sandia.gov>
>>>
>>>
>>> There isn't any distributed data in this example. I just wanted two
>>> mpi processes to simultaneously write out two independent HDF5 files.
>>> But I noticed that if the two HDF5 files were different sizes (1000
>>> data items vs 2000 data items) then I got a stall. If they both write
>>> data of the same size, then everything goes through.
>>>
>>> On 9/15/14, 4:10 PM, Jonathan Hu wrote:
>>>>> I am using the EpetraExt_HDF5 interface to save and load solutions,
>>>>> but
>>>>> I've run into some odd behavior and was wondering if anyone could
>>>>> explain it. My goal is to have each processor write out its own part
>>>>> of
>>>>> the solution in a different HDF5 file. For the time being, I am
>>>>> assuming
>>>>> that the number of processors loading the solution is equal to the
>>>>> number writing it. Since each processor is completely independent, I
>>>>> shouldn't get any weird race conditions or anything like that
>>>>> (theoretically). In order to communicate this to EpetraExt, I am
>>>>> using a
>>>>> Epetra_SerialComm in the constructor. However, the following code
>>>>> hangs
>>>>> when I run with 2 mpi nodes
>>>>>
>>>>>
>>>>> {
>>>>>      int commRank = Teuchos::GlobalMPISession::getRank();
>>>>>      Epetra_SerialComm Comm;
>>>>>      EpetraExt::HDF5 hdf5(Comm);
>>>>>      hdf5.Create("file"+Teuchos::toString(commRank)+".h5");
>>>>>      vector<double> testVec;
>>>>>      for (int i=0; i<1000+1000*commRank; i++)
>>>>>      {
>>>>>        testVec.push_back(1.0);
>>>>>      }
>>>>>      hdf5.Write("Test", "Group", H5T_NATIVE_DOUBLE, testVec.size(),
>>>>> &testVec[0]);
>>>>> }
>>>>> {
>>>>>      int commRank =
>>>>> Teuchos::Global_______________________________________________
>>>>> Trilinos-Users mailing list
>>>>> Trilinos-Users at software.sandia.gov
>>>>>
>>>>> https://software.sandia.gov/mailman/listinfo/trilinos-usersMPISession::
>>>>> getRank();
>>>>>
>>>>>      Epetra_SerialComm Comm;
>>>>>      EpetraExt::HDF5 hdf5(Comm);
>>>>>      hdf5.Open("file"+Teuchos::toString(commRank)+".h5");
>>>>>      hdf5.Close();
>>>>> }
>>>>>
>>>>> Note that commRank 0 writes 1000 elements while commRank 1 writes
>>>>> 2000.
>>>>> The code works just fine when both write the same number of elements.
>>>>> Can someone enlighten me on what I am doing wrong? Is it possible to
>>>>> get
>>>>> the behavior I want, where each processor's read and write is
>>>>> independent of others?
>>>>>
>>>>> Thanks,
>>>>> Truman Ellis
>>>> Truman,
>>>>
>>>>      Rank 1 is loading/writing testVec from from 0..2000 due to the
>>>> bounds in your for loop.  I'm guessing that you want rank 1 to load
>>>> from 1001..2000 instead, so replace
>>>>
>>>>     for (int i=0; i<1000+1000*commRank; i++)
>>>>
>>>> with
>>>>
>>>>     for (int i=1000*commRank; i<1000+1000*commRank; i++)
>>>>
>>>> Hope this helps.
>>>>
>>>> Jonathan
>>>>
>> Truman,
>>
>>    Ok, I completely misunderstood your original email.  Hopefully one of
>> the I/O developers can chime in here.
>>
>> Jonathan
>>
>> _______________________________________________
>> Trilinos-Users mailing list
>> Trilinos-Users at software.sandia.gov
>> https://software.sandia.gov/mailman/listinfo/trilinos-users
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov
> https://software.sandia.gov/mailman/listinfo/trilinos-users
>