[Trilinos-Users] Parallel memory problem in ML?

Mon Oct 4 12:34:21 MDT 2010

  Hi Paul,

     Well, this could easily be a bug in the interface.  I'll have a 
look and report back.

Jonathan

trilinos-users-request at software.sandia.gov wrote on 10/04/2010 10:10 AM:
> Send Trilinos-Users mailing list submissions to
>          trilinos-users at software.sandia.gov
>
> To subscribe or unsubscribe via the World Wide Web, visit
>          http://software.sandia.gov/mailman/listinfo/trilinos-users
> or, via email, send a message with subject or body 'help' to
>          trilinos-users-request at software.sandia.gov
>
> You can reach the person managing the list at
>          trilinos-users-owner at software.sandia.gov
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Trilinos-Users digest..."
>
>
> Today's Topics:
>
>     1. Parallel memory problem in ML? (Paul Dionne)
>     2. Parallel Performance and non-continuous block     layout
>        (M. Scot Breitenfeld)
>     3. Re: Parallel Performance and non-continuous block layout
>        (Heroux, Michael A)
>     4. Re: Parallel Performance and non-continuous block layout
>        (M. Scot Breitenfeld)
>     5. Re: Tpetra for objects (Vol 62, Issue 2) (Mark Hoemmen)
>     6. Re: Parallel Performance and non-continuous       block   layout
>        (Erik Boman)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 04 Oct 2010 09:11:32 -0500
> From: "Paul Dionne"<pjd at cfdrc.com>
> Subject: [Trilinos-Users] Parallel memory problem in ML?
> To: "trilinos-users at software.sandia.gov"
>          <trilinos-users at software.sandia.gov>
> Message-ID:<4CA9E094.9080608 at cfdrc.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
>    Hi,
>
> I am new to Trilinos - I just started working with the the
> Trilinos/Petsc interface, with Trilinos version 10.4.1 and Petsc 3.0.0.
> I started with the Trilinos/Petsc example given at:
>
> http://trilinos.sandia.gov/packages/docs/dev/packages/epetraext/doc/html/epetraext_petsc_cpp.html
>
> and have been modifying it for my testing purposes.  I'm running a
> fairly small problem that takes about 0.4% of the machine memory, but
> when I run ML in parallel after 20 iterations or so I am up to 100%
> memory on all machines.  I have narrowed it down to the ML
> preconditioner, and I have also found that if I use Petsc as the
> smoother ("-petsc_smoother" option) the problem does not show up.
>
> Is this a known problem?  Is there anything else I can try?
>
> Thanks,
>
> Paul Dionne
>
> --
> Paul J. Dionne
> CFD Research Corporation
> 215 Wynn Drive, 5th floor
> Huntsville, Al  35805
> (256) 726-4837
> http://www.cfdrc.com
>
> Confidentiality Notice
>
> The information contained in this communication and its attachments is intended only for the use of the individual to whom it is addressed and may contain information that is legally privileged, confidential, or exempt from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, copying, or reliance on this communication is strictly prohibited. If you have received this communication in error, please notify confidentiality at cfdrc.com (256-726-4800) and delete the communication without retaining any copies.
>
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://software.sandia.gov/pipermail/trilinos-users/attachments/20101004/4b42b983/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Mon, 04 Oct 2010 11:29:52 -0500
> From: "M. Scot Breitenfeld"<brtnfld at uiuc.edu>
> Subject: [Trilinos-Users] Parallel Performance and non-continuous
>          block   layout
> To: trilinos-users at software.sandia.gov
> Message-ID:<4CAA0100.3060903 at uiuc.edu>
> Content-Type: text/plain; charset=iso-8859-1; format=flowed
>
>    Does it effect the parallel performance to have non-continuous nodal
> global ordering? For example, in 1D if I have (2 procs),
>
> 1    2    3   4    5   6   7   8
> o    o   o   o  | o   o   o  o
>
> then the processors will have continuous blocks of rows (proc 0, 0-3:
> proc 1: 4-7) or the other case:
>
> 1    4    5   8    3   6   7   2
> o    o   o   o  | o   o   o  o
>
> where proc 0 has rows 0,3,4,7 and proc 1 has rows 2, 5, 7, 1.
> Also, my method requires nodes in the region past partition boundary
> (for example nodes 3,6,7 contribute to proc 0), not just the nodes
> directly adjacent to the partition boundary.
>
> And another question, can I do my calculations for partition boundary
> nodes (nodes involved with communication) and then: call
> GlobalAssemble,  do the calculations for my interior nodes, and call
> FillComplete ?
>
> Thanks,
> Scot
>
>
>
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 4 Oct 2010 10:54:31 -0600
> From: "Heroux, Michael A"<maherou at sandia.gov>
> Subject: Re: [Trilinos-Users] Parallel Performance and non-continuous
>          block layout
> To: "M. Scot Breitenfeld"<brtnfld at uiuc.edu>,
>          "trilinos-users at software.sandia.gov"
>          <trilinos-users at software.sandia.gov>
> Message-ID:<C8CF70F7.2E2E7%maherou at sandia.gov>
> Content-Type: text/plain; charset="us-ascii"
>
> Scot,
>
> The performance difference continuous and non-continuous should be minimal,
> for sufficiently large problems.
>
> Regarding your last questions, I think Alan Williams can best answer it.  I
> assume you are using the FECrsMatrix class, right?
>
> Mike
>
>
> On 10/4/10 11:29 AM, "M. Scot Breitenfeld"<brtnfld at uiuc.edu>  wrote:
>
>>    Does it effect the parallel performance to have non-continuous nodal
>> global ordering? For example, in 1D if I have (2 procs),
>>
>> 1    2    3   4    5   6   7   8
>> o    o   o   o  | o   o   o  o
>>
>> then the processors will have continuous blocks of rows (proc 0, 0-3:
>> proc 1: 4-7) or the other case:
>>
>> 1    4    5   8    3   6   7   2
>> o    o   o   o  | o   o   o  o
>>
>> where proc 0 has rows 0,3,4,7 and proc 1 has rows 2, 5, 7, 1.
>> Also, my method requires nodes in the region past partition boundary
>> (for example nodes 3,6,7 contribute to proc 0), not just the nodes
>> directly adjacent to the partition boundary.
>>
>> And another question, can I do my calculations for partition boundary
>> nodes (nodes involved with communication) and then: call
>> GlobalAssemble,  do the calculations for my interior nodes, and call
>> FillComplete ?
>>
>> Thanks,
>> Scot
>>
>>
>>
>>
>>
>> _______________________________________________
>> Trilinos-Users mailing list
>> Trilinos-Users at software.sandia.gov
>> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 04 Oct 2010 11:57:25 -0500
> From: "M. Scot Breitenfeld"<brtnfld at uiuc.edu>
> Subject: Re: [Trilinos-Users] Parallel Performance and non-continuous
>          block layout
> To: "Heroux, Michael A"<maherou at sandia.gov>
> Cc: "trilinos-users at software.sandia.gov"
>          <trilinos-users at software.sandia.gov>
> Message-ID:<4CAA0775.9000906 at uiuc.edu>
> Content-Type: text/plain; charset=iso-8859-1; format=flowed
>
>    On 10/04/2010 11:54 AM, Heroux, Michael A wrote:
>> Scot,
>>
>> The performance difference continuous and non-continuous should be minimal,
>> for sufficiently large problems.
>>
>> Regarding your last questions, I think Alan Williams can best answer it.  I
>> assume you are using the FECrsMatrix class, right?
> Correct, I'm using FECrsMartrix.
>
>
>> Mike
>>
>>
>> On 10/4/10 11:29 AM, "M. Scot Breitenfeld"<brtnfld at uiuc.edu>   wrote:
>>
>>>     Does it effect the parallel performance to have non-continuous nodal
>>> global ordering? For example, in 1D if I have (2 procs),
>>>
>>> 1    2    3   4    5   6   7   8
>>> o    o   o   o  | o   o   o  o
>>>
>>> then the processors will have continuous blocks of rows (proc 0, 0-3:
>>> proc 1: 4-7) or the other case:
>>>
>>> 1    4    5   8    3   6   7   2
>>> o    o   o   o  | o   o   o  o
>>>
>>> where proc 0 has rows 0,3,4,7 and proc 1 has rows 2, 5, 7, 1.
>>> Also, my method requires nodes in the region past partition boundary
>>> (for example nodes 3,6,7 contribute to proc 0), not just the nodes
>>> directly adjacent to the partition boundary.
>>>
>>> And another question, can I do my calculations for partition boundary
>>> nodes (nodes involved with communication) and then: call
>>> GlobalAssemble,  do the calculations for my interior nodes, and call
>>> FillComplete ?
>>>
>>> Thanks,
>>> Scot
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Trilinos-Users mailing list
>>> Trilinos-Users at software.sandia.gov
>>> http://software.sandia.gov/mailman/listinfo/trilinos-users
>>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 4 Oct 2010 11:09:44 -0600
> From: Mark Hoemmen<mhoemme at sandia.gov>
> Subject: Re: [Trilinos-Users] Tpetra for objects (Vol 62, Issue 2)
> To: "trilinos-users at software.sandia.gov"
>          <trilinos-users at software.sandia.gov>
> Message-ID:<278BAE7E-A2D3-4235-A274-62C87C9D1B84 at sandia.gov>
> Content-Type: text/plain; charset="us-ascii"
>
> On Oct 2, 2010, at 12:00 PM, trilinos-users-request at software.sandia.gov wrote:
>> Message: 1
>> Date: Fri, 1 Oct 2010 15:32:57 -0700
>> From: "Qiyang Hu"<huqy2000 at gmail.com>
>> Subject: [Trilinos-Users] Tpetra for objects
>> To: "trilinos-users at software.sandia.gov"
>>        <trilinos-users at software.sandia.gov>
>> Message-ID:
>>        <AANLkTinuBJfjwH8y4qQYk70QuwxyWHHw-j=vrOLxacHu at mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hello,
>>
>> I am trying to use Tpetra::MultiVector to serve as a distributed container
>> for self-defined objects. According to the document, any data type can be
>> used so long as it implements Teuchos::ScalarTraits and
>> Teuchos::SerializationTraits. It seems obvious to me for ScalarTraits. But I
>> failed to implement SerializationTraits. Is there any example code that I
>> can get some hints on it? Thanks so much in advance!
> While you might get things to work, it seems like a bad idea to use Tpetra::MultiVector for completely arbitrary objects.  The Tpetra documentation says of Scalar: "A Scalar is the data structure used for storing values. This is the type most likely to be changed by many users. The most common use cases are float, double, complex<float>  and complex<double>. However, any data type can be used so long as it implements Teuchos::ScalarTraits and Teuchos::SerializationTraits and supports the necessary arithmetic operations, such as addition, subtraction, division and multiplication."  If those operations don't make sense for your object type, then maybe you should consider some other subclass of Tpetra::DistObject.
>
> mfh
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://software.sandia.gov/pipermail/trilinos-users/attachments/20101004/d4e82ed6/attachment-0001.html
>
> ------------------------------
>
> Message: 6
> Date: Mon, 04 Oct 2010 11:10:23 -0600
> From: Erik Boman<egboman at sandia.gov>
> Subject: Re: [Trilinos-Users] Parallel Performance and non-continuous
>          block   layout
> To: "M. Scot Breitenfeld"<brtnfld at uiuc.edu>
> Cc: "Heroux, Michael A"<maherou at sandia.gov>,
>          "trilinos-users at software.sandia.gov"
>          <trilinos-users at software.sandia.gov>
> Message-ID:<1286212223.26582.288.camel at octopi.sandia.gov>
> Content-Type: text/plain
>
> Scot,
>
> In general, the data distribution (maps) will affect parallel
> performance. You want to minimize communication between processors. I'm
> not sure how to interpret the figures below. If the global numbers
> correspond to the coordinates along the axis of a 1D PDE problem, then
> clearly the contiguous map (top) is better than the non-contiguous map
> (bottom), since there's less communication. However, if you're simply
> relabeling your variables then I agree with Mike the difference should
> be small (minor difference due to cache effects is possible).
>
> Erik
>
> On Mon, 2010-10-04 at 10:57 -0600, M. Scot Breitenfeld wrote:
>>    On 10/04/2010 11:54 AM, Heroux, Michael A wrote:
>>> Scot,
>>>
>>> The performance difference continuous and non-continuous should be minimal,
>>> for sufficiently large problems.
>>>
>>> Regarding your last questions, I think Alan Williams can best answer it.  I
>>> assume you are using the FECrsMatrix class, right?
>> Correct, I'm using FECrsMartrix.
>>
>>
>>> Mike
>>>
>>>
>>> On 10/4/10 11:29 AM, "M. Scot Breitenfeld"<brtnfld at uiuc.edu>   wrote:
>>>
>>>>     Does it effect the parallel performance to have non-continuous nodal
>>>> global ordering? For example, in 1D if I have (2 procs),
>>>>
>>>> 1    2    3   4    5   6   7   8
>>>> o    o   o   o  | o   o   o  o
>>>>
>>>> then the processors will have continuous blocks of rows (proc 0, 0-3:
>>>> proc 1: 4-7) or the other case:
>>>>
>>>> 1    4    5   8    3   6   7   2
>>>> o    o   o   o  | o   o   o  o
>>>>
>>>> where proc 0 has rows 0,3,4,7 and proc 1 has rows 2, 5, 7, 1.
>>>> Also, my method requires nodes in the region past partition boundary
>>>> (for example nodes 3,6,7 contribute to proc 0), not just the nodes
>>>> directly adjacent to the partition boundary.
>>>>
>>>> And another question, can I do my calculations for partition boundary
>>>> nodes (nodes involved with communication) and then: call
>>>> GlobalAssemble,  do the calculations for my interior nodes, and call
>>>> FillComplete ?
>>>>
>>>> Thanks,
>>>> Scot
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Trilinos-Users mailing list
>>>> Trilinos-Users at software.sandia.gov
>>>> http://software.sandia.gov/mailman/listinfo/trilinos-users
>>>
>>
>> _______________________________________________
>> Trilinos-Users mailing list
>> Trilinos-Users at software.sandia.gov
>> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>
> ------------------------------
>
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov
> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>
> End of Trilinos-Users Digest, Vol 62, Issue 4
> *********************************************
>