[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos
Heroux, Michael A
maherou at sandia.gov
Thu Sep 27 12:38:30 MDT 2012
Eric,
Belos is a good choice for iterative solver. It is compatible with ML.
As for threading support, please give me a sense of your motivation for
using threads. Is it your intent to use them instead of MPI, or with MPI?
Other things being equal, MPI-only is still a better approach than
MPI+OpenMP or OpenMP-only on most systems, unless you are using very large
systems like BlueGene or Cray.
Mike
On 9/27/12 1:19 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com>
wrote:
>Thanks Mike,
>This is very helpful. I ran your codes on my system and was able to
>reproduce
>your timing results fairly closely.
>
>Given these results, would you recommend that I use Belos instead of
>AztecOO
>and ML? That is what these performance numbers suggest to me, but many
>of
>the real problems I'm solving have 3D diffusion, so I think I may need
>the type
>of preconditioning that ML has. Or are there other (threaded)
>preconditioners that I could use instead with Belos?
>
>--Eric
>
>
>On Thursday, September 27, 2012 01:50:18 am Heroux, Michael A wrote:
>> Eric,
>>
>> I took a look at your example code on my 12 core (two hex core Intel
>> Westmeres) workstation. There are two primary reasons for not seeing
>> speedup:
>>
>>
>> * Your timer includes all the problem generation steps. These are
>> fairly expensive and not threaded, so you won't see speed up when they
>>are
>> included. Also, it appears that even when you were not using the ML
>> preconditioned, an ILU preconditioned was invoked, which is also not
>> threaded. * AztecOO doesn't actually use the Epetra vectors in
>>threaded
>> mode, so only the Matvec is running in parallel. AztecOO extracts a
>> pointer to the Epetra data and executes it on a single core. This is
>> particularly bad since Epetra has actually mapped the data across the
>>NUMA
>> memory system for threaded execution. I actually knew this before, but
>> forgot.
>>
>> Even so, when I removed the preconditioner completely from the AztecOO
>>call
>> there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads, resp
>>on
>> 1M equations).
>>
>> Next, I replaced AztecOO with Belos CG. Belos relies entirely on the
>> underlying linear algebra library so it performs computations entirely
>> with Epetra.
>>
>> With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads,
>> respectively. Still not a linear improvement, but better than was you
>> were seeing. With 12 threads, I only reduced it to 2.1 seconds.
>>
>> I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds,
>>not
>> too different. But then with 12 ranks the time went to 1.5 seconds.
>>
>> In my experience It is still the case for current generation
>>workstations
>> and small node counts that MPI-only does better than OpenMP threads, but
>> this is changing in the future and is already not true on large systems
>> where the number of MPI ranks becomes a resource issue, and when we can
>> use different algorithms because of threading and shared memory. We are
>> also learning better ways of managing data layout and process mapping,
>>but
>> this is very manual right now and very finicky.
>>
>> I have attached my two codes. The first, ml-example, is your original
>>code
>> with all preconditioning removed. The second, belos-example,
>>substitutes
>> Belos for AztecOO.
>>
>> I hope this helps.
>>
>> Best regards,
>>
>> Mike
>>
>> On 9/20/12 1:18 PM, "Eric Marttila"
>>
>><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.c
>>o
>> m>> wrote:
>>
>> Mike,
>> I have been using ML, but changed my code to run the simulation without
>>ML
>> as you suggested. It takes much longer to run this way (about 25 seconds
>> instead of ~3) but does give some very slight speedup for 2 threads,
>>then
>> gets slower again for 4 and 8 threads. The timing results below are
>>wall
>> time using the function omp_get_wtime(). I'm using the CG option in
>> AztecOO (not GMRES).
>>
>> OMP_NUM_THREADS Wall Time for solve
>> ------------------------------------------------------------
>> 1 24.6308 (seconds)
>> 2 24.0489 (seconds)
>> 4 25.9732 (seconds)
>> 8 28.6976 (seconds)
>>
>> --Eric
>>
>> On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
>> Here are a few comments:
>> - Your problem size is certainly sufficient to realize some parallel
>> speedup. - ML (which I assume you are using) will not see any
>>improvement
>> from OpenMP parallelism. It is not instrumented for it. - This means
>>that
>> the only possible parallelism is in the sparse MV and vector updates.
>> Since ML is usually more than 50% of the total runtime, you won't see a
>> lot of improvement from threading, even when other issues are resolved.
>> A few suggestions:
>> - Try to run your environment without ML, just to see if you get any
>> improvement in the SpMV and vector operations. - If you are using GMRES,
>> make sure you link with a threaded BLAS. DGEMV is the main kernel of
>> GMRES other than SpMV and will need to be executed in threaded mode. -
>> Make sure your timer is a wall-clock timer, not a cpu timer. A
>>reasonable
>> timer is the one that comes with OpenMP.
>> I hope this helps. Let me know what you find out.
>> Mike
>> On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
>>
>>
><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.co
>m>>
>wrote:
>> > Mike,
>> > The problem size is 1 million unknowns. I have Trilinos compiled with
>>MPI
>> > enabled. However, I'm launching my program with only one MPI process.
>> > Here is some system information:
>> > Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
>> > RAM: 96 GB
>> > OS: CentOS 6.2 64bit.
>> > When solving for 1 million unknowns on this system, AztecOO reports
>>the
>> > following solution times: Solution time: 3.3 seconds (using Trilinos
>> > with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos with
>> > OpenMP enabled)
>> > I had OMP_NUM_THREADS set to 4.
>> > If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
>> > Thanks for your help.
>> > --Eric
>> >
>> > On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A wrote:
>> > > Eric,
>> > >
>> > > Can you give some details about problem size, use of MPI (or not),
>>type
>> > > of system, etc.
>> > >
>> > > Thanks.
>> > >
>> > > Mike
>> > >
>> > > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
>> > > > Hello,
>> > > >
>> > > > I'm using AztecOO and ML to solve a linear system. I've been
>>running
>> > > > my simulation in serial mode, but now I would like to take
>>advantage
>> > > > of multiple cores by using the OpenMP support that is available in
>> > > > Trilinos. I realize that the packages I'm using are not fully
>> > > > multi-threaded with openmp, but I'm hoping for some performance
>> > > > improvement since some of the packages I'm using have at least
>>some
>> > > > level of OpenMP support.
>> > > >
>> > > > I reconfigured and built Trilinos 10.12.2 with
>> > > >
>> > > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
>> > > >
>> > > > ...but when I run my simulation I see that it is slower than if I
>> > > > have Trilinos configured without the above option. I have set the
>> > > > environment variable OMP_NUM_THREADS to the desired number of
>> > > > threads.
>> > > >
>> > > > I was also able to reproduce this behavior with one of the
>>trilinos
>> > > > example prgrams (attached below), so I suspect I am missing
>>something
>> > > > obvious in using the OpenMP support.
>> > > >
>> > > > Does anybody have thoughts of what I might be missing?
>> > > >
>> > > > Thanks.
>> > > > --Eric
>> >
>> > --
>> > Eric A. Marttila
>> > ThermoAnalytics, Inc.
>> > 23440 Airpark Blvd.
>> > Calumet, MI 49913
>> > email:
>> >
>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.c
>> > om> phone: 810-636-2443
>> > fax: 906-482-9755
>> > web: http://www.thermoanalytics.com
>>
>> --
>> Eric A. Marttila
>> ThermoAnalytics, Inc.
>> 23440 Airpark Blvd.
>> Calumet, MI 49913
>>
>> email:
>>
>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.co
>>m
>> > phone: 810-636-2443
>> fax: 906-482-9755
>> web: http://www.thermoanalytics.com
>>
>> _______________________________________________
>> Trilinos-Users mailing list
>>
>>Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandia.
>>go
>> v> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>--
>Eric A. Marttila
>ThermoAnalytics, Inc.
>23440 Airpark Blvd.
>Calumet, MI 49913
>
>email: Eric.Marttila at ThermoAnalytics.com
>phone: 810-636-2443
>fax: 906-482-9755
>web: http://www.thermoanalytics.com
>
More information about the Trilinos-Users
mailing list