[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos
Eric Marttila
eric.marttila at thermoanalytics.com
Thu Sep 27 12:19:31 MDT 2012
Thanks Mike,
This is very helpful. I ran your codes on my system and was able to reproduce
your timing results fairly closely.
Given these results, would you recommend that I use Belos instead of AztecOO
and ML? That is what these performance numbers suggest to me, but many of
the real problems I'm solving have 3D diffusion, so I think I may need the type
of preconditioning that ML has. Or are there other (threaded)
preconditioners that I could use instead with Belos?
--Eric
On Thursday, September 27, 2012 01:50:18 am Heroux, Michael A wrote:
> Eric,
>
> I took a look at your example code on my 12 core (two hex core Intel
> Westmeres) workstation. There are two primary reasons for not seeing
> speedup:
>
>
> * Your timer includes all the problem generation steps. These are
> fairly expensive and not threaded, so you won't see speed up when they are
> included. Also, it appears that even when you were not using the ML
> preconditioned, an ILU preconditioned was invoked, which is also not
> threaded. * AztecOO doesn't actually use the Epetra vectors in threaded
> mode, so only the Matvec is running in parallel. AztecOO extracts a
> pointer to the Epetra data and executes it on a single core. This is
> particularly bad since Epetra has actually mapped the data across the NUMA
> memory system for threaded execution. I actually knew this before, but
> forgot.
>
> Even so, when I removed the preconditioner completely from the AztecOO call
> there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads, resp on
> 1M equations).
>
> Next, I replaced AztecOO with Belos CG. Belos relies entirely on the
> underlying linear algebra library so it performs computations entirely
> with Epetra.
>
> With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads,
> respectively. Still not a linear improvement, but better than was you
> were seeing. With 12 threads, I only reduced it to 2.1 seconds.
>
> I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds, not
> too different. But then with 12 ranks the time went to 1.5 seconds.
>
> In my experience It is still the case for current generation workstations
> and small node counts that MPI-only does better than OpenMP threads, but
> this is changing in the future and is already not true on large systems
> where the number of MPI ranks becomes a resource issue, and when we can
> use different algorithms because of threading and shared memory. We are
> also learning better ways of managing data layout and process mapping, but
> this is very manual right now and very finicky.
>
> I have attached my two codes. The first, ml-example, is your original code
> with all preconditioning removed. The second, belos-example, substitutes
> Belos for AztecOO.
>
> I hope this helps.
>
> Best regards,
>
> Mike
>
> On 9/20/12 1:18 PM, "Eric Marttila"
> <eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.co
> m>> wrote:
>
> Mike,
> I have been using ML, but changed my code to run the simulation without ML
> as you suggested. It takes much longer to run this way (about 25 seconds
> instead of ~3) but does give some very slight speedup for 2 threads, then
> gets slower again for 4 and 8 threads. The timing results below are wall
> time using the function omp_get_wtime(). I'm using the CG option in
> AztecOO (not GMRES).
>
> OMP_NUM_THREADS Wall Time for solve
> ------------------------------------------------------------
> 1 24.6308 (seconds)
> 2 24.0489 (seconds)
> 4 25.9732 (seconds)
> 8 28.6976 (seconds)
>
> --Eric
>
> On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
> Here are a few comments:
> - Your problem size is certainly sufficient to realize some parallel
> speedup. - ML (which I assume you are using) will not see any improvement
> from OpenMP parallelism. It is not instrumented for it. - This means that
> the only possible parallelism is in the sparse MV and vector updates.
> Since ML is usually more than 50% of the total runtime, you won't see a
> lot of improvement from threading, even when other issues are resolved.
> A few suggestions:
> - Try to run your environment without ML, just to see if you get any
> improvement in the SpMV and vector operations. - If you are using GMRES,
> make sure you link with a threaded BLAS. DGEMV is the main kernel of
> GMRES other than SpMV and will need to be executed in threaded mode. -
> Make sure your timer is a wall-clock timer, not a cpu timer. A reasonable
> timer is the one that comes with OpenMP.
> I hope this helps. Let me know what you find out.
> Mike
> On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
>
>
<eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.com>>
wrote:
> > Mike,
> > The problem size is 1 million unknowns. I have Trilinos compiled with MPI
> > enabled. However, I'm launching my program with only one MPI process.
> > Here is some system information:
> > Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
> > RAM: 96 GB
> > OS: CentOS 6.2 64bit.
> > When solving for 1 million unknowns on this system, AztecOO reports the
> > following solution times: Solution time: 3.3 seconds (using Trilinos
> > with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos with
> > OpenMP enabled)
> > I had OMP_NUM_THREADS set to 4.
> > If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
> > Thanks for your help.
> > --Eric
> >
> > On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A wrote:
> > > Eric,
> > >
> > > Can you give some details about problem size, use of MPI (or not), type
> > > of system, etc.
> > >
> > > Thanks.
> > >
> > > Mike
> > >
> > > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
> > > > Hello,
> > > >
> > > > I'm using AztecOO and ML to solve a linear system. I've been running
> > > > my simulation in serial mode, but now I would like to take advantage
> > > > of multiple cores by using the OpenMP support that is available in
> > > > Trilinos. I realize that the packages I'm using are not fully
> > > > multi-threaded with openmp, but I'm hoping for some performance
> > > > improvement since some of the packages I'm using have at least some
> > > > level of OpenMP support.
> > > >
> > > > I reconfigured and built Trilinos 10.12.2 with
> > > >
> > > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
> > > >
> > > > ...but when I run my simulation I see that it is slower than if I
> > > > have Trilinos configured without the above option. I have set the
> > > > environment variable OMP_NUM_THREADS to the desired number of
> > > > threads.
> > > >
> > > > I was also able to reproduce this behavior with one of the trilinos
> > > > example prgrams (attached below), so I suspect I am missing something
> > > > obvious in using the OpenMP support.
> > > >
> > > > Does anybody have thoughts of what I might be missing?
> > > >
> > > > Thanks.
> > > > --Eric
> >
> > --
> > Eric A. Marttila
> > ThermoAnalytics, Inc.
> > 23440 Airpark Blvd.
> > Calumet, MI 49913
> > email:
> > Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.c
> > om> phone: 810-636-2443
> > fax: 906-482-9755
> > web: http://www.thermoanalytics.com
>
> --
> Eric A. Marttila
> ThermoAnalytics, Inc.
> 23440 Airpark Blvd.
> Calumet, MI 49913
>
> email:
> Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.com
> > phone: 810-636-2443
> fax: 906-482-9755
> web: http://www.thermoanalytics.com
>
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandia.go
> v> http://software.sandia.gov/mailman/listinfo/trilinos-users
--
Eric A. Marttila
ThermoAnalytics, Inc.
23440 Airpark Blvd.
Calumet, MI 49913
email: Eric.Marttila at ThermoAnalytics.com
phone: 810-636-2443
fax: 906-482-9755
web: http://www.thermoanalytics.com
More information about the Trilinos-Users
mailing list