[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos

Thu Sep 27 12:19:31 MDT 2012

Thanks Mike,
This is very helpful.  I ran your codes on my system and was able to reproduce 
your timing results fairly closely.

Given these results, would you recommend that I use Belos instead of AztecOO 
and ML?   That is what these performance numbers suggest to me, but many of 
the real problems I'm solving have 3D diffusion, so I think I may need the type 
of preconditioning that ML has.   Or are there other (threaded) 
preconditioners that I could use instead with Belos?

--Eric

On Thursday, September 27, 2012 01:50:18 am Heroux, Michael A wrote:
> Eric,
> 
> I took a look at your example code on my 12 core (two hex core Intel
> Westmeres) workstation.  There are two primary reasons for not seeing
> speedup:
> 
> 
>   *   Your timer includes all the problem generation steps.  These are
> fairly expensive and not threaded, so you won't see speed up when they are
> included.  Also, it appears that even when you were not using the ML
> preconditioned, an ILU preconditioned was invoked, which is also not
> threaded. *   AztecOO doesn't actually use the Epetra vectors in threaded
> mode, so only the Matvec is running in parallel.  AztecOO extracts a
> pointer to the Epetra data and executes it on a single core.  This is
> particularly bad since Epetra has actually mapped the data across the NUMA
> memory system for threaded execution.  I actually knew this before, but
> forgot.
> 
> Even so, when I removed the preconditioner completely from the AztecOO call
> there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads, resp on
> 1M equations).
> 
> Next, I replaced AztecOO with Belos CG.  Belos relies entirely on the
> underlying linear algebra library so it performs computations entirely
> with Epetra.
> 
> With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads,
> respectively.  Still not a linear improvement, but better than was you
> were seeing.  With 12 threads, I only reduced it to 2.1 seconds.
> 
> I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds, not
> too different.  But then with 12 ranks the time went to 1.5 seconds.
> 
> In my experience It is still the case for current generation workstations
> and small node counts that MPI-only does better than OpenMP threads, but
> this is changing in the future and is already not true on large systems
> where the number of MPI ranks becomes a resource issue, and when we can
> use different algorithms because of threading and shared memory.  We are
> also learning better ways of managing data layout and process mapping, but
> this is very manual right now and very finicky.
> 
> I have attached my two codes.  The first, ml-example, is your original code
> with all preconditioning removed.  The second, belos-example, substitutes
> Belos for AztecOO.
> 
> I hope this helps.
> 
> Best regards,
> 
> Mike
> 
> On 9/20/12 1:18 PM, "Eric Marttila"
> <eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.co
> m>> wrote:
> 
> Mike,
> I have been using ML, but changed my code to run the simulation without ML
> as you suggested. It takes much longer to run this way (about 25 seconds
> instead of ~3) but does give some very slight speedup for 2 threads, then
> gets slower again for 4 and 8 threads.  The timing results below are wall
> time using the function omp_get_wtime().  I'm using the CG option in
> AztecOO (not GMRES).
> 
> OMP_NUM_THREADS     Wall Time for solve
> ------------------------------------------------------------
> 1                                        24.6308 (seconds)
> 2                                        24.0489 (seconds)
> 4                                        25.9732 (seconds)
> 8                                        28.6976 (seconds)
> 
> --Eric
> 
> On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
> Here are a few comments:
> - Your problem size is certainly sufficient to realize some parallel
> speedup. - ML (which I assume you are using) will not see any improvement
> from OpenMP parallelism.  It is not instrumented for it. - This means that
> the only possible parallelism is in the sparse MV and vector updates.
> Since ML is usually more than 50% of the total runtime, you won't see a
> lot of improvement from threading, even when other issues are resolved.
> A few suggestions:
> - Try to run your environment without ML, just to see if you get any
> improvement in the SpMV and vector operations. - If you are using GMRES,
> make sure you link with a threaded BLAS.  DGEMV is the main kernel of
> GMRES other than SpMV and will need to be executed in threaded mode. -
> Make sure your timer is a wall-clock timer, not a cpu timer.  A reasonable
> timer is the one that comes with OpenMP.
> I hope this helps.  Let me know what you find out.
> Mike
> On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
> 
> 
<eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.com>> 
wrote:
> > Mike,
> > The problem size is 1 million unknowns. I have Trilinos compiled with MPI
> > enabled. However, I'm launching my program with only one MPI process.
> > Here is some system information:
> > Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
> > RAM: 96 GB
> > OS: CentOS 6.2 64bit.
> > When solving for 1 million unknowns on this system, AztecOO reports the
> > following solution times: Solution time: 3.3 seconds (using Trilinos
> > with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos with
> > OpenMP enabled)
> > I had OMP_NUM_THREADS set to 4.
> > If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
> > Thanks for your help.
> > --Eric
> > 
> > On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A wrote:
> > > Eric,
> > > 
> > > Can you give some details about problem size, use of MPI (or not), type
> > > of system, etc.
> > > 
> > > Thanks.
> > > 
> > > Mike
> > > 
> > > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
> > > > Hello,
> > > > 
> > > > I'm using AztecOO and ML to solve a linear system. I've been running
> > > > my simulation in serial mode, but now I would like to take advantage
> > > > of multiple cores by using the OpenMP support that is available in
> > > > Trilinos. I realize that the packages I'm using are not fully
> > > > multi-threaded with openmp, but I'm hoping for some performance
> > > > improvement since some of the packages I'm using have at least some
> > > > level of OpenMP support.
> > > > 
> > > > I reconfigured and built Trilinos 10.12.2 with
> > > > 
> > > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
> > > > 
> > > > ...but when I run my simulation I see that it is slower than if I
> > > > have Trilinos configured without the above option. I have set the
> > > > environment variable OMP_NUM_THREADS to the desired number of
> > > > threads.
> > > > 
> > > > I was also able to reproduce this behavior with one of the trilinos
> > > > example prgrams (attached below), so I suspect I am missing something
> > > > obvious in using the OpenMP support.
> > > > 
> > > > Does anybody have thoughts of what I might be missing?
> > > > 
> > > > Thanks.
> > > > --Eric
> > 
> > --
> > Eric A. Marttila
> > ThermoAnalytics, Inc.
> > 23440 Airpark Blvd.
> > Calumet, MI 49913
> > email:
> > Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.c
> > om> phone: 810-636-2443
> > fax: 906-482-9755
> > web: http://www.thermoanalytics.com
> 
> --
> Eric A. Marttila
> ThermoAnalytics, Inc.
> 23440 Airpark Blvd.
> Calumet, MI 49913
> 
> email:
> Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.com
> > phone: 810-636-2443
> fax:   906-482-9755
> web: http://www.thermoanalytics.com
> 
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandia.go
> v> http://software.sandia.gov/mailman/listinfo/trilinos-users

-- 
Eric A. Marttila
ThermoAnalytics, Inc.
23440 Airpark Blvd.
Calumet, MI 49913

email: Eric.Marttila at ThermoAnalytics.com
phone: 810-636-2443
fax:   906-482-9755
web: http://www.thermoanalytics.com