[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos

Eric Marttila eric.marttila at thermoanalytics.com
Thu Sep 27 13:15:36 MDT 2012


Mike,
I'm using trilinos solvers within a monolithic piece of software that has an 
integrated GUI, and at the moment I can't easily make use of MPI.  So I was 
hoping to get some speedup initially by enabling OpenMP support in trilinos.

In the future I was planning to use OpenMP with MPI, thinking that would give 
me the best performance on our target platforms, which generally consist of 
small clusters of multicore nodes (e.g. 4-8 machines that are similar to your 
12 core machine).  But from what you describe it sounds like it would be best 
for me to focus my efforts on MPI-only.

--Eric

On Thursday, September 27, 2012 02:38:30 pm Heroux, Michael A wrote:
> Eric,
> 
> Belos is a good choice for iterative solver.  It is compatible with ML.
> As for threading support, please give me a sense of your motivation for
> using threads.  Is it your intent to use them instead of MPI, or with MPI?
>  Other things being equal, MPI-only is still a better approach than
> MPI+OpenMP or OpenMP-only on most systems, unless you are using very large
> systems like BlueGene or Cray.
> 
> Mike
> 
> On 9/27/12 1:19 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com>
> 
> wrote:
> >Thanks Mike,
> >This is very helpful.  I ran your codes on my system and was able to
> >reproduce
> >your timing results fairly closely.
> >
> >Given these results, would you recommend that I use Belos instead of
> >AztecOO
> >and ML?   That is what these performance numbers suggest to me, but many
> >of
> >the real problems I'm solving have 3D diffusion, so I think I may need
> >the type
> >of preconditioning that ML has.   Or are there other (threaded)
> >preconditioners that I could use instead with Belos?
> >
> >--Eric
> >
> >On Thursday, September 27, 2012 01:50:18 am Heroux, Michael A wrote:
> >> Eric,
> >> 
> >> I took a look at your example code on my 12 core (two hex core Intel
> >> Westmeres) workstation.  There are two primary reasons for not seeing
> >> 
> >> speedup:
> >>   *   Your timer includes all the problem generation steps.  These are
> >> 
> >> fairly expensive and not threaded, so you won't see speed up when they
> >>
> >>are
> >>
> >> included.  Also, it appears that even when you were not using the ML
> >> preconditioned, an ILU preconditioned was invoked, which is also not
> >> threaded. *   AztecOO doesn't actually use the Epetra vectors in
> >>
> >>threaded
> >>
> >> mode, so only the Matvec is running in parallel.  AztecOO extracts a
> >> pointer to the Epetra data and executes it on a single core.  This is
> >> particularly bad since Epetra has actually mapped the data across the
> >>
> >>NUMA
> >>
> >> memory system for threaded execution.  I actually knew this before, but
> >> forgot.
> >> 
> >> Even so, when I removed the preconditioner completely from the AztecOO
> >>
> >>call
> >>
> >> there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads, resp
> >>
> >>on
> >>
> >> 1M equations).
> >> 
> >> Next, I replaced AztecOO with Belos CG.  Belos relies entirely on the
> >> underlying linear algebra library so it performs computations entirely
> >> with Epetra.
> >> 
> >> With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads,
> >> respectively.  Still not a linear improvement, but better than was you
> >> were seeing.  With 12 threads, I only reduced it to 2.1 seconds.
> >> 
> >> I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds,
> >>
> >>not
> >>
> >> too different.  But then with 12 ranks the time went to 1.5 seconds.
> >> 
> >> In my experience It is still the case for current generation
> >>
> >>workstations
> >>
> >> and small node counts that MPI-only does better than OpenMP threads, but
> >> this is changing in the future and is already not true on large systems
> >> where the number of MPI ranks becomes a resource issue, and when we can
> >> use different algorithms because of threading and shared memory.  We are
> >> also learning better ways of managing data layout and process mapping,
> >>
> >>but
> >>
> >> this is very manual right now and very finicky.
> >> 
> >> I have attached my two codes.  The first, ml-example, is your original
> >>
> >>code
> >>
> >> with all preconditioning removed.  The second, belos-example,
> >>
> >>substitutes
> >>
> >> Belos for AztecOO.
> >> 
> >> I hope this helps.
> >> 
> >> Best regards,
> >> 
> >> Mike
> >> 
> >> On 9/20/12 1:18 PM, "Eric Marttila"
> >>
> >><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.c
> >>o
> >>
> >> m>> wrote:
> >> 
> >> Mike,
> >> I have been using ML, but changed my code to run the simulation without
> >>
> >>ML
> >>
> >> as you suggested. It takes much longer to run this way (about 25 seconds
> >> instead of ~3) but does give some very slight speedup for 2 threads,
> >>
> >>then
> >>
> >> gets slower again for 4 and 8 threads.  The timing results below are
> >>
> >>wall
> >>
> >> time using the function omp_get_wtime().  I'm using the CG option in
> >> AztecOO (not GMRES).
> >> 
> >> OMP_NUM_THREADS     Wall Time for solve
> >> ------------------------------------------------------------
> >> 1                                        24.6308 (seconds)
> >> 2                                        24.0489 (seconds)
> >> 4                                        25.9732 (seconds)
> >> 8                                        28.6976 (seconds)
> >> 
> >> --Eric
> >> 
> >> On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
> >> Here are a few comments:
> >> - Your problem size is certainly sufficient to realize some parallel
> >> speedup. - ML (which I assume you are using) will not see any
> >>
> >>improvement
> >>
> >> from OpenMP parallelism.  It is not instrumented for it. - This means
> >>
> >>that
> >>
> >> the only possible parallelism is in the sparse MV and vector updates.
> >> Since ML is usually more than 50% of the total runtime, you won't see a
> >> lot of improvement from threading, even when other issues are resolved.
> >> A few suggestions:
> >> - Try to run your environment without ML, just to see if you get any
> >> improvement in the SpMV and vector operations. - If you are using GMRES,
> >> make sure you link with a threaded BLAS.  DGEMV is the main kernel of
> >> GMRES other than SpMV and will need to be executed in threaded mode. -
> >> Make sure your timer is a wall-clock timer, not a cpu timer.  A
> >>
> >>reasonable
> >>
> >> timer is the one that comes with OpenMP.
> >> I hope this helps.  Let me know what you find out.
> >> Mike
> >> On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
> >
> ><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.co
> >m>>
> >
> >wrote:
> >> > Mike,
> >> > The problem size is 1 million unknowns. I have Trilinos compiled with
> >>
> >>MPI
> >>
> >> > enabled. However, I'm launching my program with only one MPI process.
> >> > Here is some system information:
> >> > Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
> >> > RAM: 96 GB
> >> > OS: CentOS 6.2 64bit.
> >> > When solving for 1 million unknowns on this system, AztecOO reports
> >>
> >>the
> >>
> >> > following solution times: Solution time: 3.3 seconds (using Trilinos
> >> > with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos with
> >> > OpenMP enabled)
> >> > I had OMP_NUM_THREADS set to 4.
> >> > If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
> >> > Thanks for your help.
> >> > --Eric
> >> > 
> >> > On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A wrote:
> >> > > Eric,
> >> > > 
> >> > > Can you give some details about problem size, use of MPI (or not),
> >>
> >>type
> >>
> >> > > of system, etc.
> >> > > 
> >> > > Thanks.
> >> > > 
> >> > > Mike
> >> > > 
> >> > > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
> >> > > > Hello,
> >> > > > 
> >> > > > I'm using AztecOO and ML to solve a linear system. I've been
> >>
> >>running
> >>
> >> > > > my simulation in serial mode, but now I would like to take
> >>
> >>advantage
> >>
> >> > > > of multiple cores by using the OpenMP support that is available in
> >> > > > Trilinos. I realize that the packages I'm using are not fully
> >> > > > multi-threaded with openmp, but I'm hoping for some performance
> >> > > > improvement since some of the packages I'm using have at least
> >>
> >>some
> >>
> >> > > > level of OpenMP support.
> >> > > > 
> >> > > > I reconfigured and built Trilinos 10.12.2 with
> >> > > > 
> >> > > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
> >> > > > 
> >> > > > ...but when I run my simulation I see that it is slower than if I
> >> > > > have Trilinos configured without the above option. I have set the
> >> > > > environment variable OMP_NUM_THREADS to the desired number of
> >> > > > threads.
> >> > > > 
> >> > > > I was also able to reproduce this behavior with one of the
> >>
> >>trilinos
> >>
> >> > > > example prgrams (attached below), so I suspect I am missing
> >>
> >>something
> >>
> >> > > > obvious in using the OpenMP support.
> >> > > > 
> >> > > > Does anybody have thoughts of what I might be missing?
> >> > > > 
> >> > > > Thanks.
> >> > > > --Eric
> >> > 
> >> > --
> >> > Eric A. Marttila
> >> > ThermoAnalytics, Inc.
> >> > 23440 Airpark Blvd.
> >> > Calumet, MI 49913
> >>
> >> > email:
> >>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.c
> >>
> >> > om> phone: 810-636-2443
> >> > fax: 906-482-9755
> >> > web: http://www.thermoanalytics.com
> >> 
> >> --
> >> Eric A. Marttila
> >> ThermoAnalytics, Inc.
> >> 23440 Airpark Blvd.
> >> Calumet, MI 49913
> >>
> >> email:
> >>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.co
> >>m
> >>
> >> > phone: 810-636-2443
> >> 
> >> fax:   906-482-9755
> >> web: http://www.thermoanalytics.com
> >> 
> >> _______________________________________________
> >> Trilinos-Users mailing list
> >>
> >>Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandia.
> >>go
> >>
> >> v> http://software.sandia.gov/mailman/listinfo/trilinos-users

-- 
Eric A. Marttila
ThermoAnalytics, Inc.
23440 Airpark Blvd.
Calumet, MI 49913

email: Eric.Marttila at ThermoAnalytics.com
phone: 810-636-2443
fax:   906-482-9755
web: http://www.thermoanalytics.com



More information about the Trilinos-Users mailing list