[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos

Thu Sep 27 14:42:18 MDT 2012

Thanks Mike,
I will try that approach.
--Eric

On Thursday, September 27, 2012 03:21:49 pm Heroux, Mike wrote:
> Eric,
> 
> I think a reasonable approach would be to insert MPI into the application
> but only make non-trivial use of the additional ranks (other than rank 0)
> when using the solver.  Epetra provides very nice data redistribution
> tools that allow you to take data objects that are on rank 0 and spread
> them out across all the other processors.  You should see a significant
> performance improvement for a 3D diffusion problem by doing this.
> 
> Mike
> 
> On 9/27/12 2:15 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com>
> 
> wrote:
> >Mike,
> >I'm using trilinos solvers within a monolithic piece of software that has
> >an
> >integrated GUI, and at the moment I can't easily make use of MPI.  So I
> >was
> >hoping to get some speedup initially by enabling OpenMP support in
> >trilinos.
> >
> >In the future I was planning to use OpenMP with MPI, thinking that would
> >give
> >me the best performance on our target platforms, which generally consist
> >of
> >small clusters of multicore nodes (e.g. 4-8 machines that are similar to
> >your
> >12 core machine).  But from what you describe it sounds like it would be
> >best
> >for me to focus my efforts on MPI-only.
> >
> >--Eric
> >
> >On Thursday, September 27, 2012 02:38:30 pm Heroux, Michael A wrote:
> >> Eric,
> >> 
> >> Belos is a good choice for iterative solver.  It is compatible with ML.
> >> As for threading support, please give me a sense of your motivation for
> >> using threads.  Is it your intent to use them instead of MPI, or with
> >>
> >>MPI?
> >>
> >>  Other things being equal, MPI-only is still a better approach than
> >> 
> >> MPI+OpenMP or OpenMP-only on most systems, unless you are using very
> >>
> >>large
> >>
> >> systems like BlueGene or Cray.
> >> 
> >> Mike
> >> 
> >> On 9/27/12 1:19 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com>
> >> 
> >> wrote:
> >> >Thanks Mike,
> >> >This is very helpful.  I ran your codes on my system and was able to
> >> >reproduce
> >> >your timing results fairly closely.
> >> >
> >> >Given these results, would you recommend that I use Belos instead of
> >> >AztecOO
> >> >and ML?   That is what these performance numbers suggest to me, but
> >>
> >>many
> >>
> >> >of
> >> >the real problems I'm solving have 3D diffusion, so I think I may need
> >> >the type
> >> >of preconditioning that ML has.   Or are there other (threaded)
> >> >preconditioners that I could use instead with Belos?
> >> >
> >> >--Eric
> >> >
> >> >On Thursday, September 27, 2012 01:50:18 am Heroux, Michael A wrote:
> >> >> Eric,
> >> >> 
> >> >> I took a look at your example code on my 12 core (two hex core Intel
> >> >> Westmeres) workstation.  There are two primary reasons for not seeing
> >> >> 
> >> >> speedup:
> >> >>   *   Your timer includes all the problem generation steps.  These
> >>
> >>are
> >>
> >> >> fairly expensive and not threaded, so you won't see speed up when
> >>
> >>they
> >>
> >> >>are
> >> >>
> >> >> included.  Also, it appears that even when you were not using the ML
> >> >> preconditioned, an ILU preconditioned was invoked, which is also not
> >> >> threaded. *   AztecOO doesn't actually use the Epetra vectors in
> >> >>
> >> >>threaded
> >> >>
> >> >> mode, so only the Matvec is running in parallel.  AztecOO extracts a
> >> >> pointer to the Epetra data and executes it on a single core.  This is
> >> >> particularly bad since Epetra has actually mapped the data across the
> >> >>
> >> >>NUMA
> >> >>
> >> >> memory system for threaded execution.  I actually knew this before,
> >>
> >>but
> >>
> >> >> forgot.
> >> >> 
> >> >> Even so, when I removed the preconditioner completely from the
> >>
> >>AztecOO
> >>
> >> >>call
> >> >>
> >> >> there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads,
> >>
> >>resp
> >>
> >> >>on
> >> >>
> >> >> 1M equations).
> >> >> 
> >> >> Next, I replaced AztecOO with Belos CG.  Belos relies entirely on the
> >> >> underlying linear algebra library so it performs computations
> >>
> >>entirely
> >>
> >> >> with Epetra.
> >> >> 
> >> >> With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads,
> >> >> respectively.  Still not a linear improvement, but better than was
> >>
> >>you
> >>
> >> >> were seeing.  With 12 threads, I only reduced it to 2.1 seconds.
> >> >> 
> >> >> I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds,
> >> >>
> >> >>not
> >> >>
> >> >> too different.  But then with 12 ranks the time went to 1.5 seconds.
> >> >> 
> >> >> In my experience It is still the case for current generation
> >> >>
> >> >>workstations
> >> >>
> >> >> and small node counts that MPI-only does better than OpenMP threads,
> >>
> >>but
> >>
> >> >> this is changing in the future and is already not true on large
> >>
> >>systems
> >>
> >> >> where the number of MPI ranks becomes a resource issue, and when we
> >>
> >>can
> >>
> >> >> use different algorithms because of threading and shared memory.  We
> >>
> >>are
> >>
> >> >> also learning better ways of managing data layout and process
> >>
> >>mapping,
> >>
> >> >>but
> >> >>
> >> >> this is very manual right now and very finicky.
> >> >> 
> >> >> I have attached my two codes.  The first, ml-example, is your
> >>
> >>original
> >>
> >> >>code
> >> >>
> >> >> with all preconditioning removed.  The second, belos-example,
> >> >>
> >> >>substitutes
> >> >>
> >> >> Belos for AztecOO.
> >> >> 
> >> >> I hope this helps.
> >> >> 
> >> >> Best regards,
> >> >> 
> >> >> Mike
> >> >> 
> >> >> On 9/20/12 1:18 PM, "Eric Marttila"
> >>>>
> >>>><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics
> >>>>.c
> >>>>
> >> >>o
> >> >>
> >> >> m>> wrote:
> >> >> 
> >> >> Mike,
> >> >> I have been using ML, but changed my code to run the simulation
> >>
> >>without
> >>
> >> >>ML
> >> >>
> >> >> as you suggested. It takes much longer to run this way (about 25
> >>
> >>seconds
> >>
> >> >> instead of ~3) but does give some very slight speedup for 2 threads,
> >> >>
> >> >>then
> >> >>
> >> >> gets slower again for 4 and 8 threads.  The timing results below are
> >> >>
> >> >>wall
> >> >>
> >> >> time using the function omp_get_wtime().  I'm using the CG option in
> >> >> AztecOO (not GMRES).
> >> >> 
> >> >> OMP_NUM_THREADS     Wall Time for solve
> >> >> ------------------------------------------------------------
> >> >> 1                                        24.6308 (seconds)
> >> >> 2                                        24.0489 (seconds)
> >> >> 4                                        25.9732 (seconds)
> >> >> 8                                        28.6976 (seconds)
> >> >> 
> >> >> --Eric
> >> >> 
> >> >> On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
> >> >> Here are a few comments:
> >> >> - Your problem size is certainly sufficient to realize some parallel
> >> >> speedup. - ML (which I assume you are using) will not see any
> >> >>
> >> >>improvement
> >> >>
> >> >> from OpenMP parallelism.  It is not instrumented for it. - This means
> >> >>
> >> >>that
> >> >>
> >> >> the only possible parallelism is in the sparse MV and vector updates.
> >> >> Since ML is usually more than 50% of the total runtime, you won't
> >>
> >>see a
> >>
> >> >> lot of improvement from threading, even when other issues are
> >>
> >>resolved.
> >>
> >> >> A few suggestions:
> >> >> - Try to run your environment without ML, just to see if you get any
> >> >> improvement in the SpMV and vector operations. - If you are using
> >>
> >>GMRES,
> >>
> >> >> make sure you link with a threaded BLAS.  DGEMV is the main kernel of
> >> >> GMRES other than SpMV and will need to be executed in threaded mode.
> >>
> >>-
> >>
> >> >> Make sure your timer is a wall-clock timer, not a cpu timer.  A
> >> >>
> >> >>reasonable
> >> >>
> >> >> timer is the one that comes with OpenMP.
> >> >> I hope this helps.  Let me know what you find out.
> >> >> Mike
> >> >> On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
> >>>
> >>><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.
> >>>co
> >>>
> >> >m>>
> >> >
> >> >wrote:
> >> >> > Mike,
> >> >> > The problem size is 1 million unknowns. I have Trilinos compiled
> >>
> >>with
> >>
> >> >>MPI
> >> >>
> >> >> > enabled. However, I'm launching my program with only one MPI
> >>
> >>process.
> >>
> >> >> > Here is some system information:
> >> >> > Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
> >> >> > RAM: 96 GB
> >> >> > OS: CentOS 6.2 64bit.
> >> >> > When solving for 1 million unknowns on this system, AztecOO reports
> >> >>
> >> >>the
> >> >>
> >> >> > following solution times: Solution time: 3.3 seconds (using
> >>
> >>Trilinos
> >>
> >> >> > with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos
> >>
> >>with
> >>
> >> >> > OpenMP enabled)
> >> >> > I had OMP_NUM_THREADS set to 4.
> >> >> > If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
> >> >> > Thanks for your help.
> >> >> > --Eric
> >> >> > 
> >> >> > On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A
> >>
> >>wrote:
> >> >> > > Eric,
> >> >> > > 
> >> >> > > Can you give some details about problem size, use of MPI (or
> >>
> >>not),
> >>
> >> >>type
> >> >>
> >> >> > > of system, etc.
> >> >> > > 
> >> >> > > Thanks.
> >> >> > > 
> >> >> > > Mike
> >> >> > > 
> >> >> > > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
> >> >> > > > Hello,
> >> >> > > > 
> >> >> > > > I'm using AztecOO and ML to solve a linear system. I've been
> >> >>
> >> >>running
> >> >>
> >> >> > > > my simulation in serial mode, but now I would like to take
> >> >>
> >> >>advantage
> >> >>
> >> >> > > > of multiple cores by using the OpenMP support that is
> >>
> >>available in
> >>
> >> >> > > > Trilinos. I realize that the packages I'm using are not fully
> >> >> > > > multi-threaded with openmp, but I'm hoping for some performance
> >> >> > > > improvement since some of the packages I'm using have at least
> >> >>
> >> >>some
> >> >>
> >> >> > > > level of OpenMP support.
> >> >> > > > 
> >> >> > > > I reconfigured and built Trilinos 10.12.2 with
> >> >> > > > 
> >> >> > > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
> >> >> > > > 
> >> >> > > > ...but when I run my simulation I see that it is slower than
> >>
> >>if I
> >>
> >> >> > > > have Trilinos configured without the above option. I have set
> >>
> >>the
> >>
> >> >> > > > environment variable OMP_NUM_THREADS to the desired number of
> >> >> > > > threads.
> >> >> > > > 
> >> >> > > > I was also able to reproduce this behavior with one of the
> >> >>
> >> >>trilinos
> >> >>
> >> >> > > > example prgrams (attached below), so I suspect I am missing
> >> >>
> >> >>something
> >> >>
> >> >> > > > obvious in using the OpenMP support.
> >> >> > > > 
> >> >> > > > Does anybody have thoughts of what I might be missing?
> >> >> > > > 
> >> >> > > > Thanks.
> >> >> > > > --Eric
> >> >> > 
> >> >> > --
> >> >> > Eric A. Marttila
> >> >> > ThermoAnalytics, Inc.
> >> >> > 23440 Airpark Blvd.
> >> >> > Calumet, MI 49913
> >>>>
> >> >> > email:
> >>>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.
> >>>>c
> >>>>
> >> >> > om> phone: 810-636-2443
> >> >> > fax: 906-482-9755
> >> >> > web: http://www.thermoanalytics.com
> >> >> 
> >> >> --
> >> >> Eric A. Marttila
> >> >> ThermoAnalytics, Inc.
> >> >> 23440 Airpark Blvd.
> >> >> Calumet, MI 49913
> >>>>
> >> >> email:
> >>>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.
> >>>>co
> >>>>
> >> >>m
> >> >>
> >> >> > phone: 810-636-2443
> >> >> 
> >> >> fax:   906-482-9755
> >> >> web: http://www.thermoanalytics.com
> >> >> 
> >> >> _______________________________________________
> >> >> Trilinos-Users mailing list
> >>>>
> >>>>Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandi
> >>>>a.
> >>>>
> >> >>go
> >> >>
> >> >> v> http://software.sandia.gov/mailman/listinfo/trilinos-users

-- 
Eric A. Marttila
ThermoAnalytics, Inc.
23440 Airpark Blvd.
Calumet, MI 49913

email: Eric.Marttila at ThermoAnalytics.com
phone: 810-636-2443
fax:   906-482-9755
web: http://www.thermoanalytics.com