[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos

Thu Sep 27 13:21:49 MDT 2012

Eric,

I think a reasonable approach would be to insert MPI into the application
but only make non-trivial use of the additional ranks (other than rank 0)
when using the solver.  Epetra provides very nice data redistribution
tools that allow you to take data objects that are on rank 0 and spread
them out across all the other processors.  You should see a significant
performance improvement for a 3D diffusion problem by doing this.

Mike

On 9/27/12 2:15 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com>
wrote:

>Mike,
>I'm using trilinos solvers within a monolithic piece of software that has
>an 
>integrated GUI, and at the moment I can't easily make use of MPI.  So I
>was 
>hoping to get some speedup initially by enabling OpenMP support in
>trilinos.
>
>In the future I was planning to use OpenMP with MPI, thinking that would
>give 
>me the best performance on our target platforms, which generally consist
>of 
>small clusters of multicore nodes (e.g. 4-8 machines that are similar to
>your 
>12 core machine).  But from what you describe it sounds like it would be
>best 
>for me to focus my efforts on MPI-only.
>
>--Eric
>
>On Thursday, September 27, 2012 02:38:30 pm Heroux, Michael A wrote:
>> Eric,
>> 
>> Belos is a good choice for iterative solver.  It is compatible with ML.
>> As for threading support, please give me a sense of your motivation for
>> using threads.  Is it your intent to use them instead of MPI, or with
>>MPI?
>>  Other things being equal, MPI-only is still a better approach than
>> MPI+OpenMP or OpenMP-only on most systems, unless you are using very
>>large
>> systems like BlueGene or Cray.
>> 
>> Mike
>> 
>> On 9/27/12 1:19 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com>
>> 
>> wrote:
>> >Thanks Mike,
>> >This is very helpful.  I ran your codes on my system and was able to
>> >reproduce
>> >your timing results fairly closely.
>> >
>> >Given these results, would you recommend that I use Belos instead of
>> >AztecOO
>> >and ML?   That is what these performance numbers suggest to me, but
>>many
>> >of
>> >the real problems I'm solving have 3D diffusion, so I think I may need
>> >the type
>> >of preconditioning that ML has.   Or are there other (threaded)
>> >preconditioners that I could use instead with Belos?
>> >
>> >--Eric
>> >
>> >On Thursday, September 27, 2012 01:50:18 am Heroux, Michael A wrote:
>> >> Eric,
>> >> 
>> >> I took a look at your example code on my 12 core (two hex core Intel
>> >> Westmeres) workstation.  There are two primary reasons for not seeing
>> >> 
>> >> speedup:
>> >>   *   Your timer includes all the problem generation steps.  These
>>are
>> >> 
>> >> fairly expensive and not threaded, so you won't see speed up when
>>they
>> >>
>> >>are
>> >>
>> >> included.  Also, it appears that even when you were not using the ML
>> >> preconditioned, an ILU preconditioned was invoked, which is also not
>> >> threaded. *   AztecOO doesn't actually use the Epetra vectors in
>> >>
>> >>threaded
>> >>
>> >> mode, so only the Matvec is running in parallel.  AztecOO extracts a
>> >> pointer to the Epetra data and executes it on a single core.  This is
>> >> particularly bad since Epetra has actually mapped the data across the
>> >>
>> >>NUMA
>> >>
>> >> memory system for threaded execution.  I actually knew this before,
>>but
>> >> forgot.
>> >> 
>> >> Even so, when I removed the preconditioner completely from the
>>AztecOO
>> >>
>> >>call
>> >>
>> >> there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads,
>>resp
>> >>
>> >>on
>> >>
>> >> 1M equations).
>> >> 
>> >> Next, I replaced AztecOO with Belos CG.  Belos relies entirely on the
>> >> underlying linear algebra library so it performs computations
>>entirely
>> >> with Epetra.
>> >> 
>> >> With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads,
>> >> respectively.  Still not a linear improvement, but better than was
>>you
>> >> were seeing.  With 12 threads, I only reduced it to 2.1 seconds.
>> >> 
>> >> I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds,
>> >>
>> >>not
>> >>
>> >> too different.  But then with 12 ranks the time went to 1.5 seconds.
>> >> 
>> >> In my experience It is still the case for current generation
>> >>
>> >>workstations
>> >>
>> >> and small node counts that MPI-only does better than OpenMP threads,
>>but
>> >> this is changing in the future and is already not true on large
>>systems
>> >> where the number of MPI ranks becomes a resource issue, and when we
>>can
>> >> use different algorithms because of threading and shared memory.  We
>>are
>> >> also learning better ways of managing data layout and process
>>mapping,
>> >>
>> >>but
>> >>
>> >> this is very manual right now and very finicky.
>> >> 
>> >> I have attached my two codes.  The first, ml-example, is your
>>original
>> >>
>> >>code
>> >>
>> >> with all preconditioning removed.  The second, belos-example,
>> >>
>> >>substitutes
>> >>
>> >> Belos for AztecOO.
>> >> 
>> >> I hope this helps.
>> >> 
>> >> Best regards,
>> >> 
>> >> Mike
>> >> 
>> >> On 9/20/12 1:18 PM, "Eric Marttila"
>> >>
>> 
>>>><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics
>>>>.c
>> >>o
>> >>
>> >> m>> wrote:
>> >> 
>> >> Mike,
>> >> I have been using ML, but changed my code to run the simulation
>>without
>> >>
>> >>ML
>> >>
>> >> as you suggested. It takes much longer to run this way (about 25
>>seconds
>> >> instead of ~3) but does give some very slight speedup for 2 threads,
>> >>
>> >>then
>> >>
>> >> gets slower again for 4 and 8 threads.  The timing results below are
>> >>
>> >>wall
>> >>
>> >> time using the function omp_get_wtime().  I'm using the CG option in
>> >> AztecOO (not GMRES).
>> >> 
>> >> OMP_NUM_THREADS     Wall Time for solve
>> >> ------------------------------------------------------------
>> >> 1                                        24.6308 (seconds)
>> >> 2                                        24.0489 (seconds)
>> >> 4                                        25.9732 (seconds)
>> >> 8                                        28.6976 (seconds)
>> >> 
>> >> --Eric
>> >> 
>> >> On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
>> >> Here are a few comments:
>> >> - Your problem size is certainly sufficient to realize some parallel
>> >> speedup. - ML (which I assume you are using) will not see any
>> >>
>> >>improvement
>> >>
>> >> from OpenMP parallelism.  It is not instrumented for it. - This means
>> >>
>> >>that
>> >>
>> >> the only possible parallelism is in the sparse MV and vector updates.
>> >> Since ML is usually more than 50% of the total runtime, you won't
>>see a
>> >> lot of improvement from threading, even when other issues are
>>resolved.
>> >> A few suggestions:
>> >> - Try to run your environment without ML, just to see if you get any
>> >> improvement in the SpMV and vector operations. - If you are using
>>GMRES,
>> >> make sure you link with a threaded BLAS.  DGEMV is the main kernel of
>> >> GMRES other than SpMV and will need to be executed in threaded mode.
>>-
>> >> Make sure your timer is a wall-clock timer, not a cpu timer.  A
>> >>
>> >>reasonable
>> >>
>> >> timer is the one that comes with OpenMP.
>> >> I hope this helps.  Let me know what you find out.
>> >> Mike
>> >> On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
>> >
>> 
>>><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.
>>>co
>> >m>>
>> >
>> >wrote:
>> >> > Mike,
>> >> > The problem size is 1 million unknowns. I have Trilinos compiled
>>with
>> >>
>> >>MPI
>> >>
>> >> > enabled. However, I'm launching my program with only one MPI
>>process.
>> >> > Here is some system information:
>> >> > Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
>> >> > RAM: 96 GB
>> >> > OS: CentOS 6.2 64bit.
>> >> > When solving for 1 million unknowns on this system, AztecOO reports
>> >>
>> >>the
>> >>
>> >> > following solution times: Solution time: 3.3 seconds (using
>>Trilinos
>> >> > with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos
>>with
>> >> > OpenMP enabled)
>> >> > I had OMP_NUM_THREADS set to 4.
>> >> > If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
>> >> > Thanks for your help.
>> >> > --Eric
>> >> > 
>> >> > On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A
>>wrote:
>> >> > > Eric,
>> >> > > 
>> >> > > Can you give some details about problem size, use of MPI (or
>>not),
>> >>
>> >>type
>> >>
>> >> > > of system, etc.
>> >> > > 
>> >> > > Thanks.
>> >> > > 
>> >> > > Mike
>> >> > > 
>> >> > > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
>> >> > > > Hello,
>> >> > > > 
>> >> > > > I'm using AztecOO and ML to solve a linear system. I've been
>> >>
>> >>running
>> >>
>> >> > > > my simulation in serial mode, but now I would like to take
>> >>
>> >>advantage
>> >>
>> >> > > > of multiple cores by using the OpenMP support that is
>>available in
>> >> > > > Trilinos. I realize that the packages I'm using are not fully
>> >> > > > multi-threaded with openmp, but I'm hoping for some performance
>> >> > > > improvement since some of the packages I'm using have at least
>> >>
>> >>some
>> >>
>> >> > > > level of OpenMP support.
>> >> > > > 
>> >> > > > I reconfigured and built Trilinos 10.12.2 with
>> >> > > > 
>> >> > > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
>> >> > > > 
>> >> > > > ...but when I run my simulation I see that it is slower than
>>if I
>> >> > > > have Trilinos configured without the above option. I have set
>>the
>> >> > > > environment variable OMP_NUM_THREADS to the desired number of
>> >> > > > threads.
>> >> > > > 
>> >> > > > I was also able to reproduce this behavior with one of the
>> >>
>> >>trilinos
>> >>
>> >> > > > example prgrams (attached below), so I suspect I am missing
>> >>
>> >>something
>> >>
>> >> > > > obvious in using the OpenMP support.
>> >> > > > 
>> >> > > > Does anybody have thoughts of what I might be missing?
>> >> > > > 
>> >> > > > Thanks.
>> >> > > > --Eric
>> >> > 
>> >> > --
>> >> > Eric A. Marttila
>> >> > ThermoAnalytics, Inc.
>> >> > 23440 Airpark Blvd.
>> >> > Calumet, MI 49913
>> >>
>> >> > email:
>> 
>>>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.
>>>>c
>> >>
>> >> > om> phone: 810-636-2443
>> >> > fax: 906-482-9755
>> >> > web: http://www.thermoanalytics.com
>> >> 
>> >> --
>> >> Eric A. Marttila
>> >> ThermoAnalytics, Inc.
>> >> 23440 Airpark Blvd.
>> >> Calumet, MI 49913
>> >>
>> >> email:
>> 
>>>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.
>>>>co
>> >>m
>> >>
>> >> > phone: 810-636-2443
>> >> 
>> >> fax:   906-482-9755
>> >> web: http://www.thermoanalytics.com
>> >> 
>> >> _______________________________________________
>> >> Trilinos-Users mailing list
>> >>
>> 
>>>>Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandi
>>>>a.
>> >>go
>> >>
>> >> v> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>-- 
>Eric A. Marttila
>ThermoAnalytics, Inc.
>23440 Airpark Blvd.
>Calumet, MI 49913
>
>email: Eric.Marttila at ThermoAnalytics.com
>phone: 810-636-2443
>fax:   906-482-9755
>web: http://www.thermoanalytics.com
>
>_______________________________________________
>Trilinos-Users mailing list
>Trilinos-Users at software.sandia.gov
>http://software.sandia.gov/mailman/listinfo/trilinos-users
>