[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos

Thu Sep 27 12:38:30 MDT 2012

Eric,

Belos is a good choice for iterative solver.  It is compatible with ML.
As for threading support, please give me a sense of your motivation for
using threads.  Is it your intent to use them instead of MPI, or with MPI?
 Other things being equal, MPI-only is still a better approach than
MPI+OpenMP or OpenMP-only on most systems, unless you are using very large
systems like BlueGene or Cray.

Mike

On 9/27/12 1:19 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com>
wrote:

>Thanks Mike,
>This is very helpful.  I ran your codes on my system and was able to
>reproduce 
>your timing results fairly closely.
>
>Given these results, would you recommend that I use Belos instead of
>AztecOO 
>and ML?   That is what these performance numbers suggest to me, but many
>of 
>the real problems I'm solving have 3D diffusion, so I think I may need
>the type 
>of preconditioning that ML has.   Or are there other (threaded)
>preconditioners that I could use instead with Belos?
>
>--Eric
>
>
>On Thursday, September 27, 2012 01:50:18 am Heroux, Michael A wrote:
>> Eric,
>> 
>> I took a look at your example code on my 12 core (two hex core Intel
>> Westmeres) workstation.  There are two primary reasons for not seeing
>> speedup:
>> 
>> 
>>   *   Your timer includes all the problem generation steps.  These are
>> fairly expensive and not threaded, so you won't see speed up when they
>>are
>> included.  Also, it appears that even when you were not using the ML
>> preconditioned, an ILU preconditioned was invoked, which is also not
>> threaded. *   AztecOO doesn't actually use the Epetra vectors in
>>threaded
>> mode, so only the Matvec is running in parallel.  AztecOO extracts a
>> pointer to the Epetra data and executes it on a single core.  This is
>> particularly bad since Epetra has actually mapped the data across the
>>NUMA
>> memory system for threaded execution.  I actually knew this before, but
>> forgot.
>> 
>> Even so, when I removed the preconditioner completely from the AztecOO
>>call
>> there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads, resp
>>on
>> 1M equations).
>> 
>> Next, I replaced AztecOO with Belos CG.  Belos relies entirely on the
>> underlying linear algebra library so it performs computations entirely
>> with Epetra.
>> 
>> With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads,
>> respectively.  Still not a linear improvement, but better than was you
>> were seeing.  With 12 threads, I only reduced it to 2.1 seconds.
>> 
>> I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds,
>>not
>> too different.  But then with 12 ranks the time went to 1.5 seconds.
>> 
>> In my experience It is still the case for current generation
>>workstations
>> and small node counts that MPI-only does better than OpenMP threads, but
>> this is changing in the future and is already not true on large systems
>> where the number of MPI ranks becomes a resource issue, and when we can
>> use different algorithms because of threading and shared memory.  We are
>> also learning better ways of managing data layout and process mapping,
>>but
>> this is very manual right now and very finicky.
>> 
>> I have attached my two codes.  The first, ml-example, is your original
>>code
>> with all preconditioning removed.  The second, belos-example,
>>substitutes
>> Belos for AztecOO.
>> 
>> I hope this helps.
>> 
>> Best regards,
>> 
>> Mike
>> 
>> On 9/20/12 1:18 PM, "Eric Marttila"
>> 
>><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.c
>>o
>> m>> wrote:
>> 
>> Mike,
>> I have been using ML, but changed my code to run the simulation without
>>ML
>> as you suggested. It takes much longer to run this way (about 25 seconds
>> instead of ~3) but does give some very slight speedup for 2 threads,
>>then
>> gets slower again for 4 and 8 threads.  The timing results below are
>>wall
>> time using the function omp_get_wtime().  I'm using the CG option in
>> AztecOO (not GMRES).
>> 
>> OMP_NUM_THREADS     Wall Time for solve
>> ------------------------------------------------------------
>> 1                                        24.6308 (seconds)
>> 2                                        24.0489 (seconds)
>> 4                                        25.9732 (seconds)
>> 8                                        28.6976 (seconds)
>> 
>> --Eric
>> 
>> On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
>> Here are a few comments:
>> - Your problem size is certainly sufficient to realize some parallel
>> speedup. - ML (which I assume you are using) will not see any
>>improvement
>> from OpenMP parallelism.  It is not instrumented for it. - This means
>>that
>> the only possible parallelism is in the sparse MV and vector updates.
>> Since ML is usually more than 50% of the total runtime, you won't see a
>> lot of improvement from threading, even when other issues are resolved.
>> A few suggestions:
>> - Try to run your environment without ML, just to see if you get any
>> improvement in the SpMV and vector operations. - If you are using GMRES,
>> make sure you link with a threaded BLAS.  DGEMV is the main kernel of
>> GMRES other than SpMV and will need to be executed in threaded mode. -
>> Make sure your timer is a wall-clock timer, not a cpu timer.  A
>>reasonable
>> timer is the one that comes with OpenMP.
>> I hope this helps.  Let me know what you find out.
>> Mike
>> On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
>> 
>> 
><eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.co
>m>> 
>wrote:
>> > Mike,
>> > The problem size is 1 million unknowns. I have Trilinos compiled with
>>MPI
>> > enabled. However, I'm launching my program with only one MPI process.
>> > Here is some system information:
>> > Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
>> > RAM: 96 GB
>> > OS: CentOS 6.2 64bit.
>> > When solving for 1 million unknowns on this system, AztecOO reports
>>the
>> > following solution times: Solution time: 3.3 seconds (using Trilinos
>> > with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos with
>> > OpenMP enabled)
>> > I had OMP_NUM_THREADS set to 4.
>> > If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
>> > Thanks for your help.
>> > --Eric
>> > 
>> > On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A wrote:
>> > > Eric,
>> > > 
>> > > Can you give some details about problem size, use of MPI (or not),
>>type
>> > > of system, etc.
>> > > 
>> > > Thanks.
>> > > 
>> > > Mike
>> > > 
>> > > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
>> > > > Hello,
>> > > > 
>> > > > I'm using AztecOO and ML to solve a linear system. I've been
>>running
>> > > > my simulation in serial mode, but now I would like to take
>>advantage
>> > > > of multiple cores by using the OpenMP support that is available in
>> > > > Trilinos. I realize that the packages I'm using are not fully
>> > > > multi-threaded with openmp, but I'm hoping for some performance
>> > > > improvement since some of the packages I'm using have at least
>>some
>> > > > level of OpenMP support.
>> > > > 
>> > > > I reconfigured and built Trilinos 10.12.2 with
>> > > > 
>> > > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
>> > > > 
>> > > > ...but when I run my simulation I see that it is slower than if I
>> > > > have Trilinos configured without the above option. I have set the
>> > > > environment variable OMP_NUM_THREADS to the desired number of
>> > > > threads.
>> > > > 
>> > > > I was also able to reproduce this behavior with one of the
>>trilinos
>> > > > example prgrams (attached below), so I suspect I am missing
>>something
>> > > > obvious in using the OpenMP support.
>> > > > 
>> > > > Does anybody have thoughts of what I might be missing?
>> > > > 
>> > > > Thanks.
>> > > > --Eric
>> > 
>> > --
>> > Eric A. Marttila
>> > ThermoAnalytics, Inc.
>> > 23440 Airpark Blvd.
>> > Calumet, MI 49913
>> > email:
>> > 
>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.c
>> > om> phone: 810-636-2443
>> > fax: 906-482-9755
>> > web: http://www.thermoanalytics.com
>> 
>> --
>> Eric A. Marttila
>> ThermoAnalytics, Inc.
>> 23440 Airpark Blvd.
>> Calumet, MI 49913
>> 
>> email:
>> 
>>Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.co
>>m
>> > phone: 810-636-2443
>> fax:   906-482-9755
>> web: http://www.thermoanalytics.com
>> 
>> _______________________________________________
>> Trilinos-Users mailing list
>> 
>>Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandia.
>>go
>> v> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>-- 
>Eric A. Marttila
>ThermoAnalytics, Inc.
>23440 Airpark Blvd.
>Calumet, MI 49913
>
>email: Eric.Marttila at ThermoAnalytics.com
>phone: 810-636-2443
>fax:   906-482-9755
>web: http://www.thermoanalytics.com
>