[Trilinos-Users] [EXTERNAL] Re: Using OpenMP support in Trilinos

Wed Sep 26 23:50:18 MDT 2012

Eric,

I took a look at your example code on my 12 core (two hex core Intel Westmeres) workstation.  There are two primary reasons for not seeing speedup:

  *   Your timer includes all the problem generation steps.  These are fairly expensive and not threaded, so you won't see speed up when they are included.  Also, it appears that even when you were not using the ML preconditioned, an ILU preconditioned was invoked, which is also not threaded.
  *   AztecOO doesn't actually use the Epetra vectors in threaded mode, so only the Matvec is running in parallel.  AztecOO extracts a pointer to the Epetra data and executes it on a single core.  This is particularly bad since Epetra has actually mapped the data across the NUMA memory system for threaded execution.  I actually knew this before, but forgot.

Even so, when I removed the preconditioner completely from the AztecOO call there was some speedup (from 4.13 to 3.0 sec with 1 and 4 threads, resp on 1M equations).

Next, I replaced AztecOO with Belos CG.  Belos relies entirely on the underlying linear algebra library so it performs computations entirely with Epetra.

With this approach I went from 3.7 to 2.2 sec with 1 and 4 threads, respectively.  Still not a linear improvement, but better than was you were seeing.  With 12 threads, I only reduced it to 2.1 seconds.

I ran the same Belos code with MPI using 4 ranks and got 2.1 seconds, not too different.  But then with 12 ranks the time went to 1.5 seconds.

In my experience It is still the case for current generation workstations and small node counts that MPI-only does better than OpenMP threads, but this is changing in the future and is already not true on large systems where the number of MPI ranks becomes a resource issue, and when we can use different algorithms because of threading and shared memory.  We are also learning better ways of managing data layout and process mapping, but this is very manual right now and very finicky.

I have attached my two codes.  The first, ml-example, is your original code with all preconditioning removed.  The second, belos-example, substitutes Belos for AztecOO.

I hope this helps.

Best regards,

Mike

On 9/20/12 1:18 PM, "Eric Marttila" <eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.com>> wrote:

Mike,
I have been using ML, but changed my code to run the simulation without ML as
you suggested. It takes much longer to run this way (about 25 seconds instead
of ~3) but does give some very slight speedup for 2 threads, then gets slower
again for 4 and 8 threads.  The timing results below are wall time using the
function omp_get_wtime().  I'm using the CG option in AztecOO (not GMRES).

OMP_NUM_THREADS     Wall Time for solve
------------------------------------------------------------
1                                        24.6308 (seconds)
2                                        24.0489 (seconds)
4                                        25.9732 (seconds)
8                                        28.6976 (seconds)

--Eric

On Wednesday, September 19, 2012 10:56:32 pm Heroux, Michael A wrote:
Here are a few comments:
- Your problem size is certainly sufficient to realize some parallel
speedup. - ML (which I assume you are using) will not see any improvement
from OpenMP parallelism.  It is not instrumented for it. - This means that
the only possible parallelism is in the sparse MV and vector updates.
Since ML is usually more than 50% of the total runtime, you won't see a
lot of improvement from threading, even when other issues are resolved.
A few suggestions:
- Try to run your environment without ML, just to see if you get any
improvement in the SpMV and vector operations. - If you are using GMRES,
make sure you link with a threaded BLAS.  DGEMV is the main kernel of
GMRES other than SpMV and will need to be executed in threaded mode. -
Make sure your timer is a wall-clock timer, not a cpu timer.  A reasonable
timer is the one that comes with OpenMP.
I hope this helps.  Let me know what you find out.
Mike
On Sep 19, 2012, at 8:39 PM, "Eric Marttila"
<eric.marttila at thermoanalytics.com<mailto:eric.marttila at thermoanalytics.com>> wrote:
> Mike,
> The problem size is 1 million unknowns. I have Trilinos compiled with MPI
> enabled. However, I'm launching my program with only one MPI process.
> Here is some system information:
> Processors: Dual Intel Xeon E5645 Hex-Core / 2.4 Ghz / Cache: 12MB
> RAM: 96 GB
> OS: CentOS 6.2 64bit.
> When solving for 1 million unknowns on this system, AztecOO reports the
> following solution times: Solution time: 3.3 seconds (using Trilinos
> with OpenMP disabled) Solution time: 4.0 seconds (using Trilinos with
> OpenMP enabled)
> I had OMP_NUM_THREADS set to 4.
> If I set OMP_NUM_THREADS to 1 then I get 3.3 seconds in both cases.
> Thanks for your help.
> --Eric
>
> On Wednesday, September 19, 2012 08:48:20 pm Heroux, Michael A wrote:
> > Eric,
> >
> > Can you give some details about problem size, use of MPI (or not), type
> > of system, etc.
> >
> > Thanks.
> >
> > Mike
> >
> > On Sep 19, 2012, at 3:17 PM, Eric Marttila wrote:
> > > Hello,
> > >
> > > I'm using AztecOO and ML to solve a linear system. I've been running
> > > my simulation in serial mode, but now I would like to take advantage
> > > of multiple cores by using the OpenMP support that is available in
> > > Trilinos. I realize that the packages I'm using are not fully
> > > multi-threaded with openmp, but I'm hoping for some performance
> > > improvement since some of the packages I'm using have at least some
> > > level of OpenMP support.
> > >
> > > I reconfigured and built Trilinos 10.12.2 with
> > >
> > > -D Trilinos_ENABLE_OpenMP:BOOL=ON
> > >
> > > ...but when I run my simulation I see that it is slower than if I
> > > have Trilinos configured without the above option. I have set the
> > > environment variable OMP_NUM_THREADS to the desired number of
> > > threads.
> > >
> > > I was also able to reproduce this behavior with one of the trilinos
> > > example prgrams (attached below), so I suspect I am missing something
> > > obvious in using the OpenMP support.
> > >
> > > Does anybody have thoughts of what I might be missing?
> > >
> > > Thanks.
> > > --Eric
>
> --
> Eric A. Marttila
> ThermoAnalytics, Inc.
> 23440 Airpark Blvd.
> Calumet, MI 49913
> email: Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.com>
> phone: 810-636-2443
> fax: 906-482-9755
> web: http://www.thermoanalytics.com

--
Eric A. Marttila
ThermoAnalytics, Inc.
23440 Airpark Blvd.
Calumet, MI 49913

email: Eric.Marttila at ThermoAnalytics.com<mailto:Eric.Marttila at ThermoAnalytics.com>
phone: 810-636-2443
fax:   906-482-9755
web: http://www.thermoanalytics.com

_______________________________________________
Trilinos-Users mailing list
Trilinos-Users at software.sandia.gov<mailto:Trilinos-Users at software.sandia.gov>
http://software.sandia.gov/mailman/listinfo/trilinos-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://software.sandia.gov/pipermail/trilinos-users/attachments/20120927/8d69443f/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ml-example.cpp
Type: application/octet-stream
Size: 1892 bytes
Desc: ml-example.cpp
Url : https://software.sandia.gov/pipermail/trilinos-users/attachments/20120927/8d69443f/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: belos-example.cpp
Type: application/octet-stream
Size: 4018 bytes
Desc: belos-example.cpp
Url : https://software.sandia.gov/pipermail/trilinos-users/attachments/20120927/8d69443f/attachment-0001.obj