[Trilinos-Users] Performance Issues

Wed Feb 17 21:57:25 EST 2016

James,

I don’t know if you have made progress on this, but one thing to check is your compiler optimization flags.  Have you experimented with different levels of optimization?

Mike

From: James Hawkes <jameshawkes at outlook.com<mailto:jameshawkes at outlook.com>>
Date: Monday, February 15, 2016 at 12:10 PM
To: Michael A Heroux <maherou at sandia.gov<mailto:maherou at sandia.gov>>, Trilinos Users <trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>>
Subject: [EXTERNAL] RE: [Trilinos-Users] Performance Issues

Hi Mike,

For the Krylov solver I am using the standard "GMRES" from the Belos SolverFactory, with the following parameters:

solverParams->set ("Num Blocks", 30);
solverParams->set ("Maximum Iterations", 99999);
solverParams->set ("Convergence Tolerance", 0.1); //relative residual 2-norm
solverParams->set ("Orthogonalization", "ICGS"); // DKGS, ICGS, IMGS give minor performance differences
solverParams->set ( "Implicit Residual Scaling", "Norm of Initial Residual" );

The relative tolerance is very relaxed (deliberately, it only needs 9 iterations) but if I set it lower (0.01 or 0.001) not much changes other than the total number of iterations -- the time-per-iteration is still high. Likewise I have tested on larger problems (up to 2.7m equations), but the performance gap still exists. We are trying to push the equations-per-core down which is one of the motivations for switching to a hybrid scheme.

For the preconditioner:
PCParams.set ("fact: ilut level-of-fill", 1.0);
PCParams.set ("fact: drop tolerance", 0.0);
PCParams.set ("fact: absolute threshold", 0.1);

I have tried variations on level-of-fill, drop tolerance and absolute thresholds. Sure enough it changes the convergence but doesn't make much difference in terms of time-per-iteration. I think the above settings match the Block Jacobi preconditioner I am comparing against. In both cases I apply it as a right-preconditioner.

I should also mention that 32 processes is two distributed-memory nodes in our cluster, I get the same problems when running on 16 cores in a single node or with only one core.

Kind regards,
James

________________________________
From: maherou at sandia.gov<mailto:maherou at sandia.gov>
To: jameshawkes at outlook.com<mailto:jameshawkes at outlook.com>; trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>
Subject: Re: [Trilinos-Users] Performance Issues
Date: Mon, 15 Feb 2016 16:32:56 +0000

James,

Can you give us some details about the parameter values you are using?  Levels of fill, Krylov solver and its parameters, number of iterations, etc.

Also, 16K equations is quite small for running on 32 processors.  If this is your typical problem size, fine, but if not, you might consider a larger problem.

Mike

From: Trilinos-Users <trilinos-users-bounces at trilinos.org<mailto:trilinos-users-bounces at trilinos.org>> on behalf of James Hawkes <jameshawkes at outlook.com<mailto:jameshawkes at outlook.com>>
Date: Monday, February 15, 2016 at 9:42 AM
To: Trilinos Users <trilinos-users at trilinos.org<mailto:trilinos-users at trilinos.org>>
Subject: [EXTERNAL] [Trilinos-Users] Performance Issues

I am exploring alternative linear solver packages to PETSc, because they recently stopped supporting hybrid parallelization (MPI+OpenMP) and this is something we want to start exploring. However, I am struggling to match the performance of the two packages using very similar solvers. I am solving a Poisson equation generated from CFD, with 16k rows.

I'm using GMRES from PETSc and GMRES from Belos. With and without preconditioning, they both compute exactly the same residual in exactly the same number of iterations -- however, Trilinos takes much longer and I cannot pin-point the reason why. Here are the timings without preconditioning:

PETSC    on 32 MPI processes: 7.246e-03s
Trilinos on 32 MPI processes: 2.902e-02s (compiled with openmp with Kokkos, with OMP_NUM_THREADS=1)
Trilinos on 32 MPI processes: 2.535e-02s (compiled with serial nodes, no threading)
Trilinos on  4 MPI processes: 2.453e-02s (compiled with openmp, OMP_NUM_THREADS=8)

With preconditioning (Ifpack2 ILUT from Trilinos and Block Jacobi from PETSc), the differences become larger (~4x worse). I have profiled the results using the timing tools from Teuchos, and there is nothing that particularly stands out. The proportions of time spent in expensive operations (such as orthogonalization and normalization) are mostly equal between the two solvers, it seems like Trilinos is just running slower across-the-board. Matrix assembly is comparable in both packages (and very fast in comparison).

My first thoughts were thread affinity problems and over-subscription, but even when disabling multi-threading the problems are still there.

The results are the same regardless of compiler too. I have tried Intel v15, v16 and gcc 5.3.0.

I feel like there must be something obvious I am missing, and would appreciate if anyone can point me in the right direction. I have attached my cmake script below. When compiling with no threading I set the Kokkos default node appropriately and disable anything related to OpenMP.

Kind regards,
James

cmake \
          -D Trilinos_ENABLE_DEFAULT_PACKAGES:BOOL=OFF \
          -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
          -D BUILD_SHARED_LIBS:BOOL=ON \
          -D Trilinos_ENABLE_Epetra:BOOL=OFF \
          -D Trilinos_ENABLE_Tpetra:BOOL=ON \
          -D Trilinos_ENABLE_Kokkos:BOOL=ON \
          -D Trilinos_ENABLE_Ifpack2:BOOL=ON \
          -D Trilinos_ENABLE_Belos:BOOL=ON \
          -D Trilinos_ENABLE_Teuchos:BOOL=ON \
          -D Trilinos_ENABLE_CTrilinos:BOOL=OFF \
          -D Trilinos_ENABLE_TESTS:BOOL=OFF \
          -D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
          -D Tpetra_INST_SERIAL:BOOL=ON \
          -D Tpetra_INST_OPENMP:BOOL=ON \
          -D Trilinos_ENABLE_MPI:BOOL=ON \
          -D TPL_ENABLE_MKL:BOOL=ON \
          -D Trilinos_ENABLE_OpenMP:BOOL=ON \
          -D Kokkos_ENABLE_Pthread:BOOl=OFF \
          -D Tpetra_DefaultNode:STRING="Kokkos::Compat::KokkosOpenMPWrapperNode" \
          -D Kokkos_ENABLE_MPI:BOOL=ON \
          -D Kokkos_ENABLE_OpenMP:BOOL=ON \
          -D TPL_ENABLE_MPI:BOOL=ON \
          -D TPL_ENABLE_Pthread:BOOL=OFF \
          -D Belos_ENABLE_TEUCHOS_TIME_MONITOR:BOOL=ON \
          -D BLAS_LIBRARY_DIRS:PATH=$MKLROOT/lib/intel64/ \
          -D LAPACK_LIBRARY_DIRS:PATH=$MKLROOT/lib/intel64/ \
          -D BLAS_LIBRARY_NAMES:STRING="mkl_intel_lp64;mkl_core;mkl_sequential" \
          -D BLAS_INCLUDE_DIRS:PATH=$MKLROOT/include/ \
          -D LAPACK_LIBRARY_NAMES:STRING="mkl_intel_lp64;mkl_core;mkl_sequential" \
          -D TPL_MKL_LIBRARIES:STRING="mkl_intel_lp64;mkl_core;mkl_sequential" \
          -D CMAKE_INSTALL_PREFIX:PATH=$MYINSTALLPATH \
          -D Tpetra_INST_COMPLEX_DOUBLE:BOOL=OFF \
          -D CMAKE_BUILD_TYPE:STRING=RELEASE \
          ${EXTRA_ARGS} \
          ${TRILINOS_PATH}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20160218/4957369a/attachment.html>