[Trilinos-Users] Performance Issues

James Hawkes jameshawkes at outlook.com
Mon Feb 15 12:10:29 EST 2016


Hi Mike,
For the Krylov solver I am using the standard "GMRES" from the Belos SolverFactory, with the following parameters:
solverParams->set ("Num Blocks", 30);solverParams->set ("Maximum Iterations", 99999);solverParams->set ("Convergence Tolerance", 0.1); //relative residual 2-normsolverParams->set ("Orthogonalization", "ICGS"); // DKGS, ICGS, IMGS give minor performance differencessolverParams->set ( "Implicit Residual Scaling", "Norm of Initial Residual" );
The relative tolerance is very relaxed (deliberately, it only needs 9 iterations) but if I set it lower (0.01 or 0.001) not much changes other than the total number of iterations -- the time-per-iteration is still high. Likewise I have tested on larger problems (up to 2.7m equations), but the performance gap still exists. We are trying to push the equations-per-core down which is one of the motivations for switching to a hybrid scheme.
For the preconditioner:PCParams.set ("fact: ilut level-of-fill", 1.0);PCParams.set ("fact: drop tolerance", 0.0);PCParams.set ("fact: absolute threshold", 0.1);
I have tried variations on level-of-fill, drop tolerance and absolute thresholds. Sure enough it changes the convergence but doesn't make much difference in terms of time-per-iteration. I think the above settings match the Block Jacobi preconditioner I am comparing against. In both cases I apply it as a right-preconditioner.
I should also mention that 32 processes is two distributed-memory nodes in our cluster, I get the same problems when running on 16 cores in a single node or with only one core.
Kind regards,James

From: maherou at sandia.gov
To: jameshawkes at outlook.com; trilinos-users at trilinos.org
Subject: Re: [Trilinos-Users] Performance Issues
Date: Mon, 15 Feb 2016 16:32:56 +0000






James,



Can you give us some details about the parameter values you are using?  Levels of fill, Krylov solver and its parameters, number of iterations, etc.



Also, 16K equations is quite small for running on 32 processors.  If this is your typical problem size, fine, but if not, you might consider a larger problem.  



Mike





From: Trilinos-Users <trilinos-users-bounces at trilinos.org> on behalf of James Hawkes <jameshawkes at outlook.com>

Date: Monday, February 15, 2016 at 9:42 AM

To: Trilinos Users <trilinos-users at trilinos.org>

Subject: [EXTERNAL] [Trilinos-Users] Performance Issues







I am exploring alternative linear solver packages to PETSc, because they recently stopped supporting hybrid parallelization (MPI+OpenMP) and this is something we want to start exploring. However, I am struggling to match the performance of the
 two packages using very similar solvers. I am solving a Poisson equation generated from CFD, with 16k rows.



I'm using GMRES from PETSc and GMRES from Belos. With and without preconditioning, they both compute exactly the same residual in exactly the same number of iterations -- however, Trilinos takes much longer and I cannot pin-point
 the reason why. Here are the timings without preconditioning:



PETSC    on 32 MPI processes: 7.246e-03s
Trilinos on 32 MPI processes: 2.902e-02s (compiled with openmp with Kokkos, with OMP_NUM_THREADS=1)
Trilinos on 32 MPI processes: 2.535e-02s (compiled with serial nodes, no threading)
Trilinos on  4 MPI processes: 2.453e-02s (compiled with openmp, OMP_NUM_THREADS=8)



With preconditioning (Ifpack2 ILUT from Trilinos and Block Jacobi from PETSc), the differences become larger (~4x worse). I have profiled the results using the timing tools from Teuchos, and there is nothing that particularly
 stands out. The proportions of time spent in expensive operations (such as orthogonalization and normalization) are mostly equal between the two solvers, it seems like Trilinos is just running slower across-the-board. Matrix assembly is comparable in both
 packages (and very fast in comparison).



My first thoughts were thread affinity problems and over-subscription, but even when disabling multi-threading the problems are still there.



The results are the same regardless of compiler too. I have tried Intel v15, v16 and gcc 5.3.0.



I feel like there must be something obvious I am missing, and would appreciate if anyone can point me in the right direction. I have attached my cmake script below. When compiling with no threading I set the Kokkos default
 node appropriately and disable anything related to OpenMP.



Kind regards,
James




cmake \
          -D Trilinos_ENABLE_DEFAULT_PACKAGES:BOOL=OFF \
          -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
          -D BUILD_SHARED_LIBS:BOOL=ON \
          -D Trilinos_ENABLE_Epetra:BOOL=OFF \
          -D Trilinos_ENABLE_Tpetra:BOOL=ON \
          -D Trilinos_ENABLE_Kokkos:BOOL=ON \
          -D Trilinos_ENABLE_Ifpack2:BOOL=ON \
          -D Trilinos_ENABLE_Belos:BOOL=ON \
          -D Trilinos_ENABLE_Teuchos:BOOL=ON \
          -D Trilinos_ENABLE_CTrilinos:BOOL=OFF \
          -D Trilinos_ENABLE_TESTS:BOOL=OFF \
          -D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
          -D Tpetra_INST_SERIAL:BOOL=ON \
          -D Tpetra_INST_OPENMP:BOOL=ON \
          -D Trilinos_ENABLE_MPI:BOOL=ON \
          -D TPL_ENABLE_MKL:BOOL=ON \
          -D Trilinos_ENABLE_OpenMP:BOOL=ON \
          -D Kokkos_ENABLE_Pthread:BOOl=OFF \
          -D Tpetra_DefaultNode:STRING="Kokkos::Compat::KokkosOpenMPWrapperNode" \
          -D Kokkos_ENABLE_MPI:BOOL=ON \
          -D Kokkos_ENABLE_OpenMP:BOOL=ON \
          -D TPL_ENABLE_MPI:BOOL=ON \
          -D TPL_ENABLE_Pthread:BOOL=OFF \
          -D Belos_ENABLE_TEUCHOS_TIME_MONITOR:BOOL=ON \
          -D BLAS_LIBRARY_DIRS:PATH=$MKLROOT/lib/intel64/ \
          -D LAPACK_LIBRARY_DIRS:PATH=$MKLROOT/lib/intel64/ \
          -D BLAS_LIBRARY_NAMES:STRING="mkl_intel_lp64;mkl_core;mkl_sequential" \
          -D BLAS_INCLUDE_DIRS:PATH=$MKLROOT/include/ \
          -D LAPACK_LIBRARY_NAMES:STRING="mkl_intel_lp64;mkl_core;mkl_sequential" \
          -D TPL_MKL_LIBRARIES:STRING="mkl_intel_lp64;mkl_core;mkl_sequential" \
          -D CMAKE_INSTALL_PREFIX:PATH=$MYINSTALLPATH \
          -D Tpetra_INST_COMPLEX_DOUBLE:BOOL=OFF \
          -D CMAKE_BUILD_TYPE:STRING=RELEASE \
          ${EXTRA_ARGS} \
          ${TRILINOS_PATH}




 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20160215/71154689/attachment.html>


More information about the Trilinos-Users mailing list