[Trilinos-Users] [EXTERNAL] Re: Results from a scaling study of ML

John Cary cary at colorado.edu
Tue Apr 6 05:50:48 MST 2021


Hi James,

I had forgotten that we patch trilinos to get it to build without 
parmetis/metis.
We cannot include those in the build chain, as they have a commercial 
license.

We use SuperLU_Dist-5.4.0.

Our patch for that is

diff -ruN trilinos-13.0.0/packages/amesos/CMakeLists.txt 
trilinos-13.0.0-new/packages/amesos/CMakeLists.txt
--- trilinos-13.0.0/packages/amesos/CMakeLists.txt      2020-08-05 
19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/packages/amesos/CMakeLists.txt  2020-10-31 
13:03:07.394676831 -0600
@@ -10,9 +10,11 @@
  # B) Set up package-specific options
  #

-# if using SuperLUDist, must also link in ParMETIS for some reason
-IF(${PACKAGE_NAME}_ENABLE_SuperLUDist AND NOT 
${PACKAGE_NAME}_ENABLE_ParMETIS)
-  MESSAGE(FATAL_ERROR "The Amesos support for the SuperLUIDist TPL 
requires the ParMETIS TPL.  Either disable Amesos SuperLUDist support or 
enable the ParMETIS TPL.")
+# One can now configure SuperLUDist without ParMETIS
+if (NOT TPL_ENABLE_SuperLUDist_Without_ParMETIS)
+  IF(${PACKAGE_NAME}_ENABLE_SuperLUDist AND NOT 
${PACKAGE_NAME}_ENABLE_ParMETIS)
+    MESSAGE(FATAL_ERROR "The Amesos support for the SuperLUDist TPL 
requires the ParMETIS TPL.  Either disable Amesos SuperLUDist support or 
enable the ParMETIS TPL.")
+  ENDIF()
  ENDIF()

Our full patch is attached.  It has some pretty small changes to also 
get trilinos to
build for us on Windows, where we use LLVM-10.  I also had to add a fix 
for superlu
version < 5, which works for me, but I am not sure whether it is right.  
I suppose
I should try submitting PRs again, but will have to reproduce the 
reasons for the PR.

Our SuperLU_Dist configuration includes

   -Denable_parmetislib:BOOL='OFF' \
   -DXSDK_ENABLE_Fortran:BOOL='OFF' \
   -Denable_blaslib:BOOL='OFF' \

I attach that full configure script for your reference.

So when you run ML, is SuperLU used somehow?

Thx....John




On 4/6/21 6:28 AM, Elliott, James John wrote:
> I fat-fingered my final comment:
> So, I guess I am curious if Trilinos supports the case SuperLUDist
> **without**
>   ParMETIS. Glancing at the superlu_dist.a library, I do see symbols for getting metis/parmetis. (I don't know the precise configure used for SuperLUDist when it was built)
>
> Sorry!
>
> On 4/6/21, 4:09 AM, "Trilinos-Users on behalf of Elliott, James John" <trilinos-users-bounces at trilinos.org on behalf of jjellio at sandia.gov> wrote:
>
>      John,
>      
>      I checked on our mini Cori. A few things:
>      
>      I tried using our the mojo that our CI toolchains use for this platform (ATDM environment with ats1-haswell-intel-relese) - the following bit is a short hand used in some of our apps+CI - on the mini Cori (ATS1), we have TPLs built that the CI framework uses for nightly testing. (I used a slightly modified version of your Cmake though - not the SNL 'atdm shortcuts')
>      
>      1. On that platform, we don't support GNU - so I figured I'd just try Intel.
>      2. I then saw `-DTPL_ENABLE_SuperLUDist_Without_ParMETIS:BOOL=TRUE`
>      In the CMake script - I do not believe that is a combo we test.
>      
>      3. When I spun off a build against trilinos/develop, Ameso cries:
>      ```
>      Processing enabled package: Amesos (Libs, Examples)
>      CMake Error at packages/amesos/CMakeLists.txt:15 (MESSAGE):
>        The Amesos support for the SuperLUIDist TPL requires the ParMETIS TPL.
>        Either disable Amesos SuperLUDist support or enable the ParMETIS TPL.
>      ```
>      
>      4. if I enable ParMETIS, I see this at the end of configure:
>      Unused:  Trilinos_ENABLE_SuperLU5_API (Maybe this is not needed? Or is my SuperLUDist version high/low enough to negate it?)
>      My SuperLUDist is: superlu_dist-5.4.0
>      
>      
>      5. If I keep ` DTPL_ENABLE_SuperLUDist_Without_ParMETIS:BOOL=TRUE ` and add ParMETIS, Ameso will configure:
>      ```
>      Processing enabled package: Amesos (Libs, Examples)
>      -- Amesos_example_AmesosFactory_Tridiag: NOT added test because Amesos_ENABLE_TESTS='OFF'.
>      -- Amesos_example_AmesosFactory: NOT added test because Amesos_ENABLE_TESTS='OFF'.
>      -- Amesos_example_AmesosFactory_HB: NOT added test because Amesos_ENABLE_TESTS='OFF'.
>      -- Amesos_compare_solvers: NOT added test because Amesos_ENABLE_TESTS='OFF'.
>      -- Amesos_a_trivial_mpi_test: NOT added test because Amesos_ENABLE_TESTS='OFF'.
>      ```
>      
>      6. Curiously, if you configure with (5)
>      You will get:
>      ```
>      CMake Warning:
>        Manually-specified variables were not used by the project:
>      
>          TPL_ENABLE_SuperLUDist_Without_ParMETIS
>          Trilinos_ENABLE_SuperLU5_API
>      ```
>      
>      I am on github develop (not release v13)
>      
>      
>      So, I guess I am curious if Trilinos supports the case SuperLUDist w/ParMETIS. Glancing at the superlu_dist.a library, I do see symbols for getting metis/parmetis. (I don't know the precise configure used for SuperLUDist when it was built)
>      
>      include/superlu_dist_config.h:
>      ```
>      /* superlu_dist_config.h.in */
>        
>      /* Enable parmetis */
>      #define HAVE_PARMETIS TRUE
>      
>      /* Enable CombBLAS */
>      /* #undef HAVE_COMBBLAS */
>      
>      /* enable 64bit index mode */
>      /* #undef XSDK_INDEX_SIZE */
>      
>      #if (XSDK_INDEX_SIZE == 64)
>      #define _LONGINT 1
>      #endif
>      ```
>      
>      
>      
>      On 3/29/21, 3:52 PM, "Trilinos-Users on behalf of Elliott, James John" <trilinos-users-bounces at trilinos.org on behalf of jjellio at sandia.gov> wrote:
>      
>          John that's odd.
>          
>          Cori performance variations usually happen as you scale out to multiple nodes (and you end up with an allocation + other users that causes bad routing performance).
>          
>          It may be easier to post on github
>          
>          If you can give me your slurm: sbatch or salloc commands/script. A list of the modules used, and then your srun ( plus app name + flags you give it). I can try to reproduce this on our miniature Cori (trinity testbed at SNL). I no longer have access to NERSC (I was part of the KNL early access program on Cori).
>          
>          If you are somehow running the Haswell binary on KNL, this could explain a marked slowdown.
>          On Cori, you usually have to salloc/sbatch with -C haswell.
>          
>          A Haswell binary will run on KNL, but a KNL binary will not run on Haswell.
>          
>          Your loaded modules can also have some impacts on performance (even though the binary may be static)
>          
>          Jonathan, Chris, and I did run MueLu a reasonable amount on Cori duing the early access. The main culprits (then) were large scale perf variations and tracking down issues in MueLu's repartitioning routines (avoiding many to one communications)
>          
>          James
>          
>          On 3/29/21, 6:11 AM, "Trilinos-Users on behalf of John Cary" <trilinos-users-bounces at trilinos.org on behalf of cary at colorado.edu> wrote:
>          
>              Thanks, James.  So I did
>              
>              srun -n 32 --distribution=block,block -c 2
>              /global/cscratch1/sd/cary/builds-cori-gcc/vsimall-cori-gcc/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples/ML_preconditioner.exe
>              
>              but I am still seeing the same single-node scaling of dropping to 25%
>              parallel efficiency.
>              
>              I can see that it is not the fault of ML, because on my own local
>              cluster, which has two
>              AMD EPYC 7302 16-Core Processor per node, the single-node parallel
>              efficiency at 32 processes
>              is 82%.
>              
>              So I guess I still do not know how best to launch on cori.
>              
>              Thx.....John
>              
>              
>              On 3/28/21 6:18 PM, James Elliott wrote:
>              > # cores per proc is usually between 1 and 16 (fill up one socket)
>              >
>              > I may be off... been a while since I ran there. FYI, cori was really
>              > noisy.
>              >
>              > cores_per_proc=1
>              > John, I believe the usual Cori/Haswell slurm launch should look like:
>              >
>              > srun_opts=(
>              > # use cores,v if you want verbosity
>              > --cpu_bind=cores
>              > -c $(($cores_per_proc*2))
>              > # distribution puts ranks on nodes, then sockets
>              > # block,block - is like aprun default, which fills
>              > # a socket on a node, then the next socket on the same node
>              > # the the next node...
>              > # block,cyclic is/was the default on Cori
>              > # that will put rank0 on socket0, rank1 on socket1 (same node)
>              > # and repeat until the node is full. (it will stride your procs
>              > # between the sockets on the node)
>              > # This detail caused a few apps pain when Trinity swapped from
>              > # aprun.
>              > # Pick block,block or block,cyclic
>              > --distribution=block,block
>              > # the usual -n -N stuff
>              > )
>              >
>              > srun "${srun_opts[@]}" ./app ....
>              >
>              > On 3/28/2021 5:23 PM, John Cary wrote:
>              >> Hi All,
>              >>
>              >> As promised, we have done scaling studies on the haswell nodes on
>              >> Cori at NERSC using ML_preconditioner.exe
>              >> as compiled, so this is a weak scaling study with 65536 cells/nodes
>              >> per processor.  We find a parallel efficiency
>              >> (speedup/expected speedup) that drops to 25% on 32 processes.
>              >>
>              >> Is this expected?
>              >>
>              >> Are their command line args to srun that might improve this?  (I
>              >> tried various args to --cpu-bind.)
>              >>
>              >> I can provide plenty more info (configuration line, how run, ...).
>              >>
>              >> Thx.....John
>              >>
>              >> On 3/24/21 9:05 AM, John Cary wrote:
>              >>>
>              >>>
>              >>> Thanks, Chris, thanks Jonathan,
>              >>>
>              >>> I have found these executables, and we are doing scaling studies now.
>              >>>
>              >>> Will report....John
>              >>>
>              >>>
>              >>>
>              >>> On 3/23/21 9:42 PM, Siefert, Christopher wrote:
>              >>>> John,
>              >>>>
>              >>>> There are some scaling examples in
>              >>>> trilinoscouplings/examples/scaling (example_Poisson.cpp and
>              >>>> example_Poisson2D.cpp) that use the old stack and might do what you
>              >>>> need.
>              >>>>
>              >>>> -Chris
>              >>>
>              >>>
>              >>> On 3/23/21 7:48 PM, Hu, Jonathan wrote:
>              >>>> Hi John,
>              >>>>
>              >>>>     ML has a 2D Poisson driver in
>              >>>> ml/examples/BasicExamples/ml_preconditioner.cpp.  The cmake target
>              >>>> should be either "ML_preconditioner" or "ML_preconditioner.exe".
>              >>>> There's a really similar one in ml/examples/XML/ml_XML.cpp that you
>              >>>> can drive with an XML deck. Is this what you're after?
>              >>>>
>              >>>> Jonathan
>              >>>>
>              >>>> On 3/23/21, 5:47 PM, "Trilinos-Users on behalf of John Cary"
>              >>>> <trilinos-users-bounces at trilinos.org on behalf of
>              >>>> cary at colorado.edu> wrote:
>              >>>>
>              >>>>      We are still using the old stack: ML, Epetra, ...
>              >>>>
>              >>>>      When we run a simple Poisson solve on our cluster (32
>              >>>> cores/node), we
>              >>>>      see parallel efficiency drop to 4% on one node with 32 cores.
>              >>>> So we
>              >>>>      naturally believe we are doing something wrong.
>              >>>>
>              >>>>      Does trilinos come with a simple Poisson-solve executable that
>              >>>> we could
>              >>>>      use to test scaling (to get around the uncertainties of our
>              >>>> use of
>              >>>>      trilinos)?
>              >>>>
>              >>>>      Thx.......John Cary
>              >>>>
>              >>>>      _______________________________________________
>              >>>>      Trilinos-Users mailing list
>              >>>>      Trilinos-Users at trilinos.org
>              >>>> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>              >>>>
>              >>>>
>              >>>>
>              >>>
>              >>
>              >>
>              >> _______________________________________________
>              >> Trilinos-Users mailing list
>              >> Trilinos-Users at trilinos.org
>              >> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>              >
>              
>              
>              _______________________________________________
>              Trilinos-Users mailing list
>              Trilinos-Users at trilinos.org
>              http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>              
>          
>          _______________________________________________
>          Trilinos-Users mailing list
>          Trilinos-Users at trilinos.org
>          http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>          
>      
>      _______________________________________________
>      Trilinos-Users mailing list
>      Trilinos-Users at trilinos.org
>      http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>      
>
>

-------------- next part --------------
diff -ruN trilinos-13.0.0/cmake/tribits/win_interface/include/gettimeofday.c trilinos-13.0.0-new/cmake/tribits/win_interface/include/gettimeofday.c
--- trilinos-13.0.0/cmake/tribits/win_interface/include/gettimeofday.c	2020-08-05 19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/cmake/tribits/win_interface/include/gettimeofday.c	2020-10-31 13:03:07.386676573 -0600
@@ -1,4 +1,4 @@
-#include < time.h >
+#include <time.h>
 #include <Winsock2.h> /* to get timeval struct */
 
 struct timezone 
diff -ruN trilinos-13.0.0/packages/amesos/CMakeLists.txt trilinos-13.0.0-new/packages/amesos/CMakeLists.txt
--- trilinos-13.0.0/packages/amesos/CMakeLists.txt	2020-08-05 19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/packages/amesos/CMakeLists.txt	2020-10-31 13:03:07.394676831 -0600
@@ -10,9 +10,11 @@
 # B) Set up package-specific options
 #
 
-# if using SuperLUDist, must also link in ParMETIS for some reason
-IF(${PACKAGE_NAME}_ENABLE_SuperLUDist AND NOT ${PACKAGE_NAME}_ENABLE_ParMETIS)
-  MESSAGE(FATAL_ERROR "The Amesos support for the SuperLUIDist TPL requires the ParMETIS TPL.  Either disable Amesos SuperLUDist support or enable the ParMETIS TPL.")
+# One can now configure SuperLUDist without ParMETIS
+if (NOT TPL_ENABLE_SuperLUDist_Without_ParMETIS)
+  IF(${PACKAGE_NAME}_ENABLE_SuperLUDist AND NOT ${PACKAGE_NAME}_ENABLE_ParMETIS)
+    MESSAGE(FATAL_ERROR "The Amesos support for the SuperLUDist TPL requires the ParMETIS TPL.  Either disable Amesos SuperLUDist support or enable the ParMETIS TPL.")
+  ENDIF()
 ENDIF()
 
 IF(${PACKAGE_NAME}_ENABLE_PARAKLETE)
diff -ruN trilinos-13.0.0/packages/amesos2/src/Amesos2_Superlu_def.hpp trilinos-13.0.0-new/packages/amesos2/src/Amesos2_Superlu_def.hpp
--- trilinos-13.0.0/packages/amesos2/src/Amesos2_Superlu_def.hpp	2020-08-05 19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/packages/amesos2/src/Amesos2_Superlu_def.hpp	2020-10-31 13:23:37.114035426 -0600
@@ -747,6 +747,7 @@
 
     // ILU parameters
 
+#if (SUPERLU_MAJOR_VERSION < 5)
     setStringToIntegralParameter<SLU::rowperm_t>("RowPerm", "LargeDiag",
             "Type of row permutation strategy to use",
             tuple<string>("NOROWPERM","LargeDiag","MY_PERMR"),
@@ -758,6 +759,22 @@
             SLU::MY_PERMR),
             pl.getRawPtr());
 
+#else
+    setStringToIntegralParameter<SLU::rowperm_t>("RowPerm", "NOROWPERM",
+            "Type of row permutation strategy to use",
+            tuple<string>("NOROWPERM","LargeDiag_MC64", "LargeDiag_AWPM",
+              "MY_PERMR"),
+            tuple<string>("Use natural ordering",
+            "Use weighted bipartite matching algorithm (not for serial)",
+            "Parallelizable approximate matching algorithm (not for serial)",
+            "Use the ordering given in perm_r input"),
+            tuple<SLU::rowperm_t>(SLU::NOROWPERM,
+            SLU::LargeDiag_MC64,
+            SLU::LargeDiag_AWPM,
+            SLU::MY_PERMR),
+            pl.getRawPtr());
+#endif
+
     /*setStringToIntegralParameter<SLU::rule_t>("ILU_DropRule", "DROP_BASIC",
             "Type of dropping strategy to use",
             tuple<string>("DROP_BASIC","DROP_PROWS",
diff -ruN trilinos-13.0.0/packages/kokkos/cmake/kokkos_arch.cmake trilinos-13.0.0-new/packages/kokkos/cmake/kokkos_arch.cmake
--- trilinos-13.0.0/packages/kokkos/cmake/kokkos_arch.cmake	2020-08-05 19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/packages/kokkos/cmake/kokkos_arch.cmake	2020-10-31 13:03:07.403677121 -0600
@@ -296,6 +296,14 @@
   )
 ENDIF()
 
+# From https://github.com/kokkos/kokkos/pull/2977/commits/9f6f9f8ecd320470d25e0094603c0255ff6afb40
+# Clang needs mcx16 option enabled for Windows atomic functions
+IF (CMAKE_CXX_COMPILER_ID STREQUAL Clang AND WIN32)
+  COMPILER_SPECIFIC_FLAGS(
+    Clang -mcx16
+  )
+ENDIF()
+
 #Right now we cannot get the compiler ID when cross-compiling, so just check
 #that HIP is enabled
 IF (Kokkos_ENABLE_HIP)
diff -ruN trilinos-13.0.0/packages/kokkos/core/src/Kokkos_Macros.hpp trilinos-13.0.0-new/packages/kokkos/core/src/Kokkos_Macros.hpp
--- trilinos-13.0.0/packages/kokkos/core/src/Kokkos_Macros.hpp	2020-08-05 19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/packages/kokkos/core/src/Kokkos_Macros.hpp	2020-10-31 13:03:07.411677379 -0600
@@ -633,8 +633,10 @@
 #define KOKKOS_ATTRIBUTE_NODISCARD
 #endif
 
-#if defined(KOKKOS_COMPILER_GNU) || defined(KOKKOS_COMPILER_CLANG) || \
-    defined(KOKKOS_COMPILER_INTEL) || defined(KOKKOS_COMPILER_PGI)
+// From https://github.com/kokkos/kokkos/pull/2977/commits/9f6f9f8ecd320470d25e0094603c0255ff6afb40
+#if (defined(KOKKOS_COMPILER_GNU) || defined(KOKKOS_COMPILER_CLANG) || \
+    defined(KOKKOS_COMPILER_INTEL) || defined(KOKKOS_COMPILER_PGI)) && \
+    !defined(KOKKOS_COMPILER_MSVC)
 #define KOKKOS_IMPL_ENABLE_STACKTRACE
 #define KOKKOS_IMPL_ENABLE_CXXABI
 #endif
diff -ruN trilinos-13.0.0/packages/shylu/shylu_node/hts/src/shylu_hts_impl_def.hpp trilinos-13.0.0-new/packages/shylu/shylu_node/hts/src/shylu_hts_impl_def.hpp
--- trilinos-13.0.0/packages/shylu/shylu_node/hts/src/shylu_hts_impl_def.hpp	2020-08-05 19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/packages/shylu/shylu_node/hts/src/shylu_hts_impl_def.hpp	2020-10-31 13:03:07.422677733 -0600
@@ -104,11 +104,11 @@
   T* c, blas_int ldc);
 
 extern "C" {
-  void F77_BLAS_MANGLE(sgemm,SGEMM)(
+  int F77_BLAS_MANGLE(sgemm,SGEMM)(
     const char*, const char*, const blas_int*, const blas_int*, const blas_int*,
     const float*, const float*, const blas_int*, const float*, const blas_int*,
     const float*, float*, const blas_int*);
-  void F77_BLAS_MANGLE(dgemm,DGEMM)(
+  int F77_BLAS_MANGLE(dgemm,DGEMM)(
     const char*, const char*, const blas_int*, const blas_int*, const blas_int*,
     const double*, const double*, const blas_int*, const double*, const blas_int*,
     const double*, double*, const blas_int*);
diff -ruN trilinos-13.0.0/packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp trilinos-13.0.0-new/packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp
--- trilinos-13.0.0/packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp	2020-08-05 19:22:40.000000000 -0600
+++ trilinos-13.0.0-new/packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp	2020-10-31 13:03:07.434678120 -0600
@@ -59,7 +59,7 @@
 #include "Xpetra_StridedMapFactory.hpp"
 #include "Xpetra_StridedMap.hpp"
 
-#include <execinfo.h>
+// #include <execinfo.h>
 
 #ifdef HAVE_XPETRA_EPETRA
 #include <Xpetra_EpetraCrsMatrix_fwd.hpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vcloud.txcorp.com-superlu_dist-parcomm-config.sh
Type: application/x-sh
Size: 983 bytes
Desc: not available
URL: <http://trilinos.org/pipermail/trilinos-users_trilinos.org/attachments/20210406/8f3671a1/attachment.sh>


More information about the Trilinos-Users mailing list