[Trilinos-Users] [EXTERNAL] Re: Results from a scaling study of ML

Elliott, James John jjellio at sandia.gov
Tue Apr 6 05:58:22 MST 2021


John - I didn't push through a build and run, since this seems pretty important.

I can't speak for ML behavior (why I replied back to the list). In general, I've found muelu leans on Amesos (1 or 2)  to get their level solvers, and then Amesos pulls in the external solvers.

Let me see how it goes with a patch to Amesos.

On 4/6/21, 6:52 AM, "Trilinos-Users on behalf of John Cary" <trilinos-users-bounces at trilinos.org on behalf of cary at colorado.edu> wrote:

    Hi James,
    
    I had forgotten that we patch trilinos to get it to build without 
    parmetis/metis.
    We cannot include those in the build chain, as they have a commercial 
    license.
    
    We use SuperLU_Dist-5.4.0.
    
    Our patch for that is
    
    diff -ruN trilinos-13.0.0/packages/amesos/CMakeLists.txt 
    trilinos-13.0.0-new/packages/amesos/CMakeLists.txt
    --- trilinos-13.0.0/packages/amesos/CMakeLists.txt      2020-08-05 
    19:22:40.000000000 -0600
    +++ trilinos-13.0.0-new/packages/amesos/CMakeLists.txt  2020-10-31 
    13:03:07.394676831 -0600
    @@ -10,9 +10,11 @@
      # B) Set up package-specific options
      #
    
    -# if using SuperLUDist, must also link in ParMETIS for some reason
    -IF(${PACKAGE_NAME}_ENABLE_SuperLUDist AND NOT 
    ${PACKAGE_NAME}_ENABLE_ParMETIS)
    -  MESSAGE(FATAL_ERROR "The Amesos support for the SuperLUIDist TPL 
    requires the ParMETIS TPL.  Either disable Amesos SuperLUDist support or 
    enable the ParMETIS TPL.")
    +# One can now configure SuperLUDist without ParMETIS
    +if (NOT TPL_ENABLE_SuperLUDist_Without_ParMETIS)
    +  IF(${PACKAGE_NAME}_ENABLE_SuperLUDist AND NOT 
    ${PACKAGE_NAME}_ENABLE_ParMETIS)
    +    MESSAGE(FATAL_ERROR "The Amesos support for the SuperLUDist TPL 
    requires the ParMETIS TPL.  Either disable Amesos SuperLUDist support or 
    enable the ParMETIS TPL.")
    +  ENDIF()
      ENDIF()
    
    Our full patch is attached.  It has some pretty small changes to also 
    get trilinos to
    build for us on Windows, where we use LLVM-10.  I also had to add a fix 
    for superlu
    version < 5, which works for me, but I am not sure whether it is right.  
    I suppose
    I should try submitting PRs again, but will have to reproduce the 
    reasons for the PR.
    
    Our SuperLU_Dist configuration includes
    
       -Denable_parmetislib:BOOL='OFF' \
       -DXSDK_ENABLE_Fortran:BOOL='OFF' \
       -Denable_blaslib:BOOL='OFF' \
    
    I attach that full configure script for your reference.
    
    So when you run ML, is SuperLU used somehow?
    
    Thx....John
    
    
    
    
    On 4/6/21 6:28 AM, Elliott, James John wrote:
    > I fat-fingered my final comment:
    > So, I guess I am curious if Trilinos supports the case SuperLUDist
    > **without**
    >   ParMETIS. Glancing at the superlu_dist.a library, I do see symbols for getting metis/parmetis. (I don't know the precise configure used for SuperLUDist when it was built)
    >
    > Sorry!
    >
    > On 4/6/21, 4:09 AM, "Trilinos-Users on behalf of Elliott, James John" <trilinos-users-bounces at trilinos.org on behalf of jjellio at sandia.gov> wrote:
    >
    >      John,
    >      
    >      I checked on our mini Cori. A few things:
    >      
    >      I tried using our the mojo that our CI toolchains use for this platform (ATDM environment with ats1-haswell-intel-relese) - the following bit is a short hand used in some of our apps+CI - on the mini Cori (ATS1), we have TPLs built that the CI framework uses for nightly testing. (I used a slightly modified version of your Cmake though - not the SNL 'atdm shortcuts')
    >      
    >      1. On that platform, we don't support GNU - so I figured I'd just try Intel.
    >      2. I then saw `-DTPL_ENABLE_SuperLUDist_Without_ParMETIS:BOOL=TRUE`
    >      In the CMake script - I do not believe that is a combo we test.
    >      
    >      3. When I spun off a build against trilinos/develop, Ameso cries:
    >      ```
    >      Processing enabled package: Amesos (Libs, Examples)
    >      CMake Error at packages/amesos/CMakeLists.txt:15 (MESSAGE):
    >        The Amesos support for the SuperLUIDist TPL requires the ParMETIS TPL.
    >        Either disable Amesos SuperLUDist support or enable the ParMETIS TPL.
    >      ```
    >      
    >      4. if I enable ParMETIS, I see this at the end of configure:
    >      Unused:  Trilinos_ENABLE_SuperLU5_API (Maybe this is not needed? Or is my SuperLUDist version high/low enough to negate it?)
    >      My SuperLUDist is: superlu_dist-5.4.0
    >      
    >      
    >      5. If I keep ` DTPL_ENABLE_SuperLUDist_Without_ParMETIS:BOOL=TRUE ` and add ParMETIS, Ameso will configure:
    >      ```
    >      Processing enabled package: Amesos (Libs, Examples)
    >      -- Amesos_example_AmesosFactory_Tridiag: NOT added test because Amesos_ENABLE_TESTS='OFF'.
    >      -- Amesos_example_AmesosFactory: NOT added test because Amesos_ENABLE_TESTS='OFF'.
    >      -- Amesos_example_AmesosFactory_HB: NOT added test because Amesos_ENABLE_TESTS='OFF'.
    >      -- Amesos_compare_solvers: NOT added test because Amesos_ENABLE_TESTS='OFF'.
    >      -- Amesos_a_trivial_mpi_test: NOT added test because Amesos_ENABLE_TESTS='OFF'.
    >      ```
    >      
    >      6. Curiously, if you configure with (5)
    >      You will get:
    >      ```
    >      CMake Warning:
    >        Manually-specified variables were not used by the project:
    >      
    >          TPL_ENABLE_SuperLUDist_Without_ParMETIS
    >          Trilinos_ENABLE_SuperLU5_API
    >      ```
    >      
    >      I am on github develop (not release v13)
    >      
    >      
    >      So, I guess I am curious if Trilinos supports the case SuperLUDist w/ParMETIS. Glancing at the superlu_dist.a library, I do see symbols for getting metis/parmetis. (I don't know the precise configure used for SuperLUDist when it was built)
    >      
    >      include/superlu_dist_config.h:
    >      ```
    >      /* superlu_dist_config.h.in */
    >        
    >      /* Enable parmetis */
    >      #define HAVE_PARMETIS TRUE
    >      
    >      /* Enable CombBLAS */
    >      /* #undef HAVE_COMBBLAS */
    >      
    >      /* enable 64bit index mode */
    >      /* #undef XSDK_INDEX_SIZE */
    >      
    >      #if (XSDK_INDEX_SIZE == 64)
    >      #define _LONGINT 1
    >      #endif
    >      ```
    >      
    >      
    >      
    >      On 3/29/21, 3:52 PM, "Trilinos-Users on behalf of Elliott, James John" <trilinos-users-bounces at trilinos.org on behalf of jjellio at sandia.gov> wrote:
    >      
    >          John that's odd.
    >          
    >          Cori performance variations usually happen as you scale out to multiple nodes (and you end up with an allocation + other users that causes bad routing performance).
    >          
    >          It may be easier to post on github
    >          
    >          If you can give me your slurm: sbatch or salloc commands/script. A list of the modules used, and then your srun ( plus app name + flags you give it). I can try to reproduce this on our miniature Cori (trinity testbed at SNL). I no longer have access to NERSC (I was part of the KNL early access program on Cori).
    >          
    >          If you are somehow running the Haswell binary on KNL, this could explain a marked slowdown.
    >          On Cori, you usually have to salloc/sbatch with -C haswell.
    >          
    >          A Haswell binary will run on KNL, but a KNL binary will not run on Haswell.
    >          
    >          Your loaded modules can also have some impacts on performance (even though the binary may be static)
    >          
    >          Jonathan, Chris, and I did run MueLu a reasonable amount on Cori duing the early access. The main culprits (then) were large scale perf variations and tracking down issues in MueLu's repartitioning routines (avoiding many to one communications)
    >          
    >          James
    >          
    >          On 3/29/21, 6:11 AM, "Trilinos-Users on behalf of John Cary" <trilinos-users-bounces at trilinos.org on behalf of cary at colorado.edu> wrote:
    >          
    >              Thanks, James.  So I did
    >              
    >              srun -n 32 --distribution=block,block -c 2
    >              /global/cscratch1/sd/cary/builds-cori-gcc/vsimall-cori-gcc/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples/ML_preconditioner.exe
    >              
    >              but I am still seeing the same single-node scaling of dropping to 25%
    >              parallel efficiency.
    >              
    >              I can see that it is not the fault of ML, because on my own local
    >              cluster, which has two
    >              AMD EPYC 7302 16-Core Processor per node, the single-node parallel
    >              efficiency at 32 processes
    >              is 82%.
    >              
    >              So I guess I still do not know how best to launch on cori.
    >              
    >              Thx.....John
    >              
    >              
    >              On 3/28/21 6:18 PM, James Elliott wrote:
    >              > # cores per proc is usually between 1 and 16 (fill up one socket)
    >              >
    >              > I may be off... been a while since I ran there. FYI, cori was really
    >              > noisy.
    >              >
    >              > cores_per_proc=1
    >              > John, I believe the usual Cori/Haswell slurm launch should look like:
    >              >
    >              > srun_opts=(
    >              > # use cores,v if you want verbosity
    >              > --cpu_bind=cores
    >              > -c $(($cores_per_proc*2))
    >              > # distribution puts ranks on nodes, then sockets
    >              > # block,block - is like aprun default, which fills
    >              > # a socket on a node, then the next socket on the same node
    >              > # the the next node...
    >              > # block,cyclic is/was the default on Cori
    >              > # that will put rank0 on socket0, rank1 on socket1 (same node)
    >              > # and repeat until the node is full. (it will stride your procs
    >              > # between the sockets on the node)
    >              > # This detail caused a few apps pain when Trinity swapped from
    >              > # aprun.
    >              > # Pick block,block or block,cyclic
    >              > --distribution=block,block
    >              > # the usual -n -N stuff
    >              > )
    >              >
    >              > srun "${srun_opts[@]}" ./app ....
    >              >
    >              > On 3/28/2021 5:23 PM, John Cary wrote:
    >              >> Hi All,
    >              >>
    >              >> As promised, we have done scaling studies on the haswell nodes on
    >              >> Cori at NERSC using ML_preconditioner.exe
    >              >> as compiled, so this is a weak scaling study with 65536 cells/nodes
    >              >> per processor.  We find a parallel efficiency
    >              >> (speedup/expected speedup) that drops to 25% on 32 processes.
    >              >>
    >              >> Is this expected?
    >              >>
    >              >> Are their command line args to srun that might improve this?  (I
    >              >> tried various args to --cpu-bind.)
    >              >>
    >              >> I can provide plenty more info (configuration line, how run, ...).
    >              >>
    >              >> Thx.....John
    >              >>
    >              >> On 3/24/21 9:05 AM, John Cary wrote:
    >              >>>
    >              >>>
    >              >>> Thanks, Chris, thanks Jonathan,
    >              >>>
    >              >>> I have found these executables, and we are doing scaling studies now.
    >              >>>
    >              >>> Will report....John
    >              >>>
    >              >>>
    >              >>>
    >              >>> On 3/23/21 9:42 PM, Siefert, Christopher wrote:
    >              >>>> John,
    >              >>>>
    >              >>>> There are some scaling examples in
    >              >>>> trilinoscouplings/examples/scaling (example_Poisson.cpp and
    >              >>>> example_Poisson2D.cpp) that use the old stack and might do what you
    >              >>>> need.
    >              >>>>
    >              >>>> -Chris
    >              >>>
    >              >>>
    >              >>> On 3/23/21 7:48 PM, Hu, Jonathan wrote:
    >              >>>> Hi John,
    >              >>>>
    >              >>>>     ML has a 2D Poisson driver in
    >              >>>> ml/examples/BasicExamples/ml_preconditioner.cpp.  The cmake target
    >              >>>> should be either "ML_preconditioner" or "ML_preconditioner.exe".
    >              >>>> There's a really similar one in ml/examples/XML/ml_XML.cpp that you
    >              >>>> can drive with an XML deck. Is this what you're after?
    >              >>>>
    >              >>>> Jonathan
    >              >>>>
    >              >>>> On 3/23/21, 5:47 PM, "Trilinos-Users on behalf of John Cary"
    >              >>>> <trilinos-users-bounces at trilinos.org on behalf of
    >              >>>> cary at colorado.edu> wrote:
    >              >>>>
    >              >>>>      We are still using the old stack: ML, Epetra, ...
    >              >>>>
    >              >>>>      When we run a simple Poisson solve on our cluster (32
    >              >>>> cores/node), we
    >              >>>>      see parallel efficiency drop to 4% on one node with 32 cores.
    >              >>>> So we
    >              >>>>      naturally believe we are doing something wrong.
    >              >>>>
    >              >>>>      Does trilinos come with a simple Poisson-solve executable that
    >              >>>> we could
    >              >>>>      use to test scaling (to get around the uncertainties of our
    >              >>>> use of
    >              >>>>      trilinos)?
    >              >>>>
    >              >>>>      Thx.......John Cary
    >              >>>>
    >              >>>>      _______________________________________________
    >              >>>>      Trilinos-Users mailing list
    >              >>>>      Trilinos-Users at trilinos.org
    >              >>>> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
    >              >>>>
    >              >>>>
    >              >>>>
    >              >>>
    >              >>
    >              >>
    >              >> _______________________________________________
    >              >> Trilinos-Users mailing list
    >              >> Trilinos-Users at trilinos.org
    >              >> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
    >              >
    >              
    >              
    >              _______________________________________________
    >              Trilinos-Users mailing list
    >              Trilinos-Users at trilinos.org
    >              http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
    >              
    >          
    >          _______________________________________________
    >          Trilinos-Users mailing list
    >          Trilinos-Users at trilinos.org
    >          http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
    >          
    >      
    >      _______________________________________________
    >      Trilinos-Users mailing list
    >      Trilinos-Users at trilinos.org
    >      http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
    >      
    >
    >
    
    



More information about the Trilinos-Users mailing list