[Trilinos-Users] [EXTERNAL] Re: Results from a scaling study of ML

John Cary cary at colorado.edu
Tue Mar 30 05:12:17 MST 2021


Thanks for looking at this.

I attach several scripts.  The first is the configuration script, which 
is first run with
-DTrilinos_ENABLE_EXAMPLES=FALSE and the project is built.  (TRUE will 
not build.)
Then I set -DTrilinos_ENABLE_EXAMPLES=TRUE, reconfigure and do

cd packages/ml/examples/BasicExamples
make ML_preconditioner

which gives me ML_preconditioner.exe in
/global/cscratch1/sd/cary/builds-cori-gcc/vorpalall-cori-gcc-dev1/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples

Attached is the sbatch script, which in it has the srun invocation 
showing the path
to the test executable.  That works out to

srun -n 16 --distribution=block,block -c 2 
/global/cscratch1/sd/cary/builds-cori-gcc/vorpalall-cori-gcc-dev1/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples/ML_preconditioner.exe

in a particular case.

I also attach the output of "module list".

The script is submitted with

   sbatch -C haswell ML_preconditionerCori.sh

The timings are

NTASKS  time
  1     0.087072
  2     0.090850
  4     0.162583
  8     0.197526
16     0.216090
32     0.423770

Because this is a weak scaling study, I derive a parallel efficiency of 
0.087072/0.423770 = 20.5%.

As noted before, I have seen better performance on our own cluster 
(70-80%), so I figure I am
not launching with the best parameters, but I do not know what those 
would be.

Thanks for any help!

John Cary




On 3/29/21 3:50 PM, Elliott, James John wrote:
> John that's odd.
>
> Cori performance variations usually happen as you scale out to multiple nodes (and you end up with an allocation + other users that causes bad routing performance).
>
> It may be easier to post on github
>
> If you can give me your slurm: sbatch or salloc commands/script. A list of the modules used, and then your srun ( plus app name + flags you give it). I can try to reproduce this on our miniature Cori (trinity testbed at SNL). I no longer have access to NERSC (I was part of the KNL early access program on Cori).
>
> If you are somehow running the Haswell binary on KNL, this could explain a marked slowdown.
> On Cori, you usually have to salloc/sbatch with -C haswell.
>
> A Haswell binary will run on KNL, but a KNL binary will not run on Haswell.
>
> Your loaded modules can also have some impacts on performance (even though the binary may be static)
>
> Jonathan, Chris, and I did run MueLu a reasonable amount on Cori duing the early access. The main culprits (then) were large scale perf variations and tracking down issues in MueLu's repartitioning routines (avoiding many to one communications)
>
> James
>
> On 3/29/21, 6:11 AM, "Trilinos-Users on behalf of John Cary" <trilinos-users-bounces at trilinos.org on behalf of cary at colorado.edu> wrote:
>
>      Thanks, James.  So I did
>      
>      srun -n 32 --distribution=block,block -c 2
>      /global/cscratch1/sd/cary/builds-cori-gcc/vsimall-cori-gcc/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples/ML_preconditioner.exe
>      
>      but I am still seeing the same single-node scaling of dropping to 25%
>      parallel efficiency.
>      
>      I can see that it is not the fault of ML, because on my own local
>      cluster, which has two
>      AMD EPYC 7302 16-Core Processor per node, the single-node parallel
>      efficiency at 32 processes
>      is 82%.
>      
>      So I guess I still do not know how best to launch on cori.
>      
>      Thx.....John
>      
>      
>      On 3/28/21 6:18 PM, James Elliott wrote:
>      > # cores per proc is usually between 1 and 16 (fill up one socket)
>      >
>      > I may be off... been a while since I ran there. FYI, cori was really
>      > noisy.
>      >
>      > cores_per_proc=1
>      > John, I believe the usual Cori/Haswell slurm launch should look like:
>      >
>      > srun_opts=(
>      > # use cores,v if you want verbosity
>      > --cpu_bind=cores
>      > -c $(($cores_per_proc*2))
>      > # distribution puts ranks on nodes, then sockets
>      > # block,block - is like aprun default, which fills
>      > # a socket on a node, then the next socket on the same node
>      > # the the next node...
>      > # block,cyclic is/was the default on Cori
>      > # that will put rank0 on socket0, rank1 on socket1 (same node)
>      > # and repeat until the node is full. (it will stride your procs
>      > # between the sockets on the node)
>      > # This detail caused a few apps pain when Trinity swapped from
>      > # aprun.
>      > # Pick block,block or block,cyclic
>      > --distribution=block,block
>      > # the usual -n -N stuff
>      > )
>      >
>      > srun "${srun_opts[@]}" ./app ....
>      >
>      > On 3/28/2021 5:23 PM, John Cary wrote:
>      >> Hi All,
>      >>
>      >> As promised, we have done scaling studies on the haswell nodes on
>      >> Cori at NERSC using ML_preconditioner.exe
>      >> as compiled, so this is a weak scaling study with 65536 cells/nodes
>      >> per processor.  We find a parallel efficiency
>      >> (speedup/expected speedup) that drops to 25% on 32 processes.
>      >>
>      >> Is this expected?
>      >>
>      >> Are their command line args to srun that might improve this?  (I
>      >> tried various args to --cpu-bind.)
>      >>
>      >> I can provide plenty more info (configuration line, how run, ...).
>      >>
>      >> Thx.....John
>      >>
>      >> On 3/24/21 9:05 AM, John Cary wrote:
>      >>>
>      >>>
>      >>> Thanks, Chris, thanks Jonathan,
>      >>>
>      >>> I have found these executables, and we are doing scaling studies now.
>      >>>
>      >>> Will report....John
>      >>>
>      >>>
>      >>>
>      >>> On 3/23/21 9:42 PM, Siefert, Christopher wrote:
>      >>>> John,
>      >>>>
>      >>>> There are some scaling examples in
>      >>>> trilinoscouplings/examples/scaling (example_Poisson.cpp and
>      >>>> example_Poisson2D.cpp) that use the old stack and might do what you
>      >>>> need.
>      >>>>
>      >>>> -Chris
>      >>>
>      >>>
>      >>> On 3/23/21 7:48 PM, Hu, Jonathan wrote:
>      >>>> Hi John,
>      >>>>
>      >>>>     ML has a 2D Poisson driver in
>      >>>> ml/examples/BasicExamples/ml_preconditioner.cpp.  The cmake target
>      >>>> should be either "ML_preconditioner" or "ML_preconditioner.exe".
>      >>>> There's a really similar one in ml/examples/XML/ml_XML.cpp that you
>      >>>> can drive with an XML deck. Is this what you're after?
>      >>>>
>      >>>> Jonathan
>      >>>>
>      >>>> On 3/23/21, 5:47 PM, "Trilinos-Users on behalf of John Cary"
>      >>>> <trilinos-users-bounces at trilinos.org on behalf of
>      >>>> cary at colorado.edu> wrote:
>      >>>>
>      >>>>      We are still using the old stack: ML, Epetra, ...
>      >>>>
>      >>>>      When we run a simple Poisson solve on our cluster (32
>      >>>> cores/node), we
>      >>>>      see parallel efficiency drop to 4% on one node with 32 cores.
>      >>>> So we
>      >>>>      naturally believe we are doing something wrong.
>      >>>>
>      >>>>      Does trilinos come with a simple Poisson-solve executable that
>      >>>> we could
>      >>>>      use to test scaling (to get around the uncertainties of our
>      >>>> use of
>      >>>>      trilinos)?
>      >>>>
>      >>>>      Thx.......John Cary
>      >>>>
>      >>>>      _______________________________________________
>      >>>>      Trilinos-Users mailing list
>      >>>>      Trilinos-Users at trilinos.org
>      >>>> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>      >>>>
>      >>>>
>      >>>>
>      >>>
>      >>
>      >>
>      >> _______________________________________________
>      >> Trilinos-Users mailing list
>      >> Trilinos-Users at trilinos.org
>      >> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>      >
>      
>      
>      _______________________________________________
>      Trilinos-Users mailing list
>      Trilinos-Users at trilinos.org
>      http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>      
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cori.nersc.gov-trilinos-parcomm-config.sh
Type: application/x-sh
Size: 3064 bytes
Desc: not available
URL: <http://trilinos.org/pipermail/trilinos-users_trilinos.org/attachments/20210330/62e129aa/attachment.sh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ML_preconditionerCori.sh
Type: application/x-sh
Size: 963 bytes
Desc: not available
URL: <http://trilinos.org/pipermail/trilinos-users_trilinos.org/attachments/20210330/62e129aa/attachment-0001.sh>
-------------- next part --------------
Currently Loaded Modulefiles:
  1) modules/3.2.11.4
  2) altd/2.0
  3) darshan/3.2.1
  4) craype-network-aries
  5) craype/2.6.2
  6) cray-mpich/7.7.10
  7) craype-haswell
  8) craype-hugepages2M
  9) cray-libsci/19.06.1
 10) udreg/2.3.2-7.0.1.1_3.47__g8175d3d.ari
 11) ugni/6.0.14.0-7.0.1.1_7.49__ge78e5b0.ari
 12) pmi/5.0.14
 13) dmapp/7.1.1-7.0.1.1_4.61__g38cf134.ari
 14) gni-headers/5.0.12.0-7.0.1.1_6.36__g3b1768f.ari
 15) xpmem/2.2.20-7.0.1.1_4.20__g0475745.ari
 16) job/2.2.4-7.0.1.1_3.47__g36b56f4.ari
 17) dvs/2.12_2.2.165-7.0.1.1_14.4__ge967908e
 18) alps/6.6.58-7.0.1.1_6.19__g437d88db.ari
 19) rca/2.2.20-7.0.1.1_4.61__g8e3fb5b.ari
 20) atp/2.1.3
 21) PrgEnv-gnu/6.0.5
 22) cmake/3.14.4
 23) gcc/8.3.0
 24) texlive/2019
 25) git-lfs/2.8.0


More information about the Trilinos-Users mailing list