[Trilinos-Users] [EXTERNAL] Re: Results from a scaling study of ML
John Cary
cary at colorado.edu
Tue Mar 30 05:12:17 MST 2021
Thanks for looking at this.
I attach several scripts. The first is the configuration script, which
is first run with
-DTrilinos_ENABLE_EXAMPLES=FALSE and the project is built. (TRUE will
not build.)
Then I set -DTrilinos_ENABLE_EXAMPLES=TRUE, reconfigure and do
cd packages/ml/examples/BasicExamples
make ML_preconditioner
which gives me ML_preconditioner.exe in
/global/cscratch1/sd/cary/builds-cori-gcc/vorpalall-cori-gcc-dev1/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples
Attached is the sbatch script, which in it has the srun invocation
showing the path
to the test executable. That works out to
srun -n 16 --distribution=block,block -c 2
/global/cscratch1/sd/cary/builds-cori-gcc/vorpalall-cori-gcc-dev1/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples/ML_preconditioner.exe
in a particular case.
I also attach the output of "module list".
The script is submitted with
sbatch -C haswell ML_preconditionerCori.sh
The timings are
NTASKS time
1 0.087072
2 0.090850
4 0.162583
8 0.197526
16 0.216090
32 0.423770
Because this is a weak scaling study, I derive a parallel efficiency of
0.087072/0.423770 = 20.5%.
As noted before, I have seen better performance on our own cluster
(70-80%), so I figure I am
not launching with the best parameters, but I do not know what those
would be.
Thanks for any help!
John Cary
On 3/29/21 3:50 PM, Elliott, James John wrote:
> John that's odd.
>
> Cori performance variations usually happen as you scale out to multiple nodes (and you end up with an allocation + other users that causes bad routing performance).
>
> It may be easier to post on github
>
> If you can give me your slurm: sbatch or salloc commands/script. A list of the modules used, and then your srun ( plus app name + flags you give it). I can try to reproduce this on our miniature Cori (trinity testbed at SNL). I no longer have access to NERSC (I was part of the KNL early access program on Cori).
>
> If you are somehow running the Haswell binary on KNL, this could explain a marked slowdown.
> On Cori, you usually have to salloc/sbatch with -C haswell.
>
> A Haswell binary will run on KNL, but a KNL binary will not run on Haswell.
>
> Your loaded modules can also have some impacts on performance (even though the binary may be static)
>
> Jonathan, Chris, and I did run MueLu a reasonable amount on Cori duing the early access. The main culprits (then) were large scale perf variations and tracking down issues in MueLu's repartitioning routines (avoiding many to one communications)
>
> James
>
> On 3/29/21, 6:11 AM, "Trilinos-Users on behalf of John Cary" <trilinos-users-bounces at trilinos.org on behalf of cary at colorado.edu> wrote:
>
> Thanks, James. So I did
>
> srun -n 32 --distribution=block,block -c 2
> /global/cscratch1/sd/cary/builds-cori-gcc/vsimall-cori-gcc/trilinos-13.0.0/parcomm/packages/ml/examples/BasicExamples/ML_preconditioner.exe
>
> but I am still seeing the same single-node scaling of dropping to 25%
> parallel efficiency.
>
> I can see that it is not the fault of ML, because on my own local
> cluster, which has two
> AMD EPYC 7302 16-Core Processor per node, the single-node parallel
> efficiency at 32 processes
> is 82%.
>
> So I guess I still do not know how best to launch on cori.
>
> Thx.....John
>
>
> On 3/28/21 6:18 PM, James Elliott wrote:
> > # cores per proc is usually between 1 and 16 (fill up one socket)
> >
> > I may be off... been a while since I ran there. FYI, cori was really
> > noisy.
> >
> > cores_per_proc=1
> > John, I believe the usual Cori/Haswell slurm launch should look like:
> >
> > srun_opts=(
> > # use cores,v if you want verbosity
> > --cpu_bind=cores
> > -c $(($cores_per_proc*2))
> > # distribution puts ranks on nodes, then sockets
> > # block,block - is like aprun default, which fills
> > # a socket on a node, then the next socket on the same node
> > # the the next node...
> > # block,cyclic is/was the default on Cori
> > # that will put rank0 on socket0, rank1 on socket1 (same node)
> > # and repeat until the node is full. (it will stride your procs
> > # between the sockets on the node)
> > # This detail caused a few apps pain when Trinity swapped from
> > # aprun.
> > # Pick block,block or block,cyclic
> > --distribution=block,block
> > # the usual -n -N stuff
> > )
> >
> > srun "${srun_opts[@]}" ./app ....
> >
> > On 3/28/2021 5:23 PM, John Cary wrote:
> >> Hi All,
> >>
> >> As promised, we have done scaling studies on the haswell nodes on
> >> Cori at NERSC using ML_preconditioner.exe
> >> as compiled, so this is a weak scaling study with 65536 cells/nodes
> >> per processor. We find a parallel efficiency
> >> (speedup/expected speedup) that drops to 25% on 32 processes.
> >>
> >> Is this expected?
> >>
> >> Are their command line args to srun that might improve this? (I
> >> tried various args to --cpu-bind.)
> >>
> >> I can provide plenty more info (configuration line, how run, ...).
> >>
> >> Thx.....John
> >>
> >> On 3/24/21 9:05 AM, John Cary wrote:
> >>>
> >>>
> >>> Thanks, Chris, thanks Jonathan,
> >>>
> >>> I have found these executables, and we are doing scaling studies now.
> >>>
> >>> Will report....John
> >>>
> >>>
> >>>
> >>> On 3/23/21 9:42 PM, Siefert, Christopher wrote:
> >>>> John,
> >>>>
> >>>> There are some scaling examples in
> >>>> trilinoscouplings/examples/scaling (example_Poisson.cpp and
> >>>> example_Poisson2D.cpp) that use the old stack and might do what you
> >>>> need.
> >>>>
> >>>> -Chris
> >>>
> >>>
> >>> On 3/23/21 7:48 PM, Hu, Jonathan wrote:
> >>>> Hi John,
> >>>>
> >>>> ML has a 2D Poisson driver in
> >>>> ml/examples/BasicExamples/ml_preconditioner.cpp. The cmake target
> >>>> should be either "ML_preconditioner" or "ML_preconditioner.exe".
> >>>> There's a really similar one in ml/examples/XML/ml_XML.cpp that you
> >>>> can drive with an XML deck. Is this what you're after?
> >>>>
> >>>> Jonathan
> >>>>
> >>>> On 3/23/21, 5:47 PM, "Trilinos-Users on behalf of John Cary"
> >>>> <trilinos-users-bounces at trilinos.org on behalf of
> >>>> cary at colorado.edu> wrote:
> >>>>
> >>>> We are still using the old stack: ML, Epetra, ...
> >>>>
> >>>> When we run a simple Poisson solve on our cluster (32
> >>>> cores/node), we
> >>>> see parallel efficiency drop to 4% on one node with 32 cores.
> >>>> So we
> >>>> naturally believe we are doing something wrong.
> >>>>
> >>>> Does trilinos come with a simple Poisson-solve executable that
> >>>> we could
> >>>> use to test scaling (to get around the uncertainties of our
> >>>> use of
> >>>> trilinos)?
> >>>>
> >>>> Thx.......John Cary
> >>>>
> >>>> _______________________________________________
> >>>> Trilinos-Users mailing list
> >>>> Trilinos-Users at trilinos.org
> >>>> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> Trilinos-Users mailing list
> >> Trilinos-Users at trilinos.org
> >> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
> >
>
>
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at trilinos.org
> http://trilinos.org/mailman/listinfo/trilinos-users_trilinos.org
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cori.nersc.gov-trilinos-parcomm-config.sh
Type: application/x-sh
Size: 3064 bytes
Desc: not available
URL: <http://trilinos.org/pipermail/trilinos-users_trilinos.org/attachments/20210330/62e129aa/attachment.sh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ML_preconditionerCori.sh
Type: application/x-sh
Size: 963 bytes
Desc: not available
URL: <http://trilinos.org/pipermail/trilinos-users_trilinos.org/attachments/20210330/62e129aa/attachment-0001.sh>
-------------- next part --------------
Currently Loaded Modulefiles:
1) modules/3.2.11.4
2) altd/2.0
3) darshan/3.2.1
4) craype-network-aries
5) craype/2.6.2
6) cray-mpich/7.7.10
7) craype-haswell
8) craype-hugepages2M
9) cray-libsci/19.06.1
10) udreg/2.3.2-7.0.1.1_3.47__g8175d3d.ari
11) ugni/6.0.14.0-7.0.1.1_7.49__ge78e5b0.ari
12) pmi/5.0.14
13) dmapp/7.1.1-7.0.1.1_4.61__g38cf134.ari
14) gni-headers/5.0.12.0-7.0.1.1_6.36__g3b1768f.ari
15) xpmem/2.2.20-7.0.1.1_4.20__g0475745.ari
16) job/2.2.4-7.0.1.1_3.47__g36b56f4.ari
17) dvs/2.12_2.2.165-7.0.1.1_14.4__ge967908e
18) alps/6.6.58-7.0.1.1_6.19__g437d88db.ari
19) rca/2.2.20-7.0.1.1_4.61__g8e3fb5b.ari
20) atp/2.1.3
21) PrgEnv-gnu/6.0.5
22) cmake/3.14.4
23) gcc/8.3.0
24) texlive/2019
25) git-lfs/2.8.0
More information about the Trilinos-Users
mailing list