[Trilinos-Users] strange performance on simple matrix/vector product

Fri Mar 7 02:18:02 MST 2008

I'm at a loss for words; I can't even begin to say how helpful you are with
your answer. I'll try your suggestions as soon as possible, but I'm
confident they will work ok.
Actually the tols4000 was chosen just as a safe alternative; it was fast and
gave no problems at the beginning of my experimentations; the actual work
will be done on much larger problems, but thanks again for pointing out the
alternatives.
Thanks again for everything! You've been really great.

I'll keep in touch :)

Daniele

On 3/6/08, Heroux, Michael A <maherou at sandia.gov> wrote:
>
>  Daniele,
>
> I confirmed your observations  but did make some changes to your test
> code.  Your test matrix is quite small (dimension 4000), but it is
> particularly sparse (avg 2.2 nonzeros per row).  With so little work to
> do, the inline loops do better because there is no function call or logic
> overhead at all, whereas the Epetra kernel has a few levels of logic it goes
> through before getting to the actual kernels.
>
> I did one thing to make the comparison fairer:  I moved the allocation and
> initialization of the arrays used by your inline loops so that they were
> executed before the Epetra kernel.  Your original code had this
> initialization right before the loops so that the data was already in the
> cache.  By doing this, the performance difference became smaller.
>
> I configured Trilinos with FFLAGS = "-O3 -ftree-vectorize" (on my Intel
> Mac).
>
> With the above change, I found that the inline loops got 274 MFLOPS/s
> while Epetra got 256 MFLOPS/s.  I also ran the comparison on a variety of
> test problems I have and found that, for all by the very smallest problems,
> Epetra tends to do better than the inline loops.  (See results below).
>
> My question to you: Is the tols4000 Matrix market example is typical of
> the matrix size and patterns you are working with, or just a sample problem?
>  If it is typical, then I can suggest a few things to consider:  Use a
> direct sparse solver if the problems are this small, or write a special
> adapter for Epetra_BasicRowMatrix that uses data structures specifically for
> your class of problems.
>
> Here are some efficiency results for a bunch of Harwell-Boeing problems
> that I have from matrix market and my own "private collection".  You will
> see that, except for problems that are very small, and your particular
> problem, Epetra is 20 to 70% faster than the inline loops.
>
> Mike
>
> tols4000.rua (n=4000/nz=8784)                 Efficiency = 88.1327%
> 3DNODE92.hb (n=6120/nz=275583)                Efficiency =161.357%
> INDUCTOR019.hb (n=3/nz=7)                     Efficiency = 17.0785%
> KENGO1.hb (n=4512/nz=226830)                  Efficiency = 161.289%
> Km3d2.hb (n=2048/nz=36344)                    Efficiency = 158.269%
> Km4d2.hb (n=20000/nz=353262)                  Efficiency = 143.19%
> NIMROD2.hb (n=9798/nz=504374)                 Efficiency = 138.24%
> RadProb.hb (n=306/nz=2364)                    Efficiency = 119.149%
> VM214IMG0_K1.hb (n=3108/nz=79436)             Efficiency = 156.257%
> VM214IMG1_K1.hb (n=3108/nz=101016)            Efficiency = 160.026%
> VM214IMG45_K1.hb (n=3108/nz=101016)           Efficiency = 155.722%
> add20_3.hb (n=2479/nz=18231)                  Efficiency = 136.348%
> cube27_bad_reorder.hb (n=32768/nz=830584)     Efficiency = 126.208%
> cube27p_32n_1x2x2.hb (n=32768/nz=830584)      Efficiency = 124.905%
> cube27p_32n_2x2x2.hb (n=32768/nz=830584)      Efficiency = 123.346%
> cube27p_32n_2x2x4.hb (n=32768/nz=830584)      Efficiency = 122.725%
> cube27p_32n_4x4x4.hb (n=32768/nz=830584)      Efficiency = 122.827%
> cube27p_32n_natural.hb (n=32768/nz=830584)    Efficiency = 123.569%
> cube_bad.hb (n=402/nz=27623)                  Efficiency = 161.876%
> cube_bad_trans.hb (n=402/nz=27623)            Efficiency = 162.572%
> cube_good.hb (n=402/nz=27607)                 Efficiency = 156.992%
> cube_good_trans.hb (n=402/nz=27607)           Efficiency = 162.053%
> cube_rcm.hb (n=32768/nz=830584)               Efficiency = 122.638%
> dday-complex_im.hb (n=10590/nz=276979)        Efficiency = 157.214%
> dday-complex_re.hb (n=10590/nz=276979)        Efficiency = 154.009%
> dday01.hb (n=21180/nz=923006)                 Efficiency = 115.398%
> defroll.hb (n=6277/nz=294197)                 Efficiency = 158.17%
> defrollnip_diagdom.hb (n=6001/nz=163533)      Efficiency = 157.787%
> defrollnip_step2.hb (n=5881/nz=265776)        Efficiency = 154.154%
> defrollnip_unordered.hb (n=6001/nz=163533)    Efficiency = 157.066%
> diag100.hb (n=100/nz=100)                     Efficiency =118.069%
> diag10000.hb (n=10000/nz=10000)               Efficiency = 157.309%
> diag100000.hb (n=100000/nz=100000)            Efficiency = 158.567%
> diag1000000.hb (n=1000000/nz=1000000)         Efficiency = 120.673%
> diag125000.hb (n=125000/nz=125000)            Efficiency = 154.4%
> diag250000.hb (n=250000/nz=250000)            Efficiency = 133.095%
> diag31250.hb (n=31250/nz=31250)               Efficiency = 157.877%
> diag3x3.hb (n=3/nz=3)                         Efficiency = 7.33652%
> diag500000.hb (n=500000/nz=500000)            Efficiency = 119.015%
> diag62500.hb (n=62500/nz=62500)               Efficiency = 158.699%
> f122pe10.hb (n=4512/nz=226830)                Efficiency = 170.109%
> f122pe16.hb (n=4512/nz=226830)                Efficiency = 159.835%
> f122pe16a.hb (n=4512/nz=226830)               Efficiency = 154.688%
> f122pe20.hb (n=4512/nz=226830)                Efficiency = 158.816%
> f122pe20a.hb (n=4512/nz=226830)               Efficiency = 162.209%
> fidap035_rua.hb (n=19716/nz=218308)           Efficiency = 139.484%
> fidapm11_rua.hb (n=22294/nz=623554)           Efficiency = 124.381%
> full8x8.hb (n=8/nz=64)                        Efficiency = 51.2379%
> goma_braze1.hb (n=1540/nz=191710)             Efficiency = 166.607%
> goma_braze_npspg.hb (n=1344/nz=143640)        Efficiency = 170.365%
> m3d2.hb (n=2048/nz=36344)                     Efficiency = 158.82%
> m3d21.hb (n=2048/nz=36344)                    Efficiency = 160.053%
> m3d2_im.hb (n=1024/nz=12480)                  Efficiency = 145.436%
> m3d2_re.hb (n=1024/nz=12480)                  Efficiency = 157.675%
> m3d2v2_im.hb (n=1024/nz=5692)                 Efficiency = 113.381%
> m3d2v2_re.hb (n=1024/nz=12480)                Efficiency = 149.727%
> m4d2.hb (n=20000/nz=353262)                   Efficiency = 144.51%
> m4d21.hb (n=20000/nz=353262)                  Efficiency = 144.315%
> m4d2_im.hb (n=10000/nz=127400)                Efficiency = 157.564%
> m4d2_re.hb (n=10000/nz=127400)                Efficiency =153.21%
> new_visco.hb (n=23439/nz=1069018)             Efficiency = 124.913%
> nos1.hb (n=237/nz=1017)                       Efficiency = 78.7574%
> s1rmq4m1.hb (n=5489/nz=6270)                  Efficiency = 111.317%
> simple_singleton.hb (n=5/nz=11)               Efficiency = 28.399%
> simpnewt.hb (n=2478/nz=73445)                 Efficiency = 157.419%
> small.hb (n=484/nz=12176)                     Efficiency = 158.983%
> square_rcm.hb (n=484/nz=12176)                Efficiency = 157.849%
> test.hb (n=237/nz=1017)                       Efficiency = 78.9191%
> tmpmatrix.hb (n=10/nz=28)                     Efficiency = 33.0626%
> tree_5x8.hb (n=4681/nz=14041)                 Efficiency = 133.882%
> triangle.fnf_0.5_02_001.hb (n=28/nz=464)      Efficiency = 120.305%
> trid10x10.hb (n=10/nz=28)                     Efficiency = 31.091%
> trid4x4.hb (n=4/nz=10)                        Efficiency = 17.2178%
> visco_unordered.hb (n=23439/nz=890449)        Efficiency = 122.235%
> viscoelastic.hb (n=23439/nz=911185)           Efficiency = 121.125%
> vm214_small_K1.hb (n=3108/nz=79436)           Efficiency = 155.256%
>
>
> On 2/29/08 11:45 AM, "Daniele Bettella" <sunriis at gmail.com> wrote:
>
> Actually I already do that (exactly 100 times :D) and then take the
> average time...
> too many things I forgot to mention...
>
> Daniele
>
> On Fri, Feb 29, 2008 at 6:35 PM, Hoekstra, Robert J <rjhoeks at sandia.gov>
> wrote:
>
>
> Daniele,
>
> I would also recommend looping this test as well, at least 100 times.
>  This
> is a small enough problem that you will be significantly impacted by
> system calls, etc.
> If you want good timings, you really want the average of many operations.
>
> Rob
>
> Robert Hoekstra
> ____________________________
> Electrical & Microsystems Modeling
> Sandia National Laboratories
> P.O. Box 5800 / MS 0316
> Albuquerque, NM 87185
> phone: 505-844-7627
> fax: 505-284-5451
> e-mail: rjhoeks at sandia.gov
> web: http://www.cs.sandia.gov
>
>
> -----Original Message-----
> From: trilinos-users-bounces at software.sandia.gov [
> mailto:trilinos-users-bounces at software.sandia.gov]<trilinos-users-bounces at software.sandia.gov%5D>On Behalf Of Daniele Bettella
> Sent: Friday, February 29, 2008 7:55 AM
> To: trilinos-users at software.sandia.gov
> Subject: Re: [Trilinos-Users] strange performance on simple matrix/vector
> product
>
> Thanks for the answer,
> 1) yes, you are correct, sorry I didn't state it in my original post
> 2)this is the code used for importing, it's based on the teuchos tutorial
>
> #ifdef HAVE_MPI
>   MPI_Init(&argc, &argv);
>   Epetra_MpiComm Comm(MPI_COMM_WORLD);
> #else
>   Epetra_SerialComm Comm;
> #endif
>   Epetra_Map* readMap;
>   Epetra_CrsMatrix* readA;
>   Epetra_Vector* readx;
>   Epetra_Vector* readb;
>   Epetra_Vector* readxexact;
>   char* matrix_file;
>   int matrix_format;
>   if(argc > 1){
>     matrix_file = argv[1];
>   }
>   else{
>     matrix_file = "tols4000.rua";
>   }
>   Trilinos_Util_ReadHb2Epetra(matrix_file, Comm, readMap, readA, readx,
> readb, readxexact);
>   Epetra_Map map(readMap->NumGlobalElements(), 0, Comm);
>   Epetra_CrsMatrix A(Copy, map, 0);
>
>   const Epetra_Map &OriginalMap = readA->RowMatrixRowMap() ;
>   assert (OriginalMap.SameAs(*readMap));
>   Epetra_Export exporter(OriginalMap, map);
>   A.Export(*readA, exporter, Add);
>   A.FillComplete();
>
> 3)this is how I configured Trilinos on my laptop ../configure
> --disable-default-packages --enable-pytrilinos --enable-teuchos
> --enable-epetra --enable-triutils --enable-aztecoo
> --with-mpi-compilers=/usr/local/bin/
> on the Xeon I used basically the same command except for the fact that I
> had to use the -fPic flag for cxx, c and fortran flags.
>
> as for compiling the test itself I just use -O3 (same results with -O2)
>
> I read that guide already, thanks for pointing it out, nothing that tells
> me why the same test runs slower on the xeon than on my laptop...
> at least for what I can understand...
>
> hope that helps, thanks again
>
> Daniele
>
> Heroux, Michael A ha scritto:
> > Daniele,
> >
> > A few questions:
> >
> > 1)  I assume first that you are using Epetra_CrsMatrix and its
> > Multiply method.  Correct?
> >
> > 2) How are you importing the matrix into Epetra?
> >
> > 3) How are you compiling Trilinos?  Depending on how the matrix is
> > imported, the Multiply method is very sensitive to either the C++ or
> > the Fortran compiler optimization flags.
> >
> > Also, there is a performance optimization guide
> >
> > http://trilinos.sandia.gov/packages/epetra/EpetraPerformanceGuide.pdf
> >
> > Mike
> >
> >
> > On 2/28/08 11:11 AM, "Daniele Bettella" <jagfsdfhf at libero.it> wrote:
> >
> >     Hello, I have a strange performance problem on Epetra, using a
> simple
> >     multiply.
> >     I have two configurations for testing purposes, the first one is a
> >     personal laptop, single core duo T2300 with 1GB of RAM, the second
> one
> >     is a dual Xeon quad core E5345 with 16GB of RAM
> >
> >     I take as an example the matrix e30r5000 from matrix market. The
> >     matrix
> >     is imported into trilinos and multiplied via Epetra multiply
> >     function by
> >     a random vector, i take the time and i get 0.0148 seconds on my
> laptop
> >     and 0.00227 seconds on the dual Xeon; now, since I'm not the only
> one
> >     using the Xeon I wrote another test for comparison; this time I
> >     multiply
> >     "by hand" storing the matrix in crs, this is what I got:
> >     on my laptop - 0.00168 seconds
> >     on the Xeon - 0.00168 seconds
> >     so basically I have the same time on manual implementation, but
> >     trilinos
> >     is sensibly slower on the dual xeon.
> >     I prepared many more matrixes on which I tested the same program; on
> >     many of them the Xeon has better times performing the manual
> multiply
> >     then the laptop, but Trilinos multiply is always slower...
> >
> >     I understand it's quite hard to tell what the problem might be, but
> is
> >     there something I'm missing?
> >     I mean, I'm using just one core on each machine, but a core on my
> >     laptop
> >     is a 1.6Ghz, wether the Xeon goes at 2.33Ghz... 32 bit vs 64 bit,
> >     2MB vs
> >     8MB of cache... and it generally shows on the product I wrote...
> >     if anyone has any idea I'd be gratefull
> >
> >     thanks in advance and sorry for my english
> >
> >     Daniele
> >
> >     _______________________________________________
> >     Trilinos-Users mailing list
> >     Trilinos-Users at software.sandia.gov
> >     http://software.sandia.gov/mailman/listinfo/trilinos-users
> >
> >
>
>
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov
> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov
> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://software.sandia.gov/mailman/private/trilinos-users/attachments/20080307/ba7d3cf6/attachment-0001.html