[Trilinos-Users] strange performance on simple matrix/vector
product
Heroux, Michael A
maherou at sandia.gov
Thu Mar 6 12:23:37 MST 2008
Daniele,
I confirmed your observations but did make some changes to your test code. Your test matrix is quite small (dimension 4000), but it is particularly sparse (avg 2.2 nonzeros per row). With so little work to do, the inline loops do better because there is no function call or logic overhead at all, whereas the Epetra kernel has a few levels of logic it goes through before getting to the actual kernels.
I did one thing to make the comparison fairer: I moved the allocation and initialization of the arrays used by your inline loops so that they were executed before the Epetra kernel. Your original code had this initialization right before the loops so that the data was already in the cache. By doing this, the performance difference became smaller.
I configured Trilinos with FFLAGS = "-O3 -ftree-vectorize" (on my Intel Mac).
With the above change, I found that the inline loops got 274 MFLOPS/s while Epetra got 256 MFLOPS/s. I also ran the comparison on a variety of test problems I have and found that, for all by the very smallest problems, Epetra tends to do better than the inline loops. (See results below).
My question to you: Is the tols4000 Matrix market example is typical of the matrix size and patterns you are working with, or just a sample problem? If it is typical, then I can suggest a few things to consider: Use a direct sparse solver if the problems are this small, or write a special adapter for Epetra_BasicRowMatrix that uses data structures specifically for your class of problems.
Here are some efficiency results for a bunch of Harwell-Boeing problems that I have from matrix market and my own "private collection". You will see that, except for problems that are very small, and your particular problem, Epetra is 20 to 70% faster than the inline loops.
Mike
tols4000.rua (n=4000/nz=8784) Efficiency = 88.1327%
3DNODE92.hb (n=6120/nz=275583) Efficiency = 161.357%
INDUCTOR019.hb (n=3/nz=7) Efficiency = 17.0785%
KENGO1.hb (n=4512/nz=226830) Efficiency = 161.289%
Km3d2.hb (n=2048/nz=36344) Efficiency = 158.269%
Km4d2.hb (n=20000/nz=353262) Efficiency = 143.19%
NIMROD2.hb (n=9798/nz=504374) Efficiency = 138.24%
RadProb.hb (n=306/nz=2364) Efficiency = 119.149%
VM214IMG0_K1.hb (n=3108/nz=79436) Efficiency = 156.257%
VM214IMG1_K1.hb (n=3108/nz=101016) Efficiency = 160.026%
VM214IMG45_K1.hb (n=3108/nz=101016) Efficiency = 155.722%
add20_3.hb (n=2479/nz=18231) Efficiency = 136.348%
cube27_bad_reorder.hb (n=32768/nz=830584) Efficiency = 126.208%
cube27p_32n_1x2x2.hb (n=32768/nz=830584) Efficiency = 124.905%
cube27p_32n_2x2x2.hb (n=32768/nz=830584) Efficiency = 123.346%
cube27p_32n_2x2x4.hb (n=32768/nz=830584) Efficiency = 122.725%
cube27p_32n_4x4x4.hb (n=32768/nz=830584) Efficiency = 122.827%
cube27p_32n_natural.hb (n=32768/nz=830584) Efficiency = 123.569%
cube_bad.hb (n=402/nz=27623) Efficiency = 161.876%
cube_bad_trans.hb (n=402/nz=27623) Efficiency = 162.572%
cube_good.hb (n=402/nz=27607) Efficiency = 156.992%
cube_good_trans.hb (n=402/nz=27607) Efficiency = 162.053%
cube_rcm.hb (n=32768/nz=830584) Efficiency = 122.638%
dday-complex_im.hb (n=10590/nz=276979) Efficiency = 157.214%
dday-complex_re.hb (n=10590/nz=276979) Efficiency = 154.009%
dday01.hb (n=21180/nz=923006) Efficiency = 115.398%
defroll.hb (n=6277/nz=294197) Efficiency = 158.17%
defrollnip_diagdom.hb (n=6001/nz=163533) Efficiency = 157.787%
defrollnip_step2.hb (n=5881/nz=265776) Efficiency = 154.154%
defrollnip_unordered.hb (n=6001/nz=163533) Efficiency = 157.066%
diag100.hb (n=100/nz=100) Efficiency = 118.069%
diag10000.hb (n=10000/nz=10000) Efficiency = 157.309%
diag100000.hb (n=100000/nz=100000) Efficiency = 158.567%
diag1000000.hb (n=1000000/nz=1000000) Efficiency = 120.673%
diag125000.hb (n=125000/nz=125000) Efficiency = 154.4%
diag250000.hb (n=250000/nz=250000) Efficiency = 133.095%
diag31250.hb (n=31250/nz=31250) Efficiency = 157.877%
diag3x3.hb (n=3/nz=3) Efficiency = 7.33652%
diag500000.hb (n=500000/nz=500000) Efficiency = 119.015%
diag62500.hb (n=62500/nz=62500) Efficiency = 158.699%
f122pe10.hb (n=4512/nz=226830) Efficiency = 170.109%
f122pe16.hb (n=4512/nz=226830) Efficiency = 159.835%
f122pe16a.hb (n=4512/nz=226830) Efficiency = 154.688%
f122pe20.hb (n=4512/nz=226830) Efficiency = 158.816%
f122pe20a.hb (n=4512/nz=226830) Efficiency = 162.209%
fidap035_rua.hb (n=19716/nz=218308) Efficiency = 139.484%
fidapm11_rua.hb (n=22294/nz=623554) Efficiency = 124.381%
full8x8.hb (n=8/nz=64) Efficiency = 51.2379%
goma_braze1.hb (n=1540/nz=191710) Efficiency = 166.607%
goma_braze_npspg.hb (n=1344/nz=143640) Efficiency = 170.365%
m3d2.hb (n=2048/nz=36344) Efficiency = 158.82%
m3d21.hb (n=2048/nz=36344) Efficiency = 160.053%
m3d2_im.hb (n=1024/nz=12480) Efficiency = 145.436%
m3d2_re.hb (n=1024/nz=12480) Efficiency = 157.675%
m3d2v2_im.hb (n=1024/nz=5692) Efficiency = 113.381%
m3d2v2_re.hb (n=1024/nz=12480) Efficiency = 149.727%
m4d2.hb (n=20000/nz=353262) Efficiency = 144.51%
m4d21.hb (n=20000/nz=353262) Efficiency = 144.315%
m4d2_im.hb (n=10000/nz=127400) Efficiency = 157.564%
m4d2_re.hb (n=10000/nz=127400) Efficiency = 153.21%
new_visco.hb (n=23439/nz=1069018) Efficiency = 124.913%
nos1.hb (n=237/nz=1017) Efficiency = 78.7574%
s1rmq4m1.hb (n=5489/nz=6270) Efficiency = 111.317%
simple_singleton.hb (n=5/nz=11) Efficiency = 28.399%
simpnewt.hb (n=2478/nz=73445) Efficiency = 157.419%
small.hb (n=484/nz=12176) Efficiency = 158.983%
square_rcm.hb (n=484/nz=12176) Efficiency = 157.849%
test.hb (n=237/nz=1017) Efficiency = 78.9191%
tmpmatrix.hb (n=10/nz=28) Efficiency = 33.0626%
tree_5x8.hb (n=4681/nz=14041) Efficiency = 133.882%
triangle.fnf_0.5_02_001.hb (n=28/nz=464) Efficiency = 120.305%
trid10x10.hb (n=10/nz=28) Efficiency = 31.091%
trid4x4.hb (n=4/nz=10) Efficiency = 17.2178%
visco_unordered.hb (n=23439/nz=890449) Efficiency = 122.235%
viscoelastic.hb (n=23439/nz=911185) Efficiency = 121.125%
vm214_small_K1.hb (n=3108/nz=79436) Efficiency = 155.256%
On 2/29/08 11:45 AM, "Daniele Bettella" <sunriis at gmail.com> wrote:
Actually I already do that (exactly 100 times :D) and then take the average time...
too many things I forgot to mention...
Daniele
On Fri, Feb 29, 2008 at 6:35 PM, Hoekstra, Robert J <rjhoeks at sandia.gov> wrote:
Daniele,
I would also recommend looping this test as well, at least 100 times. This
is a small enough problem that you will be significantly impacted by system calls, etc.
If you want good timings, you really want the average of many operations.
Rob
Robert Hoekstra
____________________________
Electrical & Microsystems Modeling
Sandia National Laboratories
P.O. Box 5800 / MS 0316
Albuquerque, NM 87185
phone: 505-844-7627
fax: 505-284-5451
e-mail: rjhoeks at sandia.gov
web: http://www.cs.sandia.gov
-----Original Message-----
From: trilinos-users-bounces at software.sandia.gov [mailto:trilinos-users-bounces at software.sandia.gov] On Behalf Of Daniele Bettella
Sent: Friday, February 29, 2008 7:55 AM
To: trilinos-users at software.sandia.gov
Subject: Re: [Trilinos-Users] strange performance on simple matrix/vector product
Thanks for the answer,
1) yes, you are correct, sorry I didn't state it in my original post 2)this is the code used for importing, it's based on the teuchos tutorial
#ifdef HAVE_MPI
MPI_Init(&argc, &argv);
Epetra_MpiComm Comm(MPI_COMM_WORLD);
#else
Epetra_SerialComm Comm;
#endif
Epetra_Map* readMap;
Epetra_CrsMatrix* readA;
Epetra_Vector* readx;
Epetra_Vector* readb;
Epetra_Vector* readxexact;
char* matrix_file;
int matrix_format;
if(argc > 1){
matrix_file = argv[1];
}
else{
matrix_file = "tols4000.rua";
}
Trilinos_Util_ReadHb2Epetra(matrix_file, Comm, readMap, readA, readx, readb, readxexact);
Epetra_Map map(readMap->NumGlobalElements(), 0, Comm);
Epetra_CrsMatrix A(Copy, map, 0);
const Epetra_Map &OriginalMap = readA->RowMatrixRowMap() ;
assert (OriginalMap.SameAs(*readMap));
Epetra_Export exporter(OriginalMap, map);
A.Export(*readA, exporter, Add);
A.FillComplete();
3)this is how I configured Trilinos on my laptop ../configure --disable-default-packages --enable-pytrilinos --enable-teuchos --enable-epetra --enable-triutils --enable-aztecoo --with-mpi-compilers=/usr/local/bin/
on the Xeon I used basically the same command except for the fact that I had to use the -fPic flag for cxx, c and fortran flags.
as for compiling the test itself I just use -O3 (same results with -O2)
I read that guide already, thanks for pointing it out, nothing that tells me why the same test runs slower on the xeon than on my laptop...
at least for what I can understand...
hope that helps, thanks again
Daniele
Heroux, Michael A ha scritto:
> Daniele,
>
> A few questions:
>
> 1) I assume first that you are using Epetra_CrsMatrix and its
> Multiply method. Correct?
>
> 2) How are you importing the matrix into Epetra?
>
> 3) How are you compiling Trilinos? Depending on how the matrix is
> imported, the Multiply method is very sensitive to either the C++ or
> the Fortran compiler optimization flags.
>
> Also, there is a performance optimization guide
>
> http://trilinos.sandia.gov/packages/epetra/EpetraPerformanceGuide.pdf
>
> Mike
>
>
> On 2/28/08 11:11 AM, "Daniele Bettella" <jagfsdfhf at libero.it> wrote:
>
> Hello, I have a strange performance problem on Epetra, using a simple
> multiply.
> I have two configurations for testing purposes, the first one is a
> personal laptop, single core duo T2300 with 1GB of RAM, the second one
> is a dual Xeon quad core E5345 with 16GB of RAM
>
> I take as an example the matrix e30r5000 from matrix market. The
> matrix
> is imported into trilinos and multiplied via Epetra multiply
> function by
> a random vector, i take the time and i get 0.0148 seconds on my laptop
> and 0.00227 seconds on the dual Xeon; now, since I'm not the only one
> using the Xeon I wrote another test for comparison; this time I
> multiply
> "by hand" storing the matrix in crs, this is what I got:
> on my laptop - 0.00168 seconds
> on the Xeon - 0.00168 seconds
> so basically I have the same time on manual implementation, but
> trilinos
> is sensibly slower on the dual xeon.
> I prepared many more matrixes on which I tested the same program; on
> many of them the Xeon has better times performing the manual multiply
> then the laptop, but Trilinos multiply is always slower...
>
> I understand it's quite hard to tell what the problem might be, but is
> there something I'm missing?
> I mean, I'm using just one core on each machine, but a core on my
> laptop
> is a 1.6Ghz, wether the Xeon goes at 2.33Ghz... 32 bit vs 64 bit,
> 2MB vs
> 8MB of cache... and it generally shows on the product I wrote...
> if anyone has any idea I'd be gratefull
>
> thanks in advance and sorry for my english
>
> Daniele
>
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov
> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>
_______________________________________________
Trilinos-Users mailing list
Trilinos-Users at software.sandia.gov
http://software.sandia.gov/mailman/listinfo/trilinos-users
_______________________________________________
Trilinos-Users mailing list
Trilinos-Users at software.sandia.gov
http://software.sandia.gov/mailman/listinfo/trilinos-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://software.sandia.gov/mailman/private/trilinos-users/attachments/20080306/89c96b27/attachment-0001.html
More information about the Trilinos-Users
mailing list