# [Trilinos-Users] Finite Element Assembly

Sun Jan 31 12:50:41 EST 2016

```Hello,

As an update, I added the call to Trilinos::K->OptimizeStorage() and now both methods take the same amount of time in the linear solver.

I had a remaining question about optimizing my assembly routine, since it is still slower than the in-house one we have ~1.6-1.7x slower.

Here is the main parts of my trilinos assembly code, which loop over the elements and assembles in the block element stiffnesses.

Please let me know if you see any optimization I am missing.

My global stiffness matrix is FE_VbrMatrix constructed form FECrsGraph, i.e.

Trilinos::K = new Epetra_FEVbrMatrix(Copy, K_graph); , where

Epetra_FECrsGraph K_graph(Copy, *Trilinos::blockMap,

&nnzPerRow);

I saw that setting StaticProfile to true for Epetra_CrsGraph resulted in better performance, but I don't see this available for the Epetra_FECrsGraph.

I call OptimizeStorage on K after the assembly is complete.

Any help or thoughts would be appreciated.

Thank you,

for (int a = 0; a < numNodesPerElement; ++a)

{

//sum into contributions from element node-global assemble will take those from shared nodes on other processors since FE routine

int error = Trilinos::K->BeginSumIntoGlobalValues(localToGlobal[a], //global block row

numNodesPerElement, //number of block entries in this row

&localToGlobal); //block global column indices-implicitly does inner b loop

if (error != 0)

{

std::cout << "ERROR: Setting block row and block column of summing into global values!" << std::endl;

exit(1);

}

//Multiply by Dirichlet term to 0 appropriate row

for (int i = 0; i < dof; ++i)

{

lR[a*dof + i] *= W[(eqN[a] - 1)*dof + i];

}

//submit global F values

Trilinos::F->SumIntoGlobalValues (1, //number of global block rows put in 1 at a time

&localToGlobal[a],  //global index id

&numValuesPerID, //dof values per id pointer to dof

&lR[a*dof]); //values of size dof

if (error != 0) //1 or more indices not associated with calling processor!

{

std::cout << error << std::endl;

std::cout << "ERROR: Summing into global vector values!" << std::endl;

exit(1);

}

//loop over local nodes for columns

for (int b = 0; b < numNodesPerElement; ++b)

{

std::vector<double> values(dof*dof); //used to transpose block since Trilinos takes in SerialMAtrix in column major

for (int i = 0; i < dof; ++i)

{

for (int j = 0; j < dof; ++j) //premult by diagonal W so W(a) multiplies row a

{

values[i*dof + j] = lK[b*dof*dof*numNodesPerElement + a*dof*dof + j*dof + i];

}

}

error = Trilinos::K->SubmitBlockEntry(&values, //square dof*dof blocks

dof, //LDA

dof, //number of rows

dof); //number of columns

if (error != 0)

{

std::cout << "ERROR: Inputting values of summing into matrix global values!" << std::endl;

exit(1);

}

}

//finish submitting the block entry for the current row

error = Trilinos::K->EndSubmitEntries();

if (error != 0)

{

std::cout << "ERROR: End submitting block entries!" << std::endl;

exit(1);

}

}

}

________________________________
Sent: Friday, January 29, 2016 2:17 PM
To: trilinos-users at trilinos.org
Subject: Finite Element Assembly

Hello,

I am getting confusing timing results based on two versions of our finite element code that uses Trilinos.  I started with an in-house finite element solver and we wanted to see how Trilinos linear solvers compared to ours. I am using Epetra and AztecOO.  I have two methods.  One where I copy our global stiffness and global force vector into Trilinos, using BeginReplaceIntoGlobalValues.  My matrix is FeVBRMatrix.  In this case, the times of the linear solver per each Newton iteration are given below:

time_global=[0.1058    0.7401    0.8486    0.8326    0.8607]

iter_global =  [50 196 214 212 218]

In the other case, I replace our current assemble routine and assemble in Epetra using BeginSumIntoGlobalValues.  I verified that the global stiffness matrix and force vector that I get is exactly the same, before I call gmres on it. However, the times are sufficiently slower for the same number of iterations and same solution.

time_assemble = [0.320155 1.6002 1.75062 1.71539 1.84786]

iter_assemble = [50 196 214 212 218]

So I see a 2-3x slowdown for the same number of iterations and same linear problem between the two methods.  Is there any potential reason for this that may come to mind?

Thanks,