[Trilinos-Users] Finite Element Assembly
Danielle Catherine Maddix
dcmaddix at stanford.edu
Sun Jan 31 12:50:41 EST 2016
Hello,
As an update, I added the call to Trilinos::K->OptimizeStorage() and now both methods take the same amount of time in the linear solver.
I had a remaining question about optimizing my assembly routine, since it is still slower than the in-house one we have ~1.6-1.7x slower.
Here is the main parts of my trilinos assembly code, which loop over the elements and assembles in the block element stiffnesses.
Please let me know if you see any optimization I am missing.
My global stiffness matrix is FE_VbrMatrix constructed form FECrsGraph, i.e.
Trilinos::K = new Epetra_FEVbrMatrix(Copy, K_graph); , where
Epetra_FECrsGraph K_graph(Copy, *Trilinos::blockMap,
&nnzPerRow[0]);
I saw that setting StaticProfile to true for Epetra_CrsGraph resulted in better performance, but I don't see this available for the Epetra_FECrsGraph.
I call OptimizeStorage on K after the assembly is complete.
Any help or thoughts would be appreciated.
Thank you,
Danielle Maddix
for (int a = 0; a < numNodesPerElement; ++a)
{
//sum into contributions from element node-global assemble will take those from shared nodes on other processors since FE routine
int error = Trilinos::K->BeginSumIntoGlobalValues(localToGlobal[a], //global block row
numNodesPerElement, //number of block entries in this row
&localToGlobal[0]); //block global column indices-implicitly does inner b loop
if (error != 0)
{
std::cout << "ERROR: Setting block row and block column of summing into global values!" << std::endl;
exit(1);
}
//Multiply by Dirichlet term to 0 appropriate row
for (int i = 0; i < dof; ++i)
{
lR[a*dof + i] *= W[(eqN[a] - 1)*dof + i];
}
//submit global F values
Trilinos::F->SumIntoGlobalValues (1, //number of global block rows put in 1 at a time
&localToGlobal[a], //global index id
&numValuesPerID, //dof values per id pointer to dof
&lR[a*dof]); //values of size dof
if (error != 0) //1 or more indices not associated with calling processor!
{
std::cout << error << std::endl;
std::cout << "ERROR: Summing into global vector values!" << std::endl;
exit(1);
}
//loop over local nodes for columns
for (int b = 0; b < numNodesPerElement; ++b)
{
std::vector<double> values(dof*dof); //used to transpose block since Trilinos takes in SerialMAtrix in column major
for (int i = 0; i < dof; ++i)
{
for (int j = 0; j < dof; ++j) //premult by diagonal W so W(a) multiplies row a
{
values[i*dof + j] = lK[b*dof*dof*numNodesPerElement + a*dof*dof + j*dof + i];
}
}
error = Trilinos::K->SubmitBlockEntry(&values[0], //square dof*dof blocks
dof, //LDA
dof, //number of rows
dof); //number of columns
if (error != 0)
{
std::cout << "ERROR: Inputting values of summing into matrix global values!" << std::endl;
exit(1);
}
}
//finish submitting the block entry for the current row
error = Trilinos::K->EndSubmitEntries();
if (error != 0)
{
std::cout << "ERROR: End submitting block entries!" << std::endl;
exit(1);
}
}
}
________________________________
From: Danielle Catherine Maddix
Sent: Friday, January 29, 2016 2:17 PM
To: trilinos-users at trilinos.org
Subject: Finite Element Assembly
Hello,
I am getting confusing timing results based on two versions of our finite element code that uses Trilinos. I started with an in-house finite element solver and we wanted to see how Trilinos linear solvers compared to ours. I am using Epetra and AztecOO. I have two methods. One where I copy our global stiffness and global force vector into Trilinos, using BeginReplaceIntoGlobalValues. My matrix is FeVBRMatrix. In this case, the times of the linear solver per each Newton iteration are given below:
time_global=[0.1058 0.7401 0.8486 0.8326 0.8607]
iter_global = [50 196 214 212 218]
In the other case, I replace our current assemble routine and assemble in Epetra using BeginSumIntoGlobalValues. I verified that the global stiffness matrix and force vector that I get is exactly the same, before I call gmres on it. However, the times are sufficiently slower for the same number of iterations and same solution.
time_assemble = [0.320155 1.6002 1.75062 1.71539 1.84786]
iter_assemble = [50 196 214 212 218]
So I see a 2-3x slowdown for the same number of iterations and same linear problem between the two methods. Is there any potential reason for this that may come to mind?
Thanks,
Danielle Maddix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20160131/2735aa2b/attachment.html>
More information about the Trilinos-Users
mailing list