[Trilinos-Users] Sparse direct solver and language options for GPU

Thu Sep 18 08:28:46 MDT 2014

Tom,

You might try using a preconditioner with AztecOO.  For example, Ifpack with incomplete Cholesky would give you some idea about how responsive the problem is to preconditioning.  But if you avoid the GPU, direct solvers are probably the best way to go anyway.  KLU is very robust and designed for problems like yours.

My results on EC2 are now fairly old, and we should go back and run the scalability tests again (but honestly I don’t have time right now).  EC2 is certainly competitive on single process performance and, since KLU is serial-only, you may be just fine on a big memory node.

Mike

From: Tom Anderson <tomacorp at gmail.com<mailto:tomacorp at gmail.com>>
Date: Thursday, September 18, 2014 at 3:17 AM
To: "trilinos-users at software.sandia.gov<mailto:trilinos-users at software.sandia.gov>" <trilinos-users at software.sandia.gov<mailto:trilinos-users at software.sandia.gov>>
Subject: Re: [Trilinos-Users] Sparse direct solver and language options for GPU

> I am a bit confused by the combination of well-conditioned and semi-definite in your description
I probably didn't use the correct matrix terminology.
My matrix is a Modified Nodal Analysis circuit solver matrix.
The only elements I have are resistors, constant current sources, and constant voltage sources,
so it really looks just like example 3 in http://www.swarthmore.edu/NatSci/echeeve1/Ref/mna/MNA2.html
except larger.

I used Anasazi.BasicEigenproblem to find the largest and smallest eigenvalues.
The condition number is the ratio of the largest to the smallest eigenvalue.
There were messages from AztecOO complaining about negative eigenvalues.
"Warning : The smallest eigenvalue of the Lanczos matrix

                is negative or zero (-6.975319e-01)"

My solution looks right, has energy balance, and agrees with Xyce for the circuit solution, so I think the input matrix is okay.

I didn't find great results with the iterative solvers.
Using PyTrilinos and AztecOO, the best result I found was with AZ_cg_condnum:
    solver = AztecOO.AztecOO(A, x, b)
    solver.SetAztecOption(AztecOO.AZ_solver, AztecOO.AZ_cg_condnum)
    ierr = solver.Iterate(iterations, 1e-8)
This iterative solution converges slowly, and more slowly with large problems.
I can send a comparison of the direct and iterative solutions, if that would be interesting.

This is the direct approach in PyTrilinos:
    problem= Epetra.LinearProblem(A, x, b)
    solver= Amesos.Klu(problem)
    solver.SymbolicFactorization()
    solver.NumericFactorization()
    ierr = solver.Solve()

> There are very few direct sparse solvers available on a GPU
Since I was mistaken in my original question and Amesos2 doesn't use the GPU for direct sparse solutions,
I think I should try something other than a GPU.

Perhaps I can cram the problem into one big machine, or try MPI on smaller machines.
It looks like there is an Amazon EC2 instance that might be sufficient to test the idea:
r3.8xlarge 244GB
Do the performance problems with EC2 noted here:
http://www.csm.ornl.gov/workshops/SOS17/documents/HerouxDataOwnership.pdf
apply to a single large instance running problems such as an Amesos sparse direct solver?

> you could work in either real or complex arithmetic
I haven't needed a complex matrix for this project yet, although it may become interesting in the future.

Thanks for the advice, it is all helpful.

-Tom Anderson

On Wed, Sep 17, 2014 at 10:17 AM, Heroux, Mike <MHeroux at csbsju.edu<mailto:MHeroux at csbsju.edu>> wrote:
Tom,

With such a well-conditioned matrix, it might be the case that an iterative method would be more effective, especially if your target device is a GPU.  There are very few direct sparse solvers available on a GPU right now (AFAIK).  Even those that are available on the GPU tend to do problem setup on the host and only push out portions of the computation to the GPU.

I am a bit confused by the combination of well-conditioned and semi-definite in your description, since the latter implies that some eigenvalues could be zero making the matrix singular.  Do you pre-process the matrix to filter out some equations?  Assuming you have a way to keep the condition number small, a diagonally scaled CG may be all you need.

Also, you could work in either real or complex arithmetic.  The equivalent real formulation of a Hermitian matrix is real positive definite with the same set of unique eigenvalue (multiplicity increases by a factor of two), so CG will converge (assuming equivalent preconditioning) in the same number of iterations regardless of using the complex or real version of CG.

I hope this helps.

Mike

From: Tom Anderson <tomacorp at gmail.com<mailto:tomacorp at gmail.com><mailto:tomacorp at gmail.com<mailto:tomacorp at gmail.com>>>
Date: Wednesday, September 17, 2014 at 11:49 AM
To: "trilinos-users at software.sandia.gov<mailto:trilinos-users at software.sandia.gov><mailto:trilinos-users at software.sandia.gov<mailto:trilinos-users at software.sandia.gov>>" <trilinos-users at software.sandia.gov<mailto:trilinos-users at software.sandia.gov><mailto:trilinos-users at software.sandia.gov<mailto:trilinos-users at software.sandia.gov>>>
Subject: [Trilinos-Users] Sparse direct solver and language options for GPU

Is C++ and Amesos2 the only GPU-capable sparse direct solver option using Trilinos?
Have I missed any good alternatives?

Here is a longer version of the same question with a description of the application:

I am working on a thermal simulation problem, using a fine square mesh instead of a more complex course mesh.

I think about the thermal problem as the solution to an equivalent electronic circuit.

I have constructed a matrix using Modified Nodal Analysis as described in
http://www.swarthmore.edu/NatSci/echeeve1/Ref/mna/MNA2.html .
My PyTrilinos matrix code agrees with the Xyce circuit simulator.
I could just use Xyce instead of writing my own matrix code,
but Xyce gets slow on large problems.
My equivalent circuit only needs voltage sources, current sources, and resistors.
The solution I want is the DC operating point of the circuit.

The matrix is real, Hermitian, positive semi-definite, if I understand matrix lingo correctly.
The condition number is about 430 and I expect it to always be less than about 7000.

I am aiming for 200e6 unknowns (circuit nodes) and 1.4e9 matrix entries (1.2e9 resistors).
As of now, I can solve 1e6 nodes using Amesos/PyTrilinos on an old Mac mini, single threaded.
Solve time is reasonable and linear in the number of unknowns, and memory usage is about 1kB per unknown.
My existing code uses PyTrilinos and is 2D.  I plan to move to 2.5D (3D planar) soon.
I expect to deploy the application on Linux.

Next I would like to use GPU processing. I didn't find a way to use the GPU with PyTrilinos.
I have been avoiding Fortran because of the install process on OSX.
It seems that C++ and Amesos2 is the only GPU-capable sparse direct solver using Trilinos.
Have I missed any good alternatives?

Any advice about my project is welcome, including suggestions for different approaches, platforms, or software.

Thanks,

Tom Anderson
tom_anderson at keysight.com<mailto:tom_anderson at keysight.com><mailto:tom_anderson at keysight.com<mailto:tom_anderson at keysight.com>>
Keysight Technologies
Santa Rosa, California