[Trilinos-Users] Belos GMRES results changing with MPI rank count

Rutherford, Joseph M jmruther at illinois.edu
Wed Jun 8 23:50:55 EDT 2016


Alicia,

Thanks for the guidance.  The MatrixMarket writer output is a huge help for seeing what is happening. That gives me a strong starting point.

To test my ability to reliably read the files, I wrote a simple driver to write out a CrsMatrix as MatrixMarket, and then tried to read it.  I was surprised to get an error message back out from MATLAB about my file:
<console>
>> [matrix,rows,cols,entries] = mmread('crsmatrix.mtx')
Data file does not contain expected amount of data.
Check that number of data lines matches nonzero count.
Error using mmread (line 113)
Invalid data.
</console>
  I assume that one is to use the reader provided at http://math.nist.gov/MatrixMarket/mmio/matlab/mmiomatlab.html ?

I also tried to read the file in Python using scipy.io.mmread(), but I got “ValueError: total size of new array must be unchanged”.  Is there another package you recommend for reading these into MATLAB/Python/etc?

Joe

From: Alicia Klinvex [mailto:aklinvex at gmail.com]
Sent: Wednesday, June 08, 2016 10:28 AM
To: Rutherford, Joseph M <jmruther at illinois.edu>
Cc: trilinos-users at trilinos.org
Subject: Re: [Trilinos-Users] Belos GMRES results changing with MPI rank count

Hello Joseph,

I recommend you try calling the MatrixMarket writer to ensure that your operators are in fact the same:
https://trilinos.org/docs/dev/packages/tpetra/doc/html/classTpetra_1_1MatrixMarket_1_1Writer.html#a6fcc0d5884f6b7ec0016dc628b170e86<https://urldefense.proofpoint.com/v2/url?u=https-3A__trilinos.org_docs_dev_packages_tpetra_doc_html_classTpetra-5F1-5F1MatrixMarket-5F1-5F1Writer.html-23a6fcc0d5884f6b7ec0016dc628b170e86&d=CwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Ng9lEtA9hn_MbIhqAtV5991RoWo8JGET5un-J3-Ub7o&m=LIUBYPZ_kwuh_GXoAAo_c1WXxFHJLBIdlOZ9kQ5g2OA&s=O2IyZPQjghmtKIyHDChBBgTNR5s6LtLTRVADzwvK8nk&e=>
I don't know if you've ever used MatrixMarket before, but it's a filetype that you can easily read into Matlab for debugging (assuming your matrix is small enough/computer is big enough).  I would start by calling that writer from 1, 2, and 4 MPI processes and using Matlab to determine whether those matrices are actually the same.  I would also use Matlab to estimate/compute the condition number.  I assume you're aware that you can get small differences in your mat-vec with different numbers of MPI processes, just because the order of operations is different, and those differences can be magnified like any other error.
Best wishes,
Alicia

On Wed, Jun 8, 2016 at 9:17 AM, Rutherford, Joseph M <jmruther at illinois.edu<mailto:jmruther at illinois.edu>> wrote:
All,

My custom operator expresses a diagonalized dense operator as a cascade of sparse operators.  If A,B,C are Tpetra::Operators, I is an optional identity Operator effecting Tpetra::Import, and x,y are MultiVectors, then my custom operator is executing y=(A*B+C)*I*x.   My system is

1.)    Vectors x,y are 1:1 distributed in a non-uniform map.

2.)    apply() consistently computes the same mat-vec with MPI rank counts 1 and 2.

3.)    Belos GMRES converges to validated answers with 1 MPI rank.

4.)    Belos GMRES converges to different answers with MPI rank counts 1, 2, and 4.

Points 1 and 2 suggest that perhaps the operator itself is working correctly. I have no idea what I might be doing incorrectly with Belos.  The only “GMRES” solver parameters I’m defining are “Convergence Tolerance” and “Verbosity”.  For verbosity=127 (all enums summed together), I get the following output:

<console>
Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (153, 1, Passed)
   Passed.......OR Combination ->
     OK...........Number of Iterations = 152 < 1000
     Converged....(2-Norm Res Vec) / (2-Norm Prec Res0)
                  residual [ 0 ] = 8.47249e-07 < 1e-06

Passed.......OR Combination ->
  OK...........Number of Iterations = 152 < 1000
  Converged....(2-Norm Res Vec) / (2-Norm Prec Res0)
               residual [ 0 ] = 8.47249e-07 < 1e-06


=========================================================================================================================

                                          TimeMonitor results over 4 processors

Timer Name                                        MinOverProcs    MeanOverProcs    MaxOverProcs    MeanOverCallCounts
-------------------------------------------------------------------------------------------------------------------------
Belos: Operation Op*x                             9.564 (154)     9.717 (154)      9.883 (154)     0.0631 (154)
Belos: Operation Prec*x                           0 (0)           0 (0)            0 (0)           0 (0)
Belos: Orthogonalization                          0.1221 (153)    0.2889 (153)     0.4433 (153)    0.001888 (153)
Belos: PseudoBlockGmresSolMgr total solve time    9.949 (1)       9.949 (1)        9.95 (1)        9.949 (1)
=========================================================================================================================
</console>

Can anyone please suggest how to better diagnose the problem?

Joe

_______________________________________________
Trilinos-Users mailing list
Trilinos-Users at trilinos.org<mailto:Trilinos-Users at trilinos.org>
https://trilinos.org/mailman/listinfo/trilinos-users<https://urldefense.proofpoint.com/v2/url?u=https-3A__trilinos.org_mailman_listinfo_trilinos-2Dusers&d=CwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Ng9lEtA9hn_MbIhqAtV5991RoWo8JGET5un-J3-Ub7o&m=LIUBYPZ_kwuh_GXoAAo_c1WXxFHJLBIdlOZ9kQ5g2OA&s=-dF96ZnA0dl2qUguRYisiF8CPxORg5QDtg5E6MTVk5I&e=>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://trilinos.org/pipermail/trilinos-users/attachments/20160609/38c209cc/attachment.html>


More information about the Trilinos-Users mailing list