[Trilinos-Users] Determinism in iterative solvers (Tpetra, Belos)

Tue Jul 26 12:46:16 EDT 2016

On 7/26/16, 7:50 AM, "Trilinos-Users on behalf of trilinos-users-request at trilinos.org" <trilinos-users-bounces at trilinos.org on behalf of trilinos-users-request at trilinos.org> wrote:
>Date: Fri, 15 Jul 2016 17:18:23 +0200
>From: Christopher Thiele <christopher.thiele92 at gmail.com>
>To: trilinos-users at trilinos.org
>Subject: [Trilinos-Users] Determinism in iterative solvers (Tpetra, Belos)
>
>I am currently comparing a custom CG solver implementation against the CG
>solver from the Belos package (Belos+Tpetra with OpenMP) to verify
>correctness and to get an idea of its computational performance. When I
>look at the residual norms in my solver, they are basically the same as
>those I get from Belos, but with tiny deviations. These deviations are not
>unexpected, as the CG method depends on dot products quite a lot and the
>involved reductions are not fully deterministic, as the order in which
>threads and processes sum up their intermediate results may vary.

Just to summarize for other readers:

OpenMP parallel reductions are not deterministic by default.  You may get different results each time when running with the same input data and the same number of threads.

Tpetra uses Kokkos for parallel reductions.  Kokkos’ OpenMP back-end takes special care to make parallel reductions deterministic.  If you run with the same input (with the same data alignment – vectorization is one factor we can’t completely control) and the same number of threads, Kokkos::parallel_reduce will give you the same answer each time.

Please note, however, that MPI does NOT promise deterministic reductions (MPI_Allreduce, MPI_Reduce, etc.).  For example, it is legal for MPI to change the reduction tree between two different calls to MPI_Allreduce in the same run with the same communicator.  I have seen research projects that did just that, for automatic performance tuning.  I don’t generally see this with production MPI implementations, but it is allowed.

>However, I noticed that with the Belos solver the residuals were identical
>every time, even up to the 25th digit or so. 

I’m very happy to hear this ☺

>My question is how much of an
>impact this deterministic behavior will have on the performance. It
>probably requires additional synchronization, and I want to do a fair
>comparison. 

It does not cost very much, and it scales well with the number of threads.  The Kokkos developers have a strong incentive to make this fast.

>Also, is there a way to disable deterministic reductions in
>Trilinos/Tpetra/Kokkos altogether?

We have very little incentive to do this.  We want deterministic reductions, because they make algorithm development and debugging easier.  Our users want deterministic reductions for the same reasons.

Also, OpenMP reductions only work with built-in data types and a small set of built-in reduction operators.  What about std::complex or custom Scalar types, or custom reduction operators?  What if the reduction type is a pair of values (say a Scalar and a bool to decide whether the result is correct), or some other small struct?  Tpetra and downstream packages depend on all of these cases.

The Kokkos developers have contemplated adding an option to use built-in OpenMP reductions, when the result type and reduction operator work with OpenMP.  If that’s something you want, and you have evidence that it matters for performance, please let us know.

mfh