[Trilinos-Users] Why isn't BLAS used more in Epetra?

Tue Sep 22 07:40:26 EDT 2015

Sven,

Thanks for the additional details.  

Regarding the non-associative option, we have it in place to protect users from unwanted variation in floating point results.  We protect any collective operation such as norm2, dotproduct or atomic writes with this macro.  Although not a desirable property if it can be avoided, some codes are very sensitive to variations in floating point values, even if those variations come from valid, but different ordering of operations.

If you want to enable non-associative parallel execution, just define the EPETRA_HAVE_OMP_NONASSOCIATIVE in your cmake compiler flags (-DEPETRA_HAVE_OMP_NONASSOCIATIVE).

Regarding thread-safe RCP, we are finally working on it, and should have it done in a couple of months.

Thanks again.

Mike

> On Sep 22, 2015, at 3:52 AM, Sven Baars <s.baars at rug.nl> wrote:
> 
> Hi Mike,
> 
> I linked to whatever was present on my system, which apparently was
> OpenBLAS. I just tried it with ATLAS and this is indeed slower. I also
> tried it with MKL on a cluster and there performance of daxpy also seems
> better than that of Update if I run it on a single core, if I run it
> threaded, it's also faster than threaded Update, but if I run it with
> MPI (4 processes on one node 1 thread each), performance is similar.
> 
> Mainly the fact that ATLAS seems to be slower on my system than Update
> is weird, because they do seem to have tried to do some heavy
> optimization for the daxpy operation. So I assume there is something
> wrong with the version of ATLAS that is present on my system.
> 
> The fact that daxpy with MPI performs similarly might be due to
> assumptions about cache usage that are not true because you use it
> multiple times. So in most actual use cases (MPI+MKL) it might not
> matter all that much.
> 
> Now, what I think is that we may assume that BLAS developers tried to
> optimize their library as much as possible. This means that even
> something simple as x+ay is more optimized than a simple for loop. And
> for that reason, BLAS should be used as much as possible. If performance
> is worse with BLAS, that's most likely a problem with the BLAS
> implementation.
> 
> By the way, it's not only Update, but also something like Norm2. I'm not
> really sure about the other functions. At least Multiply seems to use
> BLAS, which is nice.
> 
> Something about the threaded implementations I don't understand is for
> instance that Norm2 is only threaded when EPETRA_HAVE_OMP_NONASSOCIATIVE
> is defined, but not when EPETRA_HAVE_OMP is defined. Shouldn't
> non-associative mean that you can't thread it in the way it is done
> there, but you can when EPETRA_HAVE_OMP is defined? Also: when is
> EPETRA_HAVE_OMP_NONASSOCIATIVE defined? Or is this just some define that
> disables threading for every for loop that sums something because it was
> bugged?
> 
> One of the problems I have with Epetra and OpenMP was described here:
> https://trilinos.org/pipermail/trilinos-users/2014-March/003996.html
> But currently I am not in the process of trying to get this to work
> properly any more.
> 
> Sven
> 
>> On 09/21/2015 06:12 PM, Heroux, Michael A wrote:
>> Sven,
>> 
>> Thanks for your comments.  When Epetra was originally developed, the
>> performance of level-1 BLAS was not much faster than raw code, and
>> sometimes slower, especially for small to medium sized problem where
>> function call overhead had an impact.  Also, since Update had broader
>> functionality, the use of daxpy would have complicated the implementation
>> without substantial performance improvement.
>> 
>> Epetra does use BLAS extensively for level-3 operation, primarily found in
>> the Multiply method, which is used for block Krylov solvers.  Linking with
>> a good BLAS implementation has a dramatic performance improvement for
>> these operations.
>> 
>> Your data point of the Update vs daxpy performance is intriguing.  Thanks
>> for letting us know.  Is this from the MKL version of daxpy?
>> 
>> Regarding Tpetra, all of these operations are handled by the Kokkos
>> package, which has custom backends for multicore, manycore and GPUs.
>> These operations have regularly shown excellent and portable performance
>> via Kokkos on all these platforms.  So for Tpetra users there should not
>> be the same kind of issue.
>> 
>> If you have had issues with threaded Epetra, please let us know.  We will
>> fix these errors.
>> 
>> Thanks again.
>> 
>> Mike
>> 
>> On 9/21/15, 3:10 AM, "Trilinos-Users on behalf of Sven Baars"
>> <trilinos-users-bounces at trilinos.org on behalf of s.baars at rug.nl> wrote:
>> 
>>> Hey everyone,
>>> 
>>> I was wondering why BLAS isn't used more often. It seems to me that
>>> you'd want to use this as much as possible. For instance in the Update
>>> method of the Epetra_MultiVector. I attached an example where I test its
>>> performance. Here is my output on my local machine:
>>> 
>>> $ ./test
>>> Time with Update 1.35864
>>> Time with daxpy_ 1.05639
>>> 
>>> and it's even threaded:
>>> 
>>> $ OMP_NUM_THREADS=4 ./test
>>> Time with Update 1.38495
>>> Time with daxpy_ 0.66636
>>> 
>>> Note that I don't compile Epetra with OpenMP support, because for me
>>> it's bugged in some places. But I can't imagine that the implementation
>>> in Epetra is better than the BLAS one. So why isn't BLAS used more often?
>>> 
>>> Cheers,
>>> Sven
>>> 
>>> P.S. I know Teuchos has BLAS wrappers, but I just wanted to make sure I
>>> was actually using BLAS, and that Tpetra probably does this better, but
>>> Tpetra's API is still too unstable for me.
>