[Trilinos-Users] Why isn't BLAS used more in Epetra?

Sven Baars s.baars at rug.nl
Tue Sep 22 04:52:31 EDT 2015


Hi Mike,

I linked to whatever was present on my system, which apparently was
OpenBLAS. I just tried it with ATLAS and this is indeed slower. I also
tried it with MKL on a cluster and there performance of daxpy also seems
better than that of Update if I run it on a single core, if I run it
threaded, it's also faster than threaded Update, but if I run it with
MPI (4 processes on one node 1 thread each), performance is similar.

Mainly the fact that ATLAS seems to be slower on my system than Update
is weird, because they do seem to have tried to do some heavy
optimization for the daxpy operation. So I assume there is something
wrong with the version of ATLAS that is present on my system.

The fact that daxpy with MPI performs similarly might be due to
assumptions about cache usage that are not true because you use it
multiple times. So in most actual use cases (MPI+MKL) it might not
matter all that much.

Now, what I think is that we may assume that BLAS developers tried to
optimize their library as much as possible. This means that even
something simple as x+ay is more optimized than a simple for loop. And
for that reason, BLAS should be used as much as possible. If performance
is worse with BLAS, that's most likely a problem with the BLAS
implementation.

By the way, it's not only Update, but also something like Norm2. I'm not
really sure about the other functions. At least Multiply seems to use
BLAS, which is nice.

Something about the threaded implementations I don't understand is for
instance that Norm2 is only threaded when EPETRA_HAVE_OMP_NONASSOCIATIVE
is defined, but not when EPETRA_HAVE_OMP is defined. Shouldn't
non-associative mean that you can't thread it in the way it is done
there, but you can when EPETRA_HAVE_OMP is defined? Also: when is
EPETRA_HAVE_OMP_NONASSOCIATIVE defined? Or is this just some define that
disables threading for every for loop that sums something because it was
bugged?

One of the problems I have with Epetra and OpenMP was described here:
https://trilinos.org/pipermail/trilinos-users/2014-March/003996.html
But currently I am not in the process of trying to get this to work
properly any more.

Sven

On 09/21/2015 06:12 PM, Heroux, Michael A wrote:
> Sven,
>
> Thanks for your comments.  When Epetra was originally developed, the
> performance of level-1 BLAS was not much faster than raw code, and
> sometimes slower, especially for small to medium sized problem where
> function call overhead had an impact.  Also, since Update had broader
> functionality, the use of daxpy would have complicated the implementation
> without substantial performance improvement.
>
> Epetra does use BLAS extensively for level-3 operation, primarily found in
> the Multiply method, which is used for block Krylov solvers.  Linking with
> a good BLAS implementation has a dramatic performance improvement for
> these operations.
>
> Your data point of the Update vs daxpy performance is intriguing.  Thanks
> for letting us know.  Is this from the MKL version of daxpy?
>
> Regarding Tpetra, all of these operations are handled by the Kokkos
> package, which has custom backends for multicore, manycore and GPUs.
> These operations have regularly shown excellent and portable performance
> via Kokkos on all these platforms.  So for Tpetra users there should not
> be the same kind of issue.
>
> If you have had issues with threaded Epetra, please let us know.  We will
> fix these errors.
>
> Thanks again.
>
> Mike
>
> On 9/21/15, 3:10 AM, "Trilinos-Users on behalf of Sven Baars"
> <trilinos-users-bounces at trilinos.org on behalf of s.baars at rug.nl> wrote:
>
>> Hey everyone,
>>
>> I was wondering why BLAS isn't used more often. It seems to me that
>> you'd want to use this as much as possible. For instance in the Update
>> method of the Epetra_MultiVector. I attached an example where I test its
>> performance. Here is my output on my local machine:
>>
>> $ ./test
>> Time with Update 1.35864
>> Time with daxpy_ 1.05639
>>
>> and it's even threaded:
>>
>> $ OMP_NUM_THREADS=4 ./test
>> Time with Update 1.38495
>> Time with daxpy_ 0.66636
>>
>> Note that I don't compile Epetra with OpenMP support, because for me
>> it's bugged in some places. But I can't imagine that the implementation
>> in Epetra is better than the BLAS one. So why isn't BLAS used more often?
>>
>> Cheers,
>> Sven
>>
>> P.S. I know Teuchos has BLAS wrappers, but I just wanted to make sure I
>> was actually using BLAS, and that Tpetra probably does this better, but
>> Tpetra's API is still too unstable for me.



More information about the Trilinos-Users mailing list