[Trilinos-Users] Trilinos-Users Digest, Vol 60, Issue 1

Baker, Christopher G. bakercg at ornl.gov
Mon Aug 2 13:44:56 MDT 2010


I believe that the compile-time-generated expression object can be passed via the Kokkos Node API.
According to my understanding, expression templates don't generate "new code", but instead a new combination of existing expression objects. The "new code" is (hopefully, but optionally) generated by the compiler when the code of the individual expression code is inlined; this will be done by the CUDA compiler.

There is the issue that we need to explicitly instantiate the parallel_for/parallel_reduce on the particular expression, inside a .cu file which will be processed by the CUDA compiler. However, that is not a big obstacle; the developer knows at compile time what the expression type is, because the compiler will list it in the link error when the instantiation isn't found for the CUDANode::parallel_for/parallel_reduce :).

For example, the expression example from the included wikipedia article:
http://en.wikipedia.org/wiki/Expression_templates
from the code:
Vec x = alpha*(u - v);
resulted in
Vec x = VecScaled< VecDifference<Vec,Vec> >
The implementation of "Vec" in that case would loop over the columns of the data in x and assign the expression, like so:
for (size_type i = 0; i != v.size(); ++i) {
 _data[i] = expr[i];
}
In our case, this does not work, because the for loop itself needs to be pushed down to the Node. Instead, there would be a ExpressionAssignmentOp<Expr>, so that the Vec(VecExpression<E>) operator would perform:
Vec(const VecExpression<E> &ve) {
   ExpressionAssignmentOp<VecExpression<E> > op;
   op.dst = data_;
   op.expr = ve;
   node->parallel_for(0,len_,op);
}

The downside of this is that it requires that the expression can be executed in parallel, i.e., there is no loop-carried dependency between assignment of x[i] and x[j], i != j. Otherwise, we could add a new node primitive that would sequentially initialize a buffer, i.e., the vector data. We could even, via partial specialization and new (outer) expression type(s) overload the expression template language to support both sequential and parallel initialization:
Vec x = par_init( alpha*(u-v) );
or
Vec x = seq_init( alpha*(u-v) );
One of the above could be the default.

All of this is to say, I think expression templates could be put to use, from the Kokkos level and Tpetra level (Tpetra vector will soon accept an arbitrary expression).

Chris

On 8/2/10 3:08 PM, "Bartlett, Roscoe A" <rabartl at sandia.gov> wrote:

Chris,

Expression templates generate new code right when you write:

   y = alpha * x1 + beta * x2;

and then gets optimized away by the compiler to a flat fused loop.

See:

   http://en.wikipedia.org/wiki/Expression_templates

which also references the original ET article.

How would this code get this ET compiled by CUDA and run on the GPU?

- Ross


> -----Original Message-----
> From: Baker, Christopher G. [mailto:bakercg at ornl.gov]
> Sent: Monday, August 02, 2010 12:55 PM
> To: Bartlett, Roscoe A; Hoemmen, Mark; trilinos-
> users at software.sandia.gov
> Subject: Re: [Trilinos-Users] Trilinos-Users Digest, Vol 60, Issue 1
>
> I'm not sure why "expression templates will not work when the data is
> on a GPU". Can you elaborate?
>
> Chris
>
>
> On 8/2/10 2:51 PM, "Bartlett, Roscoe A" <rabartl at sandia.gov> wrote:
>
> Expression templates will not work when the data is on a GPU.
>
> My stop-gap measure for linear algebra operations is given here:
>
>
> http://trilinos.sandia.gov/packages/docs/dev/packages/thyra/doc/html/Li
> nearAlgebraFunctionConvention.pdf
>
> with examples at:
>
>
> http://trilinos.sandia.gov/packages/docs/dev/packages/thyra/doc/html/cl
> assThyra_1_1VectorBase.html
>
> Therefore, for:
>
>    y = alpha * x1 + beta * x2
>
> you would have:
>
>   V_StVpStV( y, alpha, x1, beta, x2 );
>
> This is not the most beautiful thing you have ever seen but at least
> you can go back and forth between the linear algebra notation and the
> function declaration and call sequence.  You can also make this run
> very fast without fancy expression templates on the GPU in a fairly
> transparent way (at least transparent to the user).
>
> Cheers,
>
> - Ross
>
>
> > -----Original Message-----
> > From: trilinos-users-bounces at software.sandia.gov [mailto:trilinos-
> > users-bounces at software.sandia.gov] On Behalf Of Mark Hoemmen
> > Sent: Monday, August 02, 2010 12:25 PM
> > To: trilinos-users at software.sandia.gov
> > Subject: Re: [Trilinos-Users] Trilinos-Users Digest, Vol 60, Issue 1
> >
> > On Aug 1, 2010, at 10:08 PM, trilinos-users-
> request at software.sandia.gov
> > wrote:
> > > Message: 4
> > > Date: Mon, 02 Aug 2010 02:46:31 +0200
> > > From: Nico Schl?mer <nico.schloemer at ua.ac.be>
> > > Subject: Re: [Trilinos-Users] Epetra_SerialDenseVector vs.
> > >        Teuchos::SerialDenseVector
> > > To: "Bartlett, Roscoe A" <rabartl at sandia.gov>
> > > Cc: "Heroux, Michael A" <maherou at sandia.gov>, "Thornquist,
> Heidi
> > K"
> > >        <hkthorn at sandia.gov>, trilinos-users at software.sandia.gov
> > > Message-ID: <5a5040fba02ffca0981ca1e5c125a8d0 at ua.ac.be>
> > > Content-Type: text/plain; charset=utf-8
> > >
> > >> It is very hard to make operator+, operator-, operator* efficient.
> > >
> > > I see.
> > > Would you have any recommendation for me what to do with expression
> > like
> > >
> > >   y = alpha * x1 + beta * x2 + ...
> > >
> > > ?
> >
> > "Efficient" here means "not creating temporary vectors."  Each of
> those
> > "alpha*x1" terms will have to create a new temporary vector, and each
> > operator+ invocation will have to do so as well.  The temporaries
> will
> > go away, but you'll still have to spend time in memory allocation.
> >
> > If you're not worried about that, then you could define non-member
> > operator+, operator*, ... methods.  They don't need to belong to or
> be
> > friends of Teuchos::SerialDenseVector, because SerialDenseVector has
> an
> > operator[].  Each of these methods should create and return a new
> > SerialDenseVector.
> >
> > If you are worried about the cost of creating temporary vectors in
> such
> > expressions, but still want to write expressions in this way, you may
> > wish to investigate the uBLAS, which is in the Boost collection of
> C++
> > libraries.  See e.g., the overview at
> >
> >
> http://www.boost.org/doc/libs/1_43_0/libs/numeric/ublas/doc/overview.ht
> > m
> >
> > mfh
> >
> >
> >
> >
> > _______________________________________________
> > Trilinos-Users mailing list
> > Trilinos-Users at software.sandia.gov
> > http://software.sandia.gov/mailman/listinfo/trilinos-users
>
> _______________________________________________
> Trilinos-Users mailing list
> Trilinos-Users at software.sandia.gov
> http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>







More information about the Trilinos-Users mailing list