[Trilinos-Users] Trilinos-Users Digest, Vol 60, Issue 1

Mon Aug 2 13:52:00 MDT 2010

Chris,

We should probably take this discuss onto the Trilinos developers list.  If any other Trilinos users would like to be part of this discussion, please email me.

- Ross

> -----Original Message-----
> From: Baker, Christopher G. [mailto:bakercg at ornl.gov]
> Sent: Monday, August 02, 2010 1:45 PM
> To: Bartlett, Roscoe A; Hoemmen, Mark
> Cc: trilinos-users at software.sandia.gov
> Subject: Re: [Trilinos-Users] Trilinos-Users Digest, Vol 60, Issue 1
> 
> I believe that the compile-time-generated expression object can be
> passed via the Kokkos Node API.
> According to my understanding, expression templates don't generate "new
> code", but instead a new combination of existing expression objects.
> The "new code" is (hopefully, but optionally) generated by the compiler
> when the code of the individual expression code is inlined; this will
> be done by the CUDA compiler.
> 
> There is the issue that we need to explicitly instantiate the
> parallel_for/parallel_reduce on the particular expression, inside a .cu
> file which will be processed by the CUDA compiler. However, that is not
> a big obstacle; the developer knows at compile time what the expression
> type is, because the compiler will list it in the link error when the
> instantiation isn't found for the
> CUDANode::parallel_for/parallel_reduce :).
> 
> For example, the expression example from the included wikipedia
> article:
> http://en.wikipedia.org/wiki/Expression_templates
> from the code:
> Vec x = alpha*(u - v);
> resulted in
> Vec x = VecScaled< VecDifference<Vec,Vec> >
> The implementation of "Vec" in that case would loop over the columns of
> the data in x and assign the expression, like so:
> for (size_type i = 0; i != v.size(); ++i) {
>  _data[i] = expr[i];
> }
> In our case, this does not work, because the for loop itself needs to
> be pushed down to the Node. Instead, there would be a
> ExpressionAssignmentOp<Expr>, so that the Vec(VecExpression<E>)
> operator would perform:
> Vec(const VecExpression<E> &ve) {
>    ExpressionAssignmentOp<VecExpression<E> > op;
>    op.dst = data_;
>    op.expr = ve;
>    node->parallel_for(0,len_,op);
> }
> 
> The downside of this is that it requires that the expression can be
> executed in parallel, i.e., there is no loop-carried dependency between
> assignment of x[i] and x[j], i != j. Otherwise, we could add a new node
> primitive that would sequentially initialize a buffer, i.e., the vector
> data. We could even, via partial specialization and new (outer)
> expression type(s) overload the expression template language to support
> both sequential and parallel initialization:
> Vec x = par_init( alpha*(u-v) );
> or
> Vec x = seq_init( alpha*(u-v) );
> One of the above could be the default.
> 
> All of this is to say, I think expression templates could be put to
> use, from the Kokkos level and Tpetra level (Tpetra vector will soon
> accept an arbitrary expression).
> 
> Chris
> 
> On 8/2/10 3:08 PM, "Bartlett, Roscoe A" <rabartl at sandia.gov> wrote:
> 
> Chris,
> 
> Expression templates generate new code right when you write:
> 
>    y = alpha * x1 + beta * x2;
> 
> and then gets optimized away by the compiler to a flat fused loop.
> 
> See:
> 
>    http://en.wikipedia.org/wiki/Expression_templates
> 
> which also references the original ET article.
> 
> How would this code get this ET compiled by CUDA and run on the GPU?
> 
> - Ross
> 
> 
> > -----Original Message-----
> > From: Baker, Christopher G. [mailto:bakercg at ornl.gov]
> > Sent: Monday, August 02, 2010 12:55 PM
> > To: Bartlett, Roscoe A; Hoemmen, Mark; trilinos-
> > users at software.sandia.gov
> > Subject: Re: [Trilinos-Users] Trilinos-Users Digest, Vol 60, Issue 1
> >
> > I'm not sure why "expression templates will not work when the data is
> > on a GPU". Can you elaborate?
> >
> > Chris
> >
> >
> > On 8/2/10 2:51 PM, "Bartlett, Roscoe A" <rabartl at sandia.gov> wrote:
> >
> > Expression templates will not work when the data is on a GPU.
> >
> > My stop-gap measure for linear algebra operations is given here:
> >
> >
> >
> http://trilinos.sandia.gov/packages/docs/dev/packages/thyra/doc/html/Li
> > nearAlgebraFunctionConvention.pdf
> >
> > with examples at:
> >
> >
> >
> http://trilinos.sandia.gov/packages/docs/dev/packages/thyra/doc/html/cl
> > assThyra_1_1VectorBase.html
> >
> > Therefore, for:
> >
> >    y = alpha * x1 + beta * x2
> >
> > you would have:
> >
> >   V_StVpStV( y, alpha, x1, beta, x2 );
> >
> > This is not the most beautiful thing you have ever seen but at least
> > you can go back and forth between the linear algebra notation and the
> > function declaration and call sequence.  You can also make this run
> > very fast without fancy expression templates on the GPU in a fairly
> > transparent way (at least transparent to the user).
> >
> > Cheers,
> >
> > - Ross
> >
> >
> > > -----Original Message-----
> > > From: trilinos-users-bounces at software.sandia.gov [mailto:trilinos-
> > > users-bounces at software.sandia.gov] On Behalf Of Mark Hoemmen
> > > Sent: Monday, August 02, 2010 12:25 PM
> > > To: trilinos-users at software.sandia.gov
> > > Subject: Re: [Trilinos-Users] Trilinos-Users Digest, Vol 60, Issue
> 1
> > >
> > > On Aug 1, 2010, at 10:08 PM, trilinos-users-
> > request at software.sandia.gov
> > > wrote:
> > > > Message: 4
> > > > Date: Mon, 02 Aug 2010 02:46:31 +0200
> > > > From: Nico Schl?mer <nico.schloemer at ua.ac.be>
> > > > Subject: Re: [Trilinos-Users] Epetra_SerialDenseVector vs.
> > > >        Teuchos::SerialDenseVector
> > > > To: "Bartlett, Roscoe A" <rabartl at sandia.gov>
> > > > Cc: "Heroux, Michael A" <maherou at sandia.gov>, "Thornquist,
> > Heidi
> > > K"
> > > >        <hkthorn at sandia.gov>, trilinos-users at software.sandia.gov
> > > > Message-ID: <5a5040fba02ffca0981ca1e5c125a8d0 at ua.ac.be>
> > > > Content-Type: text/plain; charset=utf-8
> > > >
> > > >> It is very hard to make operator+, operator-, operator*
> efficient.
> > > >
> > > > I see.
> > > > Would you have any recommendation for me what to do with
> expression
> > > like
> > > >
> > > >   y = alpha * x1 + beta * x2 + ...
> > > >
> > > > ?
> > >
> > > "Efficient" here means "not creating temporary vectors."  Each of
> > those
> > > "alpha*x1" terms will have to create a new temporary vector, and
> each
> > > operator+ invocation will have to do so as well.  The temporaries
> > will
> > > go away, but you'll still have to spend time in memory allocation.
> > >
> > > If you're not worried about that, then you could define non-member
> > > operator+, operator*, ... methods.  They don't need to belong to or
> > be
> > > friends of Teuchos::SerialDenseVector, because SerialDenseVector
> has
> > an
> > > operator[].  Each of these methods should create and return a new
> > > SerialDenseVector.
> > >
> > > If you are worried about the cost of creating temporary vectors in
> > such
> > > expressions, but still want to write expressions in this way, you
> may
> > > wish to investigate the uBLAS, which is in the Boost collection of
> > C++
> > > libraries.  See e.g., the overview at
> > >
> > >
> >
> http://www.boost.org/doc/libs/1_43_0/libs/numeric/ublas/doc/overview.ht
> > > m
> > >
> > > mfh
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Trilinos-Users mailing list
> > > Trilinos-Users at software.sandia.gov
> > > http://software.sandia.gov/mailman/listinfo/trilinos-users
> >
> > _______________________________________________
> > Trilinos-Users mailing list
> > Trilinos-Users at software.sandia.gov
> > http://software.sandia.gov/mailman/listinfo/trilinos-users
> >
> >
> 
> 
>