[Mulgara-dev] Query Evaluation - binding variables

Wed Apr 23 14:48:42 UTC 2008

Hi,

I've been staying out of this conversation despite being the original
implementor of trans (which walk was subsequently derived from). But I
figure it's about time I weighed in.

On Wed, Apr 23, 2008 at 8:01 AM, Andrae Muys <andrae at netymon.com> wrote:
>  On 23/04/2008, at 10:29 PM, Alex Hall wrote:
<snip/>
>  > Can you give an estimate for how long this modification would
>  > take?  It
>  > seems like it would be a pretty significant one, as the reordering is
>  > done in the join operation (as I understand it) while the exception is
>  > thrown during constraint resolution, which occurs prior to the
>  > join.  In
>  > the meantime I'll just use the workaround suggested by Ronald.
>
>  Ok, I was going to say a day, maybe two.  However I just double
>  checked the implementation of walk/trans, and I suspect we're
>  probably talking 2-3 days, 4 if you haven't written a tuples before.

Given that the mechanism isn't immediately apparent, anyone
implementing this should really have a conversation with me before
proceeding!

>  The point is that all the infrastructure required to do it is now in
>  place, it's just that walk and trans haven't been updated to make use
>  of it.  The complication is that walk/trans weren't implemented to
>  defer evaluation, so they perform their inferencing during the resolve
>  ().
>
>  So fixing this is a case of:
>
>  a) Updating walk and/or trans (might as well do both I suppose) to
>  defer evaluation - ie. move most of the functionality currently in
>  the WalkFunction class into a WalkTuples class and perform it in
>  response to a beforeFirst() call if it doesn't yet have everything
>  bound that needs to be bound to perform the inferencing.  Note it may
>  be worth doing the inferencing up front if you can as you get a more
>  accurate row-count that way and that has an impact on join performance.

Deferring evaluation is a big deal here. When writing resolvers you
really want to know if your variable is bound to a single value or
many, as you may need different methods for handling the two
(depending on the function of the resolver). And since the Walk
function was a copy/paste of Trans then updating one should lead to
updating the other.

<snip/>

I'd also like to point out that the "anchored" form of trans was a
concession to "backward chaining". Of course, this is considered very
important for efficiency, but it is less relevant for the trans
operation. Most transitive closure algorithms have polynomial
complexity, as each chain has linear complexity and you multiply this
by the number of chains (this is ignoring the complexity in the
searches!). However, the Mulgara trans operation has
polynomial-logarithmic complexity. The polynomial part comes from the
join operations (and is pretty good), while the log aspect is the
trans algorithm itself. This is the theoretical limit of calculating
transitive closure, so we are pretty proud of it.  :-)

"Anchoring" this operation has an expected complexity that is smaller,
but a worst case complexity that is identical to the non-anchored
case. The difference is that each time we join to a tuples, the tuples
we join to represents an [N to M] relationship, however in the
anchored case the very first tuples represents a [1 to M] relationship
(all the remaining are [N to M]). The main advantage is that less
memory gets used, however due to the log nature of the operation, the
number of steps doesn't change much.

This said, it really isn't too expensive to take the following query
(which isn't working yet):

select $x $label from <rmi://localhost/server1#model>
  where $x <rdf:type> <my:Data>
    and $x <my:importantValue> $y
    and trans($y <my:relatedTo> $node and $z <my:relatedTo> $node)
    and $node <rdf:label> $label;

and convert it to:

select $x $label from <rmi://localhost/server1#model>
  where $x <rdf:type> <my:Data>
    and $x <my:importantValue> $y
    and trans($y <my:relatedTo> $node)
    and $node <rdf:label> $label;

Yes, it uses more memory, but it works pretty well. Since you have
this option, it didn't occur to me to allow a variable in an anchor.
Anchoring was only requested to make it more efficient when moving
from a *specific* node across the graph. Had we wanted to allow a
variable in an anchor then it would have saved us nothing, since back
then we couldn't guarantee that the transitive constraint didn't get
run first, and if the anchored end was a currently-unbound variable
then we'd be back to an unanchored transitive constraint.

I believe that this is different for Walk, but remember, I didn't
write Walk - it was just copied from my code at a later date. Hence,
it's needs were different, but the design didn't take that into
account.

Regards,
Paul