[Mulgara-general] Jena-Mulgara connector

Thu Jan 3 15:39:00 UTC 2008

-------- Original Message --------
> From: Paul Gearon <>
> Date: 2 January 2008 22:04
>

> > This is not optimized.  Updates should be batched (todo).  While
> > SPARQL using ARQ works (client-side execution of the algebra) I do not
> > intend to optimize it.  SPARQL is better done server-side when the
> > native Mulgara implementation is ready.
>
> It's coming.  Honest!  Things just slow down around Christmas. :-)

:-) Happy New Year.

> Blank nodes in Mulgara are given external representations like _:nnnn,
> where the "nnnn" is an integer containing the gNode value.  Because of
> the re-use of gNode values, a delete/insert sequence may end up
> re-using the IDs.  So if you try to refer to a particular blank node by
> the ID it displays, it is possible in some circumstances to instead
> refer to something else entirely.  For this reason, we have never
> allowed anyone to a blank node by its external representation.  I agree
> that it's annoying, and it's frustrated me on several occasions as
> well.

It's also "interesting" if you create a blank node by some small integer and see what node in the server graph you have really named.  id=1 seems to rdf:type.

>
> I always get around it by doing one of two things:
> - Perform add/deletes as insert...select and delete...select commands.
> - Wrap all my read/writes in a transaction.  That means I'm the only
>  writer, so I know the ID can't change.  Unfortunately, once we get
> multiple writers (some time in the future still) then this won't hold.
> Also, if I'm using TQL then I have to perform insert...select and
> delete...select operations still, as this is the only way to refer to
> the blank nodes.
>
> Of course, if I'm doing insert...select and delete...select then I need
> some guaranteed way of identifying the correct node.  IMO this is
> almost always doable if you have knowledge of the schema you are using,
> but there is no general mechanism.  The only case where you can't get a
> particular blank node is if it's indistinguishable from another... in
> which case it doesn't matter when one you get.  :-)

It's the lack of general mechanism that's the issue here.  In trying to write a general connector, as with writing any library code, the hope is that the application assumptions don't have to run all the way from top to bottom.

My canonical use case is an RDF (or OWL) editor.  These tend to regard the graph as a syntactic entity, especially when it is between consistent states.  And those RDF collections are always there to keep us on our toes. Viewing the graph syntactically is just the same as thinking of being inside the graph, not querying from the outside.

This use case isn't the primary reason for wanting the connector so, by above all, documentation for correct use is needed.

>
> As for a general mechanism for getting to the blank nodes, you say:
>
> > At the moment, I have to skolemize Jena-created blank nodes and handle
> > Mulgara-created blank nodes via their label.  This is handled
> > transparently by the connector - so it all works out for Jena (and
> > even SPARQL isBlank()) but the skolemized nodes would be visible via
> > iTQL.
> > What am I missing?
>
> Unfortunately, nothing.  You've created the generalized mechanism that
> I said we were missing, but unfortunately you've had to do it at an RDF
> level, with the obvious issues of interaction with your RDF data.
>
> As I see it, we could address the problem in a few ways:
> 1) Make all new blank nodes increment the "long" that represents the
> next gNode to be allocated when there are none left to be re-used.  We
> would also need to update the _ph file code, as this adjustment would
> cause them to grow to quickly, but that file needed updating anyway.
> :-)

> 2) The StringPool is a map of gNodes to data (either URIs or
> literal data), and the data back to gNodes.  So we just create a new
> mapping type for blank nodes. This means creating a pseudo-URI entry,
> which maps a gNode onto a representation of the form _:nnnn, where the
> number nnnn is not the ID of the gNode.  This will have a performance
> impact (though possibly not too severe) and will require several
> changes to the string pool code.

> 3) Since the system is 64 bit, we stop
> re-using gNodes altogether, and permit an explicit mapping of a blank
> node to its gNode value with the _:nnnn syntax.  This is a little like
> option 1, but impacts more of the system.  I think this is the route to
> take, since it allows for faster writes in general (we abandon the free
> lists), but it will take some engineering to get it right.

Of these, (2) and use a UUID for the label seems a good long term solution.  The UUID can be held compactly (still a lookup to get its ID though).  The big win is that it scales to a single graph distributed across several machines.  The performance impact will be only on large graphs with lots of bnodes.

> However, any option we choose is going to take some time before it gets
> into the distribution.  In the meantime you're stuck, sorry.  :-(

Thanks the the transaction warning.  The connector assumes the id is stable which is false.

I see my choices as:

A/ "indirect skolemization", that is add a label property to bnodes to find them again across transactions.  This is a slightly better way to do what the code does currently because it results in true Mulgara-bnodes visble to iTQL.

B/ Only handle bnodes perfectly in the local case and extend DatabaseSession to handle the mapping between the two bnode approaches.  I haven't followed this through - it might involve a lot of changes.

And better documentation of the connector about what can and can't be done.  I'll add access to the session transaction from a Jena application - then at least in a transaction, I can assume id stability.

>
> > One issue I encountered: Session.modelExists() for
> > RemoteSessionWrapperSession throws an exception if the model has never
> > existed but returns false if the model used to exist but has been
> > dropped.
>
> This is a bug.  I thought I'd put the fix into the trunk, but I must
> have missed it.

I'll check the trunk - I have been using 1.1.1 distribution where possible.

>
> > Thanks to Paul for pointing me in the right direction and tolerating
> > some newbie questions,
>
> Actually, I appreciate the insights from someone who understands the
> underlying semantics so well, as well as having addressed many similar
> issues in the past.

Please do not tell Pat Hayes about the skolemization of the blank nodes :-)

> Regards,
> Paul

        Thanks,
        Andy