[Mulgara-general] Jena-Mulgara connector

Wed Jan 2 22:04:00 UTC 2008

HI Andy,

On Jan 2, 2008 2:19 PM, Seaborne, Andy <andy.seaborne at hp.com> wrote:
> I've had a go at building a connector from Jena to Mulgara.  The
> objective is to get the Jena APIs running over Mulgara-stored data using
> the Session interface, so it works locally and remotely.  It is based on
> Jena's GraphBase compoent so it only needs to implement add/delete/find
>   for the rest of the system to work.
>
> This is not optimized.  Updates should be batched (todo).  While SPARQL
> using ARQ works (client-side execution of the algebra) I do not intend
> to optimize it.  SPARQL is better done server-side when the native
> Mulgara implementation is ready.

It's coming.  Honest!  Things just slow down around Christmas. :-)

> Draft documentation:
>    http://jena.hpl.hp.com/wiki/JenaMulgara
>
> I didn't find a way to completely transparently handle blank nodes that
> originate from a Jena client.  I would appreciate some guidance in
> correct use of Mulgara for blank nodes - I can't see a way to create
> them from the client and then know what blank node has been created for
> use in further add/delete operations.  A query to find them again may
> not be able to uniquely identify the node just created.

This is an issue which has been mentioned a few times.  Unfortunately,
it's not an easy one to answer.

As you know, all literals, URIs and blank nodes are represented
internally with an internal 64 bit ID.  We call these Graph Nodes, or
gNodes.  If these are ever freed up, then we re-use them wherever
possible.  This was vitally important when we had a purely 32 bit
system (else we would run out of IDs), but it's not so important with
a 64 bit system.

Blank nodes in Mulgara are given external representations like _:nnnn,
where the "nnnn" is an integer containing the gNode value.  Because of
the re-use of gNode values, a delete/insert sequence may end up
re-using the IDs.  So if you try to refer to a particular blank node
by the ID it displays, it is possible in some circumstances to instead
refer to something else entirely.  For this reason, we have never
allowed anyone to a blank node by its external representation.  I
agree that it's annoying, and it's frustrated me on several occasions
as well.

I always get around it by doing one of two things:
- Perform add/deletes as insert...select and delete...select commands.
- Wrap all my read/writes in a transaction.  That means I'm the only
writer, so I know the ID can't change.  Unfortunately, once we get
multiple writers (some time in the future still) then this won't hold.
 Also, if I'm using TQL then I have to perform insert...select and
delete...select operations still, as this is the only way to refer to
the blank nodes.

Of course, if I'm doing insert...select and delete...select then I
need some guaranteed way of identifying the correct node.  IMO this is
almost always doable if you have knowledge of the schema you are
using, but there is no general mechanism.  The only case where you
can't get a particular blank node is if it's indistinguishable from
another... in which case it doesn't matter when one you get.  :-)

As for a general mechanism for getting to the blank nodes, you say:

> At the moment, I have to skolemize Jena-created blank nodes and handle
> Mulgara-created blank nodes via their label.  This is handled
> transparently by the connector - so it all works out for Jena (and even
> SPARQL isBlank()) but the skolemized nodes would be visible via iTQL.
> What am I missing?

Unfortunately, nothing.  You've created the generalized mechanism that
I said we were missing, but unfortunately you've had to do it at an
RDF level, with the obvious issues of interaction with your RDF data.

As I see it, we could address the problem in a few ways:
1) Make all new blank nodes increment the "long" that represents the
next gNode to be allocated when there are none left to be re-used.  We
would also need to update the _ph file code, as this adjustment would
cause them to grow to quickly, but that file needed updating anyway.
:-)
2) The StringPool is a map of gNodes to data (either URIs or literal
data), and the data back to gNodes.  So we just create a new mapping
type for blank nodes. This means creating a pseudo-URI entry, which
maps a gNode onto a representation of the form _:nnnn, where the
number nnnn is not the ID of the gNode.  This will have a performance
impact (though possibly not too severe) and will require several
changes to the string pool code.
3) Since the system is 64 bit, we stop re-using gNodes altogether, and
permit an explicit mapping of a blank node to its gNode value with the
_:nnnn syntax.  This is a little like option 1, but impacts more of
the system.  I think this is the route to take, since it allows for
faster writes in general (we abandon the free lists), but it will take
some engineering to get it right.

However, any option we choose is going to take some time before it
gets into the distribution.  In the meantime you're stuck, sorry.  :-(

> One issue I encountered: Session.modelExists() for
> RemoteSessionWrapperSession throws an exception if the model has never
> existed but returns false if the model used to exist but has been dropped.

This is a bug.  I thought I'd put the fix into the trunk, but I must
have missed it.

> Thanks to Paul for pointing me in the right direction and tolerating
> some newbie questions,

Actually, I appreciate the insights from someone who understands the
underlying semantics so well, as well as having addressed many similar
issues in the past.  Also, any problems you have are definitely going
to be problems for other people looking to use Mulgara, so it's
worthwhile noting what you're finding difficult just so we can fix or
document it.

BTW, I haven't forgotten your last email, but I haven't had time to
get to it yet.

Regards,
Paul