[Mulgara-dev] Creating anonymous nodes

Paul Gearon gearon at ieee.org
Wed May 28 19:54:54 UTC 2008


With Alex's consent I've moved this discussion into the open......

On May 28, 2008, at 12:18 PM, Alex Hall wrote:

> Hi Paul,
>
> Paul Gearon wrote:
>> Hi Alex,
>> On Tue, May 27, 2008 at 7:18 PM, Alex Hall <alexhall at revelytix.com>  
>> wrote:
>>> There was another issue that I wanted to talk to you about earlier  
>>> today but
>>> didn't have time to.  In addition to being able to insert  
>>> statements using
>>> an existing blank node ID, we need a way to allocate a new blank  
>>> node and
>>> then insert statements about it into the triple store.
>>>
>>> If the insertions were localized in a single place, it would be as  
>>> simple as
>>> saying:
>>>
>>>  insert <ex:Bob> <ex:address> $x
>>>         $x <ex:street> "123 Main Street"
>>>         $x <ex:city> "Baltimore"
>>>  into <rmi://localhost/server1#graph>
>> So what's wrong with this?  I do it all the time.  Your wording makes
>> it appear that you don't think this will work.  Is it causing  
>> problems
>> for you?
>
> There's nothing wrong with that, but it doesn't do everything that I  
> would like.  Namely, I would like to get my hands on the JRDF  
> BlankNode object for the newly allocated node, so I can use it  
> directly in subsequent insertions/queries etc.  Keep in mind that  
> I'm building my commands directly via the API so using a BlankNode  
> is valid, even though there's no syntax for doing that in TQL.

Ah, so you're trying to mix APIs. That way there be dragons.....

Semantically, I just can't think of a valid way to do this, given the  
way things stand. It might be possible to modify the TQL API to return  
a binding of variable names to blank nodes after an insert, but that  
would be more time than I have.

Others have said that if they do a query and get back a blank node of  
say _:12324 then it would be create to refer to this syntactically,  
almost as if it were a URI. This is fraught with semantic issues, but  
is certainly doable. That means that if you want all the people living  
at Bob's address you could say something like:

  insert <ex:Bob> <ex:address> $x
         $x <ex:street> "123 Main Street"
         $x <ex:city> "Baltimore"
  into <rmi://localhost/server1#graph>;

select $x
from <rmi://localhost/server1#graph>
where <ex:Bob> <ex:address> $x
         $x <ex:street> "123 Main Street"
         $x <ex:city> "Baltimore";

result:  $x => _:1234

select $person
from <rmi://localhost/server1#graph>
where $person <ex:address> _:1234;

Since our blank nodes will never change then it's tempting to do this,  
but I'm still not comfortable with it. For a start, it will be time  
consuming (ie. an extra disk read) to work out that _:1235 actually  
refers to a URI and shouldn't be represented as a blank node. OTOH,  
blank nodes are SUPPOSED to be able to represent anything else  
(including URIs and Literals) so maybe that's OK.  :-)

>>> In this case, the $x in each statement gets mapped to the same  
>>> internal
>>> blank node.  The problem for me in this case is that I have no  
>>> idea *what*
>>> that internal node ID is, so I can't remember it in order to  
>>> insert any
>>> subsequent statements using that same ID.
>> OK, but here is where you can add the ID:
>>   insert <ex:Bob> <ex:address> $x
>>          $x <ex:street> "123 Main Street"
>>          $x <ex:city> "Baltimore"
>>          $x <ex:hasId> [GUID]
>>   into <rmi://localhost/server1#graph>
>
> Right, but I still have to do another query to get the BlankNode  
> object, or else use insert/select constructs anytime I want to add  
> another statement about the newly allocated node.  Granted, I'm  
> starting to see that this sort of thing will probably be  
> unavoidable, but I was hoping that wouldn't be the case.

If it's a blank node then you really SHOULD be doing insert/selects.  
That's a Good Thing.

Mind you, if we allowed the example I gave above, then I guess it  
would be possible to say:

insert _:1234 <ex:zip> "10101"
into <rmi://localhost/server1#graph>;

None of this would be hard to do. It just bugs me.  :-)  I think  
you'll find Andrae is against it.

OTOH, he has a better handle on how the language maps to the  
semantics, so maybe he thinks it's all OK.

>>> I think that one of the main difficulties for me is that Mulgara  
>>> is just a
>>> triple-store -- albeit a darn good one -- and not much more.
>> Yes. and of course, that is by design. Indeed, if you look at SPARQL
>> you'll see that this is expected.
>>> The software
>>> we write is intended for viewing and editing concepts and  
>>> individuals in an
>>> ontology; we want a model-based view of the contents of the ontology
>>> constructed from the raw RDF triples in the store. So the first  
>>> thing I did
>>> was write a modeling layer, backed by the Connection API, to  
>>> represent
>>> ontology concepts as Java objects. Naturally, the object  
>>> representing an
>>> individual needs to remember the RDF resource that it represents.
>> Modeling like this is a common requirement.  :-)
>> I have wanted to put a modeling layer into Mulgara for some time now,
>> but of course I have not had the resources to do one. However, this
>> could only ever be an option, as a lot of people have very different
>> ideas about how to model things.
>
> Of course.  For our particular software, it is already built on top  
> of Jena.  I would *love* to rip it all up and start fresh, but  
> obviously don't have the time to do this for a system that is  
> already in production.  So instead I took a Jena-like modeling  
> approach.

Fair enough, though I don't know how Jena does this. (Sorry Andy!)

>> I built such a layer for myself over a year ago, and it worked nicely
>> - for certain situations.
>> Many people want to embed the modeling in the language they are  
>> using.
>> Of course, this language is often Java. The problem with this  
>> approach
>> is that Java is statically typed, meaning that it cannot represent
>> arbitrary RDF. Some projects work within this restriction, explicitly
>> building RDF from their classes, usually using annotations to  
>> describe
>> the URIs they want for types and predicates. This permits a
>> "Hibernate" like system, though it has the advantage of being able to
>> store any structure without needing to modify the database schema.
>> However, it cannot deal with arbitrary RDF, and usually ignores
>> unknown fields, etc. Topaz have built exactly this system, and it
>> works very well. They even have a query language for their objects. I
>> highly recommend looking at this, as I know you have better things to
>> do with your time than building a library like this! (fun though it
>> may be)
>
> I'm not interested in embedding the modeling for arbitrary  
> ontologies in Java, fun though it may be :-)  The ontologies are  
> built collaboratively and incrementally by the users of our  
> software; the contents of those ontologies have no meaning to us.   
> So an OTM library such as Topaz isn't exactly what we're looking  
> for.  We want to describe properties of classes, properties, and  
> individuals in the ontologies; in other words, the modeling that  
> I've embedded into Java describes the structure of RDF (most of this  
> is provided by JRDF), RDFS, and OWL concepts.  So the code will look  
> something like:
>
> Resource r = model.getResource("Person");
> if (r.isClass()) {
>  for (Node n : r.asClass().getSubClasses()) {
>    Resource rr = model.getResource(n);
>    ...
>  }
> }
>
> If you've looked at code that uses the Jena libraries, then this  
> snippet will look very familiar to you.  The fact that Java is  
> statically typed doesn't really bother me here, because I'm not  
> trying to model arbitrary RDF.

Actually, I haven't looked at Jena, but it looks familiar  
anyway.  :-)  It seems that there are obvious ways to approach these  
problems.

>> To embed like this you really need a dynamic language like Groovy or
>> Ruby, as these languages let you redefine classes at runtime. You
>> *can* redefine classes at runtime in Java if you are prepared to use
>> class name mangling, a custom class loader, and a bytecode library -
>> but this is crazy, especially when you realize that you can't  
>> actually
>> talk to the resulting object, as you don't have an interface for  
>> it...
>> you'd have to use reflection, which defeats the point of trying to
>> embed the code in the language!
>
> I think Elmo addresses this by allowing the programmer to annotate  
> bean-like interfaces, and dynamically constructing the bytecode for  
> a class that implements all the appropriate interfaces for an object  
> depending on its declared RDF types.  So it simulates dynamic typing  
> in Java while still giving clients an interface to talk to the  
> objects... pretty cool stuff :-)

I learnt a lot about Elmo last week at SemTech and was REALLY  
impressed. I didn't hear about this feature though (assuming you have  
it right). It sounds very nice.

>> After looking at this, the approach I took was to NOT embed the
>> objects in my language (Java). Instead I took the Perl approach,
>> whereby every structure is really a hash map where properties are
>> keys, and objects are the values. This actually works surprisingly
>> well (extending easily into collections and class references), though
>> you end up with a lot of lines like:
>>  Instance address = model.newInstance("Address");
>>  address.put("street", "123 Main Street");
>>  address.put("city", "Baltimore");
>>  Instance bob = model.newInstance("Person");
>>  bob.put("name", "Bob");
>>  bob.put("address", address);
>
> This is roughly the same approach that I used for accessing  
> arbitrary properties on an instance.

Again, I find that the approach to these common problems are self- 
evident.

> [snip]
>>> In most instances, this modeling layer works beautifully.  The  
>>> only place it
>>> really breaks down is when I want to allocate a blank node for an  
>>> anonymous
>>> instance in an ontology, encapsulate it in a Java object, and then  
>>> hand it
>>> off to other parts of the code that may want to query about it or  
>>> add
>>> properties to it.  In this case, it seems that I'm reduced to what  
>>> we talked
>>> about before -- generating a GUID and then saying something like:
>> In my case, I declared a field for my object to be the key for the
>> object. This was a compulsory field. If it happened to be the "name"
>> property then that was great. Otherwise it would be a generated  
>> value.
>> Unfortunately, I didn't allow for compound keys. After declaring  
>> this,
>> the queries I used in my library were of the form:
>> select....
>> from.....
>> where $subject $key "keyvalue"
>>    and $key <rdf:type> <owl:InverseFunctionalPredicate>
>>    and .........

Now that I've moved this into a public forum I will state for the  
record that I am aware that this is OWL Full, but without a reasoner  
it's not hurting anything, and it describes exactly what I wanted with  
a well publicized (though little used) standard. My reading of OWL  
Full is that this is exactly what it is good for: clearly describing  
something, without necessarily throwing a reasoner at it.

Incidentally, I didn't realize this was OWL Full when I did it (August  
2006).  :-)

> Yeah, it looks like I'll need something along those lines.  Since I  
> don't have knowledge of any of the properties for an arbitrary blank  
> node, I'll always have to generate the key, and I should be able to  
> use a predefined key property instead of looking for an  
> owl:InverseFunctionalPredicate.  I think it will be more  
> theoretically sound to always refer to a blank node using a  
> generated key value as opposed to relying on an implementation  
> detail of the underlying database.  This will allow external  
> references to anonymous nodes to work even when the graph is moved  
> from one server to another, which is important for doing things like  
> maintaining audit trails (which is another driving requirement for  
> me).

This is in the back of my mind when I talk about these things. This  
includes the blank node references I was talking about at the top.

> This doesn't account for imported RDF which will not have unique  
> keys for blank nodes, but it shouldn't be too hard to do some post- 
> processing to add them.

A lot of data is going to have SOME kind of unique key associated with  
it. When it doesn't, then it may be time consuming to process it, but  
there's not a lot of option (unless I start adding in generated  
outputs into the head of rules).

> [snip]
>>> I don't have a fundamental issue with the requirement that  
>>> anonymous nodes
>>> be scoped to the containing RDF document.  But in practice, people  
>>> want RDF
>>> graphs to be dynamic entities: facts are added to them as they  
>>> become
>>> available.  In order to fully support this, its anonymous node  
>>> ID's need to
>>> be referenced externally by the tool that is doing the modifying;  
>>> in other
>>> words, round-tripping of blank nodes needs to be supported.  This  
>>> approach,
>>> while more flexible, relies on the external application to respect  
>>> the
>>> scoping of an anonymous node to its original graph -- for  
>>> instance, not to
>>> insert an anonymous node found in graph A into graph B.
>> While it might be tempting to start linking identifiers to specific
>> nodes, I would highly recommend against doing this. Version 2.0 of
>> SPARQL will be started soon, and insertions will almost certainly be
>> a part of that spec. You will NOT want to be incompatible with
>> everyone else on this, especially when Mulgara will end up having to
>> conform to it.
>
> I don't really understand what you're saying here.  Could you  
> elaborate a little?  I'm not talking about parsing blank node labels  
> in TQL and using those identifiers to link to a specific node.  What  
> I would prefer to see is the JRDF BlankNode object that comes back  
> from an Answer retain enough information to link back to the actual  
> node that it represents, and to have that object remain valid across  
> transactions.

OK, if you're just talking about APIs then my comment is irrelevant.  
For some reason I got it into my head that you were talking about  
round-tripping using a query interface rather than a programmatic one.

Incidentally, you should have a look at the SAIL API if you can. Since  
we don't want to be duplicating functionality between interfaces then  
JRDF may be getting deprecated.

>>> I guess that ideally, a triple-store such as Mulgara would support
>>> round-tripping of blank nodes, while maintaining an association  
>>> between a
>>> blank node and its graph in order to prevent it from being  
>>> promoted to a
>>> larger scope.  If a client needed to describe properties about a new
>>> anonymous resource, they could request that a new blank node be  
>>> allocated in
>>> the correct graph and then use the resulting blank node ID in  
>>> inserted
>>> statements.  Then again, I understand the possible storage and  
>>> performance
>>> issues that might arise from recording the provenance of all blank  
>>> nodes in
>>> the system.
>> This is possibly OK, as I think it's the approach that many people
>> want to take on this (including Jena). It seems intuitive enough to
>> me, and it's certainly safe in both XA and XA2 (though it wasn't
>> supposed to be safe in XA).
>
> I guess my main point here is that, while it works in XA it wasn't  
> intended to be that way.  I'm very uncomfortable relying on an  
> unintended consequence of a particular implementation detail of the  
> underlying database in order to implement an important piece of  
> functionality in my own code.  This is why I'm leaning towards  
> allocating unique key values for all of my blank nodes.  Moving  
> forward with Mulgara, I think we have to recognize that round- 
> tripping of blank nodes is already supported to some degree, albeit  
> unintentionally. Given that, I think it's important to decide to  
> what extent we will support round-tripping of blank nodes and  
> explicitly enforce any constraints that we decide on.  I'm perfectly  
> happy to participate in that conversation, but once again it comes  
> down to not having the time to really devote to a comprehensive  
> solution :-(

Well XA1 won't be having a lot of work done on it, except for  
maintenance. All our efforts will be going to XA2, and in that system  
we will NEVER be recycling blank node IDs - this time by design. So  
while it may have been unintentional, you can rest assured that it  
will be guaranteed (unless YOU have the time to find a subtle and rare  
race condition in code that is about to be deprecated?)  :-)

Regards,
Paul Gearon



More information about the Mulgara-dev mailing list