[Mulgara-dev] Creating anonymous nodes
Paul Gearon
gearon at ieee.org
Wed May 28 19:54:54 UTC 2008
With Alex's consent I've moved this discussion into the open......
On May 28, 2008, at 12:18 PM, Alex Hall wrote:
> Hi Paul,
>
> Paul Gearon wrote:
>> Hi Alex,
>> On Tue, May 27, 2008 at 7:18 PM, Alex Hall <alexhall at revelytix.com>
>> wrote:
>>> There was another issue that I wanted to talk to you about earlier
>>> today but
>>> didn't have time to. In addition to being able to insert
>>> statements using
>>> an existing blank node ID, we need a way to allocate a new blank
>>> node and
>>> then insert statements about it into the triple store.
>>>
>>> If the insertions were localized in a single place, it would be as
>>> simple as
>>> saying:
>>>
>>> insert <ex:Bob> <ex:address> $x
>>> $x <ex:street> "123 Main Street"
>>> $x <ex:city> "Baltimore"
>>> into <rmi://localhost/server1#graph>
>> So what's wrong with this? I do it all the time. Your wording makes
>> it appear that you don't think this will work. Is it causing
>> problems
>> for you?
>
> There's nothing wrong with that, but it doesn't do everything that I
> would like. Namely, I would like to get my hands on the JRDF
> BlankNode object for the newly allocated node, so I can use it
> directly in subsequent insertions/queries etc. Keep in mind that
> I'm building my commands directly via the API so using a BlankNode
> is valid, even though there's no syntax for doing that in TQL.
Ah, so you're trying to mix APIs. That way there be dragons.....
Semantically, I just can't think of a valid way to do this, given the
way things stand. It might be possible to modify the TQL API to return
a binding of variable names to blank nodes after an insert, but that
would be more time than I have.
Others have said that if they do a query and get back a blank node of
say _:12324 then it would be create to refer to this syntactically,
almost as if it were a URI. This is fraught with semantic issues, but
is certainly doable. That means that if you want all the people living
at Bob's address you could say something like:
insert <ex:Bob> <ex:address> $x
$x <ex:street> "123 Main Street"
$x <ex:city> "Baltimore"
into <rmi://localhost/server1#graph>;
select $x
from <rmi://localhost/server1#graph>
where <ex:Bob> <ex:address> $x
$x <ex:street> "123 Main Street"
$x <ex:city> "Baltimore";
result: $x => _:1234
select $person
from <rmi://localhost/server1#graph>
where $person <ex:address> _:1234;
Since our blank nodes will never change then it's tempting to do this,
but I'm still not comfortable with it. For a start, it will be time
consuming (ie. an extra disk read) to work out that _:1235 actually
refers to a URI and shouldn't be represented as a blank node. OTOH,
blank nodes are SUPPOSED to be able to represent anything else
(including URIs and Literals) so maybe that's OK. :-)
>>> In this case, the $x in each statement gets mapped to the same
>>> internal
>>> blank node. The problem for me in this case is that I have no
>>> idea *what*
>>> that internal node ID is, so I can't remember it in order to
>>> insert any
>>> subsequent statements using that same ID.
>> OK, but here is where you can add the ID:
>> insert <ex:Bob> <ex:address> $x
>> $x <ex:street> "123 Main Street"
>> $x <ex:city> "Baltimore"
>> $x <ex:hasId> [GUID]
>> into <rmi://localhost/server1#graph>
>
> Right, but I still have to do another query to get the BlankNode
> object, or else use insert/select constructs anytime I want to add
> another statement about the newly allocated node. Granted, I'm
> starting to see that this sort of thing will probably be
> unavoidable, but I was hoping that wouldn't be the case.
If it's a blank node then you really SHOULD be doing insert/selects.
That's a Good Thing.
Mind you, if we allowed the example I gave above, then I guess it
would be possible to say:
insert _:1234 <ex:zip> "10101"
into <rmi://localhost/server1#graph>;
None of this would be hard to do. It just bugs me. :-) I think
you'll find Andrae is against it.
OTOH, he has a better handle on how the language maps to the
semantics, so maybe he thinks it's all OK.
>>> I think that one of the main difficulties for me is that Mulgara
>>> is just a
>>> triple-store -- albeit a darn good one -- and not much more.
>> Yes. and of course, that is by design. Indeed, if you look at SPARQL
>> you'll see that this is expected.
>>> The software
>>> we write is intended for viewing and editing concepts and
>>> individuals in an
>>> ontology; we want a model-based view of the contents of the ontology
>>> constructed from the raw RDF triples in the store. So the first
>>> thing I did
>>> was write a modeling layer, backed by the Connection API, to
>>> represent
>>> ontology concepts as Java objects. Naturally, the object
>>> representing an
>>> individual needs to remember the RDF resource that it represents.
>> Modeling like this is a common requirement. :-)
>> I have wanted to put a modeling layer into Mulgara for some time now,
>> but of course I have not had the resources to do one. However, this
>> could only ever be an option, as a lot of people have very different
>> ideas about how to model things.
>
> Of course. For our particular software, it is already built on top
> of Jena. I would *love* to rip it all up and start fresh, but
> obviously don't have the time to do this for a system that is
> already in production. So instead I took a Jena-like modeling
> approach.
Fair enough, though I don't know how Jena does this. (Sorry Andy!)
>> I built such a layer for myself over a year ago, and it worked nicely
>> - for certain situations.
>> Many people want to embed the modeling in the language they are
>> using.
>> Of course, this language is often Java. The problem with this
>> approach
>> is that Java is statically typed, meaning that it cannot represent
>> arbitrary RDF. Some projects work within this restriction, explicitly
>> building RDF from their classes, usually using annotations to
>> describe
>> the URIs they want for types and predicates. This permits a
>> "Hibernate" like system, though it has the advantage of being able to
>> store any structure without needing to modify the database schema.
>> However, it cannot deal with arbitrary RDF, and usually ignores
>> unknown fields, etc. Topaz have built exactly this system, and it
>> works very well. They even have a query language for their objects. I
>> highly recommend looking at this, as I know you have better things to
>> do with your time than building a library like this! (fun though it
>> may be)
>
> I'm not interested in embedding the modeling for arbitrary
> ontologies in Java, fun though it may be :-) The ontologies are
> built collaboratively and incrementally by the users of our
> software; the contents of those ontologies have no meaning to us.
> So an OTM library such as Topaz isn't exactly what we're looking
> for. We want to describe properties of classes, properties, and
> individuals in the ontologies; in other words, the modeling that
> I've embedded into Java describes the structure of RDF (most of this
> is provided by JRDF), RDFS, and OWL concepts. So the code will look
> something like:
>
> Resource r = model.getResource("Person");
> if (r.isClass()) {
> for (Node n : r.asClass().getSubClasses()) {
> Resource rr = model.getResource(n);
> ...
> }
> }
>
> If you've looked at code that uses the Jena libraries, then this
> snippet will look very familiar to you. The fact that Java is
> statically typed doesn't really bother me here, because I'm not
> trying to model arbitrary RDF.
Actually, I haven't looked at Jena, but it looks familiar
anyway. :-) It seems that there are obvious ways to approach these
problems.
>> To embed like this you really need a dynamic language like Groovy or
>> Ruby, as these languages let you redefine classes at runtime. You
>> *can* redefine classes at runtime in Java if you are prepared to use
>> class name mangling, a custom class loader, and a bytecode library -
>> but this is crazy, especially when you realize that you can't
>> actually
>> talk to the resulting object, as you don't have an interface for
>> it...
>> you'd have to use reflection, which defeats the point of trying to
>> embed the code in the language!
>
> I think Elmo addresses this by allowing the programmer to annotate
> bean-like interfaces, and dynamically constructing the bytecode for
> a class that implements all the appropriate interfaces for an object
> depending on its declared RDF types. So it simulates dynamic typing
> in Java while still giving clients an interface to talk to the
> objects... pretty cool stuff :-)
I learnt a lot about Elmo last week at SemTech and was REALLY
impressed. I didn't hear about this feature though (assuming you have
it right). It sounds very nice.
>> After looking at this, the approach I took was to NOT embed the
>> objects in my language (Java). Instead I took the Perl approach,
>> whereby every structure is really a hash map where properties are
>> keys, and objects are the values. This actually works surprisingly
>> well (extending easily into collections and class references), though
>> you end up with a lot of lines like:
>> Instance address = model.newInstance("Address");
>> address.put("street", "123 Main Street");
>> address.put("city", "Baltimore");
>> Instance bob = model.newInstance("Person");
>> bob.put("name", "Bob");
>> bob.put("address", address);
>
> This is roughly the same approach that I used for accessing
> arbitrary properties on an instance.
Again, I find that the approach to these common problems are self-
evident.
> [snip]
>>> In most instances, this modeling layer works beautifully. The
>>> only place it
>>> really breaks down is when I want to allocate a blank node for an
>>> anonymous
>>> instance in an ontology, encapsulate it in a Java object, and then
>>> hand it
>>> off to other parts of the code that may want to query about it or
>>> add
>>> properties to it. In this case, it seems that I'm reduced to what
>>> we talked
>>> about before -- generating a GUID and then saying something like:
>> In my case, I declared a field for my object to be the key for the
>> object. This was a compulsory field. If it happened to be the "name"
>> property then that was great. Otherwise it would be a generated
>> value.
>> Unfortunately, I didn't allow for compound keys. After declaring
>> this,
>> the queries I used in my library were of the form:
>> select....
>> from.....
>> where $subject $key "keyvalue"
>> and $key <rdf:type> <owl:InverseFunctionalPredicate>
>> and .........
Now that I've moved this into a public forum I will state for the
record that I am aware that this is OWL Full, but without a reasoner
it's not hurting anything, and it describes exactly what I wanted with
a well publicized (though little used) standard. My reading of OWL
Full is that this is exactly what it is good for: clearly describing
something, without necessarily throwing a reasoner at it.
Incidentally, I didn't realize this was OWL Full when I did it (August
2006). :-)
> Yeah, it looks like I'll need something along those lines. Since I
> don't have knowledge of any of the properties for an arbitrary blank
> node, I'll always have to generate the key, and I should be able to
> use a predefined key property instead of looking for an
> owl:InverseFunctionalPredicate. I think it will be more
> theoretically sound to always refer to a blank node using a
> generated key value as opposed to relying on an implementation
> detail of the underlying database. This will allow external
> references to anonymous nodes to work even when the graph is moved
> from one server to another, which is important for doing things like
> maintaining audit trails (which is another driving requirement for
> me).
This is in the back of my mind when I talk about these things. This
includes the blank node references I was talking about at the top.
> This doesn't account for imported RDF which will not have unique
> keys for blank nodes, but it shouldn't be too hard to do some post-
> processing to add them.
A lot of data is going to have SOME kind of unique key associated with
it. When it doesn't, then it may be time consuming to process it, but
there's not a lot of option (unless I start adding in generated
outputs into the head of rules).
> [snip]
>>> I don't have a fundamental issue with the requirement that
>>> anonymous nodes
>>> be scoped to the containing RDF document. But in practice, people
>>> want RDF
>>> graphs to be dynamic entities: facts are added to them as they
>>> become
>>> available. In order to fully support this, its anonymous node
>>> ID's need to
>>> be referenced externally by the tool that is doing the modifying;
>>> in other
>>> words, round-tripping of blank nodes needs to be supported. This
>>> approach,
>>> while more flexible, relies on the external application to respect
>>> the
>>> scoping of an anonymous node to its original graph -- for
>>> instance, not to
>>> insert an anonymous node found in graph A into graph B.
>> While it might be tempting to start linking identifiers to specific
>> nodes, I would highly recommend against doing this. Version 2.0 of
>> SPARQL will be started soon, and insertions will almost certainly be
>> a part of that spec. You will NOT want to be incompatible with
>> everyone else on this, especially when Mulgara will end up having to
>> conform to it.
>
> I don't really understand what you're saying here. Could you
> elaborate a little? I'm not talking about parsing blank node labels
> in TQL and using those identifiers to link to a specific node. What
> I would prefer to see is the JRDF BlankNode object that comes back
> from an Answer retain enough information to link back to the actual
> node that it represents, and to have that object remain valid across
> transactions.
OK, if you're just talking about APIs then my comment is irrelevant.
For some reason I got it into my head that you were talking about
round-tripping using a query interface rather than a programmatic one.
Incidentally, you should have a look at the SAIL API if you can. Since
we don't want to be duplicating functionality between interfaces then
JRDF may be getting deprecated.
>>> I guess that ideally, a triple-store such as Mulgara would support
>>> round-tripping of blank nodes, while maintaining an association
>>> between a
>>> blank node and its graph in order to prevent it from being
>>> promoted to a
>>> larger scope. If a client needed to describe properties about a new
>>> anonymous resource, they could request that a new blank node be
>>> allocated in
>>> the correct graph and then use the resulting blank node ID in
>>> inserted
>>> statements. Then again, I understand the possible storage and
>>> performance
>>> issues that might arise from recording the provenance of all blank
>>> nodes in
>>> the system.
>> This is possibly OK, as I think it's the approach that many people
>> want to take on this (including Jena). It seems intuitive enough to
>> me, and it's certainly safe in both XA and XA2 (though it wasn't
>> supposed to be safe in XA).
>
> I guess my main point here is that, while it works in XA it wasn't
> intended to be that way. I'm very uncomfortable relying on an
> unintended consequence of a particular implementation detail of the
> underlying database in order to implement an important piece of
> functionality in my own code. This is why I'm leaning towards
> allocating unique key values for all of my blank nodes. Moving
> forward with Mulgara, I think we have to recognize that round-
> tripping of blank nodes is already supported to some degree, albeit
> unintentionally. Given that, I think it's important to decide to
> what extent we will support round-tripping of blank nodes and
> explicitly enforce any constraints that we decide on. I'm perfectly
> happy to participate in that conversation, but once again it comes
> down to not having the time to really devote to a comprehensive
> solution :-(
Well XA1 won't be having a lot of work done on it, except for
maintenance. All our efforts will be going to XA2, and in that system
we will NEVER be recycling blank node IDs - this time by design. So
while it may have been unintentional, you can rest assured that it
will be guaranteed (unless YOU have the time to find a subtle and rare
race condition in code that is about to be deprecated?) :-)
Regards,
Paul Gearon
More information about the Mulgara-dev
mailing list