[Mulgara-dev] Blank Node Assignment in Inserts
Alex Hall
alexhall at revelytix.com
Thu Feb 28 19:54:41 UTC 2008
Paul Gearon wrote:
> On Feb 27, 2008, at 10:33 PM, Andrae Muys wrote:
>
>> On 28/02/2008, at 1:37 PM, Life is hard, and then you die wrote:
>>
>>> On Wed, Feb 27, 2008 at 08:20:48PM -0500, Alex Hall wrote:
>
> <snip description="Bug #81" href="http://mulgara.org/trac/ticket/81"/>
<snip/>
> While it *is* inadvertent, it might also be desirable. We have
> already had questions about re-accessing blank nodes across multiple
> insertions. Maybe this is the way to do it? :-)
Well as soon as you provide for re-accessing blank nodes across multiple
insertions, people will want to do the same thing for other types of
operations. :-) I know there is also a desire to re-access blank nodes
across multiple queries, so that one can include the results from one
query in the constraints for another one via the iTQL shell without
having to use subqueries. Granted that is entirely out of the scope of
this particular conversation.
From an offline conversation with Paul:
>On Feb 28, 2008, at 8:28 AM, Alex Hall wrote:
>> I'm looking at the relevant source code for bug 81 regarding the
>> scoping of blank node variables on inserts. I tracked down in
>> StringPoolSession.java where you added the blank node cache. I seem
>> to remember this being handled at a higher level of the architecture
>> in previous versions. Can you give me a little bit of background on
>> why the cache was added to StringPoolSession? The SVN comments
>> indicate that it had something to do with the distributed resolver,
>> but I'm not familiar with that aspect of the system.
>
> The cache here is not for optimizing, but rather to remember the
> mapping between blank node labels and the internal blank nodes. When
> loading a file (in this case the change was concentrating on loading
> N3 files) then blank nodes are identified by anything with a label
> that starts with "_:". This results in a new blank node being
> allocated (meaning we are given a gNode, but there is no associated
> entry in the string pool). We need to make sure that the next time
> the blank node is seen in the file, that it is given the same gNode.
When loading RDF using the RDF/XML and RIO content handlers, the blank
node label to internal blank node mapping is maintained in the content
handler itself. The Content interface also contains a getBlankNodeMap()
method which returns, according to the Javadoc, a "map attached to the
'scope' of the content object containing a mapping from ContentHandler
specific identifiers to blank nodes from previous parses of this
content," and is used by the MP3 content handler. The N3 content
handler also maintains its own internal mapping of blank node label to
internal blank node, and furthermore the ResolverSession method that it
uses to create blank nodes is newBlankNode() which always allocates a
new node, bypassing the blank node cache in StringPoolSession
altogether. So I'm still curious why the StringPoolSession needs to
keep the label to node ID mapping. This isn't to suggest that it's
incorrect, just that I don't quite understand. :-) Perhaps it's meant
to support foreign blank nodes?
> Looking at bug 81 I can see that this might be the cause... though I'd
> have to check. The cache was supposed to remember the blank node
> label to gNode mapping during a file load, but I didn't think about
> transactioned insert statements. It wouldn't have occurred to me to
> think about this, since insert statements don't use blank node labels,
> but rather use variables (as per the description in bug 81). While I
> could probably trace it out, it would be quicker to just debug it to
> see if the variable names are going into this cache.
This is indeed the case -- the variable names are going into the blank
node cache in StringPoolSession. If we were to fix this bug, the most
straightforward option would be to exclude variables from the blank node
cache in StringPoolSession and always allocate a new node for them. The
TripleSetWrapperStatements class already maintains a mapping of
globalized node object to local node ID, so the identity of a variable
blank node would be maintained over the course of a single insertion. I
can't tell if making this change would have any other side effects.
Another, less elegant solution would be to mangle variable names inside
an insertion to ensure that they are unique to that operation.
> I'm inclined to fix it, but I'm interested in opinions here.
I agree. I will have potentially long-running transactions with
multiple insertions, and would prefer not to have to guarantee the
uniqueness of my variable names across the entire transaction.
More than anything else, I think it is important to come up with a
consistent approach to handling the mapping of blank node labels to
internal node ID's throughout the software. It just seems kind of silly
to maintain these blank node mappings in half a dozen different places.
What this approach should be, I'm not sure, but I'm interested in more
opinions as well.
Alex
More information about the Mulgara-dev
mailing list