[Mulgara-dev] Blank Node Assignment in Inserts

Thu Feb 28 19:54:41 UTC 2008

Paul Gearon wrote:
> On Feb 27, 2008, at 10:33 PM, Andrae Muys wrote:
> 
>> On 28/02/2008, at 1:37 PM, Life is hard, and then you die wrote:
>>
>>> On Wed, Feb 27, 2008 at 08:20:48PM -0500, Alex Hall wrote:
> 
> <snip description="Bug #81" href="http://mulgara.org/trac/ticket/81"/>

<snip/>

> While it *is* inadvertent, it might also be desirable.  We have  
> already had questions about re-accessing blank nodes across multiple  
> insertions.  Maybe this is the way to do it?  :-)

Well as soon as you provide for re-accessing blank nodes across multiple 
insertions, people will want to do the same thing for other types of 
operations. :-)  I know there is also a desire to re-access blank nodes 
across multiple queries, so that one can include the results from one 
query in the constraints for another one via the iTQL shell without 
having to use subqueries.  Granted that is entirely out of the scope of 
this particular conversation.

 From an offline conversation with Paul:

 >On Feb 28, 2008, at 8:28 AM, Alex Hall wrote:
 >> I'm looking at the relevant source code for bug 81 regarding the
 >> scoping of blank node variables on inserts.  I tracked down in
 >> StringPoolSession.java where you added the blank node cache.  I seem
 >> to remember this being handled at a higher level of the architecture
 >> in previous versions.  Can you give me a little bit of background on
 >> why the cache was added to StringPoolSession?  The SVN comments
 >> indicate that it had something to do with the distributed resolver,
 >> but I'm not familiar with that aspect of the system.
 >
 > The cache here is not for optimizing, but rather to remember the
 > mapping between blank node labels and the internal blank nodes.  When
 > loading a file (in this case the change was concentrating on loading
 > N3 files) then blank nodes are identified by anything with a label
 > that starts with "_:".  This results in a new blank node being
 > allocated (meaning we are given a gNode, but there is no associated
 > entry in the string pool).  We need to make sure that the next time
 > the blank node is seen in the file, that it is given the same gNode.

When loading RDF using the RDF/XML and RIO content handlers, the blank 
node label to internal blank node mapping is maintained in the content 
handler itself.  The Content interface also contains a getBlankNodeMap() 
method which returns, according to the Javadoc, a "map attached to the 
'scope' of the content object containing a mapping from ContentHandler 
specific identifiers to blank nodes from previous parses of this 
content," and is used by the MP3 content handler.  The N3 content 
handler also maintains its own internal mapping of blank node label to 
internal blank node, and furthermore the ResolverSession method that it 
uses to create blank nodes is newBlankNode() which always allocates a 
new node, bypassing the blank node cache in StringPoolSession 
altogether.  So I'm still curious why the StringPoolSession needs to 
keep the label to node ID mapping.  This isn't to suggest that it's 
incorrect, just that I don't quite understand. :-)  Perhaps it's meant 
to support foreign blank nodes?

 > Looking at bug 81 I can see that this might be the cause... though I'd
 > have to check.  The cache was supposed to remember the blank node
 > label to gNode mapping during a file load, but I didn't think about
 > transactioned insert statements.  It wouldn't have occurred to me to
 > think about this, since insert statements don't use blank node labels,
 > but rather use variables (as per the description in bug 81).  While I
 > could probably trace it out, it would be quicker to just debug it to
 > see if the variable names are going into this cache.

This is indeed the case -- the variable names are going into the blank 
node cache in StringPoolSession.  If we were to fix this bug, the most 
straightforward option would be to exclude variables from the blank node 
cache in StringPoolSession and always allocate a new node for them.  The 
TripleSetWrapperStatements class already maintains a mapping of 
globalized node object to local node ID, so the identity of a variable 
blank node would be maintained over the course of a single insertion.  I 
can't tell if making this change would have any other side effects. 
Another, less elegant solution would be to mangle variable names inside 
an insertion to ensure that they are unique to that operation.

> I'm inclined to fix it, but I'm interested in opinions here.

I agree.  I will have potentially long-running transactions with 
multiple insertions, and would prefer not to have to guarantee the 
uniqueness of my variable names across the entire transaction.

More than anything else, I think it is important to come up with a 
consistent approach to handling the mapping of blank node labels to 
internal node ID's throughout the software.  It just seems kind of silly 
to maintain these blank node mappings in half a dozen different places. 
  What this approach should be, I'm not sure, but I'm interested in more 
opinions as well.

Alex