[Mulgara-dev] ModelName URN/URL fix (MGR-58)

Thu Jun 28 08:13:39 UTC 2007

Well I spent a fair amount of last week preparing to implement this.   
The design is relatively straight forward, although some of the  
constraints it must satisfy are subtle.

The primary problem this is intended to address is the current  
conflation of a model's location with its name.  Specifically we  
currently use the same URI to refer to a model in a query that we use  
to refer to a model in the rdf statements describing the model.  To  
make this more concrete consider a model rmi://localhost/server#test.

When the model was created it as assigned a model type.  This type  
identifies which resolver should be used to answer queries referring  
to the model, and the type is stored as a statement in the system- 
model.  In the case of a normal model the statement stored is:

rmi://localhost/server#test rdf:type mulgara:Model

On the other hand if rmi://localhost/server#test should be a view the  
statement stored is:

rmi://localhost/server#test rdf:type mulgara:ViewModel

Each resolver factory is responsible for ensuring it registers itself  
as handling the appropriate model-types, and with each query mulgara  
queries the system model to identify the type.

The problem is that from the perspective of a client there are large  
number of sensible 'names' that could be used as aliases for a given  
model in a query - in the case of 'test' a few might be:

rmi://127.0.0.1/server#test
soap://localhost/server#test
local:server#test
rmi://my.domain.name/server#test

and given what defines a 'legitimate' 'name' for a model in a query  
is client defined, the server cannot know every possible option - .   
We would like

select $s $p $o from rmi://localhost/server#test where $s $p $o
and
select $s $p $o from soap://127.0.0.1/server#test where $s $p $o

to be equivalent in every respect, except possibly the protocol used  
to access the server.

The problem is that because the mapping from model to model-type is  
by necessity in terms of a specific model-name, and consequently a  
naive attempt to resolve a query against an alias (soap:....) will  
fail to find the model.

The natural solution to this problem is to specify that the name used  
in the system-model be some 'canonical name', and that all references  
to models received in queries and other operations be first mapped  
into a canonical namespace before use.

Our initial attempt at this involved trying to identify a suitable c- 
name from the dns system, and to use that. Maintaining a list of  
known and configured aliases that could be used to map incoming model- 
names to c-names.  This works to a point - and that point is when  
people try to migrate databases to new systems, or run mulgara on  
mobile platforms that routinely migrate between different networks  
(ie. notebooks).  At this point the reliance on the dns system bites us.

A key realisation required before we can solve this problem was that  
the names being used by clients and the names being used by mulgara  
internally are actually distinct.  The URI's used internally really  
are *names*; while the URI's used by clients are actually  
*locations*.  It is trivial to maintain a one to one relationship  
between internal names and models, however the only guarantee we can  
provide with locations is that once dereferenced, at a specific point  
in time, from a specific client, a location will only ever refer to a  
single model.

This distinction becomes important when the usecases become more  
involved - specifically when people start using meta-models.

Consider a FOAF aggregation application where each FOAF file is  
loaded into its own model, and a separate catalogue of FOAF files is  
maintained which tracks such information as when/where each FOAF file  
was obtained.  So we have:

rmi://localhost/foafdb#foaf-file-1
... contents ...
rmi://localhost/foafdb#foaf-file-2
... contents ...

rmi://localhost/foafdb#foaf-catalogue
rmi://localhost/foafdb#foaf-file-1 foafdb:downloadedfrom http:// 
my.domain.com/myfoaf.foaf
rmi://localhost/foafdb#foaf-file-1 foafdb:downloadedat  
"20070205"^^<xsd:Date>
rmi://localhost/foafdb#foaf-file-2 foafdb:downloadedfrom http:// 
your.domain.com/yourfoaf.foaf
rmi://localhost/foafdb#foaf-file-2 foafdb:downloadedat  
"20070207"^^<xsd:Date>

and now we want to consider the following pseudo-code to print all  
FOAF files downloaded on a Wednesday:

answer = foafdb.query("
    select $foaf $date
    from <rmi://localhost/foafdb#foaf-catalogue>
    where $foaf <foafdb:downloadedat> $date")

for (foafFile, date) in answer:
   if isWednesday(date):
     foafContents = foafdb.query("
       select $s $p $o
       from " + foafFile + "
       where $s $p $o")
     printFoaf(foafContents)

Now the URI's inserted into the catalogue need to be names because  
they need to unambiguously identify a unique model.  However when the  
client wishes to use the name in a query what it actually needs is a  
location.  Now provided the client has access to RMI, and uses the  
same dns mapping between name and address the server used, this will  
work.  Unfortunately neither of these can be guaranteed, and when it  
does fail there is no workaround.

The goal of MGR-58 is therefore to abandon any pretense that a  
model's name can be used as a location, and to make this distinction  
explicit.

We do this by introducing a new URI scheme that will identify model  
names, rdfdb.  So the test model above will become: rdfdb://some- 
unique-id#test.  A ModelURLResolver would then be able to map this  
URI into a suitable URL for referring to the model externally.  So  
the catalogue above becomes:

rdfdb://unique-id#foaf-catalogue
rdfdb://unique-id#foaf-file-1 foafdb:downloadedfrom http:// 
my.domain.com/myfoaf.foaf
rdfdb://unique-id#foaf-file-1 foafdb:downloadedat "20070205"^^<xsd:Date>
rdfdb://unique-id#foaf-file-2 foafdb:downloadedfrom http:// 
your.domain.com/yourfoaf.foaf
rdfdb://unique-id#foaf-file-2 foafdb:downloadedat "20070207"^^<xsd:Date>

and the first query becomes:

answer = foafdb.query("
    select $foafurl $date
    from <rmi://localhost/foafdb#foaf-catalogue>
    where $foafuri <foafdb:downloadedat> $date
      and $foafuri <mulgara:hasCanonicalRMIURL> $foafurl in <rmi:// 
localhost/foafdb#modelURLResolver>")

or if the application is using soap:

answer = foafdb.query("
    select $foafurl $date
    from <soap://localhost/foafdb#foaf-catalogue>
    where $foafuri <foafdb:downloadedat> $date
      and $foafuri <mulgara:hasCanonicalSOAPURL> $foafurl in <soap:// 
localhost/foafdb#modelURLResolver>")

There is also the suggestion that the ModelURLResolver could also  
support deconstructing (and therefore in a prolog-like manner,  
constructing) URL's into their components. ie.

select $foafurl $date
    from <soap://localhost/foafdb#foaf-catalogue>
    where $foafuri <foafdb:downloadedat> $date
      and { $foafurl <mulgara:refersTo> $foafuri
                   : <mulgara:scheme> "soap:"
                   : <mulgara:host> "localhost" in <soap://localhost/ 
foafdb#modelURLResolver> }

Although support for this sort of thing is not intended for the  
initial release as I would prefer to avoid the work required to  
implement a query transformation based resolver delaying the release  
of this fix.

The work required to implement this falls into two areas.

1. Bootstrapping the SystemModel and ServerGUID.
2. Catching all references to model-url's and mapping them to uri's  
before use.

Bootstrap is relatively self contained - the system's bootstrap code  
is mostly contained in BootstrapOperation.java.  We will need to  
check for an existing ServerGUID, and if one is found store it in the  
DatabaseMetadataImpl class.  If it isn't found we create a new one  
and store it in a local distinguished model - this would be similar  
to the way preallocated nodes are currently handled.  NB. we could  
just use this as the system model directly, but because the URI would  
have to be the same globally there will be no way to query this model  
externally - it will be accessible *only* by mulgara internally - and  
there are too many usecases that require external read access to the  
system model for that to be feasible.

References are made to models in 6 operations that I am aware of  
currently (I may have missed one or two in the KRule stuff).  In 5  
the mapping is trivial as the model is passed directly as a parameter:

Create
Modify
Remove
Set
Backup

The 6th of course is Query, and there we have options.

1. We can apply a query-transform that examines the from-clause and  
every in-clause and performs the mapping
2. We can catch every localization of an in/from clause as it is  
integrated into the constraint in LocalQueryResolver::resolve
3. We can catch it just before we lookup the ResolverFactory in  
DatabaseOperationContext::getCanonicalModel

The current dns based partial-solution lives in 3.  
(DOC::getCanonicalModel) - so the quickest approach is to just modify  
that.  The cleanest approach is either 1. or 2.  Ultimately I believe  
we are going to have to go with 1., however we currently don't have  
support for rewriting the from-clause in a query-transformation so  
this would require updates to the symbolic-transformation code.  We  
probably need to change the implementation of  
SymbolicTransformationContext, but for now if we stick with either 2.  
or 3. we can avoid changing the Resolver-SPI, which would be nice.

Andrae

--
Andrae Muys
andrae at netymon.com
Mulgara Consultant
Netymon Pty Ltd