[Mulgara-dev] Resolvers

Tue Dec 14 00:08:34 UTC 2010

On Mon, Dec 13, 2010 at 3:50 PM, Gregg Reynolds <dev at mobileink.com> wrote:
> I've gone over what documentation I can find on Resolvers, and I have a few
> questions.
> 1)  It looks like a protocol handler is part of a resolver.  Does this mean
> that each resolver that works with e.g. http has to have its own http
> protocol handler?  My understanding is that a resolver combines protocol
> "resolution", fetching of a resource, and translation (via a content
> adapter) to a graph. It seems like protocol handlers should be
> self-contained components available for use by any resolver.

Resolvers have 2 jobs. The first is figuring out what a URI means, and
accessing the data for it. The second is making sure that data appears
as triples.

One of the resolvers is the HTTP resolver. This one recognizes URIs
that start with HTTP, and knows how to make HTTP connections to get
the data. At this point, it knows that it can't just take the raw data
an return triples, so it matches the data that it gets against the
content handlers. The content handlers can parse file streams and
return triples. So the resolver hands the data stream off to the
content handler, and gets the data back in a temporary graph. It can
then "resolve" any patterns that were requested of it against that
temporary graph. For instance, if the query needs to find all the
people, then the resolver will resolve the pattern "* rdf:type
foaf:Person".

I appreciate the idea of separating out the protocol handling from the
resolver, but remember that a resolver is just a mechanism for mapping
a URI to triples, and providing a function for matching against those
triples. In other words, it takes a (URI, pattern) and returns a
Resolution. This seems like a reasonable level of abstraction. The
other point is that only a couple of resolvers handle network
protocols. Others work on the local data store, while others query an
SQL database.

> 2)  What is the relation between an "external" graph and an "internal"
> graph?

A couple of things. First, the internal graphs are stored in a way
that is directly accessible by the local JVM (eg. in memory, or in
disk files owned by the database). External graphs come from somewhere
else.

The most important difference is in nodes allocation. Internal graphs
use locally allocated nodes and are represented with Longs. External
graphs need to allocate any new nodes (including ALL of their blank
nodes) as they come in, and these are represented with negative Longs,
to ensure they never clash. Inside the query engine you'll see them
stored in the "Temporary String Pool".

(All nodes are stored in a "string pool". The name comes from the fact
that Mulgara used to store everything as strings: URIs and Literals.
URIs are obviously strings, and Literals were stored in lexical form,
which means strings. These days binary objects are stored as well, but
the name stuck).

> My understanding is that the resolver fetches and transforms the
> external data into a graph, but from the docs it looks like the resolver is
> responsible also for processing queries against the graph.  In other words
> the resolver is a little database unto itself.  As opposed to having the
> resolver construct a graph and pass it to the primary buffer/disk manager
> somehow.  In other words, what is the relation between the query processor,
> the storage manager, and the resolver?

The query processor takes a complex query, figures out a plan to
execute it, and runs it. Running a query is the process of resolving
each of the individual parts, and joining the results. For instance,
the first names of all people with a last name of Smith is:
  SELECT ?name WHERE { ?person foaf:givenName ?name . ?person
foaf:familyName "Smith" }

The query engine works out that this can be resolved by matching the patterns:
  *  foaf:givenName  *
  *  foaf:familyName  "Smith"
And joining those results on the first column of each. So it finds the
resolvers that can answer this (based on the graph URI - not shown)
and asks the resolvers to handle the two patterns.

A resolver will take the URI of the graph being queried and figure out
how to give you back the data that matches your pattern. If the graph
is a file accessed over HTTP, then it will download the file and use a
content handler to convert it into a temporary graph, and then query
the graph for the pattern. Usually however, you'll be using the XA
(transactional) resolver, or the more recent XA1.1 resolver.

The XA (and XA 1.1) resolver is a resolver that stores and reads data
on local disks. This is where we do all of our work trying to write
data as quickly and efficiently as possible, and to read it back with
as few disk seeks as possible.

> 3)  Which resolver implementation is most suitable as the basis of a
> tutorial?

Good question.

TestResolver takes TestConstraints (as opposed to normal constraints)
and returns resolutions that are predetermined. That's relatively
straightforward.

DistributedResolver is pretty simple as well. It creates a
NetworkDelegator (which looks like a resolver) and gets that to do
everything. The NetworkDelegator just creates a connection to a remote
server and asks that connection whatever it got asked. It then wraps
the response in the appropriate interfaces.

Both FileResolver and HttpResolver extend the abstract
ContentResolver, and are also reasonably simple. The ContentResolver
class is the one that loads a content handler and gets the triples
back from them.

If you're looking into the code at this level, then you'll need to be
aware of "local nodes" vs "global nodes". Global nodes take the form
URIReference<http://example.com/foo> or Literal<"This is a string">.
Local nodes are Long values. There is always a 1 to 1 mapping between
local nodes and global nodes. So for instance, the Literal "This is a
string" may have a local representation of 42. In that case 42 will
always be globalized back into the Literal "This is a string".
Anything coming off the network, going to a user, or coming from a
user, will always be global. Anything being stored on disk, or that
has just been retrieved from disk will always be local. So a lot of
resolver code is about "localizing" and "globalizing" nodes.

Regards,
Paul