[Mulgara-dev] SPARQL handling of POSTed file uploads

Paul Gearon gearon at ieee.org
Thu Dec 30 03:54:43 UTC 2010


Hi Gregg,

I don't know if Andy is still subscribed to this list, but if so, then
I'm sure he would have some pertinent observations to add.

On Tue, Dec 28, 2010 at 10:06 AM, Gregg Reynolds <dev at mobileink.com> wrote:
> On Sun, Dec 26, 2010 at 11:37 AM, Paul Gearon <gearon at ieee.org> wrote:
>>
>> to REST where I could. However, a real protocol is nearly here, and
>> everything in Mulgara's protocols will need to be updated to comply.
>> The preliminary document for this is at:
>>  http://www.w3.org/2009/sparql/docs/http-rdf-update/
>> It's getting close to final call, so most of this document has been
>> finalized and we should go ahead and update Mulgara to use it.
>
> First impression: lots of problems.  In fact the more I think about it the
> more I doubt whether the document is even necessary.  The semantics of the
> HTTP methods are already defined in RFC 2616.  The only need I can see (off
> the top of my head) is for RDF-specific query syntax (i.e. "graph=...").
>  But that is not a protocol issue; it's just a naming convention.

I think you're getting a little picky there. Conventions *are* a kind
of protocol. Also, if you don't follow the convention, then you can't
communicate with an endpoint that "satisfies the protocol".

> Ok, that
>  and conventions for returning information in response to successful or
> failed operations; e.g. indicating the number of statements inserted for a
> PUT or POST.  But all that is a matter of application conventions rather
> than HTTP protocol.

It's a standard way to use HTTP for these operations. If you have a
look at REST implementations out in the wild, it's very clear that
there's almost always more than one way to implement REST for almost
any kind of application (and that's just talking about those
implementations that don't make a mess of the concepts behind REST!).
This document tries to explain what should be considered a resource
when using REST on RDF, how URIs are used for referring to these
resources, and what each HTTP verb should mean when it is used.

Application conventions *are* a protocol, so I don't understand what
you mean when you try to distinguish between the two. Perhaps you were
looking at HTTP as the protocol, and this is just building over the
top of that, but it's quite common to build protocols on top of other
protocols.

> The major problem, for me anyway, is the talk about "RDF knowledge".
> "Knowledge" is not a technical term; I wish the W3C would ban it.

I disagree. It has a very specific meaning, and the W3C are trying to
be careful to use it correctly. Knowledge is referring to the
information that the RDF is encoding, without any reference to how it
is encoded. It may be RDF/XML, JSON, N3, or one of several other
formats. Some of those formats may require significant transformation
to derive the "triples" that forms the basis of RDF. Referring to
"triples" also tends to imply N3, which they are trying to avoid here.

> Sentences
> like  "However, in using a URI in this way, we are not directly identifying
> an RDF graph but rather the RDF knowledge that is represented by an RDF
> document, which serializes that graph." make my eyeballs hurt. To me it
> reads like bad amateur philosophizing - note the logical
> incoherence.

I agree that it makes for painful reading, but it's done for a purpose.

Part of the reason is that a lot of this stuff has been developed by
mathematicians. Many programming systems (eg. data formats, languages,
etc) are developed in an ad hoc and very practical way, that often
serves adequately. However, this approach has regularly created
situations where corner cases have significant problems, or particular
expressions are simply not possible. One reason for the mathematical
foundation of all of this is to avoid that kind of problem. Another
issue is that OWL was developed by mathematicians who were
specifically trying to create a language capable of expressing a
particular set of mathematical properties. To build this on RDF meant
that RDF needed very clear and formal semantics as well. So the entire
system requires this horrible level of formalism that is very dry and
often appears redundant, particularly from the perspective of english
language. The thing is that those apparent redundancies are often
required when using mathematical english.

So it's not "amateur philosophizing", but rather an expression of a
formalism in english, which is a language that is not really suitable
for that task. Believe me, when you try to make these sentences read
more naturally, the PhDs in mathematics all complain and insist that
it be changed to read like this.

> Graphs is graphs is graphs.  SQL gets along just fine without
> mentioning model theory or knowledge, yet relational data is no different
> than graph data as far as formal semantics is concerned.  Nobody ever says
> that an SQL query URL identifies "the relational knowledge that is
> represented by an SQL table, which serializes that relation."

Part of the reason for that is because SQL grew out of the practical
programming community, and various mathematicians have analyzed it
over time. RDF has been (mostly) built the other way around, where the
mathematicians have defined it, and then the programmers have had to
come in later and implement it.

Both approaches have problems. I believe that SQL has had a couple of
problems over the decades, and formalisms that were back ported onto
it were required to clarify how the next SQL version should proceed.
RDF/SPARQL had great formalisms, but these have sometimes led to
implementations that are necessarily impractical.

>  I've never
> understood why the W3C is so in love with model theory and "knowledge" and
> so forth, considering the rather obvious point that formal semantics is just
> that, purely formal, i.e. syntactic.

I think I've explained some of where the model theory and requirement
for formalisms has come from. However, the formalisms are NOT just
syntactic. That is simply an artifact produced by the process. From
the perspective of developers, this may be the only useful result of
the process, but modelers have the reverse perspective.

> That stuff is just there to ensure
> consistency anyway, and should be kept in the attic like any crazy uncle.

I know a lot of ontologists who would disagree. They also don't care
about the syntax at all. But they *do* care a lot about the formal
semantics.

Also, the modeling that OWL provides is *far* more expressive than
what RDBMS systems can express. Of course, an RDBMS can glue a lot of
things together by using non-modeling tools like triggers and stored
procedures, but that kind of imperative approach is not modeling. It
may be more efficient, but it is also more prone to errors, cannot be
analyzed to the same extent, and does not have the inherent capacity
to be built upon. It's very much like expressing ontologies using
rules vs using an ontology language (in fact, the ontology language
can be made more efficient).

> Another example: first sentence of section 5.1:  "A request that uses the
> HTTP PUT method SHOULD store the enclosed RDF payload as RDF knowledge."  I
> don't even know what that means, if it means anything.

It says that if you get a PUT then store the data in your system.

The reason for the obtuse description is that they're trying to avoid
requiring a particular mechanism of storage (triples, XML, etc). Part
of the reason is that while 99% of systems might do it as triples,
there are almost always exceptions (for instance, I've heard of
storing RDF/XML in XML databases). So long as the exceptions don't
create untenable restrictions, the document will always try to include
these alternatives.

> Ok, rant over.  Anyway, obscure technical standards create a market for good
> documentation, so maybe I should just write a book and make some money
> instead of complaining. ;)

That's what I'm doing (except the money part).  :-)  Which reminds
me... my co-author wants me to finish a chapter soon.  :-(

These documents try to cover every possibility, which does make things
easier if you have a question. Unfortunately, it's unbelievably
complex to cover *everything*, so things will always slip through. The
complexity of the documents (and hence, the difficulty in reading
them) only adds to this problem.

> The larger issue is that there is no "HTTP Protocol for Managing RDF
> Graphs".  HTTP is already well (?) defined; providing "an interpretation" of
> it in terms of RDF Graph database processing just muddies the waters.  First
> sentence of section 3:  "This protocol specifies the semantics of HTTP
> operations for managing a Graph Store".  In my view that is entirely
> inappropriate.  HTTP is about resources, not services; the operations of
> Graph Store management are far beyond the scope of HTTP, and the semantics
> of HTTP operations are already defined.

I sort of agree with you here. To really manage a graph store you need
to be using SPARQL Update (which is sent via a POST, and is therefore
not a part of the REST document).

However, a graph store is a collection of graphs, which *are* defined
in this document as resources. These resources can be uploaded or
replaced with PUT, updated with POST, removed with DELETE, and
retrieved with GET. These operations are all manipulating graphs,
which can be considered to be "managing" the graphs in a graph store.
So I guess they're not strictly wrong, but I don't think it's really
how you want to go about "managing" your store.

> HTTP request URIs identify resources; the server is free to construct such
> resources in any way it sees fit; whether it uses an RDF DBMS or not is
> explicitly beyond the scope of HTTP.  For example, section 4 gives some
> examples of GET requests glossed with language about "RDF knowledge",
> graphs, etc. etc.  The problem is that this effectively redefines HTTP.  We
> know that GET /foo/bar requests the resource at /foo/bar; that's all.  If
> the request also says "Accept: application/rdf+xml", then we know the client
> wants an RDF/XML representation of the resource; that's all.  How it happens
> is irrelevant.

OK, I've just been through this section, and I don't understand your
problem. In particular, I don't see how you come to the conclusion
that the mechanisms for graph naming lead to a redefinition of HTTP.
Can you provide an example here please?

Section 4 is about how to name graphs. In particular, you're asking a
store to deal with graphs that have a URI, but that URI may not be in
the domain of the server (e.g. your server is at
http://company.com/sparql and the graph in question has the URI
http://example.com/graph). Section 4 is about how to deal with
resources that you cannot reference directly. You end up with a new
URI that is the "resource" that you are referencing when issuing a
GET, PUT, POST, DELETE, etc. There is no redefinition of HTTP going on
at all. Also, this portion of the document doesn't talk about *how*
you're going to store or retrieve data - just how you're going to
address it.

> Or take POST.  As far as HTTP is concerned a POST requests that the origin
> server accept the entity enclosed in the request as a new subordinate of the
> resource identified by the Request-URI in the Request-Line, and the URI in a
> POST request identifies the resource that will handle the enclosed
> entity.  Period.

Your use of the word "Period" indicates that you believe that you have
made a point that contradicts what is found in the SPARQL HTTP
protocol document, yet this document meets the criteria of
http://tools.ietf.org/html/rfc2616#section-9.5 perfectly well. How do
you believe them to differ?

> It should should just return 200, 201, or 204, whether it
> involves RDF data or not, whether server acceptance involves an RDF DBMS or
> a file system or whatever.  All defined by RFC 2616.

The protocol document only mentions 201 and 404, both of which are
acceptable in RFC 2616. It didn't mention the other possible return
codes and situations in which they might arise, which is something
that might be pointed out to the authors. But other than that,
everything appears to be in order.

>  If an application
> (e.g. a SPARQL endpoint) wants to standardize additional convention, such as
> returning a "X-Statements-Inserted: 123" header, that's a matter of
> application protocol, not HTTP protocol.

OK, but that's not mentioned in the SPARQL doc here.

> Now, it might make sense to
> provide some explanation of what "subordinate of the resource identified by
> the Request-URI" means when the resources involved are RDF Graphs; but that
> should be handled by an informal, informative NOTE.
> All in all, it seems to me it would be better to call the thing "SPARQL
> Application Protocol" or the like, and replace all the stuff about the
> meanings of the HTTP methods by a simple reference to RFD2616.

Well section 4 (the one on graph naming) is obviously required, and I
think you're in agreement there. While I do see an argument for
section 5 (describing the actions of HTTP verbs) being handled
implicitly, I believe there would be too much ambiguity if it were
done that way.

As I mentioned earlier, there are a lot of really poor REST
implementations out there where the authors just didn't understand
that they were supposed to be acting on resources, and accidentally
went off in the direction of RPC instead. Also, there may be some
alternative interpretations of some operations. For instance, because
RDF is both monotonic and has the open-world assumption, some
implementors may think it valid for PUT to simply insert the contents
of a graph additively, rather than replacing the existing graph. In
many other contexts this would be wrong, but the monotonic and open
world assumptions of RDF open this question up to debate.

Finally, there is a lot of precedent for redundantly explaining
concepts from foreign documents. This allows the reader to see exactly
what the author had in mind for this particular context (in case of
ambiguity) and also operates as a convenience so that the reader does
not have to try to correlate various sources in order to understand a
single document. Of course, this runs the risk of conflicting
information, but I don't believe this has happened here.

> Actually I'd
> go further and drop the reference to SPARQL, and just call it something like
> "RDF Resource Management Protocols".

I get your point, but remember that the "P" in SPARQL stands for
"Protocol". The document is about a protocol, so it gets published as
part of the SPARQL spec.

>  Maybe add an Informative section to
> the SPARQL Protocol for RDF document discussing (informally) the relation
> between SPARQL and RESTful operations.

Well, the RESTful operations do form a protocol, making them part of
SPARQL, and any query (or indeed update) operations are also part of
SPARQL as they fall into the "Query Language" section. Of course,
anything in the query language portion of the spec will (most likely)
not be a RESTful operation.

So what exactly do you think is the relation between SPARQL and
RESTful operations?

> Actually a better option might be a note discussing "REST Architectures for
> RDF Applications over HTTP" or something like that.

I'm not so sure. According to this spec, REST is used for uploading,
retrieving, deleting, and modifying graphs. An RDF application really
needs to do a lot more. In particular, it needs to be able to work on
the contents of a graph, accessing and manipulating structures built
in RDF. The documented protocol does not permit that.

> For example, regarding
> POST, the request URI identifies a resource "that will handle" the payload.
>  No constraints on what the handler is, only what it does; it doesn't have
> to be an RDF DBMS.  This is a critical point.  The protocol cannot stipulate
> that the handler must, for example, merge incoming triples with an existing
> graph.  It would be perfectly acceptable for the handler to just store
> payloads in files.  Since the action performed by the POST method might not
> result in a resource that can be identified by a URI it is not compelled to
> make such stored payloads available as resources; instead they are
> "subordinate" to the "handler" resource (a/k/a request URI) of the POST.  A
> later GET request for that resource could simply serve up the files and
> leave it to the client to do the graph merging.

If I've read you right, you've just described a hypothetical server
that handles POSTs in a particular way. If so, then sure. This is fine
and works with RFC 2616.

> That is an issue of
> application architecture, and is off-limits to a protocol definition.

It's not touched on by the HTTP definition, but it's certainly not off
limits to a protocol built on top of HTTP. Indeed, and protocol built
on top of an existing protocol must both conform to the original
definition AND constrain it further. If it doesn't conform, then it's
not building on the original protocol, and if it doesn't constrain,
then it's not providing anything in addition to the original protocol.

> It
> might be an insane way to do things, but the point is you cannot predict
> creativity - it might turn out to be the perfect solution for some problem.

The protocol document certainly imposes some restrictions that are
annoying for individual implementors, but in most cases this is
because the spec is trying to allow all implementations to work with
it. Some graph stores have drastically different approaches to things
like graph names, for instance. The spec is trying to work for all of
them (remember, it's a committee process).

> Ditto for PUT, where the request URI identifies the entity enclosed with the
> request  (the "payload") rather than a handler resource, and the request is
> that the enclosed entity be stored under the supplied
> Request-URI.  Pretending we know what "stored under" means, this effectively
> says "make it so that future (GET) requests to this request-URI serve up the
> payload in this POST request (in the appropriate format, etc.)."  Pretty
> simple, IMHO.

(You meant to say "PUT" there, right?)

> Compare that with the first sentence of section 5.1:  A
> request that uses the HTTP PUT method SHOULD store the enclosed RDF payload
> as RDF knowledge.  Which completely misses the point, even if we ignore the
> fact that "RDF knowledge" is meaningless.

I've already stated that it has a very specific meaning, so I needn't
go there again.

With the exception of the use of the word "SHOULD" (and I should try
to work out why they use that modifier) this is perfectly fine, IMO.

Forgetting that conneg can be used to change the format of what is
retrieved in a GET (meaning that what has been PUT may not necessarily
exactly equal what is returned from a GET), there are almost no
databases out there that are capable of loading an arbitrary RDF graph
document and extracting it back out in an identical representation. If
you get an RDF/XML document, then the resources can be in any order,
they can be deeply nested, or flattened, and they can use one of
several formats for representing structures such as containers. N3 can
be reordered with no effect, and it can use (or not use) prefixes in
various ways to represent URIs. It also has options for property
paths, and several syntaxes for blank nodes. So it is not possible to
require of a database that if a document is PUT to a particular
resource, then a subsequent GET will retrieve that exact document.

However, if a URI represents some information that is represented in
RDF, then we have a lot more flexibility. Now the exact document that
was PUT no longer matters. Only the information that is represented in
the document matters, and that is identical regardless of format (XML,
N3, JSON, etc). When specified that way, then it is now possible to
GET exactly what was PUT in the first place. Sure, it's not serialized
identically, but since it's the "information" that is being retrieved
now, and not the specific document, then the HTTP requirements have
been met.

Incidentally, I've used the word "information" here, in the hope that
it might be clearer to you. The term "RDF Knowledge" was developed
specifically for this purpose instead.

> To summarize:  I see several distinct issues mixed up in the draft.  One is
> interpretation of HTTP methods when the resources involved are RDF
> graphs.  One is the (application) protocol for RDF graph store management.
>  One is the relation of SPARQL to the other two issues.  And there is the
> major issue of bad meta-language ("RDF knowledge" etc.)

If, after reading my responses, you still have issues with these
points, then please write to the SPARQL Working Group to air your
concerns. There are other people who can give you much better answers
on this document, and you may find them to be more satisfactory.

On the other hand, if your points are compelling, then you can have a
direct influence on what shape this document takes. The standard is
currently in the process of soliciting feedback from the community
specifically to address any problems that remain in the documents.
They need to hear from people like you. If the documents are bad, and
no one acts to fix them, then the community has no one to blame but
themselves.

>  As far as Mulgara is concerned, I guess it doesn't much matter what the W3C
> eventually publishes; my impression is that the stuff in ProtocolServlet et
> al. already has most of what will be needed, so it can be adapted relatively
> quickly once the standard is promulgated.  Needs some tweaks to the request
> handling logic, and response codes and formatting need a little polishing,
> but nothing too major, I think.  Although I haven't used all the methods
> yet.

No, I think it's pretty close. As you say, the responses need
tweaking, but it's mostly there.

The ProtocolServlet also tries to treat individual triples as
resources as well. There's nothing wrong with that per se, but HTTP
has far too much overhead for that to be a useful thing to do. It sort
of made sense at the time from a RESTful point of view, but I don't
think it has any practical benefit.

> With a little work it should be possible to ship Mulgara with both a
> SPARQL-compliant servlet stack and a RESTful RDF interface, such that
> developers can easily clone and modify to create application-specific
> protocols.  Which is a Good Thing, IMO.  I understand that currently the
> monolithic mulgara-x.y.z.jar is the primary product, but I'm liking Mulgara
> as an embeddable, Jetty-like product as well.

I do like that notion. My problem has been figuring out how to make
the two distributions coexist effectively. It was tricky enough
getting the whole thing to build so that both the standalone
monolithic server *and* the WAR distribution operated identically. I'm
reasonably happy with the outcome there, but extending it to work as
something a general Jetty system can load as well is really pushing it
out there. :-)

Right now my priorities are on some regressions bugs that crept in
recently (tests that used to pass on 2.1.8 have started to fail. Some
of them due to documents on the web changing, but I need to confirm
that's all it is), a transaction rollback bug that is affecting 2.1.9,
SPARQL 1.1 implementation (query and update), new features in the
rules engine, and a new (and more scalable) storage module.
Deployments haven't made it onto my list for a while.  :-}

Regards,
Paul


More information about the Mulgara-dev mailing list