[Mulgara-dev] RDF/XML parse errors

Alex Hall alexhall at revelytix.com
Tue Oct 28 15:43:15 UTC 2008


All,

I think that the RDF/XML content handler might be a little trigger-happy
when it comes to throwing exceptions reported by the ARP parser.  I'm
trying to use Mulgara to parse the RDF/XML output of a third-party web
service.  It turns out that the RDF/XML is not entirely valid, yet Jena
is able to import the document and ignore the invalid portions, while
Mulgara rejects the document altogether.  Both tools use the ARP parser
and see the same parse error, but only Mulgara stops processing when it
encounters the error.

Taking a closer look, the Parser class in the RDF/XML content handler
implements the SAX ErrorHandler interface, which ARP uses to report
problems back to the calling application.  The ErrorHandler interface
specifies three different levels of problem: warning, error, and fatal
error.  The particular problem that I'm seeing falls into the "error"
category.  However, the SAX documentation defines "error" as a
recoverable condition.  As used by ARP, it appears to mean that the
invalid portion of an RDF/XML document will be ignored, and parsing of
the valid portion will continue.

Of course, the ideal solution is to work with only valid RDF/XML
documents.  But when we're working with generated documents from
external sources, that isn't always possible, and it would be nice to
get the valid content out of such documents on a best-effort basis.  I
see two approaches to mitigate this situation:

1. Change the content handler to log reported errors and continue
parsing, to match the default Jena behavior.  Only fatal errors will
cause an exception to be thrown by the content handler.

2. Make the error handling behavior configurable -- perhaps add an
option for "strict" or "lax" parsing mode.  The quick-and-dirty way
would be to use a Java system property, but I don't like that because
those options tend to get lost over time.  It would be nice to extend
the MulgaraConfig framework to allow for passing configuration options
into system components, but I'm not sure I'm willing to bite off that
task for such a minor change :-)

I prefer option 1, but I'm open to option 2.  Thoughts or suggestions?

Regards,
Alex



More information about the Mulgara-dev mailing list