[Mulgara-general] mulgara eval

Paul Gearon gearon at ieee.org
Tue Nov 9 18:46:40 UTC 2010


On Mon, Nov 8, 2010 at 5:46 PM, Gregg Reynolds <dev at mobileink.com> wrote:
> On the one hand, we are starting to see lots of stuff like couchdb and
> Google datastore.  Maybe not directly competitors with RDF stores, for many
> purposes they do just as well.

The big difference is the SPARQL query language. NoSQL solutions are
very good at retrieving the data you need, so long as you were able to
figure out before hand what it was that you were going to need. RDF is
more general purpose, which is expressed with the query language. The
trick is implementing the queries efficiently.

For instance, NoSQL stores are great at finding something like "All
people named Smith", but they are less effective at "All people named
Smith who are related to people named Jones by some kind of 'in-law'
relationship".

> On the other hand, it looks (to somebody who doesn't look very often) like
> the RDF datastore space is getting more crowded (which I take as a Good
> Thing).    It's a bit of a mess at the moment; e.g. it isn't clear which
> keywords one should use to find RDF databases in a google search, which I
> take as a sign that nobody entirely understands what these critters are.  In
> any case, I'm satisfied that mulgara works for what I need, but I would
> wonder if anybody in the mulgara world has had time to look at any of the
> newer offerings.  In particular, the open version of Virtuoso, and I'm
> particularly intrigued by Parliament.  Any comparisons?

I'm not that familiar with Virtuoso myself.

Parliament is a storage layer for use with Jena. So it's using Jena's
query engine with their own indexing and storage. The storage system
has a similar structure to an indexing system I'm halfway through
writing for Mulgara. It is designed to handle fast loading, while also
facilitating fast index lookups on any one element of a triple.
However, they have made a tradeoff whereby they are avoiding
duplicates by performing a lookup on every insertion. This greatly
reduces their load speed. I've also tried using BDB like they use for
their indexes, but I'm finding that it isn't performing as well as it
ought to, particularly for data that is entirely in memory.

Currently, Mulgara uses more traditional indexes, which means that it
also does lookups when inserting. This means that it avoids storing
duplicates, but a lookup that doesn't find a triple also provides the
information for where the triple needs to be inserted.

The newer index does something completely different (well, it *will*
do something completely different, when I'm finished with it). It
naively presumes that a triple is new, and inserts it blindly. Like
Parliament, almost all writes to the indexes occur at the end of
files, allowing a lot of data to be chunked up into large write
operations, and avoiding a lot of disk seeks. The indexes are then
wrapped in a "virtual index" that wraps all committed data, plus data
in a write transaction. This combined index then handles all the
queries. Queries are therefore a little slower, at least to start
with. Once write operations are committed, a background process merges
committed transaction data with persistent data into a single index,
and when this is complete all queries can be issued against the new
unified index.

The main tradeoff is to allow faster writes by making querying go
slower to start with. Once the index merging is dealt with in the
background, lookups in the indexes should be about what they are
today. The other advantage with this approach of merging indexes, is
that it allows us to have multiple write transactions occurring in
parallel. Currently Mulgara only supports a single writer at a time. I
don't know how Parliament deals with transactions.

The fact that Parliament is hooked into Jena means that they have many
more SPARQL 1.1 features than Mulgara has. Mulgara will get there
eventually, but at this point, it is just a whole lot of new code in a
JavaCC file that hasn't been integrated in yet. The Jena query
optimizer is completely different as well.

Regards,
Paul Gearon


More information about the Mulgara-general mailing list