[Mulgara-general] Optimizing variable-graph queries

Alex Hall alexhall at revelytix.com
Fri Nov 6 21:04:00 UTC 2009

Hi Paul,

Paul Gearon wrote:
> Hi Alex,
> On Fri, Nov 6, 2009 at 1:30 PM, Alex Hall <alexhall at revelytix.com> wrote:
>> I'm working on an application that stores many records in Mulgara, all
>> described using the same schema but organized into separate graphs to
>> track provenance information.  I need to collect records from several
>> graphs and apply ordering and a limit in order to, for example, find the
>> 50 most recent records across all graphs along with the graphs in which
>> they appear.
> This should be properly documented, I know, but did you know that if
> you're using lots of graphs then you may want to use the XA indexes?
> In other words, the config file would use:
> <PersistentResolverFactory
> type="org.mulgara.resolver.store.StatementStoreResolverFactory"
> dir="xaStatementStore"/>
> Unfortunately, XA11 was the wrong name for me to give to the new
> statement indexes. It simply represents an optimization of the common
> pattern of only using a couple of graphs. Whereas on the StringPool
> and NodePool XA11 is a real upgrade.
> All the same, I need to document the continued use of
> StatementStoreResolverFactory, since a lot of people need to work with
> multiple graphs. But I think <50 should still be OK.

Yes, I am aware of these differences. I should have mentioned that I
tried the tests with both the XA and XA11 statement indexes (using the
XA11 StringPool for both) with almost identical results, so that would
seem to indicate a problem with the query plan as opposed to the indexes
as you mention elsewhere.

>> On the other hand, doing everything as individual queries isn't exactly
>> the ideal solution.  For one, I'm developing with everything running
>> locally on my laptop but eventually there will be a network sitting
>> between the client and server.  Also, combining results is something
>> that the database is supposed to be good at.
>> So my question is, is there any sort of optimization that can be done,
>> either in terms of rewriting my SPARQL query or in tweaking the Mulgara
>> query engine, in order to improve the performance of this query?
> I'm curious, doesn't the following query work?
> SELECT ?graph ?item ?timestamp
>   GRAPH ?graph {
>     ?item :hasTimestamp ?timestamp .
>     // other criteria to identify the records of interest
>   }
> }
> ORDER BY DESC(?timestamp) LIMIT 50
> (ie. no FROM or FROM NAMED)
> I know that I originally coded SPARQL so that this would *not* work,
> but Andy assured me that it should, so I thought I went back and did
> it correctly. If it doesn't work, then please let me know.

It probably does work, but I didn't try it. I don't want results from
*all* graphs, just a known subset of them, which is exactly the use case

Regardless of my particular use case, it doesn't seem like this should
make a difference in the overall performance.  Based on my reading of
the query code, it looks like both queries would generate identical
plans with the exception that FROM NAMED would cause additional
mulgara:is constraints to be added to the outer-most constraint
expression. If anything, that seems like it would improve performance.

> Either way, you're right in that it should be faster. There are unions
> in there, which could be part of the reason, in which case the coming
> optimizations may fix things. You can check on this for me by hitting
> Ctrl-\ on the server console while waiting on a query to respond. If
> you see that you're in a "sort" method, then the next release should
> help you. Otherwise, let me know where you're spending most of the
> time, and I'll look at what's happening with the query plan.

A thoroughly unscientific sampling shows that the query does spend the
majority of its time in the TuplesOperations.sort() method.

BTW, in case anybody was wondering, Ctrl-Break is the Windows equivalent
of Ctrl-\ (thread dump) :-)


More information about the Mulgara-general mailing list