[Mulgara-general] iTQL question regarding the relationship between query size and query speed

Wed Apr 16 15:31:29 UTC 2008

On Wed, Apr 16, 2008 at 7:52 AM, David Moll <DMoll at myaperio.com> wrote:
> I have some questions about iTQL query structure and performance.  Although
> is it technically TQL now?  In any case, I shall preface my question with a
> statement regarding our setup.
>
> We are running the queries on Mulgara 1.1.1, on a Win2k3 server with 8 GB of
> RAM and 2 quad-core Xeon processors, using a 64-bit JVM.

Any chance you can go to 1.2? It's definitely better.

> The queries are
> executed using the ItqlInterpreterBeanService, as our development
> environment is C#, so we do all of the Mulgara commands using SOAP. The
> model I am running the queries against has 521,998 triples, most of them are
> of the form:
>
>
> <Asset:num> <rdfs:type> <viewpoint:Asset>
>
> <Asset:num> <viewpoint:TimeStamp> 'Date Time Literal'
>
> <Asset:num> <viewpoint:name> 'Name Literal'

Two things to ask here.  Are you really using rdfs:type? This isn't a
known name in the rdfs space. It should be rdf:type. Also, are the
URIs really of the form "asset:1234" or is asset a namespace? (this is
relevant to a suggestion I have below).

> There are 171,000 sets of these three triples, so there are 171,000 unique
> values for "num" in the <Asset:num> identifier.  This is test data that was
> set up for performance testing, as we are attempting to retrieve the "Class
> name" for each "Asset type."  In the schema model there is a triple:
>
> <viewpoint:Asset> <viewpoint:name> 'ClassName Literal'
>
> There are multiple types (more than just Asset), and the goal is to retrieve
> the 'Class Name Literal' for each <Asset:num> URI.  To do this we first
> retrieve all the URIs that have an <rdfs:type> of <viewpoint:Asset>, or are
> a subclass of <viewpoint:Asset>.  Then that list of URIs is used to populate
> the final section of this query (the URIs that are connected with OR
> statements):

<snip query>

I don't see any need for subqueries here, which will certainly be
slowing you down. I believe the equivalent query is:

select $s $type $classname $p $o
from <rmi://localhost/server1#testdata>
where
  $s <rdfs:type> $type and
  $type <viewpoint:name> $classname in <rmi://localhost/server1#schema> and
  $s $p $o and
  (
    $s <http://mulgara.org/mulgara#is> <asset:79593> or
    $s <http://mulgara.org/mulgara#is> <asset:71600> or
    $s <http://mulgara.org/mulgara#is> <asset:71601> or
    $s <http://mulgara.org/mulgara#is> <asset:71602> or
    $s <http://mulgara.org/mulgara#is> <asset:7992> or
    $s <http://mulgara.org/mulgara#is> <asset:71618> or
    $s <http://mulgara.org/mulgara#is> <asset:71617> or
    $s <http://mulgara.org/mulgara#is> <asset:75590> or
    $s <http://mulgara.org/mulgara#is> <asset:71616> or
    $s <http://mulgara.org/mulgara#is> <asset:71615> or
    $s <http://mulgara.org/mulgara#is> <asset:76716> or
    $s <http://mulgara.org/mulgara#is> <asset:71619> or
    $s <http://mulgara.org/mulgara#is> <asset:76696> or
    $s <http://mulgara.org/mulgara#is> <asset:75661> or
    $s <http://mulgara.org/mulgara#is> <asset:71610>
  );

Of course, every row in the answer has only variables in it, and no sub-answers.

> We're trying to get this query as fast as possible, and the first batch of
> tests we performed involved modifying the size of the URI list used to
> generate this query.  Since there are 171,000 URIs with a type of "Asset" in
> the model, we obviously can't retrieve the "Class Name" for all of them in
> one query.  So we do it in chunks of X URIs at a time.  These are the
> results (in minutes:seconds) for different values of X.

OK, from the looks of it, you want all the assets at once, right? And
the only way to tell that they are assets is because the URI starts
with "asset:" correct? (you really should have an rdf:type on assets
to indicate that they are assets!) If that is the case, then you can
check the prefix of all your URIs for the "asset:" label. To do this,
you need to create a prefix model first.  Let's call it
<rmi://localhost/server1#prefixes>:

create <rmi://localhost/server1#prefixes> <mulgara:PrefixModel>;

You only have to create this once.

Now the URIs can be selected by their prefix. This above query gets
re-written to:

select $s $type $classname $p $o
from <rmi://localhost/server1#testdata>
where
  $s <rdfs:type> $type and
  $type <viewpoint:name> $classname in <rmi://localhost/server1#schema> and
  $s $p $o and
  $s <mulgara:prefix> 'asset:' in <rmi://localhost/server1#prefixes>;

However, if "asset" actually indicates a namespace, then you will need
to use the entire namespace and not the alias. Alternatively, you can
use a URI instead of a literal, which means you can use aliasing:

select $s $type $classname $p $o
from <rmi://localhost/server1#testdata>
where
  $s <rdfs:type> $type and
  $type <viewpoint:name> $classname in <rmi://localhost/server1#schema> and
  $s $p $o and
  $s <mulgara:prefix> <asset:> in <rmi://localhost/server1#prefixes>;

<snip/>

> Can the above query be optimized?  Are the multiple OR statements negatively
> impacting the performance of Mulgara?

No.

> Would it be possible to split this up
> into multiple small queries and send them all over in the same SOAP call?

This would only make it slower.

> Any insights or comments are appreciated.

You should try to use AND more than subqueries. These will be much faster.

Regards,
Paul Gearon