[Mulgara-general] iTQL question regarding the relationship between query size and query speed

Wed Apr 16 12:52:31 UTC 2008

I have some questions about iTQL query structure and performance.
Although is it technically TQL now?  In any case, I shall preface my
question with a statement regarding our setup.

We are running the queries on Mulgara 1.1.1, on a Win2k3 server with 8
GB of RAM and 2 quad-core Xeon processors, using a 64-bit JVM.  The
queries are executed using the ItqlInterpreterBeanService, as our
development environment is C#, so we do all of the Mulgara commands
using SOAP. The model I am running the queries against has 521,998
triples, most of them are of the form:

<Asset:num> <rdfs:type> <viewpoint:Asset> 

<Asset:num> <viewpoint:TimeStamp> 'Date Time Literal'

<Asset:num> <viewpoint:name> 'Name Literal'

There are 171,000 sets of these three triples, so there are 171,000
unique values for "num" in the <Asset:num> identifier.  This is test
data that was set up for performance testing, as we are attempting to
retrieve the "Class name" for each "Asset type."  In the schema model
there is a triple:

<viewpoint:Asset> <viewpoint:name> 'ClassName Literal'

There are multiple types (more than just Asset), and the goal is to
retrieve the 'Class Name Literal' for each <Asset:num> URI.  To do this
we first retrieve all the URIs that have an <rdfs:type> of
<viewpoint:Asset>, or are a subclass of <viewpoint:Asset>.  Then that
list of URIs is used to populate the final section of this query (the
URIs that are connected with OR statements):

select 

subquery

( select $s 

            subquery

            ( select $type 

                        subquery

                        (           select $classname from
<rmi://localhost/server1#schema> 

                                    where $type <viewpoint:name>
$classname

                        )

                        from <rmi://localhost/server1#testdata> where $s
<rdfs:type> $type

                        ) 

            subquery

            (

                        select $p $o from
<rmi://localhost/server1#testdata> where $s $p $o 

            ) 

from <rmi://localhost/server1#testdata> where  

            $s <http://mulgara.org/mulgara#is> <asset:79593> or 

            $s <http://mulgara.org/mulgara#is> <asset:71600> or 

            $s <http://mulgara.org/mulgara#is> <asset:71601> or 

            $s <http://mulgara.org/mulgara#is> <asset:71602> or 

            $s <http://mulgara.org/mulgara#is> <asset:7992> or 

            $s <http://mulgara.org/mulgara#is> <asset:71618> or 

            $s <http://mulgara.org/mulgara#is> <asset:71617> or 

            $s <http://mulgara.org/mulgara#is> <asset:75590> or 

            $s <http://mulgara.org/mulgara#is> <asset:71616> or 

            $s <http://mulgara.org/mulgara#is> <asset:71615> or 

            $s <http://mulgara.org/mulgara#is> <asset:76716> or 

            $s <http://mulgara.org/mulgara#is> <asset:71619> or 

            $s <http://mulgara.org/mulgara#is> <asset:76696> or 

            $s <http://mulgara.org/mulgara#is> <asset:75661> or 

            $s <http://mulgara.org/mulgara#is> <asset:71610> 

) 

from <rmi://localhost/server1#testdata> where $s $p $o ;

We're trying to get this query as fast as possible, and the first batch
of tests we performed involved modifying the size of the URI list used
to generate this query.  Since there are 171,000 URIs with a type of
"Asset" in the model, we obviously can't retrieve the "Class Name" for
all of them in one query.  So we do it in chunks of X URIs at a time.
These are the results (in minutes:seconds) for different values of X.

4000 URIs the process took 3:33 with 43 separate queries

3000 URIs the process took 3:53 with 57 separate queries

2000 URIs the process took 2:40 with 86 separate queries

1000 URIs the process took 2:30 with 171 separate queries

500 URIs the process took 2:25 with 342 separate queries

250 URIs the process took 2:25 with 684 separate queries

125 URIs the process took 2:32 with 1368 separate queries

So when X is down below 250, the overhead on performing SOAP call
outweights the cost of answering the query, but up above 2000 the actual
time to process the query seems to be the limiting factor.  I'm not sure
why the time was lower for 4000 than 3000 - I was unable to try this
with X = 5000 because the SOAP call fails.  

All right, here are the actual questions.

Can the above query be optimized?  Are the multiple OR statements
negatively impacting the performance of Mulgara?  Would it be possible
to split this up into multiple small queries and send them all over in
the same SOAP call?

Any insights or comments are appreciated.

Thanks,

David Moll

Software Engineer

dmoll at myaperio.com

Viewpoint Data Management LLC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mulgara.org/pipermail/mulgara-general/attachments/20080416/d2cac50c/attachment.htm>