[Mulgara-general] "Sweet spot" for model size?

Eric Freese freese.eric at gmail.com
Tue Feb 26 03:08:50 UTC 2008


Date: Mon, 25 Feb 2008 14:15:12 +1000
From: Andrae Muys <andrae at netymon.com>

>On 25/02/2008, at 9:27 AM, Eric Freese wrote:
>
>>> Date: Sat, 23 Feb 2008 18:21:55 -0800
>>> From: "Life is hard, and then you die" <ronald at innovation.ch>
>>> Subject: Re: [Mulgara-general] "Sweet spot" for model size?
>>
>>> Looks like you never got a reply - sorry about that.
>>
>> No problem
>>
>>> On Thu, Feb 14, 2008 at 10:30:46AM -0500, Eric Freese wrote:
>>>>
>>>> I'm new to Mulgara and very impressed thus far.  My main question at
>>>> this point centers around any tips the community might have as to
>>>> the
>>>> best way to organize models.
>>>>
>>>> I'm trying to load the dbpedia (dbpedia.org) RDF datasets into
>>>> mulgara.  The initial loads went pretty fast but they seem to be
>>>> getting progressively slower as more and more triples are added.
>>>> What
>>>> I'm wondering is if there is a suggested number of triples to have
>>>> within a model or a suggested strategy for how models should be
>>>> organized.
>>>
>>> It doesn't really matter how many triples per model or how many
>>> models
>>> you have, because mulgara stores everything as four-tuples with the
>>> model the being the fourth element (i.e. "(s, p. o) in m" is
>>> stored as
>>> "(s, p, o, m)") and because the indexes are fully symmetric and treat
>>> all elements of the four-tuple equally.
>>>
>>> How many triples are you loading and where do you start to see a
>>> noticeable slow-down? Are you using "insert" or "load" to load the
>>> triples, and are dong this in auto-commit mode or in separate
>>> transactions?
>>
>> Right now I have in the neighborhood of 30 million statements loaded.
>> I believe the entire main dataset is around 100 million statements.
>> There are additional components that contain 2 billion statements, but
>> I don't think I'm going to try to replicate those just yet.  Things
>> have gotten progressively slower as more statements are loaded.  I'm
>> loading individual files of varying sizes.  Some contain a few 10,000s
>> of statements; others have 2 million or more statements.  I'm using
>> "load" to add each file into the model.  I'm not doing anything
>> special so I'm assuming I'm using the auto-commit.
>>
>> In reading some of the other messages, I'm guessing that the delay is
>> caused by the increased time to update the indexes as they get larger
>> and larger.  Does that sound correct?
>
> Yes that sounds correct.  I am interested in what sort of machine you
> are running:  specifically how many hdd's and in what configuration;
> and how much RAM?  Also of interest is what JVM you are using, and if
> you are using the 32 or 64-bit runtime?

Quad core processor
2  320GB hard drives - one has the source data and the other has the
database so I'm not reading and writing to the same drive (both about
20% full)
RAM: 6GB
JVM: 1.5.0_13 64-bit, mixed mode

>> Something else I've started running into is files not loading and
>> getting a "javax.transaction.RollbackException: null" message on
>> larger files (1 million or more statements).  When I split them into
>> smaller files, they load just fine.  Any suggestions?  Should I
>> increase the max memory for my JVM?
>>
>> I'm wondering if I should load each file into its own model and then
>> use views to combine them.  Are there any performance issues (similar
>> to db joins) in using views?  Are there other pros/cons to this
>> strategy?  If I read the docs correctly, a model can participate in
>> more than one view, correct?
>
> I must admit there are currently some performance issues with
> performing complex queries against views.  Unfortunately the unions
> aren't as transparent to some of the join optimisations as we would
> like.  On the other hand there is absolutely no reason why they
> couldn't be.  Our optimisations and enhancements are mostly user
> driven, so if mulgara users start reporting this as a problem
> affecting them, it will get fixed.

So are suggesting that I shouldn't load each file into each model?  I
don't know how complex the queries themselves would be, but the unions
and intersections might get a little complex.  Are there some
optimizations I can do locally to make things work more efficiently?

Thanks!
Eric



More information about the Mulgara-general mailing list