[Mulgara-general] "Sweet spot" for model size?
Eric Freese
freese.eric at gmail.com
Tue Feb 26 03:08:50 UTC 2008
Date: Mon, 25 Feb 2008 14:15:12 +1000
From: Andrae Muys <andrae at netymon.com>
>On 25/02/2008, at 9:27 AM, Eric Freese wrote:
>
>>> Date: Sat, 23 Feb 2008 18:21:55 -0800
>>> From: "Life is hard, and then you die" <ronald at innovation.ch>
>>> Subject: Re: [Mulgara-general] "Sweet spot" for model size?
>>
>>> Looks like you never got a reply - sorry about that.
>>
>> No problem
>>
>>> On Thu, Feb 14, 2008 at 10:30:46AM -0500, Eric Freese wrote:
>>>>
>>>> I'm new to Mulgara and very impressed thus far. My main question at
>>>> this point centers around any tips the community might have as to
>>>> the
>>>> best way to organize models.
>>>>
>>>> I'm trying to load the dbpedia (dbpedia.org) RDF datasets into
>>>> mulgara. The initial loads went pretty fast but they seem to be
>>>> getting progressively slower as more and more triples are added.
>>>> What
>>>> I'm wondering is if there is a suggested number of triples to have
>>>> within a model or a suggested strategy for how models should be
>>>> organized.
>>>
>>> It doesn't really matter how many triples per model or how many
>>> models
>>> you have, because mulgara stores everything as four-tuples with the
>>> model the being the fourth element (i.e. "(s, p. o) in m" is
>>> stored as
>>> "(s, p, o, m)") and because the indexes are fully symmetric and treat
>>> all elements of the four-tuple equally.
>>>
>>> How many triples are you loading and where do you start to see a
>>> noticeable slow-down? Are you using "insert" or "load" to load the
>>> triples, and are dong this in auto-commit mode or in separate
>>> transactions?
>>
>> Right now I have in the neighborhood of 30 million statements loaded.
>> I believe the entire main dataset is around 100 million statements.
>> There are additional components that contain 2 billion statements, but
>> I don't think I'm going to try to replicate those just yet. Things
>> have gotten progressively slower as more statements are loaded. I'm
>> loading individual files of varying sizes. Some contain a few 10,000s
>> of statements; others have 2 million or more statements. I'm using
>> "load" to add each file into the model. I'm not doing anything
>> special so I'm assuming I'm using the auto-commit.
>>
>> In reading some of the other messages, I'm guessing that the delay is
>> caused by the increased time to update the indexes as they get larger
>> and larger. Does that sound correct?
>
> Yes that sounds correct. I am interested in what sort of machine you
> are running: specifically how many hdd's and in what configuration;
> and how much RAM? Also of interest is what JVM you are using, and if
> you are using the 32 or 64-bit runtime?
Quad core processor
2 320GB hard drives - one has the source data and the other has the
database so I'm not reading and writing to the same drive (both about
20% full)
RAM: 6GB
JVM: 1.5.0_13 64-bit, mixed mode
>> Something else I've started running into is files not loading and
>> getting a "javax.transaction.RollbackException: null" message on
>> larger files (1 million or more statements). When I split them into
>> smaller files, they load just fine. Any suggestions? Should I
>> increase the max memory for my JVM?
>>
>> I'm wondering if I should load each file into its own model and then
>> use views to combine them. Are there any performance issues (similar
>> to db joins) in using views? Are there other pros/cons to this
>> strategy? If I read the docs correctly, a model can participate in
>> more than one view, correct?
>
> I must admit there are currently some performance issues with
> performing complex queries against views. Unfortunately the unions
> aren't as transparent to some of the join optimisations as we would
> like. On the other hand there is absolutely no reason why they
> couldn't be. Our optimisations and enhancements are mostly user
> driven, so if mulgara users start reporting this as a problem
> affecting them, it will get fixed.
So are suggesting that I shouldn't load each file into each model? I
don't know how complex the queries themselves would be, but the unions
and intersections might get a little complex. Are there some
optimizations I can do locally to make things work more efficiently?
Thanks!
Eric
More information about the Mulgara-general
mailing list