[Mulgara-dev] Update Patterns

Thu Dec 14 00:37:48 UTC 2006

> > Initially, adding many batches of around fifty to 100 triples at a
> > time.  
> Roughly how many is 'many' - 1000, 10000, 1000000 ?
> and over what sort of timeframe?  seconds, minutes, hours?

It depends on the institution using our software.  For one,
at "initial load time" we're talking 2-3 million, over
a period of a couple days.  But I would say most institutions 
are on the order of 10-100k, over a few minutes to a few hours.

> That was the size I expected updates to be - again what 
> sort of volumes are you using, and how many TPS?

One user I know of has ~10k of these types of updates 
occurring every couple days.  It's always in batches,
and last I heard, they were getting around four to five 
updates/second.   That number includes a lot of
non-triplestore-related processing that goes on for each
update in their system.

> datastructure for this sort of data.  So I'm very interested in 
> knowing how much of that length is shared prefix?   (as tries are 
> really good at storing shared prefixes :)

Hmm...in our case, almost all URIs are going to have at least the 
first 16 characters in common.  I can certainly get you a 
representative graph if you'd find it useful for analysis.  Just 
tell me how many triples you want and in what format.

- Chris

________________________________

From: Andrae Muys [mailto:andrae at netymon.com]
Sent: Wed 12/13/2006 6:14 PM
To: Chris Wilper
Cc: mulgara-dev at mulgara.org
Subject: Re: [Mulgara-dev] Update Patterns

On 14/12/2006, at 1:19 AM, Chris Wilper wrote:

> Hi Andrae,
>
> > What sort of insert/deletes are people doing?
> > Are deletes/inserts normally paired in a single transaction?
> > How many statements are you insert/deleting in a
> > single update?  Can they be categorised?
> > If so what is the frequency of the different categories?
> Initially, adding many batches of around fifty to 100 triples at a
> time.  Most (say 75%) of the triples represent literal properties.
> Of those, probably a third are datatyped.  The majority
> (say 75%) of our datatyped literals are xsd:dateTimes.
> As for URIs used in triples, an off the cuff guess is that 75%
> of them are distinct.
Roughly how many is 'many' - 1000, 10000, 1000000 ?
and over what sort of timeframe?  seconds, minutes, hours?

> Update operations are smaller: we usually need to update only
> 5-20 triples at a time, and accomplish that via a series of deletes
> and adds as a single transaction.
That was the size I expected updates to be - again what sort of 
volumes are you using, and how many TPS?
> > What is the 'shape' of the data you insert?
> >   (ie. many mostly independent sub-graphs describing different
> > instances; or fewer instances with lots of interconnections and
> > object-reference properties?)
>
> Mostly independent sub-graphs, with diameter 2, consisting
> of a total of 50-100 triples each.  Note that there are definitely
> connections between the sub-graphs, they are just relatively
> few.

> > Is any significant % of your literals replicated?
> > What % of the data are Blank-nodes?
>
> 0% are BNodes, thankfully.  We don't do triplestore-to-triplestore
> replication right now...but BNodes would appear to complicate the
> problem.
It does depend.  The planned replication for mulgara is done as a 
dual-space transfer, so bnode-bnode replication is easy for us.  
OTOH, I don't think it's possible without the special access we have 
to the low-level bnode representations and the ability to control 
that to permit transfer between graphs - so I agree, user-level 
replication of bnodes is going to be nasty.
> > What is the average length of a URI?
>
> Average?  Probably 60-70 characters.
I'm working on developing a persistent external-memory trie 
datastructure for this sort of data.  So I'm very interested in 
knowing how much of that length is shared prefix?   (as tries are 
really good at storing shared prefixes :)
> > What is the average length of a Literal?
>
> About 50 characters, I would guess.
mmm. I am wondering if it might not be worth compressing the 
individual literals in the store-layer.  But ultimately the right 
answer there has to do with the access patterns for data - if we 
don't have sufficient locality given the access patterns, then we 
will tend to pay 1 IO per lookup anyway.  If we do get clustering in 
the string pool, then compression may improve the cache utilisation.  
I suppose I'll just have to measure it ;)

Andrae

--
Andrae Muys
andrae at netymon.com
Principal Mulgara Consultant
Netymon Pty Ltd

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mulgara.org/pipermail/mulgara-dev/attachments/20061213/6eb3ba94/attachment.htm>