[Mulgara-general] size of XML-literals

thomas thomas at stray.net
Sun Jun 17 10:55:42 UTC 2007



--On 16. Juni 2007 21:20:57 -0500 Paul Gearon <gearon at ieee.org> wrote:

>
> On Jun 16, 2007, at 7:31 AM, thomas wrote:
>
>> --On 15. Juni 2007 20:31:23 -0500 Paul Gearon <gearon at ieee.org> wrote:
>> i looked into the "File based resover" thread from mulgara-dev
>> <http://mulgara.org/pipermail/mulgara-dev/2007-February/
>> 000345.html>. it's an interesting read and i want to think about it
>> a little more. i like the way how andrae is cautious with the rdf
>> model.
>
> I've yet to see an argument pointing out where it's wrong.  Really.  I
> suspect that there's a communications disconnect here, where Andrae's
> concerns are based on something I haven't explained correctly.  But
> without details I'm at a loss as to where I need to clarify.

still thinking about it...

> If I really have missed something, then I still suspect that there's a
> way to tweak the approach so that it would work in all cases.  It will
> certainly work in all the use cases I have for it.

that's my status as well

>>> It wouldn't be too onerous to expand our file access.  All the
>>> underlying
>>> classes for access are based on 64 bit offsets anyway, so it
>>> should be
>>> relatively easy.  It's funny, but no one has asked for more literals
>>> before.  Most RDF just uses the short data types (ie. <72bytes)and
>>> we're
>>> effectively unlimited there.  The thing everyone has asked for
>>> before now
>>> is scalability in triples, not in literals.  :-)
>>
>> academia! ;-)
>
> No, it's more industry.  The best (and IMHO principal) purpose for RDF is
> not as a KMS, but as a way to "link" objects on the web.  Indeed, given

the question is if you keep RDF in the meta space - talking about data - or 
use it for the "real life" - talking data (itself) - as well. maybe it's 
just not made for the latter and maybe it would be more consequent to let 
it stay in the meta space and concentrate on strenghtening the grasp of 
this meta space onto the data space.
but on the other hand this is an arbitrary differentiation. your meta data 
is my real data and vice versa. i'm obsessed with the wish to make those 
differentiation fluid, make them explicit and by that manipulatable, 
adaptable. that's why i tend to suck everything into the metastore, break 
it down into it's basic entities and open the possibility of rearrangement. 
that can be best done, if everything is not just linked but within the 
store. maybe it's just a matter of ease of access, control, feasability - 
i'm not sure. i fear that i tend to replicate everything, though ;-)

> the flexibility of URIs, it can be used to link anything on the internet.
> This is where a lot of industrial applications are taking place.  It
> requires few literals, and most of the URIs are short enough to fit into
> the 72 character limit.
>
>> * so there seem to be 4 options:
>>
>> * 64-bit
>> a switch to 64 bit would square the storage slots, right? if that
>> can easily be done, it sounds like a good idea.
>> but what consequences would it have on the hardware/OS-side? can
>> the majority of the installed base of desktop computers hande that?
>> and wouldn't it increase the standard file sizes enourmosly? i'm
>> already scared everytime i instantiate a new server.
>
> :-)
>
> The files would only grow as you inserted data into them.  If you put a
> lot of large literals in, then you have to understand that you'll be
> using a lot of disk space!
>
> Also, note that we already use 64 bit access for many of the files, even
> on a 32 bit system.  There are 2 differences on a 64 bit system, compared
> to a 32 bit system:
> - The chance to manage the 64 bit offsets with a single word.  This
> should optimize some aspects of file access.
> - We can memory map the files.
> 64 bit access is done with read(ByteBuffer,long) and
> write(ByteBuffer,long) methods in java.nio.channels.FileChannel.  Until
> this API came out in Java 1.4, there were no methods which accepted a
> 64bit long as an offset.
>
> As for the current size.... there are a few things that can be done in
> the XA system, and I hope to get that done in the next few months.  The
> most obvious of these is to dramatically cut the size of the files ending
> in _fl_ph.  These are done in a particularly stupid manner, and hold
> impossibly sparse data.
>
> XA2 will use significantly less space.  This alone will help with its
> speed.
>
>> * file-size
>> the number of slots for a given data size are rather arbitrary (for
>> any other size than the URI-string, that is) and don't necessarily
>> fit with the average usage scenario.
>> but as the average usage scenario is obiously quite arbitrary
>> either and a change in this part of the system - more files for
>> certain slots, or dynamically increasing files - probably not so
>> easy to implement, fiddeling with the file-sizes seems like a weak
>> option.
>
> It's not arbitrary at all, but I appreciate that from a user's
> perspective it would be.

well, i saw that you didn't just throw the dices but i put it the wrong 
way. it's clear that the file sizes are developped from the one end - the 
size of the  average URI-string. to name this decision arbitrary is 
misleadsing since it's well founded in the most common use of the store.

> I'm not sure what you're saying in this point, so it's hard to comment.
> However, with the current string pool design we have two options:
> 1.  Bigger files for data of each given size.  (this means using 64 bit
> access into the files)
> 2.  More than one file for data of each given size.  (this lets us stick
> to 32 bit access)

that's what i meant to say :-)

> In theory, there's no real difference, except the number of file handles.
> (However, file handles are a limited resource, so this can be an issue.)
>
> In practice, it depends on the implementation of the filesystem.  Some
> file systems are going to be better at going to offsets in a single large
> file.  Others are going to be better if the file is split into chunks.  I
> suspect that larger files are better in most cases, but  I haven't kept
> up with filesystem design for some time.

and which one would be easier to implement? i personally would just hope 
for a quick solution that holds until XA2 comes along, not for something 
perfect.

>> * file-resolver
>> as i said above i found andraes concerns very worth considering but
>> i'm quite confident that there is a semantically sound solution to
>> the problem (or has to be, since the problem is big enough to
>> justify a strech in the semantic model ;-)
>> it seems like a good solution since there'll always be files that
>> are just too big for the data slots. eg the moment when i start to
>> manage my dvd-backups within the store.
>> practically i fear this solution because it needs to be implemented
>> first. which not only takes time but also pulls resources away from
>> the much more exiting XA2. or would you take the file resolver over
>> to XA2 anyway?
>
> Such a resolver would be completely independent of XA2.  In fact, it
> might be a good beginner project for someone wanting to learn how to code
> in Mulgara.

<snip />

>> ** if the file-resolver is planned for XA2 as well, this sounds
>> like the best solution.
>
> Yes, it would work just fine for XA2.  We just need someone who is
> prepared to give it a try.  How good are you at Java?  :-)

good question! i'm still quite bad but maybe in a few months... i'll keep 
that in mind (always dreamed of becoming a mulgara committer!)

>> ** if XA2 takes longer or the file resolver takes too much effort,
>> going for 64-bit seems like the best solution, even if it would
>> narrow down the number of machines that can handle it significantly.
>
> It wouldn't, narrow down the number of machines.  But the 64 bit machines
> would be quicker.

sounds fair

>> ** if increasing the file sizes for certain slots (say between 1kb
>> and 8000kb) isn't much of a problem, that seems like a balanced
>> workaround
>
> The only way to do that is to move those files to 64 bit access.  The
> "arbitrary" sizes you mentioned earlier are based on the limit of 32 bit
> access.

sorry, i put it the wrong way. i meant: increasing the number of files for 
certain slots. but i still don't think that's the best idea.

<snip />

ciao
thomas


mailto:thomas at stray.net
http://stray.net




: accumulated wisdom
. early optimization is the root of many evil [donald e. knuth]
. if you've got a hammer every problem looks like a nail
. the difference between theory and practice is always greater
  in practice than it is in theory



More information about the Mulgara-general mailing list