[Mulgara-general] size of XML-literals

thomas thomas at stray.net
Sat Jun 16 12:31:08 UTC 2007


--On 15. Juni 2007 20:31:23 -0500 Paul Gearon <gearon at ieee.org> wrote:

<snap />

> Andrae doesn't like it, but I think there's a place for a resolver that
> lets you make statements about file URLs.  Notice I say "URL" since it
> refers to a file that actually exists, and it's still a valid URI.  Once
> you make such a statement, the resolver then gets to take control of that
> file, and store it away where you're not allowed to modify it.  The
> resolver would be able to return the contents of such a file as a literal.

i looked into the "File based resover" thread from mulgara-dev 
<http://mulgara.org/pipermail/mulgara-dev/2007-February/000345.html>. it's 
an interesting read and i want to think about it a little more. i like the 
way how andrae is cautious with the rdf model.

> So contents of the file are being handled by the filesystem, but the
> location and access to the contents can be done with Mulgara.  So long as
> you don't try to go in and modify files belonging to the resolver, then
> all access will be wrapped up in the same transactions as the rest of
> Mulgara.  That's not an onerous restriction, as every database requires
> this.  You don't go and change the contents of /var/lib/mysql, do you?
> The only difference here is that some of the files are in a format that
> you know how to edit in an external tool.

yep

> In effect, you'd get what you asked for above, but you wouldn't know that
> you're actually loading up little files into the directories controlled
> by Mulgara.  :-)  So long as we didn't have too many files in a single
> directory, this would be fine (a hashing algorithm to determine the
> subdirectory would work here - just like iPods use).  Storing each "file"
> in it's own file, rather than lumping them all into one of our stringpool
> files would be much more efficient for space too.
>
>> but i just realize that the slots for textual content are exactly
>> as rare... well, i would be very glad if that could be changed! i'm
>> not even sure if externally saving them is a good enough solution:
>> could they be included into the search-index even if they are
>> outside the store?
>
> I have to check, but if you're using a Lucene model, then I'm pretty sure
> that Lucene does the storage, and not us.  So the limits are whatever
> Lucene has, and not what we say.

i just checked the docs and realized my misunderstanding. actually i would 
either have to copy the objects over to the lucene model (if they are 
inside the store) or reference them by url (if they are outside). another 
story, maybe another thread, later.

> It wouldn't be too onerous to expand our file access.  All the underlying
> classes for access are based on 64 bit offsets anyway, so it should be
> relatively easy.  It's funny, but no one has asked for more literals
> before.  Most RDF just uses the short data types (ie. <72bytes)and we're
> effectively unlimited there.  The thing everyone has asked for before now
> is scalability in triples, not in literals.  :-)

academia! ;-)


* so there seem to be 4 options:

* 64-bit
a switch to 64 bit would square the storage slots, right? if that can 
easily be done, it sounds like a good idea.
but what consequences would it have on the hardware/OS-side? can the 
majority of the installed base of desktop computers hande that?
and wouldn't it increase the standard file sizes enourmosly? i'm already 
scared everytime i instantiate a new server.

* file-size
the number of slots for a given data size are rather arbitrary (for any 
other size than the URI-string, that is) and don't necessarily fit with the 
average usage scenario.
but as the average usage scenario is obiously quite arbitrary either and a 
change in this part of the system - more files for certain slots, or 
dynamically increasing files - probably not so easy to implement, fiddeling 
with the file-sizes seems like a weak option.

* file-resolver
as i said above i found andraes concerns very worth considering but i'm 
quite confident that there is a semantically sound solution to the problem 
(or has to be, since the problem is big enough to justify a strech in the 
semantic model ;-)
it seems like a good solution since there'll always be files that are just 
too big for the data slots. eg the moment when i start to manage my 
dvd-backups within the store.
practically i fear this solution because it needs to be implemented first. 
which not only takes time but also pulls resources away from the much more 
exiting XA2. or would you take the file resolver over to XA2 anyway?

* XA2
the promised land ...



** conclusio

** if XA2 materialized within a year waiting for it would be good enough 
for me.
** if the file-resolver is planned for XA2 as well, this sounds like the 
best solution.
** if XA2 takes longer or the file resolver takes too much effort, going 
for 64-bit seems like the best solution, even if it would narrow down the 
number of machines that can handle it significantly.
** if increasing the file sizes for certain slots (say between 1kb and 
8000kb) isn't much of a problem, that seems like a balanced workaround


just my 5 cents
thomas




mailto:thomas at stray.net
http://stray.net




: accumulated wisdom
. early optimization is the root of many evil [donald e. knuth]
. if you've got a hammer every problem looks like a nail
. the difference between theory and practice is always greater
  in practice than it is in theory



More information about the Mulgara-general mailing list