[Mulgara-general] size of XML-literals

Sun Jun 17 02:20:57 UTC 2007

On Jun 16, 2007, at 7:31 AM, thomas wrote:

> --On 15. Juni 2007 20:31:23 -0500 Paul Gearon <gearon at ieee.org> wrote:
> i looked into the "File based resover" thread from mulgara-dev  
> <http://mulgara.org/pipermail/mulgara-dev/2007-February/ 
> 000345.html>. it's an interesting read and i want to think about it  
> a little more. i like the way how andrae is cautious with the rdf  
> model.

I've yet to see an argument pointing out where it's wrong.  Really.   
I suspect that there's a communications disconnect here, where  
Andrae's concerns are based on something I haven't explained  
correctly.  But without details I'm at a loss as to where I need to  
clarify.

If I really have missed something, then I still suspect that there's  
a way to tweak the approach so that it would work in all cases.  It  
will certainly work in all the use cases I have for it.

>> It wouldn't be too onerous to expand our file access.  All the  
>> underlying
>> classes for access are based on 64 bit offsets anyway, so it  
>> should be
>> relatively easy.  It's funny, but no one has asked for more literals
>> before.  Most RDF just uses the short data types (ie. <72bytes)and  
>> we're
>> effectively unlimited there.  The thing everyone has asked for  
>> before now
>> is scalability in triples, not in literals.  :-)
>
> academia! ;-)

No, it's more industry.  The best (and IMHO principal) purpose for  
RDF is not as a KMS, but as a way to "link" objects on the web.   
Indeed, given the flexibility of URIs, it can be used to link  
anything on the internet.  This is where a lot of industrial  
applications are taking place.  It requires few literals, and most of  
the URIs are short enough to fit into the 72 character limit.

> * so there seem to be 4 options:
>
> * 64-bit
> a switch to 64 bit would square the storage slots, right? if that  
> can easily be done, it sounds like a good idea.
> but what consequences would it have on the hardware/OS-side? can  
> the majority of the installed base of desktop computers hande that?
> and wouldn't it increase the standard file sizes enourmosly? i'm  
> already scared everytime i instantiate a new server.

:-)

The files would only grow as you inserted data into them.  If you put  
a lot of large literals in, then you have to understand that you'll  
be using a lot of disk space!

Also, note that we already use 64 bit access for many of the files,  
even on a 32 bit system.  There are 2 differences on a 64 bit system,  
compared to a 32 bit system:
- The chance to manage the 64 bit offsets with a single word.  This  
should optimize some aspects of file access.
- We can memory map the files.
64 bit access is done with read(ByteBuffer,long) and write 
(ByteBuffer,long) methods in java.nio.channels.FileChannel.  Until  
this API came out in Java 1.4, there were no methods which accepted a  
64bit long as an offset.

As for the current size.... there are a few things that can be done  
in the XA system, and I hope to get that done in the next few  
months.  The most obvious of these is to dramatically cut the size of  
the files ending in _fl_ph.  These are done in a particularly stupid  
manner, and hold impossibly sparse data.

XA2 will use significantly less space.  This alone will help with its  
speed.

> * file-size
> the number of slots for a given data size are rather arbitrary (for  
> any other size than the URI-string, that is) and don't necessarily  
> fit with the average usage scenario.
> but as the average usage scenario is obiously quite arbitrary  
> either and a change in this part of the system - more files for  
> certain slots, or dynamically increasing files - probably not so  
> easy to implement, fiddeling with the file-sizes seems like a weak  
> option.

It's not arbitrary at all, but I appreciate that from a user's  
perspective it would be.

I'm not sure what you're saying in this point, so it's hard to  
comment.  However, with the current string pool design we have two  
options:
1.  Bigger files for data of each given size.  (this means using 64  
bit access into the files)
2.  More than one file for data of each given size.  (this lets us  
stick to 32 bit access)

In theory, there's no real difference, except the number of file  
handles.  (However, file handles are a limited resource, so this can  
be an issue.)

In practice, it depends on the implementation of the filesystem.   
Some file systems are going to be better at going to offsets in a  
single large file.  Others are going to be better if the file is  
split into chunks.  I suspect that larger files are better in most  
cases, but  I haven't kept up with filesystem design for some time.

> * file-resolver
> as i said above i found andraes concerns very worth considering but  
> i'm quite confident that there is a semantically sound solution to  
> the problem (or has to be, since the problem is big enough to  
> justify a strech in the semantic model ;-)
> it seems like a good solution since there'll always be files that  
> are just too big for the data slots. eg the moment when i start to  
> manage my dvd-backups within the store.
> practically i fear this solution because it needs to be implemented  
> first. which not only takes time but also pulls resources away from  
> the much more exiting XA2. or would you take the file resolver over  
> to XA2 anyway?

Such a resolver would be completely independent of XA2.  In fact, it  
might be a good beginner project for someone wanting to learn how to  
code in Mulgara.

> * XA2
> the promised land ...

Yes.  It's just that the design is big and detailed.  This means lots  
of programmer hours in order to implement.  Programmers have to eat,  
send their kids to school, pay the rent..... you know.  :-)

We're working on it.

> ** conclusio
>
> ** if XA2 materialized within a year waiting for it would be good  
> enough for me.

Fingers crossed.  It's not out of the realm of possibilities.  We'll  
let people know as we find out ourselves.

> ** if the file-resolver is planned for XA2 as well, this sounds  
> like the best solution.

Yes, it would work just fine for XA2.  We just need someone who is  
prepared to give it a try.  How good are you at Java?  :-)

> ** if XA2 takes longer or the file resolver takes too much effort,  
> going for 64-bit seems like the best solution, even if it would  
> narrow down the number of machines that can handle it significantly.

It wouldn't, narrow down the number of machines.  But the 64 bit  
machines would be quicker.

> ** if increasing the file sizes for certain slots (say between 1kb  
> and 8000kb) isn't much of a problem, that seems like a balanced  
> workaround

The only way to do that is to move those files to 64 bit access.  The  
"arbitrary" sizes you mentioned earlier are based on the limit of 32  
bit access.

For instance, look at the 1024 slots for data of 1MB to 2MB.  This is  
because:
- The file is broken into 2MB slots.
- Anything smaller than 1MB can be stored in another file (though it  
would obviously fit in here).  Anything from 1MB to 2MB will be  
stored here (with blank space all the way to the end of the 2MB  
slot).  Anything larger than 2MB won't fit in here, so it goes up the  
list to a file that CAN store it.  Incidentally, this means that each  
slot can have up to half of it's space unused.  If you had lots of  
data of all sizes, then you'd expect the average "fill" of a slot to  
be about 75%, meaning that the file is 25% unused.  There were good  
engineering reasons for this, but it does sound steep.  :-)
- The limit of 32 bit access to a file is 2GB.  That's because the  
file is accessed using an "int" which is 32 bits long, and the  
largest "int" value is 2147483647.  (2^31 - 1.  It's not 2^32 as the  
top bit is the "sign" bit)
- a 2GB file divided into 2MB slots gives you 1024 slots.

So there's just no way to make the file hold more slots that this,  
without making the file bigger.  But anything larger than 2GB is  
needs 64 bit offsets for access.  It's also worth noting that several  
filesystems don't permit files greater than 2GB, for the same 32 bit  
reasons.

Regards,
Paul Gearon