[Mulgara-dev] Number dataset loaded on EC2

Wed Nov 19 17:28:24 UTC 2008

(CC'ed to dev)

Hi Chris,

On Wed, Nov 19, 2008 at 10:25 AM, Chris Wilper
<cwilper at fedora-commons.org> wrote:
> Hi Paul,
>
> I decided to crank up an EC2 "large" instance[1] this week and do a
> big load of the numbers dataset, using the xa11 jar you gave me while
> in Ithaca.

As soon as I'm finished this Topaz work I'll have a new jar that will
perform better. But I'm happy that you've given me a benchmark to
start with, thanks.

> It took a little over a day to load about a quarter billion triples.

Really? That's the same speed as my laptop. I don't know what kind of
bandwidth to disk that you have in the cloud, but I would have thought
it would be better. :-(

BTW, I'd forgotten this, but it's also possible to load up these files
using the .rdf.gz form. This is particularly useful for huge data
files like this.

> I only ran a couple queries.  The first, to count the triples...took a
> few minutes.

Unfortunately this is done in linear time (it adds the sizes of index
blocks). We can cut this to log time, but haven't had a lot of need
yet. Maybe the time has come.

> The second was to list distinct predicates  (select $p
> where $s $p $o).  The latter timed out at the web interface after
> running for quite a while (well over 20 mins).

Initially I thought that this was an error, but now I realize that it
needed to sort the results. Ouch.

There are several ways this can be optimized. The simplest is a little
naive. We could see that we've selected everything ($s $p $o), but
only need the predicate $p. In this case, we could select the "pos"
index instead of the "spo" index. Then we'd already have the
predicates sorted, meaning we could iterate over them trivially.

The reason that is naive is because it doesn't extend to the general
case. However, it may be possible in general to check which variables
are being selected, and preference these near the start of our indexes
whenever we have a choice of index we can use.

Andrae (if you're back online after the storm): what suggestions do
you have here?

> I'm sure there are a
> lot of other interesting queries you might want to run, but rather
> than keeping the ec2 instance up for a long time, I decided to shut it
> down and .tar.gz the server1 directory, putting it on s3 for later
> inspection.
>
> If you've got a good connection, feel free to download the tgz'd
> server1/ directory[1] here:
>
> http://xa11numbers.s3.amazonaws.com/server1.tar.gz

I already have these.  :-)

> It's about 4.6 gigs compressed, 51 gigs uncompressed.

Yes, the URIs contain a lot of redundant information in them, which I
haven't attempted to remove. XA2 will be storing strings and URIs much
more efficiently.

Regards,
Paul Gearon