[Mulgara-dev] scalability
Paul Gearon
gearon at ieee.org
Thu Nov 20 22:28:43 UTC 2008
(CC'ed to our developer list)
On Thu, Nov 20, 2008 at 11:16 AM, Michael Osofsky <mosofsky at netbase.com> wrote:
> Hi,
> I recently heard about your RDF store, Mulgara. I'm looking for stores that
> can handle upwards of 100 billion triples. How close can Mulgara get and
> how is the performance under that load?
>
> Many thanks,
>
> Michael
Performance really has to be measured both for loading that data and
for querying it. You can't query the data until you have loaded it, so
I will address that first.
LOADING
At the moment we are more in the 1 billion range. On a mid-range
notebook computer, 1 billion triples takes about 4 days to load, fully
indexed. (Some other stores load faster, but then have slower query
performance until the data is fully indexed). A server-class system
with decent RAID should be able to scale much better than this.
We currently have a development effort going on for a new store that
is designed to handle that quantity of data, but the first release for
this is not expected until about February/March of 2009 (I don't know
how much interest this will be to you). Preliminary performance of the
modules already implemented have shown that we should scale as we have
been expecting. Our subsequent phase of development will take us into
clustering, which we anticipate will help significantly. This is
partly because our indexing is disk-bound, and the more servers we can
move that to, the better.
QUERYING
For that kind of quantity of data, then different queries scale better
than others. If you want to do a query that will return a handful of
results, then you should not need to wait more than a second or two
(it would be bug if you do).
For more general querying, then some explanations are in order:
- We use tree indexes, and joins are evaluated lazily, so many queries
will return reasonably quickly. The exact time will vary according to
the type of query, of course.
- Counting data currently has linear performance, so you'll find that
is slow (it takes my notebook about a minute to count a billion
triples). That'll be optimized soon.
- Anything that returns data that needs to be sorted may be unusable
at that scale. There are some simple optimizations for certain queries
that can tell the engine to use a different index instead of doing a
sort, but this has not been a priority.
So, in general, we can handle that scale easily for some queries. Many
of the queries that we can't handle can be managed with some missing
optimization. ie. it's a matter of some work, rather than the data
structures being incapable of scaling.
The new data store in development should handle queries just as well,
and probably faster.
Please let me know if I haven't covered anything here.
Regards,
Paul Gearon
More information about the Mulgara-dev
mailing list