[Mulgara-dev] (no subject)

Sat Jan 27 06:45:47 UTC 2007

On 27/01/2007, at 2:10 PM, Life is hard, and then you die wrote:
> we have a question about disk usage. We have a database with around
> 1.4 million triples currently, and the disk usage looks as follows:
>
>   4.0K   lucene
>   1.7M   xaNodePool
>   6.0G   xaStatementStore
>   169M   xaStringPool
>
> (A detailed listing of file sizes is at the end). While the string
> pool looks fine, the statement store looks a bit large.
>
> Now, we have two calculations for the statement store, one from Paul
> and one from Andrae. Given N statements, Paul said basically the disk
> usage should be around N / 192 * 8292 * 6; Andrae said something like
> 12 * 32 * N. This comes to about 363MB and 538MB, respectively, i.e.
> same ballbark (I'm not interested in exact numbers). But both are an
> order of magnitude less than what we're seeing, though.
>
> Paul and Andrae mentioned that space is not reclaimed on deletes, but
> instead goes back into a pool. We don't have the exact numbers, but in
> our case about 1/5 of the triples got inserted (~240000), then about
> 6500 were removed, and then the rest of the triples were inserted.
> There have been some small sets of deletes since then, but nothing
> beyond a few thousand triples. So in total the deletes are < 1% of the
> inserts. Plus anything in the free pool from the deletes should've
> pretty much been used up by the following inserts. But even if not,
> this doesn't look it could account for the discrepency.
>
> So, I'm a bit curious: anybody have any idea why the large disk usage?
> Has anybody else seen this (ever 4K per statement)?

I just had a conversation with David Makepeace about this, and while  
he was surprised he did offer a possible explanation.  The sizes  
below have a few interesting properties.

1. Most of the space is being consumed by the block-files (_tb), not  
the AVLTrees.
2. There is a huge discrepancy between the 012 blockfile indices and  
the 120 blockfile indices.

The theoretical 'ideal' blockfile should be 44.8e6 bytes (for 1.4e6  
quads).
In practice the ideal block operates at a lower-bound of 50%  
utilisation so 89.6e6 bytes.
There are 6 indices so the total ideal blockfile space required for  
1.4e6 quads is 538MB.

This is quite a bit less then the 5.6GB being used below - so the  
space is not a result of 'normal' operations.  This suggests that the  
space is either on the freelist, or supporting multiple phases.  If  
you were doing alot of deletes then that might suggest the freelist -  
but the worst-case overhead for 6000 deletes (each hitting a  
different 4k block) is 24MB.  So the space must be the result of  
phase-support.

Remember that mulgara does do copy-on-write to support multiversion  
semantics, but it only retains the old copy only if there remains a  
reference to the old phase (version).  If there is no such reference,  
then the old block is placed on the freelist and used to support  
subsequent inserts.  But there is one quirk in the freelist behaviour
(From FreeList.java)
* A fifo of integer items. A list of "phases" is maintained where  
each phase
* represents a snapshot of the state of the free list at a point in  
time. Items
* added to the free list will not be returned by {@link #allocate}  
until all
* current phases have been closed (and at least one new phase created).
Note the "until all current phases have been closed".  What this  
means is that one unclosed outstanding Answer may (if it retains a  
reference to a current phase) prevent deallocated blocks from being  
reallocated!

So what I think is happening is this - somewhere you are holding an  
Answer open over the course of numerous small inserts.  I suspect  
from the file-size distributions each insert is a set of properties  
associated with only a few subjects.  The result is that on the 012  
sorted indices these inserts only hit one or two blocks (leading to  
the 5.4x inefficiency we are currently seeing).  In the case of 120  
indices each property hits a separate block, forcing a duplicate,  
which due to the outstanding phase is never reaped (leading to the  
~39x inefficiency we are currently seeing).  As an aside in practice  
the goal is between 1.5x and 2x inefficiency, so that 5.4x is really  
3x, and the 39x really 25x.  With 201 a degree of aliasing between  
'objects' provides some defense, but not enough to avoid an 18x (10x)  
penalty.

It would be a good idea to confirm this theory, but if it holds then  
there are a few options to proceed from here.  I have listed them in  
the order of preference, and incidentally both effort and mulgara  
internals experience required.

1. Remember to call close() on your Answers - especially if you are  
using the new transaction code which will allow you to keep an Answer  
open concurrently with initiating a new transaction from its parent  
Session.

2. Periodically close() your Sessions - this will force-close any  
Answer objects that may have leaked.  This could be done periodically  
in any connection-pool.

3. Once you are confident you have no very-long-lived Answers, you  
could possibly ignore the problem.  Once the phases are released, the  
space will be available for reallocation and the store's size will  
stabilise.

4. Wrap either a new Tuples or new Answer object around the server- 
side result that ages the Answer and eventually Materializes it,  
closing the inner-result and clearing the phase.

5. Modify the FreeList behaviour to start reallocating blocks that  
are not referenced by any previous phase (rather than blocks that  
were referenced by a phase newer than the oldest active phase).  This  
requires substantial new bookkeeping and is a very non-trivial change.

Can you examine your test case and see if this is indeed what is  
happening?  If so, is there any particular reason you need to hold an  
old Answer open over so many inserts?

Andrae

Note:
When estimating the space requirement above I had forgotten the AVL  
tree.  This is one node per block (4k) each node is 93bytes[0] (Paul,  
can you confirm this).  This means that the ideal AVLtree index is in  
practice ~2MB.

[0] AVLNode:
   Left-Node  - long    -  8
   Right-Node - long    -  8
   Balance    - byte    -  1
   Low-Quad   - 4xlong  - 32
   High-Quad  - 4xlong  - 32
   Nr-Quads   - int     -  4
   BlockId    - long    -  8
                        = 93

> P.S. here's a detailed listing of the files by size:
>
>   Files in xaStatementStore, sorted by size:
>
>     1744830464  xa.g_3120_tb
>     1744830464  xa.g_1203_tb
>      847249408  xa.g_3201_tb
>      838860800  xa.g_2013_tb
>      251658240  xa.g_0123_tb
>      243269632  xa.g_3012_tb
>      192757760  xa.g_3120
>      192757760  xa.g_1203
>      100573184  xa.g_3201
>      100573184  xa.g_2013
>       33529856  xa.g_3012
>       33529856  xa.g_0123
>       27623424  xa.g_1203_fl
>       27557888  xa.g_3120_fl
>       14680064  xa.g_2013_fl
>       13959168  xa.g_3201_fl
>        8388608  xa.g_3201_tb_fl_ph
>        8388608  xa.g_3201_fl_ph
>        8388608  xa.g_3120_tb_fl_ph
>        8388608  xa.g_3120_fl_ph
>        8388608  xa.g_3012_tb_fl_ph
>        8388608  xa.g_3012_fl_ph
>        8388608  xa.g_2013_tb_fl_ph
>        8388608  xa.g_2013_fl_ph
>        8388608  xa.g_1203_tb_fl_ph
>        8388608  xa.g_1203_fl_ph
>        8388608  xa.g_0123_tb_fl_ph
>        8388608  xa.g_0123_fl_ph
>        4194304  xa.g_0123_fl
>        4128768  xa.g_3012_fl
>        2949120  xa.g_1203_tb_fl
>        2916352  xa.g_3120_tb_fl
>        1409024  xa.g_3201_tb_fl
>        1409024  xa.g_2013_tb_fl
>         360448  xa.g_3012_tb_fl
>         360448  xa.g_0123_tb_fl
>           1088  xa.g
>
>   Files in xaStringPool, sorted by size:
>
>     125714432  xa.sp_avl
>      33554432  xa.sp_nd
>      11927552  xa.sp_avl_fl
>       8388608  xa.sp_avl_fl_ph
>       8388608  xa.sp_08_fl_ph
>       8388608  xa.sp_08
>       8388608  xa.sp_07_fl_ph
>       8388608  xa.sp_07
>       8388608  xa.sp_06_fl_ph
>       8388608  xa.sp_06
>       8388608  xa.sp_05_fl_ph
>       8388608  xa.sp_05
>       8388608  xa.sp_04_fl_ph
>       8388608  xa.sp_04
>       8388608  xa.sp_03_fl_ph
>       8388608  xa.sp_03
>       8388608  xa.sp_02_fl_ph
>       8388608  xa.sp_02
>       8388608  xa.sp_01_fl_ph
>       8388608  xa.sp_01
>       8388608  xa.sp_00_fl_ph
>       8388608  xa.sp_00
>         65536  xa.sp_19_fl
>         65536  xa.sp_18_fl
>         65536  xa.sp_17_fl
>         65536  xa.sp_16_fl
>         65536  xa.sp_15_fl
>         65536  xa.sp_14_fl
>         65536  xa.sp_13_fl
>         65536  xa.sp_12_fl
>         65536  xa.sp_11_fl
>         65536  xa.sp_10_fl
>         65536  xa.sp_09_fl
>         65536  xa.sp_08_fl
>         65536  xa.sp_07_fl
>         65536  xa.sp_06_fl
>         65536  xa.sp_05_fl
>         65536  xa.sp_04_fl
>         65536  xa.sp_03_fl
>         65536  xa.sp_02_fl
>         65536  xa.sp_01_fl
>         65536  xa.sp_00_fl
>          1408  xa.sp
>             0  xa.sp.lock
>             0  xa.sp_19_fl_ph
>             0  xa.sp_18_fl_ph
>             0  xa.sp_17_fl_ph
>             0  xa.sp_16_fl_ph
>             0  xa.sp_15_fl_ph
>             0  xa.sp_14_fl_ph
>             0  xa.sp_13_fl_ph
>             0  xa.sp_12_fl_ph
>             0  xa.sp_11_fl_ph
>             0  xa.sp_10_fl_ph
>             0  xa.sp_09_fl_ph
>             0  xa.sp_19
>             0  xa.sp_18
>             0  xa.sp_17
>             0  xa.sp_16
>             0  xa.sp_15
>             0  xa.sp_14
>             0  xa.sp_13
>             0  xa.sp_12
>             0  xa.sp_11
>             0  xa.sp_10
>             0  xa.sp_09
>
> _______________________________________________
> Mulgara-dev mailing list
> Mulgara-dev at mulgara.org
> http://mulgara.org/mailman/listinfo/mulgara-dev

-- 
Andrae Muys
andrae at netymon.com
Principal Mulgara Consultant
Netymon Pty Ltd