[Mulgara-dev] another 2.1.9 issue...

Paul Gearon gearon at ieee.org
Thu Oct 28 15:08:10 UTC 2010


Hi David,

OK, there's a few questions here....

On Thu, Oct 28, 2010 at 10:31 AM, David Smith <DMS at viewpointusa.com> wrote:
> This one I'm having trouble reproducing in a "small" way, but....
>
> When I switch our production system to the new mulgara (2.1.9+  w/
> utils.debug and .info fix  AND with Removed truncate from rollback in
> SVN)

Eeek. The removal of truncation is actually a non-fix. I have to get
to it, but I've been swamped with work.  (I know exactly how to fix
it, but have been putting it off until there's time).

So that you know, 2.1.9 has a bug that causes an exception if you roll
back a large enough transaction. The rollback attempts to truncate the
string pool file to the beginning of the transaction, but that file is
now memory mapped. If the transaction was large enough, then that part
of the file is mapped, and can't be truncated into. The non-fix was to
just set the file position, and avoid truncation, but the file is
append-only, and the size of the file is used to determine some things
as well, so this wasn't properly thought through.

> things would appear to run fine, until some of our background tasks
> would kick in.
>
> After that, occasionally the parsing of results of queries would fail
> (that would run normally in 2.1.8).
> Usually with some complaint about malformed XML, like:
>
>>> Message: Failed to retrieve list of tasks from the database. There is
> an unclosed literal string. Line 2, position 20905946.    at
> System.Xml.XmlTextReaderImpl.Throw(Exception e)
>>>    at System.Xml.XmlTextReaderImpl.ParseAttributeValueSlow(Int32
> curPos, Char quoteChar, NodeData attr)
>
> I do not believe that this particular query should have generated a 20MB
> result, and suspect that there's a problem in the Answer rendering.

Hmmm. Perhaps the above bug created an entry in the string pool that
is corrupted?

OK, I'll try to fix this tonight.

> So the situation is:
>
> Multiple Threads running multiple queries...  ( we use POST REST /TQL
> queries )... occasionally returning malformed TQL/XML.
>
> This is "repeatable" in the sense that it happens every time I start the
> system under 2.1.9
> I have not been able to "reproduce" or capture the state that produces
> the error yet.
> (The exceptions have been thrown in a variety of places)
> When I switch back to 2.1.8, everything runs fine.
>
> ?Do these symptoms trigger any hints as to where to look and what's
> changed between 2.1.8 and 2.1.9 ?
>
> On another topic (why we're very interested in 2.1.9)
> Paul, can you summarize the performance implication of the changes to
> the string pool between 2.1.8 and 2.1.9?

The 2.1.8 string pool was using a small amount of state when reading
the "flat" file in the string pool. This meant that concurrent threads
had to serialize their access, which was a major problem. The other
thing was that the string pool was using standard I/O, and when I
tested memory mapping it came out to be about 30% faster.

So 2.1.9 has a new set of classes that memory maps the string pool
flat file on 64 bit systems (leaving it as I/O on 32 bit systems, or
if you ask for it. The I/O solution now works concurrently). The
memory mapped version is read-only.

The flat file is only ever appended to (except in the case of a
rollback), and this can get beyond the end of the mapped area. So a
WeakHashMap is also kept alongside the memory map. Whenever a read
operation asks for something from the memory mapped area, then the
object is retrieved straight from the mapped buffer. Otherwise, it can
come from the WeakHashMap. However, if the object wasn't found in the
WeakHashMap (due to the garbage collector picking it up), then this
triggers a remapping, whereby the unmapped portion of the file gets
memory mapped, and the WeakHashMap is cleared. This reduces the need
for remapping.

In summary, 2.1.9 has much better caching of the XA1.1 string pool
(whether relying on Java caches, or the operating system's page
caches), and also allows for concurrent reading. The current bug is
triggered if a rollback occurs after enough data was added (and
accessed) to trigger a remap. An example of this would be loading a
large invalid file.

As I said, I'll get to this tonight. I was away at a conference last
week, and have been catching up since, or else I'd have done it
already.

Paul


More information about the Mulgara-dev mailing list