[Mulgara-dev] configuration, source layout, build structure, etc.

Fri Dec 3 18:05:10 UTC 2010

Hi Gregg,

The Mulgara build system is a frightening place. Every so often I've
had a go at cleaning up parts of it, but most of the time I just
pretend that it works so that I can go ahead and write code.

For some history....

The build system was first put together some time around 2002. Several
things have changed since then, including the capabilities of Ant, the
version of Java (we started on Java 1.4 Beta), and many of the
libraries that we use. As a result, you'll see a some things that are
there purely for historical reasons that people didn't know could be
removed. There are also systems that have evolved over time, resulting
in a system that is more complex than if it were simply thrown out and
re-written.

One of the reasons for the directory split in src/jar is for the
purpose of creating self-contained modules that can be joined together
in various configurations. The obvious examples here are the resolvers
and content handlers (since these are loaded dynamically), but extends
throughout the system.

For instance, a binary client needs to have a SPARQL/TQL parser so
that it can parse queries and send the binary request to the server.
The core of the database does not require a parser, but most of the
time you'll want one attached (any SPARQL endpoint needs it). So the
parser needs to be in its own module.

The Ant targets then try to pick up the modules needed (as a Jar) and
glue them together. It would be possible to pick up everything
directly from class files, but it would make the build targets vastly
more complex. However, the granular approach has led to some problems
as well (as you're seeing). Also things are more granular than they
ought to be. A little while back I did some work to merge a number of
things that are *always* needed in the kernel (I used "query" for
this), but a lot more can be done there.

Incidentally, I did *not* create this build system. I can name the
people who did, and I can tell you how much I complained about it at
the time. However, it works, I'm used to it, and the time investment
to fix it is something I'd prefer to spend elsewhere, so most of the
time I just work with it.

Now on to your questions...

On Fri, Dec 3, 2010 at 8:35 AM, Gregg Reynolds <dev at mobileink.com> wrote:
> Trying to get a handle on the source directory structure and build logic, I
> decided see what would happen if the code were laid out on the orthodox Java
> model.  The way the source tree is stuctured now is a little complicated,
> for me at least, and the Ant build files look pretty complicated, at least
> for an Ant newbie.

They're complex even for someone experienced in Ant. The root
build.xml and common.xml are the tough parts. Fortunately the
build.xml associated with all of the source directories follow a
pattern and are reasonably straightforward, so that helps a little.

> So I rearranged the source tree to make the paths match the package names.
>  I used to be pretty skeptical about the wisdom of this, but it has the
> advantage that the package structure becomes visible.  The way the code is
> currently organized, it's a real pain in the neck to find all the files for
> a package (at least for some packages), since they are not stored in the
> matching dir structure.  For example, source files for org.mulgara.server
> are stored in src/jar/query/java/org/mulgara/server, and
> src/jar/server/java/org/mulgara/server.
> YMMV, but for me this arrangement is pretty opaque.  Presumably the idea is
> to distribute source files under dirs named by functionality (e.g.
> src/jar/query).  The problem of course is that if you're looking at a source
> file, the package names don't tell you where to find the source files, as
> they do under an orthodox layout.  A Real Pain, in my experience.

Yes, it's painful. Part of what you're seeing is where some classes in
a package are needed for some packaging configurations, and not
others, while others might be needed for every configuration. Since
the packaging is done by joining intermediate Jars, and those Jars are
built from directories (instead of pulling out .class files
individually - a cleaner, but more tedious approach), then packages
can be split into different directories.

I split my time editing with Eclipse and VIM. Eclipse is OK, as it
will merge packages together for me, but it's terrible with VIM. I end
up needing to use "find" a lot.

Anyway, where possible I'd like to see these multiple directories
abandoned. I'd like to see all classes for a single package found in
the same source directory (with the possible exception of Test classes
- I like the Maven approach there). I'd also like to see everything
that will always be distributed together to be found in the same
directory tree.

However, it's not really easy to see how to do this sometimes. For
instance, the kernel of the system (currently found in "query") always
needs access to resolvers, and hence, will always need to see the
interfaces in the "resolver-spi" directory. However, each individual
resolver is a separate, self-contained module, and these also need to
read the interfaces in the "resolver" directory so that they can
implement them. (BTW, the system is not tied to any one resolver.
While the default is "resolver-store" you can configure with any
read-write resolver, such as "resolver-memory"). It may be that
"resolver" can be merged in with "query" but I'd need to check that.

Also, "query" is called that because it used to just hold code
relevant to answering queries. Now that so many other things have been
merged into it, it might be better with a name like "core" or
"kernel", but that would be a big task to change cleanly.

> Anyway, I had the day off Tuesday and in a moment of petulance started
> reorganizing the code to make the standard java layout.  Turns out it's not
> that complicated; a little regex magic using emacs' Dired is all it takes.

It's possible, but it's the various distribution jars that make this
tricky. It's always harder for me, as I only use one or two
configurations myself, and I can forget about the needs of something
like mulgara-core. But other people still use these configurations, so
I can't neglect them.

>  I did have to rename some of the org.mulgara.server stuff that implements
> EmbeddedMulgara; I changed it to org.mulgara.server.embedded.  This actually
> makes sense for another reason, namely that EmbeddedMulgara can be viewed as
> an application rather than part of the Mulgara kernel.  More on that later.
> (This turned out to be a good idea if for no other reason than it forced me
> to learn Ant, and it's a very good way to become familiar with the source
> code.)
> I dumped all the ant files into a single <root>/ant directory.  Originally I
> wasn't going to bother trying to build from this source tree, but I thought
> what the heck, maybe it won't be too bad.  Well, I begin to understand why
> ant is reviled by some.  But I did eventually figure out how to get around
> cyclic dependencies (compile it all in one step), and after a long day
> managed to compile the whole thing into class files, no jars yet.

This is fine for a single big distribution Jar. I've done it myself.
However, I never got it to the point where I could create the various
distribution configurations.

> Now I'm looking at the build logic for jarring it up.  Turns out it's quite
> easy to go from a tree of class files to the jars.  The thing is, all the
> ant files in the original distrib have pretty much the same structure.  It
> seems to me it should be possible to simplify things considerably just by
> iterating over a list of names using a generic jar task.  But I haven't
> figured out how to do that; any tips would be appreciated.  I'm actually
> thinking of using awk.

With the exception of a couple of systems, a simple compile/jar should
be easy. The main differences are where you need to do some
pre-compile steps. For instance TQL needs a SableCC task to be run,
SPARQL uses JavaCC (I'd prefer to move TQL to JavaCC as well), the
runtime configuration needs a Castor task to be run, and Axis needs to
generate stub/skeleton classes. That's all I recall from the top of my
head, but there may be other things as well.

> You can take a look at the results at http://bitbucket.org/gar/mulgara-src.
>  Not intended as a fork, only as a sandbox I can use to help grok the
> architecture and get the documentation right.
> On the other hand, going through this exercise does suggest some changes
> that might be worth looking at for the official code.  Some examples:
>
> There is some code duplication in some of the jar files.  Example:
>  some org/mulgara/server classes (e.g. Session.class) occur twice in
> mulgara-x.y.z.jar.  Seems strange to an old C programmer, but I'm not that
> familiar with the Java ecology.

It's happening because there are some classes ending up in more than
one intermediate Jar file, and both of those Jar files are being
included in the final distribution Jars. Occasionally I've been able
to remove duplicates, but some things can't come out given the current
build system. Yet another reason to change it.

> It looks like some of the build steps copy the source before compiling it.
>  Doesn't seem necessary.  I'm guessing this is in order to get some kind of
> macro expansion, but there are other ways to do that.  Is there a reason for
> this I'm not seeing?

The only things that come to mind are the pre-compile processes
(mentioned above), and the "macro expansion" done in the
server-configure target (found in src/jar/server/build.xml). In the
latter case, the source code is copied and then modified to include
the build label in it. There are other ways to do this (eg. some
resource to be included with the distro that contains the build label,
and read this at runtime), but assuming that the code is going to
include this information directly then there aren't a lot of choices.
C/C++ have the macro expansion step, which is how this is done in a
lot of other projects, but there's no equivalent in Java, which is why
Ant has utilities for doing this kind of thing.

> The build logic clutters the tree.  It creates directories in the <root>
> dir, e.g. bin, obj, which seems like a Bad Idea to me.

I think it was done to simplify the build for individual modules.
Trying to get straight from .java files to a Jar file can be difficult
in some circumstances. Having intermediate steps makes it easier to
work with, and to debug if things aren't working. The final result may
be a more complex arrangement, but on a package by package basis it's
simpler.

The obj directory contains the output of the compiler (all the .class
files) plus any resources that the Jars will need. The Jar files will
be built from the files and directories in there. This is a very
common sort of directory to have in any Java project.

The bin directory is definitely a little strange. This is the one that
contains all of the intermediate jar files. Basically, any directory
found in src/jar will end up with a jar file in the bin directory.
These are the Jars that are connected together to make the final
distribution Jars. Creating them may be unnecessary, but I believe
that it made the construction of each of the build targets easier.

> Some of the build targets seem to run even if the build products are up to
> date.

There are a few reasons there. First, is that the dependency tree
needs debugging. Second, even with the dependencies set correctly, Ant
sometimes fails to see file changes, meaning that the output of the
build is not correctly up to date. As a result, a few things work
around this by calling tasks explicitly every time, when they should
only need list them as a dependency. Third, there seems to be
occasions where Ant is being too cautious, and it runs dependencies
even when it doesn't need to (this *could* just be due to complex
dependency interactions that I'm missing, but it certainly looks like
some things are happening that don't need to).

> I haven't mastered all of the build targets in build.xml, but from what I've
> examined I don't see developer targets.  Pardon my Java ignorance, but
> wouldn't it make things go faster in the dev cycle to dispense with jarring?
>  E.g. "compile" just creates class files, and testing runs directly against
> the unjarred class files.

Despite testing being done with JUnit, the tests are evenly split
between unit tests and integration tests. The unit tests would be fine
to run from classes, but the integration tests need everything running
in place, and I don't think that's very easy from class files. This is
because some of the system configuration is looking for certain things
in the packaging structure. It might be possible to make it all run
from directories of .class files, but it may take a lot of
configuration to get it right.

> Basically I'm looking for ways to speed up the
> build - seems a little slow to me, currently.

It is, and if you can clean it up, then I'll be grateful.  :-)

> Instead of organizing the source by concept (or module, or whatever term),
> e.g. server, store, query, etc., use the build files to expose such
> structure.

(I tend to call them "modules")

I agree, and would like to do it that way. After all, the packages
already do most of that for us. The build files will need to get
bigger and more complex (particularly WRT creating build targets with
their various selection of classes), but some of what is in there now
can possibly be dropped as a result.

>  In some cases it might improve clarity to rename a package.

Perhaps. What are you thinking of here? I have no real problems with
any of the package names. The module names could be changed though
(I've already mentioned that "query" now encompasses much more than
just querying).

> Split the source into front-end stuff (e.g. parsers, query processors,
> etc.), the "kernel" (I'm not yet sure where the boundary between kernel and
> non-kernel lies, but it must be somewhere), and back-end stuff, mainly
> resolvers.  Among other benefits this would make it much easier to e.g.
> isolate resolver development for documentation and implementation.

Well, that's what the modules currently do.

Anything labelled "content-" is a content handler, and is therefore a
standalone package.

Anything labelled "resolver-" is a resolver, and is therefore a
standalone package.

The "store" modules are all related to resolvers, so they can probably
be merged into the relevant resolvers.

As for the rest:

ant-task: Used for letting Ant connect to a running Mulgara server.
client-jrdf: Client code for using the JRDF API. I think it should be removed.
config: The runtime system. Implemented with Castor.
demo, demo-mp3: A tutorial on writing content handlers.
descriptor: Web module for running results through user-defined XSL.
doclet*: Defines some new JavaDoc tags that appear in the code.
driver: client code for creating a connection factory. Nearly empty
these days, but not sure where to merge it.
dtd: I suspect this is related to descriptors. Last updated in 2001!
jrdf: Represents a stored graph as a JRDF graph. Should be mergable
into "query". May not be needed anyway.
krule: The rules engine. I planned on making it a module, but really
it's tightly integrated now, and should merge into "query".
query: The query engine. This is the core to the system, and a
destination for most things that can be merged.
querylang: Contains all the TQL and SPARQL code. This includes both
the parsers AND the protocol code.
rdql: The old Jena query language. Completely deprecated now.
server: All the standalone server code. I don't think the other
distribution configuration uses this module.
server-*: Client code for connecting to a server.
swrl: implement a RuleLoader (found in "query"). Should probably be
merged. OTOH, we should possibly throw it out and implement a RIF
parser instead (RIF is my day job right now - it's also the day job of
the guy who wrote this swrl module).
tag: Servlet tags for accessing Mulgara from a Java servlet.
tuples: The representation of data inside the query engine. Should be
merged with "query".
tuples-hybrid: A specific implementation of tuples that uses memory,
and seamlessly falls back to disk as it expands.
util: utilities used by everything in the system.
util-xa: utilities used by the storage resolvers.
web: The Web User Interface. Most servers want this, with the
exception of embedded servers. Also, some systems prefer to have a
SPARQL endpoint only.

> I've also been learning Jetty, which suggests an alternative packaging
> scheme for mulgara.  The Jetty start.jar design strikes me as a Most
> Excellent thing.  It should work with any java code, so it could probably be
> easily adapted to Mulgara.  Call it mulgara.jar; then startup could look
> something like this:
>
> $ java -jar mulgara.jar OPTIONS=kernel,memory,file,lucene,mp3
>
> or the like.  Naturally this would entail rethinking the jar structure.

As well as rethinking the entire startup of the standalone system.
Make sure you're familiar with this before trying to tweak Jetty. Note
that a few things are done in a roundabout fashion so that the same
services can be run from either Jetty or an application server like
Tomcat.

The standalone startup is in EmbeddedMulgaraServer. Since HTTP won't
be set up for every configuration, all the Jetty code is isolated into
HttpServices. (This avoids ClassNotFoundExceptions if HTTP isn't being
used and the Jetty jars aren't there - yes, this configuration is used
in some places).

> I'm also looking into exposing the architecture of Mulgara "applications" -
> mulgara-x.y.z.jar is effectively an application that uses the Mulgara
> kernel, and I'm working on splitting out the structure more clearly, as well
> as implementing the basic set of app structures, such as a servlet running
> in tomcat and talking to mulgara, etc., and the variants of doing this with
> Jetty, such as using an XML config file, etc.  In principle all this should
> be pretty simple; or at least result in some pretty simple examples that
> will help document things.

My experience is that this has been tough to set up cleanly for
multiple environments as you propose. Maybe it was just hard because I
was trying to modify what exists, rather than rebuilding it from
scratch.

> The basic thrust of all this is just to get a clean view of the architecture
> and how the parts fit together.  To reflect the architecture, at least as I
> now understand it, I propose the following doc sets:
>
> Mulgara User's Guide (already started
> at http://bitbucket.org/gar/mulgara-doc)
> Mulgara Administrator's Guide
> Mulgara Application Development Guide
> Mulgara Kernel Development Guide (pending delineation of just what "kernel"
> means)
> Mulgara Resolver Development Guide
> Mulgara API Manual
> Others?  doe the above make sense?

These would all be great to have, and some of the material is already
out there. The main problem is how do you get it written? I can
contribute but don't have time, and I think that goes for many of the
other people who have time for Mulgara.

> That's all for now.  Feedback warmly welcomed.

Regards,
Paul