mr-edd.co.uk :: horsing around with the C++ programming language

The problem with DAG-driven builds

[21st July 2010]

slam has served me reasonably well for a while now, but as soon as it came off the conveyor I knew there were things that I'd never be able to do with it.

In fact, it was only a short while after I released it that I started thinking about a successor. I started out with the notion of creating something like SCons, but easier to use[1]. The idea would be to provide an API in a powerful scripting language such as Python or Lua that allowed the construction of a directed acyclic graph (DAG) representing the build.

However, after creating a prototype and thinking about it some more, I'm gradually coming to the conclusion that the DAG-driven abstraction underlying systems such as SCons and make is inadequate for describing some types of build that exist in the real world.

The traditional DAG build model

Let's first examine how such systems work by looking at a simple build for a C library. A graph similar to the following might be created:

static library DAG

SCons won't actually create the nodes in the graph for the headers upfront (it calls them implicit dependencies), but make will (once you've bolted on others tools such as sed, makedepend, g++ -M, etc to your build process). Regardless, the graph is representative of the eventual knowledge that each system has about dependencies.

To perform the build, the system starts at the nodes at the top and works its way down, performing something like this on each node:

# pseudo-python
def build(node):
    if not node.up_to_date():
        node.run_build_task()
    for childnode in node.children():
        build(childnode)

The actual implementation details differ — one system might do a topological sort in order to deduce groups of nodes that can be built in parallel first, another might use some kind of continuation-style system to imbue the build with concurrency — but the essence is again the same.

Processes yielding multiple build products

It all looks fine so far. But now let's change the example slightly. Instead of building a static library with the GNU C++ tool-chain, we'll build a dynamic library with Visual C++. Here's the graph:

dynamic library dag

Note that Visual C++'s linker will typically be used to produce both a .lib stub library and a .dll library that contains the actual machine code, but a single invocation produces both files.

So if we were really to create this graph in a makefile or SConscript and allow it to be built in a concurrent manner (e.g. pass -j2 to make), we could end up calling link.exe twice. At best this is redundant work and at worst it's a dangerous race condition leading to file corruption.

But make and SCons have some tricks that allow you to express the fact that a single invocation of a build step will update multiple files. So if we're sufficiently skilled with our build tool, our graph should actually look like this:

dynamic library dag

Before, the processes used to update each target in the build have been implicit nodes, each with a single edge that connects to the build target. But since we've now found a case where multiple targets need to share an instance of a process, we can see that processes should probably be first class citizens in the graph.

Now, the APIs for any given DAG-driven build system might not expose the process nodes explicitly, but if two target nodes both refer to the same process-instance, then in effect such a graph really does exist within the system.

The build() function we saw earlier now needs some modification in order to do the right thing, but that's not a particularly big deal. What I do think matters, here, is that the traditional idea of the DAG is starting to crack.

Processes yielding unpredictable build products

Now let's really cause some trouble! Suppose we have a simple Java project. We have a bunch of .java source files which we want to compile in to .class files and package-up in a .jar file.

As anyone that has tried to create a nice make-driven build system for Java will attest, this is damn near impossible to do for the following reasons:

  1. dependencies between java source files can be devilishly difficult to work out (they can even be circular!). Since makefiles need to know dependencies in advance, this renders makefile construction very difficult, especially in the absence of something like makedepend for java sources.
  2. Sun's Oracle's javac likes to be helpful by compiling related .java files for you. For example, if A.java imports B, the compiler will look at B.java and compile it for you if it thinks it's warranted (this behaviour becomes even harder to predict when implicit imports are considered for source files belonging to the same package).
  3. even if a given java file has no external dependencies, compiling it might produces multiple class files if you have inner and/or anonymous classes. For example, compiling foo.java might yield foo.class, foo$1.class and foo$bar.class.

So even if we ignore the dependencies issue, point number 2 implies that we can end up with race conditions in a parallel build; if A.java and B.java both import C and two compilers are spawned to compile A and B, a race is introduced to compile C.

SCons, admirably, copes rather well with these two issues, but despite its best efforts it comes unstuck on the third. In order to determine which .class files will be created by compiling a given class files, you not only have to parse the .java file, you also have to understand it to some degree. In other words, in order to be able to compile java, you have to be able to compile java[2]!

Even in relatively simple java builds, SCons' pseudo-compilation fails to determine which .class files will be created. This means it can end up creating nodes in the DAG for .class files that will never be built and fail to create nodes for some .class files that will ultimately cause the build to fail.

So perhaps we could modify the DAG again to have some kind of lazy/placeholder node?

java DAG

Of course, there's still the problem of just how many of such nodes we would need in any given graph. Perhaps the placeholder node could represent multiple files-to-be? But now we're really twisting the original DAG abstraction out of shape.

Before I go on, I'll just note that this kind of problem isn't unique to Java. One can imagine builds where a code generation step takes an XML file as input and produces multiple C source files as output. In this case, to determine the names of the generated files, you'd need an XML parser at the very least (and probably something like an XPath implementation, too).

While such build steps probably do have deterministic output(s), they can often be impractical to deduce upfront.

An alternative build model

But what if there's another way? Why don't we just get rid of all the target nodes completely, leaving only the process nodes behind?

different java DAG

We've still got a DAG here, but it's almost an inverse of what we had originally. Now the edges of the graph are more like UNIX pipes, directing data between the process nodes.

The nice thing about a system like this is that each process, once complete can look to see what it's produced and send those files on to the next step. For example, a Java compiler node might scan the output directory for foo(\$.+)?\.class when compiling foo.java. Things are more complicated when multiple .java sources are involved, but I think we're in a better position to handle that now as all .java files can be looked at by a single build process at once.

Of course, a C compiler process might also be able to deduce the output(s) rather simply by analysing the inputs, but that's also fine. A typical C build involving a couple of libraries and an executable might look like this:

different static library DAG

Here, I'm imagining that the header scan process adds a list of headers as meta-data to the file objects being passed in from the file system. This will allow the compiler to decide which source files really do need compiling. The header scan could be rolled in to the compiler process, but in my prototyping I've found that a regular-expression based scanner that employs some judicious caching can be much faster than something like g++ -M. So I think it would be a good idea to keep the header scanning process modular.

Blemishes

Despite this system's ability to handle unpredictably named build products, I still haven't convinced myself that this is a silver bullet for builds.

One thing that's needed is a caching mechanism that remembers what's been built previously and can pass on files that are already up to date. For example, if foo.obj was built last time around, we want to avoid building it again if its source file and #included headers haven't changed, but we'll still need to pass it on to the librarian or linker. However, I don't think that's insurmountable and it's definitely the case that any non-trivial build system will need some kind of caching mechanism to maintain a reasonable level of performance. The cache could also be used to provide a list of files and folders to remove when a 'clean' is wanted.

I'm also conscious of the fact that I might be falling victim to the second system effect after having used the useful-but-seriously-flawed slam for a while now. Perhaps the lazy/placeholder node idea is actually just fine? And I shouldn't bend over backwards to accommodate strange build requirements. Java developers seem happy enough with IDE-driven and/or non-incremental builds, after all.

Footnotes
  1. the reasons that I have shied away from SCons are probably a subject for another post []
  2. little wonder programmers writing Java tend to lean on IDEs quite heavily []

Comments

Christopher Lord

[21/07/2010 at 23:35:09]

It's not really the DAG that's causing some of the issues you're mentioning. some are due to the convention of putting the arrows from the sources to the products. by just flipping the direction so that products point at their requirements, a whole family of annoying problems go away. To test this theory, I ported a large source base (500kloc) to use the tup system (http://gittup.org/tup/) and we got faster builds and much more predictable results. Unfortunately, we couldn't use it in our live system due to integration issues, but damn impressive.

Cameron Foale

[22/07/2010 at 01:25:02]

For the dll/lib example, could you possibly have an abstract "library" node, which has both of your .obj files as dependencies, and which is the dependency of both the dll and the lib? Ie. put a node where the arrows cross over. Is this what you meant by a placeholder node?

@Christopher, Edd's arrows are going the same way as tup's, just his diagrams are up the other way.

Edd

[22/07/2010 at 18:28:28]

You probably shouldn't pay too much attention the direction of the arrows here. I arbitrarily chose A -> B to mean B depends on A (in the traditional DAGs). By the time we reach the later graphs, the edges mean something else entirely, anyway.

@Christopher: I like the sound of tup and I'm sure it's great for a large subset of builds, I don't see how it would cope with something like Java, where the names and number of nodes aren't known in advance.

@Cameron: putting a node at the point where the edges cross would indeed be workaround in the lib/dll case. But then rather than representing dependency information, edges would instead represent a kind of "happens before" relationship (in the general case). It was by heading down this road that I arrived at the alternative scheme.

By "placeholder" nodes, I mean nodes that represent files whose names/locations you don't know in advance of running the build.

For example, compiling a given .java source file might produce 5 .class files. Without partially compiling foo.java, I can't predict the paths of those files and thus can't put corresponding nodes in a DAG in advance. It might be possible to have a kind of node that's anonymous until the .class file has been generated -- a placeholder.

However, since it's not only hard to predict the names of the .class files, but also how many will be generated when compiling a given .java file, placeholder nodes aren't a sufficient enhancement in themselves to be able to represent Java builds in a traditional DAG-driven build system; how many placeholder nodes should one add for a .java file?

Also, I suspect it's probably not a good idea to add nodes to a graph that's potentially being processed concurrently. It could be possible though, I haven't given that much thought.

Nadav

[24/07/2010 at 05:11:38]

I enjoyed reading this article.I think that the example with link.exe which creates two output files is the exception. Most tools actually create one file. I would think that there should be a way to get around it. In gcc-world, this is never the case. Every tool in the build process creates exactly one output file.
BTW, any thoughts about CMake ? I don't think that it is fundamentally different from the tools you described (Make, scons, etc)

Edd

[24/07/2010 at 10:16:42]

Hey Nadav. The dll/lib example is probably the exception in the C and C++ world, but in general this kind of thing perhaps isn't that uncommon.

Java is the obvious case mentioned in the post, but the same is true for many (all?) of the languages that target the JVM. For example, I know the groovy compiler can produce multiple .class files per source file, when closures and such like are present.

Haskell is another language that I've been playing with a little recently. GHC will output a .hi file and a .o file for each Haskell source file.

There are indeed workarounds when using a traditional DAG-driven build (such as the one Cameron suggested) and that's how we'd see them in the C and C++ world -- as workarounds -- because these cases are rather rare, as you say. But when using some other languages, these cases become the norm. Employing a workaround would be a kin making the whole build a hack.

But I will say this: I *think* it's the case that build systems such as Maven and Ant, which are commonly used in the Java community, aren't clever enough to do minimal rebuilds anyway. From what I can gather they seem to have the .jar file depend directly on the .java sources. So whenever any .java file is changed, all the java sources are recompiled in to however many class files, which are then re-archived in to a .jar file. There's no reason that couldn't be done in a DAG-style system. So maybe I don't need to stray too far from a standard DAG after all? It would be no worse than the existing tools for Java builds.

As for CMake, I've heard it's pretty powerful. There are some things that put me off, such as the fact that it doesn't seem to have a fully-fledged programming language underpinning it. Increasingly, I think that's a very important requirement. Also, I really don't like the fact that it produces makefiles/projects/whatever, as I've really grown to dislike makefiles. Qt's qmake also does a similar thing with respect to makefile generation, and when you/it get the dependencies wrong, it's a pain to debug.

However, I can't say I've played with CMake enough to present a fair opinion, so take these comments with a grain of salt. Perhaps I'll give it a go this week.

Brendan Miller

[27/07/2010 at 17:49:14]

As for Scons, I think it's kind of misleading for them to claim Java support. A couple of years back, working in a mixed Java/C++ code base I tried using scons for my Java as well. Scons kept blowing up, and it took me a long time to find out that they just didn't know how to handle inner classes. I wish they could be bothered to put a warning label on their Java docs.

I think the long term solution is for programming language designers to design languages so they don't need a separate build process. If you look at Python, the problem is solved out of the box, because the runtime does compilation for you.

Really, I'd like to do away with build scripts, project files, IDE's etc.

(optional)
(optional)
(required, hint)

Links can be added like [this one -> http://www.mr-edd.co.uk], to my homepage.
Phrases and blocks of code can be enclosed in {{{triple braces}}}.
Any HTML markup will be escaped.