generalized makefiles

May 18th, 2009

Build systems are probably not the most beloved pieces of machinery in this world, but hey we need them. If your compiler doesn't resolve dependencies, you need a build system. You may also want one for any repeated task that involves generating targets from sources as the sources change over time (building dist packages, xml -> html, latex -> pdf etc).

generalized_makefiles_singleFittingly, there are quite a few of them. I haven't done an exhaustive review, but I've mentioned ant and scons in the past. They have their strengths, but the biggest problem, as always, is portability. If you're shipping java then having ant is a reasonable assumption. But if not.. Same goes for python, especially if you're using scons as build system for something that generally gets installed before "luxury packages" like python. Besides, scons isn't popular. I also had a look at cmake, which is disgustingly verbose.

Make is the lowest common denominator and thus the safest option by far. So over the years I've tried as best to cope with it. Fortunately, I tend to have fairly simple builds. There's also autotools, but for a latex document it seems like overkill, to put it politely.

One to one, single instance

So what's the problem here anyway? Let's use a simple example, the randomwalks code. The green file is the source. The orange file is the target. And you have to go through all the yellow ones on the way. The problem is that make only knows about the green one. That's the only one that exists.

So the simplest thing you can do is state these dependencies explicitly, pointing each successive file at the previous one. Then it will say "randomwalks.s? That doesn't exist, but I know how to produce it." And so on.

targets := randomwalks

all: $(targets)

randomwalks : randomwalks.o
	cc -o randomwalks randomwalks.o

randomwalks.o : randomwalks.s
	as -o randomwalks.o randomwalks.s

randomwalks.s : randomwalks.c
	cc -S -o randomwalks.s randomwalks.c

clean:
	rm -f *.o *.s $(targets)

Is this what we want? No, not really. Unfortunately, it's what most make tutorials (yes, I'm looking at you, interwebs) teach you. It sucks for maintainability. Say you rename that file. Have fun renaming every occurrence of it in the makefile! Say you add a second file to be compiled with the same sequence. Copy and paste? Shameful.

One to one, multiple instances

generalized_makefiles_multipleIt's one thing if the dependency graph really is complicated. Then the makefile will be too, that's unavoidable. But if it's dead obvious like here, which it often is, then the build instructions should mirror that. I run into a lot of cases where I have the same build sequence for several files. No interdependencies, no multiple sources, precisely as shown in the picture. Then I want a makefile that requires no changes as I add/remove files.

I've tried and failed to get this to work several times. The trick is you can't use variables, you have to use patterns. Otherwise you break the "foreach" logic that runs the same command on one file at a time. But then patterns are tricky to combine with other rules. For instance, you can't put a pattern as a dependency to all.

At long last, I came up with a working makefile. Use a wildcard and substitution to manufacture a list of the target files. Then use patterns to state the actual dependencies. It's also helpful to unset .SUFFIXES so that the default patterns don't get in the way.

targets := $(patsubst %.c,%,$(wildcard *.c))

all: $(targets)

% : %.o
	cc -o $@ $<

%.o : %.s
	as -o $@ $<

%.s : %.c
	cc -S -o $@ $<

clean:
	rm -f *.o *.s $(targets)

.SUFFIXES:

Many to one

generalized_makefiles_manytooneWhat if it gets more complicated? Latex documents are often split up into chapters. You only compile the master document file, but all the imports are dependencies. Well, you could still use patterns if you were willing to use article.tex as the main document and stash all the imports in article/.

This works as expected, $< gets bound to article.tex, while the *.tex files in article/ correctly function as dependencies. Now add another document story.tex with chapters in story/ and watch it scale. :cap:

targets := $(patsubst %.tex,%.pdf,$(wildcard *.tex))
 
all: $(targets)
 
%.pdf : %.tex %/*.tex
	pdflatex $<

clean:
	rm -f *.aux *.log *.pdf

Many to many

Latex documents don't often have interdependencies. Code does. And besides, I doubt you want to force this structure of subdirectories onto your codebase anyway. So I guess you'll have to bite the bullet and put some filenames in your makefile, but you should still be able to abstract away a lot of cruft with patterns. Make also has a filter-out function, so you could state your targets explicitly, then wildcard on all source files and filter out the ones corresponding to targets, and use the resulting list as dependencies. Obviously, you'd have to be willing to use all non-targets as dependencies to every target, which yields some unnecessary builds. But at this point the only alternative is to maintain the makefile manually, so I'd still go for it on a small codebase.

PS. First time I used kivio to draw the diagrams. It works quite okay and decent on functionality, even if the user interface is a bit awkward. Rendering leaves something to be desired clearly.

love the Rybak story, don't like Rybak

May 17th, 2009

It has come to pass.

I would say it's hard to believe that it's gone this far, but truly it isn't. Eurovision has always been garbage so I'm not particularly shocked at this outcome. The problem with Eurovision isn't the music, it's the voting. Every time there are some good acts in it, but the votes fall pretty much as the wind blows, it's by no means meritocratic. (That is, even looking past the blatant political voting and back room deals between neighboring countries.) So if you watch it you'll only be pissed that the best songs never get recognized.

But then they picked Rybak. Two-three months ago I'd never heard of him. I was visiting in Norway and I was told the biggest celebrity in the country right now is a guy with Belarusian origins who plays the violin. Sweet! What's not to like about that?

It's a feel good story, almost makes me feel like it could have been me. I'd like nothing better than someone like this get recognized. Norway is a great place, but like all immigration stories go, there are some people who look down on some people from certain geographic origins; it's pure prejudice. And he plays violin to boot, an eccentric interest to say the least. Which I did too once! And this is the guy the Norwegian public loves to death.

All was rosy. And then I heard his music. :/

Awful stuff. The melody must be derived from some sort of folk music, evidenced by the lame dancing. I hate folk music. Then there is the violin, which he doesn't really do much with, except play the theme. And even if you liked it up to that point, then he starts singing. Borderline off key. It literally doesn't even make the standard of an average Eurovision entry.

Needless to say, pop music is not about music, it's about image. And he certainly has that going for him. Maybe if he gets a good producer behind him he can crank out something actually worth playing on the radio. Having come this far, he can't be completely devoid of talent, can he? Just talent misdirected, that's my guess.

The Rybak story is a great story, but it sorely needs an injection of credibility. Maybe one day he could be celebrated on musical merit?

assembly primer

May 14th, 2009

So, assembly.. :nervous: Not exactly second nature for someone who lives in the dynamic language world. Even c seems like a million miles away. But even though I'm not yearning to work in assembly, it would be nice to know something about it. If for no other reason than the fact that all my [compiled] code eventually ends up in assembly (or bytecode, which is not too distant from assembly).

No doubt one way to learn assembly would be to dive in at the deep end with the right manuals. But I've already read about all the opcodes once and it didn't tell me anything, it was all too foreign. So I think a much better way is to see how assembly gets to be the way it is, and maybe it's possible to make sense of that. For this we'll need a super simple example to dissect, because assembly code is much longer than c.

Random walks

The idea of a random walk is that you start out at position=0, and then you flip a coin. If it's heads, take one step to the right (position++). If it's tails take one step to the left (position--). Repeat this for as long as it amuses you. Eventually you terminate and check where you've come. It could be position==0, which means you had just as many heads and tails. Or it could position==-2, so you had more tails etc.

We'll see this algorithm in c, and then compile it to assembly. It's obviously very simple; the only complication is the need for a random number generator. Now, there is a rand() function in stdlib.c, but you still have to seed it with *something* random, so we'll just read bytes from /dev/urandom instead. The byte is a number [0,255], so we'll divide the range in two and see which side of 127 the number was on.

Here it is in c:

#include <stdio.h>
#include <stdlib.h>

void walk() {
    int pos = 0;
    int steps = 100;

    FILE *rand_stream = fopen("/dev/urandom", "r");

    for (int i=0; i<steps; i++) {
	int x = fgetc(rand_stream);
	pos += x < 127 ? -1 : 1;
    }

    fclose(rand_stream);

    printf("Steps taken: %d\n", steps);
    printf("End position: %d\n", pos);
}

int main() {
    walk();
}

Compile with:

gcc -std=c99 -o walks randomwalks.c

But we also want the assembly code:

gcc -std=c99 -S randomwalks.c

The first WTF in assembly is that rather independent of how your c code looks, it doesn't come out in that order. So the first thing to figure out is where the heck execution starts. There should be a function called main somewhere..

main:
    leal    4(%esp), %ecx       /* execution begins */
    andl    $-16, %esp
    pushl   -4(%ecx)
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %ecx
    subl    $4, %esp
    call    walk                /* call walk with no arguments */
    movl    $0, %eax            /* exit code := 0 */
    addl    $4, %esp
    popl    %ecx
    popl    %ebp
    leal    -4(%ecx), %esp
    ret                         /* execution ends */

The only thing that happens here that we can recognize from the c program is the call to walk. The other stuff is all bookkeeping that has to do with how programs start and end. I have no idea what this stuff is for.

Should we just go to walk then? Well, first it seems prudent to mention that all our string literals (which we'll be needing soon) are stored separately from code.

.LC0:
    .string "r"
.LC1:
    .string "/dev/urandom"
.LC2:
    .string "Steps taken: %d\n"
.LC3:
    .string "End position: %d\n"

Let's go to walk now.

Again, the first few instructions have to do with bookkeeping and are not commented. But soon we reach statements from the c program. The first interesting event is the literal 0, written $0, pushed onto the stack, at the location -4(%ebp), which is an offset calculated from the base pointer %ebp. The base pointer has something or other to do with where functions keep their local variables, making sure a successive call doesn't clobber the environment of the caller.

But anyway, -4(%ebp) is the address where literal value 0 is. And this represents the integer variable pos. We know that because this is the first thing that happens in the c program, so it's also the first thing that happens here. Until something new gets put into -4(%ebp) we know that pos==-4(%ebp).

The same thing happens with steps, stored one slot over. The addresses increase by multiples of 4 because this is a 32bit machine and 32bits/8bits/byte = 4 bytes.

Next, two string constants (their addresses, actually) are put on the stack. They have to be set up in this order to call a c function, namely fopen. The result of this call is found in %eax, and represents the variable *rand_stream. This value is the pushed on the stack again, to address -12(%ebp).

Finally, we've reached the for loop. The next instruction assigns i=0, and then we jump to the loop condition.

walk:                           /** walk function body **/
    pushl   %ebp
    movl    %esp, %ebp
    subl    $56, %esp
    movl    $0, -4(%ebp)        /* (pos = 0)        -> -4(%ebp) */
    movl    $100, -8(%ebp)      /* (steps = 100)    -> -8(%ebp) */
    movl    $.LC0, 4(%esp)      /* "r"              -> stack[1] */
    movl    $.LC1, (%esp)       /* "/dev/urandom"   -> stack[0] */
    call    fopen               /* Call fopen with 2 args from stack */
    movl    %eax, -12(%ebp)     /* *rand_stream     -> -12(%ebp) */
    movl    $0, -16(%ebp)       /* (i = 0)          -> -16(%ebp) */
    jmp     .L2                 /* goto loop condition */

In case you're wondering, .L2 is not a function declaration. It's a label, ie. an address you can jump to. Assembly makes no distinction between a label that represents a function and one that doesn't.

What we have here is the loop condition check, followed by the rest of the walk function (excluding the loop body). In other words, suppose the condition were never true, this is how we would execute walk.

We load the address representing the variable i into %eax and compare it to the value of steps. If the comparison is unsuccessful (ie. the body should be executed) we jump to the loop body.

Subsequently, we line up the argument *rand_stream. The extra step of putting into %eax seems silly, perhaps it could be done in one instruction. We then call fclose. The same happens with the calls to printf.

The last two instructions again have to do with bookkeeping and don't correspond to any statements in our c program.

.L2:                            /** loop condition check **/
    movl    -16(%ebp), %eax     /* i                -> %eax */
    cmpl    -8(%ebp), %eax      /* cmp(steps, i)    -> %eax */
    jl      .L5                 /* if (!cmp) goto loop body */

                                /** After for loop **/
    movl    -12(%ebp), %eax     /* *rand_stream                 */
    movl    %eax, (%esp)        /*                  -> stack[0] */
    call    fclose              /* call fclose with 1 arg from stack */
    movl    -8(%ebp), %eax      /* steps                        */
    movl    %eax, 4(%esp)       /*                  -> stack[1] */
    movl    $.LC2, (%esp)       /* "Steps taken.."  -> stack[0] */
    call    printf              /* call printf with 2 args from stack */
    movl    -4(%ebp), %eax      /* pos                          */
    movl    %eax, 4(%esp)       /*                  -> stack[1] */
    movl    $.LC3, (%esp)       /* "End position.." -> stack[0] */
    call    printf              /* call printf with 2 args from stack */
    leave
    ret                         /* return from walk function */

What we have left, then, is the loop body. It looks messy because we have some branching here. But it's not too bad.

The first thing being done is to get ahold of *rand_stream and call fgetc. The result is found in %eax and represents the variable x. We compare this value to the literal $126 (127 in the c program), to see if we should produce the increment 1 or -1.

If x turns out to be below 127 we store the value -1 in -36(%ebp). This value is part of an expression and does not represent any variable in the c program. We then jump over the next assignment. Alternatively, we run the assignment of 1 to -36(%ebp) and skip the first assignment.

Ending up in .L4, we add the value in -36(%ebp), which is either 1 or -1, to pos. Then we increment the loop counter i.

In the full assembly file, .L2 follows after this, which means we re-evaluate the loop condition and possibly execute the loop again. So everything checks out. :)

.L5:                            /** loop body **/
    movl    -12(%ebp), %eax     /* *rand_stream                 */
    movl    %eax, (%esp)        /*                  -> stack[0] */
    call    fgetc               /* call fgetc with 1 arg from stack */
    movl    %eax, -20(%ebp)     /* x                -> -20(%ebp) */
    cmpl    $126, -20(%ebp)     /* cmp(126,x)       -> %eax */ 
    jg      .L3                 /* if (cmp) goto assign 1 */

                                /** x is <127  => assign -1 **/
    movl    $-1, -36(%ebp)      /* -1               -> -36(%ebp) */

    jmp     .L4                 /* goto pos += ... */

.L3:                            /** x is >=127 => assign 1 **/
    movl    $1, -36(%ebp)       /* 1                -> -36(%ebp) */

.L4:                            /** pos += ... **/
    movl    -36(%ebp), %eax     /* add(1 or -1,              */
    addl    %eax, -4(%ebp)      /*              pos) -> pos  */
    addl    $1, -16(%ebp)       /* add(1, i)         -> i */

So that's a pretty non trivial chunk of code for a trivial c program. Needless to say, it really helps to have the source code when you're looking at assembly. It gets worse when programs are optimized, because the instructions will be pruned and rearranged, so it's gonna be harder to understand.

ruby compiler series: annotated git history

May 13th, 2009

I've been reading along with Vidar Hokstad's rather excellent Writing a compiler in Ruby bottom up. It's a 20 part (so far) series documenting his effort to hack together a Ruby hosted compiler that in the end will compile a language similar to Ruby into x86 assembly.

Compilers are complicated beasts that take a lot of planning to build. Now I'm not saying Vidar didn't do all the planning, but what makes this series especially palatable is the fact that he's writing it literally bottom up, through what you might call evidence based hacking. That is, compile the very simplest thing you can (starting with an empty ELF binary), and then see what gcc produces. From there on, add print "Hello World" and see how the code changes and so forth, adding new constructs. This means you can read along even if you don't know any assembly (like yours truly) and take it in small steps without first having to absorb the complexity of a whole compiler.

It's a great learning opportunity, seeing as how each step is a working compiler one iteration up. You can read along Vidar's blog with the git diff side by side and see how the assembly is changing. To make this a bit clearer I've forked his repo and annotated the early commits with tags (where they were missing) and made sure the customary make run / make clean work as expected. I've also added some commit messages that tell you exactly what the iteration achieves at each particular step, so you can browse the history and figure out say "where do I look to see how to do a while construct".

I've annotated the first 15 steps (the rest were already tagged):

compiler-in-ruby

Given how git works, where every commit is hashed from the sum total of the previous ones, the only way I could do this is by rewriting the git history, which is not ideal. So Vidar's commit objects won't match mine, but all I've done is cherry pick them off his branch and add some annotations. It's all still there. :)

repeatable one off builds with 'emerge'

May 4th, 2009

I have to say I do actually use quite a lot of software. Not all the time, but I like to try things. I'm not one of those "email and web" people that you hear about. And because I like to be able to run some application that's not widespread without it causing me any extra hassle I'm really only interested in the distros that have the largest repos. Such as Gentoo and Ubuntu.

Even so, there are those package requirements that fall through the cracks. It's hard to get a hold of bleeding edge builds sometimes. Even if there are some dedicated packagers for a particular project working for a particular distro, these ad-hoc efforts don't cover everything. Not to mention the fact that unofficial builds on Ubuntu are not necessarily easily reversible. I spent some time recently working on a project in Mono, and there's just no real option for me to get close-to-svn builds. Even Gentoo is quite far behind and I never have any luck with the absolute latests ebuilds.

Hence the svn builds.

In any case, there will always be times when custom builds are necessary. After all that's what developers do: take on the scary world of unstable so that in the end we can make sure we ship something users can run both on the current version of their distro, and (hopefully) for some time to come.

To that end I wrote a messy perl script to do mono svn builds. That was over a month ago. I was looking to add a feature just the other day and it hit me just how awful perl is. Every time I come back to it after a period of absence I feel like a stranger in the valley of death, with the auto expanding arrays and other land mines. It's not the first time I've written one of these build scripts, I need them from time to time while I'm on some project. And I often end up with a couple bash scripts so I can reproduce these installs later on.

But do I really want to be writing these ad-hoc installers every time? That's the thing about one off builds, they're not really one off. Often you want to be able to reproduce it.

Nothing whatsoever to do with Gentoo portage!

I know I should have picked a different name, but the parallels are so clear that I just couldn't resist. emerge is a simple package installer. Emphasis on installer, it does not manage packages, it does not even keep track of what has been installed. No notion of versions, no safety checks, just the user supplied shell commands wrapped in a nicer package.

  • Phases: fetch, configure, build, install
  • Other actions: list, search, pretend
  • Switches: nodeps, revision
  • Fetch support: git, svn, archive (tar,gzip,bzip2,zip)

Here's the idea:

  1. Write a build file containing a number of packages.
  2. Record dependencies between them.
  3. Use emerge to run any combination of actions on them.

It's really that simple. To use the running mono example, here's a taste:

emerge_search

The build file is an optional argument if there's only one in the current directory. Most of the command line options are identical to portage:

emerge_opts

And if you want to just see the depgraph that works too. Just don't put cycles in it or weird things will happen.

emerge_pretend

And if you need to set certain environmental variables during building/running (so that the package doesn't meddle with your system) it supports that too.

Build files

Build files are basically just a listing of the shell commands necessary to build the package. I workshopped the idea of writing them in YAML, since it's a nice and compact format. But I decided against it, because it's not really buying me anything, and since YAML's grammar is rather quirky the parse errors are harder to understand than Python's. So the build files are just regular Python modules. Here's an excerpt from mono.py:

src_path = "/ex/mono-sources"
ins_path = "/ex/mono"
svnbase = "svn://anonsvn.mono-project.com/source/trunk"

conf = "./autogen.sh --prefix=%s" % ins_path
build = "make"
install = "make install"

env = os.environ.get

for v in ("libgdiplus", "mcs", "olive", "mono", "debugger", "mono-addins",
          "mono-tools", "gtk-sharp", "gnome-sharp", "monodoc-widgets",
          "monodevelop", "paint-net"):
    exec("%s='%s'" % (v.replace("-", "_"), v))

project = {
    "src_path": src_path,
    "ins_path": ins_path,
    
    "environment": {
        "DYLD_LIBRARY_PATH": "%s/lib:%s" % (ins_path, env("DYLD_LIBRARY_PATH","")),
        "LD_LIBRARY_PATH": "%s/lib:%s" % (ins_path, env("LD_LIBRARY_PATH","")),
        "C_INCLUDE_PATH": "%s/include:%s" % (ins_path, env("C_INCLUDE_PATH","")),
        "ACLOCAL_PATH": "%s/share/aclocal" % ins_path,
        "PKG_CONFIG_PATH": "%s/lib/pkgconfig" % ins_path,
        "XDG_DATA_HOME": "%s/share:%s" % (ins_path, env("(XDG_DATA_HOME","")),
        "XDG_DATA_DIRS": "%s/share:%s" % (ins_path, env("XDG_DATA_DIRS","")),
        "PATH": "%s/bin:%s:%s" % (ins_path, ins_path, env("PATH","")),
        "PS1": "[mono] \\w \$? @ ",
    },

    "packages": {
        libgdiplus: {
            "svnurl": "%s/%s" % (svnbase, libgdiplus),
            "configure": conf,
            "build": build,
            "install": install,
        },
        mcs: {
            "svnurl": "%s/%s" % (svnbase, mcs),
            "deps": [libgdiplus],
        },
        olive: {
            "svnurl": "%s/%s" % (svnbase, olive),
            "deps": [libgdiplus],
        },
        mono: {
            "svnurl": "%s/%s" % (svnbase, mono),
            "configure": conf,
            "build": "make get-monolite-latest && %s" % build,
            "install": install,
            "deps": [libgdiplus, mcs, olive],
        },

The only thing that gets read is the dict in the global scope called project. And then it has an optional member called environment, and the packages declared under packages. It should be pretty easy to figure out. If src_path isn't set fetching defaults to /tmp.

Anyway, at about 500 lines it's decent bang for buck I think. It's a classic 80/20 effort, getting 80% payoff for 20% effort.

Love it or hate it, here's the goodies: