Archive for 2013

luna learns to crawl

October 12th, 2013

So, writing an interpreter. Where do we even start?

Why, with the essentials. While an interpreter is nothing more than a program that reads your program and runs it, that is a pretty complicated undertaking. You could conceivably write an interpreter that reads a program character by character and tries to execute it, but you would soon run into all kinds of problems, like: I need to know the value of this variable, but that's defined somewhere else in the program, and: I'm executing a function definition, which doesn't have any effect, but I need to remember that I've seen it in case anyone is calling the function later in the program.

In fact, this approach is not wrong, it's basically what your machine really does. But your machine is running machine code, and you're not, and therein lies the difference. People don't write programs in machine code anymore. It's too hard. We have high level languages that provide niceties like variables - define once, reference as often as you like. Like functions. Like modules. All of these abstractions amount to a program that is not a linear sequence of instructions, but a network of blocks of instruction sequences and various language specific structures like functions and classes that reference each other.

And yet, since many languages are compiled down to machine code, there must be a way to get from here to there? Well, compiler writers have decided that doing so in one step is needlessly complicated. And instead we do what we always do in software engineering when the problem is too hard: we split it into pieces. A compiler will run in multiple phases and break down the molecule into ever simpler compounds until it's completely flat and in the end you can execute it sequentially.

Interpreters start out the same way, but the goal is not machine code, it's execution of the program. So they bottom out in an analogous language to machine code: byte code. Byte code can be executed sequentially given a virtual machine (language specific) that runs on top of a physical machine (language agnostic) that knows a few extra things about the language it's running.

Phases

Parsing - Read the program text and build a structured representation (abstract syntax tree) out of it.

Compilation - Compile the AST to bytecode instructions that can be executed on a stack machine.

Interpretation - Pretend to be a machine and execute the bytecode.

The parser is the thing that will read the program text, but it needs to know the syntax of the programming language to recognize what it's reading. This is supplied in the form of a grammar. The parser also needs to know the structure of your abstract syntax for it to build those trees for you. And this is where it gets tricky, because the AST is one of the central data structures in an interpreter, and the way you design the AST will have a big impact on how awkward or convenient it will be to do anything meaningful with it. Even so, the AST much match the grammar very closely, otherwise it's going to be very hard to write that parser. So the optimal would be to design the AST first, then write a grammar that matches it to the letter (even better: you could generate the grammar from the AST). Well, there are decades of research that explain why this is hard. Long story short: the more powerful the parser the more leeway you have in the grammar and the easier it is to get the AST you want. Chances are you will not write the parser from scratch, you'll use a library.

The compiler is the thing that traverses an AST and spits out bytecode. It's constrained by two things: the structure of the AST, and the architecture of the vm (including the opcodes that make up your bytecode). Compilers really aren't generic - they need to know all the specifics of the machine they are compiling for. For virtual machines this means: is it a stack machine or a register machine? What are the opcodes? How does the vm supply constants and names at interpretation time? How does it implement scope and modules?

Finally, the interpreter is the thing that emulates a machine and executes the bytecode. It will read the bytecode sequentially and know what to do with each instruction. It's really the only part of the entire program that interacts with its environment extensively. It's the thing that produces output, opens sockets, creates file descriptors, executes processes etc, all on behalf of your program. In fact, it would be more correct to say that these things are contained in the standard libraries of the language, but the interpreter is what orders them to happen.

luna 0.0.1

It's out. An interpreter for basic blocks. All your code runs in a single stack frame, so you can only run expressions, assignments and function calls. There is an object model for lua values. There is a pretty barebones standard library. And there are tests (py.test is making life quite good so far).

writing a lua interpreter

October 8th, 2013

I've been doing a bunch of reading around languages and interpreters lately, but I want to convert that insight into the kind of knowledge that only comes from the experience of implementation. I feel woefully unprepared for this, but on the other hand you can never prepare sufficiently for anything in life with research alone. I expect to hit many stumbling blocks, but that's kind of the point - without hitting them I wouldn't learn much.

Why lua?

Lua as a language has some interesting and unusual properties:

  • Designed to be small.
  • Designed to be portable.
  • Designed to be embeddable.

This is not just aspirational, lua really delivers on these promises.

How small?

  • the grammar has 21 keywords and a small set of productions,
  • 8 types: nil, boolean, number (double precision float), string, function, userdata, thread, and table,
  • a single kind of data structure: tables (associative arrays),
  • no type declarations (dynamically typed),
  • no manual memory management (garbage collected),
  • no object system,
  • no exceptions.

How portable? The lua interpreter is written in ANSI C. The implementors have consistently snubbed platform-specific services so that it builds just about anywhere, and from a single Makefile at that. That's right, no autotools. Try that with gcc.

How embeddable? In a word: very. The interpreter is only 15k sloc. The binary takes all of 6 seconds to build on my platform (linux x86, lua 5.2.2) and is 211kb. Lua is used more in embedded form that standalone and it's embedded in an enormous amount of software. Being small and portable makes it a great language to embed, because it's easy to learn and easy to use.

What's luna?

It's a little known fact that Lua, who is Brazilian, has an Italian sister called Luna. It's not hard to see why that would be - Luna has very little in common with her sister. She doesn't like to travel (not portable) and she doesn't like crowds (not embeddable). Luna has none of the qualities that make Lua so popular.

Not only that, Luna is very careless. When she's doing a task she doesn't care how long she takes, because life is good, so why hurry?

Of course, Python is no language to write an interpreter in. You should be using RPython or C. But it's the perfect language to prototype designs in when you're not really sure what you're doing yet.

Roadmap (tentative)

  • 0.0.1 - An interpreter that can execute a basic block. At this stage a program can consist of: assignment, expressions, function calls.
    • A parser and AST that cover all expressions and a small subset of statements.
    • A bytecode compiler.
    • A single stack frame virtual machine (stack based) with a single global environment.
  • 0.0.2 - Blocks and control flow: if, for, while, repeat.
    • A vm with multiple environments that understands lexical scope.
  • 0.0.3 - Function definitions.
    • A vm with multiple stack frames.

References

  1. Lua 5.1 Reference Manual
  2. The Implementation of Lua 5.0
  3. Where Lua Is Used

python bytecode: loops

August 18th, 2013

New to the series? The previous entry was part 4.

We have seen quite a bit of bytecode already - we've even seen the structure of a bytecode module. But we haven't seen loops yet.

A while loop

We'll start with the simplest kind of loop. It's not really a typical loop given that it breaks after a single execution, but it's enough to give us a look at the looping machinery.

def loop(x):
    while x > 3:
        print x
        break

Disassembly of loop:
  2           0 SETUP_LOOP              22 (to 25)
        >>    3 LOAD_FAST                0 (x)
              6 LOAD_CONST               1 (3)
              9 COMPARE_OP               4 (>)
             12 POP_JUMP_IF_FALSE       24

  3          15 LOAD_FAST                0 (x)
             18 PRINT_ITEM          
             19 PRINT_NEWLINE       

  4          20 BREAK_LOOP          
             21 JUMP_ABSOLUTE            3
        >>   24 POP_BLOCK           
        >>   25 LOAD_CONST               0 (None)
             28 RETURN_VALUE

The control flow is a bit convoluted here. We start off with a SETUP_LOOP which pushes a block onto the stack. It's not really clear to me why that's necessary, given that Python does not use blocks for scopes. But it might be that the interpreter needs to know which level of looping it's at.

We then load the variable x and the constant 3 and run COMPARE_OP. This opcode actually takes a parameter to tell it which operation to perform (greater than in this case). The result of that will be a boolean value on the stack.

Now we need to know whether we're going to execute the loop body or jump past the loop, so that's POP_JUMP_IF_FALSE, which may jump to location 24 where the loop ends.

Assuming we are in the loop body, we simply load the variable x and print it. Interestingly, the print statement requires two opcodes PRINT_ITEM and then PRINT_NEWLINE, which seems a bit over the top.

We now have a BREAK_LOOP instruction. Notice that if we were to ignore and execute the JUMP_ABSOLUTE just behind it that would return us to the loop predicate, and we might continue looping. But that's not supposed to happen after a break: a break ends the loop even if the loop predicate is still true. So this must mean that we won't reach JUMP_ABSOLUTE.

After this we execute POP_BLOCK which will officially end the loop by taking the block off the stack again.

A for loop

A for loop, then, is not very different. The main difference is that we are not looping on a boolean condition - we are looping over an iterable.

def loop(x):
    for i in range(x):
        print(i)

Disassembly of loop:
  2           0 SETUP_LOOP              25 (to 28)
              3 LOAD_GLOBAL              0 (range)
              6 LOAD_FAST                0 (x)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                11 (to 27)
             16 STORE_FAST               1 (i)

  3          19 LOAD_FAST                1 (i)
             22 PRINT_ITEM          
             23 PRINT_NEWLINE       
             24 JUMP_ABSOLUTE           13
        >>   27 POP_BLOCK           
        >>   28 LOAD_CONST               0 (None)
             31 RETURN_VALUE

To do that we have LOAD_GLOBAL to load the range function on the stack. This is an opcode we haven't seeb before, and it simply means this name comes from somewhere outside this module (the __builtin__ module in this case). We then load x and call the function. This produces a list.

Now, since Python uses iterators so heavily, the loop will use this method to move through the list. It means you could also loop over any other iterable object (tuple, dict, string, your own custom iterators etc). In fact, GET_ITER amounts to calling the iter function on the list (which returns an iterator object). And FOR_ITER calls the iterator's next method to get the next item.

We now have the first int in the list, and we bind it to the name i with STORE_FAST. From there on, we may use i in the loop body.

You will notice that there is something odd about the way i is manipulated. At location 16 the int is sitting on the stack, and gets bound to a name with STORE_FAST. This consumes it on the stack. We then immediately push it on the stack again with LOAD_FAST. These two instructions cancel each other out: we could remove them without changing the meaning of the program.

So why do we have to store and load? Well, imagine i were used again in the loop body - it would have be bound, right? So it could be optimized away in this case, but not in the general case.

python bytecode: object model

August 12th, 2013

New to the series? The previous entry was part 3.

Okay, so we've seen a fair amount of bytecode already. We know how modules, classes and functions work and today we'll study an example where it all comes together. It's more about consolidating our insights rather than learning new material.

A simple module

Right as you see this code you'll be thinking in bytecode already. We have three top level bindings:

  • pencils is bound to the integer 13
  • give_back is bound to a function object
  • Person is bound to a class object

pencils = 13

def give_back(x):
    return x

class Person(object):
    default_age = 7

    def __init__(self, age):
        self.age = age or self.default_age

    def had_birthday(self):
        self.age += 1

    def how_old(self):
        return self.age

Let's disassemble this module so we can complete the picture. Ned Batchelder was kind enough to provide some example code for this purpose.

We won't go through all of the bytecode, as it's a bit long. But this diagram illustrates some of the key components of this module. On display we have:

  • The module mod.py's code object. Two of its constants are code objects too:
    • The function give_back's code object.
    • The class Person's code object. it has three constants that are code objects, one of which is:
      • The method how_old's code object.

So yes, a .pyc file simply contains a code object (of type types.CodeType), which has a bunch of attributes. Not all of those attributes will be set - it depends on what kind of functional object (module, class, function, ...) the code operates on.

A function that uses local variables will store them in .co_varnames.

A class or a function will have a .co_name value.

Many code objects will have .co_filename value that tells you where its source code comes from.

And in many cases a code object will have other code objects in its .co_consts tuple, and those will be code objects representing classes or functions or what have you.

python bytecode: classes

August 11th, 2013

New to the series? The previous entry was part 2.

Today we're going to see how classes work. Classes are a bit tricky as they provide the glue that makes functions into methods. They are also created dynamically, as we saw last time, so they must have some compiled bytecode that gets executed when the class actually gets created.

We'll do this in several steps, because unlike functions classes don't come with code objects that we can inspect directly. So we'll use some trickery to see what happens step by step.

Executing a class body

Before we look at a class in constructed form, let's poke around its namespace.

We'll use this simple class as a running example.

import dis

class Person(object):
    default_age = 7

    def __init__(self, age):
        self.age = age or self.default_age

    def had_birthday(self):
        self.age += 1

    def how_old(self):
        return self.age

    # throws NameError if these are not bound
    default_age, __init__, had_birthday, how_old

    # We'll see what this function looks like
    dis.dis(how_old)

# disassemble
 13           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                0 (age)
              6 RETURN_VALUE

What happens here is that we let all the class members get defined. This is basically like making bindings at the module level. In this scope there is nothing to suggest that we are inside a class body. (Well, except that the name __module__ is also bound to "__main__", which would not be the case at module level.)

We disassemble one of the functions to show that there is nothing special about it - it doesn't contain any special sauce that would reveal it to be a method instead of a function. self is a local variable like any other that just happens to be called self.

The class code

As mentioned there is no way to get at the code given a class object, but what we can do is put the class definition in a function body, retrieve the function's code object - which we know contains all the constants and variables used in that function. One of those is bound to be the code object of the class-to-be-constructed.

import dis

def build_class():
    class Person(object):
        default_age = 7

        def __init__(self, age):
            self.age = age or self.default_age

        def had_birthday(self):
            self.age += 1

        def how_old(self):
            return self.age

cls_code = build_class.func_code.co_consts[2]
dis.disassemble(cls_code)

# disassemble
  4           0 LOAD_NAME                0 (__name__)
              3 STORE_NAME               1 (__module__)

  5           6 LOAD_CONST               0 (7)
              9 STORE_NAME               2 (default_age)

  7          12 LOAD_CONST               1 (<code object __init__ at 0xb7533ba8, file "funcs.py", line 7>)
             15 MAKE_FUNCTION            0
             18 STORE_NAME               3 (__init__)

 10          21 LOAD_CONST               2 (<code object had_birthday at 0xb7533c80, file "funcs.py", line 10>)
             24 MAKE_FUNCTION            0
             27 STORE_NAME               4 (had_birthday)

 13          30 LOAD_CONST               3 (<code object how_old at 0xb7533e30, file "funcs.py", line 13>)
             33 MAKE_FUNCTION            0
             36 STORE_NAME               5 (how_old)
             39 LOAD_LOCALS         
             40 RETURN_VALUE

So we reach into the function's code, into it's co_consts tuple and grab the code object. It happens to be at index 2, because index 0 is None and index 1 is the string "Person".

So what does the class code do? Just like at module level, it binds all the names in its namespace, and it also binds the name __module__, because a class is supposed to know the module it's defined in.

And then? Once all those bindings have been made, it actually just returns them. So basically the class code builds a dict and returns it.

This helps complete the picture from last time. To recap, at module level the code first a) calls the class as if it were a function with CALL_FUNCTION (to which this "function" returns a dict, as we've just seen), and then b) BUILD_CLASS on that return value (ie. on the dict), which wires everything together and produces an actual class object.

Methods

Okay, now let's find out something else. We know that functions and methods are not the same type of thing. What about their code? We saw before how a function defined in a class body has no signs of being a method. Has it changed during the construction of the class? A function object replaced by a method object perhaps?

import dis

class Person(object):
    default_age = 7

    def __init__(self, age):
        self.age = age or self.default_age

    def had_birthday(self):
        self.age += 1

    def how_old(self):
        return self.age

dis.disassemble(Person.how_old.func_code)

# disassemble
 13           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                0 (age)
              6 RETURN_VALUE

The answer to that is no. The object is unchanged. In fact, the object stored in the class is a function. That's right, a function. Try reaching into Person.__dict__ to get it and what you get is a function object. It isn't until you do an attribute access on the class object (Person.how_old) that the method object appears, so the method is like a view on the function, it's not "native".

How does that work? You already know: descriptors.

func = Person.__dict__['how_old']
print func
# <function how_old at 0xb749e994>

print Person.how_old
# <unbound method Person.how_old>
print func.__get__(None, Person)
# <unbound method Person.how_old>

person = Person(4)
print person.how_old
# <bound method Person.how_old of <__main__.Person object at 0xb73f5eec>>
print func.__get__(person, Person)
# <bound method Person.how_old of <__main__.Person object at 0xb73f5eec>>

Function objects implement the descriptor protocol. Getting the function through the class (ie. getting the method) is equivalent to calling the function object's __get__ method with that class object as the type. This returns an unbound method (meaning bound to a class, but not to an instance of that class).

If you also give it an instance of the class you get a bound method.

So there you have it, classes and methods. Simple, right? Well, ish. One last thing: is the bound/unbound method object created on the fly? As in: does Python perform an object allocation every time you access a method? Because that would be... bad. Well, it doesn't. At least as far as the user can tell, it's always the same object with the same memory address.