GitHub - harryxp/pypl0: Automatically exported from code.google.com/p/pypl0

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
lib/x86asm		lib/x86asm
pl0		pl0
src		src
Makefile.cygwin		Makefile.cygwin
Makefile.elf		Makefile.elf
Makefile.in		Makefile.in
README		README
Repository files navigation

PL/0 compiler in Python
Pan Xingzhi
http://www.panxingzhi.net

Note: Pan is my family name, and literally my *first* name.
      If you use Vim but not folds, type zr to expand all.
      I appreciate any feedback, comments, bug reports, etc.

0. Introduction and overview {{{

This work is an effort for studying compiler & interpreter construction, and
related techniques.

For a novice programmer, compilers and interpreters are mysterious beings.
Code is fed and everything just happens as desired. However a real hacker can
hardly bear this obscurity and will finally want to find out what's under the
hood. After all they are just programs devised by human beings so they must be
fairly understandable and reasonable.

At least when my interest on these stuff began, this was my motivation. I
bought almost every compiler book available on the market - be it "dragon" or
"tiger". Despite the low quality, in China one can get many famous computer
books in English gravure at quite low prices. (For those biased: they're NOT
pirated.)

To my disappointment none of them served good for me. I could understand those
algoritms for sure but I didn't get the whole picture; I saw the trees and even
the leaves but I didn't get to see them as a forest. I found myself in a
paradox: the idea lurks behind those text, and the key is the understanding of
the idea itself.

This situation lasted for a few years: I always wanted to write a compiler, but
I didn't know exactly how (everything right after the parser's work became
vague to me) or why (well, in Beijing it's hard to find a related job to
justify this interest, if possible at all). However my enthusiasm for
programming languages didn't fade. Over these years I've been writing code in
different languages: Python, Scheme, Erlang, Bash, JavaScript, C, SQL, LZX,
Java ... When I was doing this I couldn't help comparing different languages,
different language features/paradigms, different ways to complete programming
tasks. I read all kinds of materials on the Internet to get ideas how and why
people do this and that. This is an enlightening process: everyday I became a
better programmer than the day before. Then one day I found myself in a good
shape to bite the bullet.

It didn't come all of a sudden. The most high level language I could use back
then was Java, which, despite the great IDE support (I was very fast on Eclipse
JDT, but after using Vim I never want to come back), is monotonous (OO only, no
first-class functions, etc.) and (thus) verbose. Now I am an expert Python
programmer, and Python is so high level that it transforms my thoughts into
executable programs with minimal code and time. I'm also good at Scheme and the
functional programming paradigm. I have a deep understanding on programming
languages now and just want to take one step further. I know how a Scheme
interpreter works, thanks to SICP the wizard book. Besides, I'm a powerful
Linux user, who has every needed tool ready at hand, especially Vim, on which
I'm writing this now.

After some warm-up reading I sat down and kicked off coding. The result is a
compilation system, with a compiler front end generating the so-called abstract
syntax tree, which can be executed directly by a interpreter, or fed to a
compiler backend (or shall we say another compiler?) to (further) generate
machine code. As the time of this writing, x86 assembly is supported and C-- is
on schedule.

My aim is to learn and play around rather than to develop a full-fledged
system, which is not worth without being paid =). So the source language is
PL/0, a very simple Pascal like language designed by Niklaus Wirth for
educational purposes. And the implementing langauge is Python, which is chosen
because its clear syntax is ideal for demostrating algorithms and its string
and seqeunce processing ability is highly valuable for writing almost any part
of a compiler. Its string template (%) is excellent for low level code
generation when code chunks don't correspond nicely, as we shall see in the
following part for generating x86 assembly. At last, Python code is publicly
recognized as easy to read, so if this documentation isn't clear enough, the
code and its inline comments speak for itself.

The diagram below shows the components of the compilation system:

+------------------------------------------------------------------------------+
|                                                                              |
|                     scanner*/parser*                                         |
|  PL/0 source code -------------------> concrete parse tree*                  |
|                                                 |                            |
|                                                 | AST generator*             |
|                                                 |                            |
|                                                 V                            |
|            interpreter* <............. abstract syntax tree*                 |
|                                                 |                            |
|                                                 | x86 assembly generator*    |
|                                                 |                            |
|                                                 V                            |
|                                            x86 assembly                      |
|                                                 |                            |
|                                                 | NASM assembler             |
|                                                 |                            |
|                                                 V                            |
|                                            object file                       |
|                                                 |                            |
|                                                 | GCC (serves as a linker)   |
|                                                 |                            |
|                                                 V                            |
|        operating system <.............   native executable                   |
|                                                                              |
|                                                                              |
| *    implemented in Python                                                   |
| -->  code transformation                                                     |
| ..>  executed by                                                             |
|                                                                              |
+------------------------------------------------------------------------------+

See Appendix I for code organization and usage.

During this joyous development I realized one important idea to keep in mind:

    A compiler is just an effort to transform/translate code, which is nothing
    magical but plain text with rules called the grammer, and with meanings
    called the semantics.

Much effort in building compilers is devoted to the translation process - we're
trying to regulate text to a form in terms of the target machine. It's a pity
that I seldomly, if never, saw this point explitly stated in compiler books.
However when I strayed from the point, this idea always pulled me back on road
- the whole thing is like translating Chinese to English, but much easier.
Easier because the grammer and the semantics in a programmer language is much
simpler than one can find in human languages.

Speaking of difficulty, the "gap" between the source and the target language
matters, a lot. We can think of our abstract syntax tree as a language and
that's no problem - we even developed an interpret for it. The gap between PL/0
source code and its AST is tiny - so small that we can jump over with little
effort. On the other hand, the target language x86 assembly needs more
sophisticated means when transforming, as we shall see in the following
sections.

We can think of this from another angle: to transform to the AST is easy but to
execute it we need an interpreter. An interpreter written in high level
languages saves us considerable effort since it understands a relatively "high
level" translation. however if we make the effort to transform to the x86
assembly (then to native code), the executor can be low level - hardware with
instruction sets. We can think of our hardware as an interpreter of the machine
code; we can also think our interpreter as some hardware who understands the
AST.

That being said, we can write a "machine" who understands different levels of
form of code as we want. Had we written a software mimicing a "machine" who
understands Intel(TM) CPU instructions (and executes it corretly), we got a
virtual CPU (I don't know if VMWare(TM) works this way but I can imagine so).

Besides this central idea, code organization is also vital. We strive to
separate phases of the compiler into individual modules - they are just boxes
takeing in input and producing (intermediate) output. By this way we can test &
observe every phase independently - we can even print out those intermediate
output. This is an excellent way to learn about compilers.

All modules are prefixed with "pypl0_" to avoid name conflicts since names like
"parser", "ast" and "utils" are very common.

I decided to handcraft a recursive descent parser instead of using any parser
generators. The simple grammer allows me to do so, and it's great fun!

}}}

1. The grammer {{{

The grammer used is a modified version of this one:
http://en.wikipedia.org/wiki/PL/0. Symbol "!" is added so we have an output
mechanism. I also capitalized all the keywords and added angle brackets to ID
and NUM, which are regex tokens (instead of simple string keywords). <EOF> is a
token representing the end of file. Here's the finished version:

PL/0 TOKENS in regex
<ID> = ["a"-"z""A"-"Z"]["a"-"z""A"-"Z""0-9"]*
<NUM> = ["0-9"]+

PL/0 GRAMMER in EBNF
program = block "." <EOF>

block = [ "CONST" <ID> "=" <NUM> {"," <ID> "=" <NUM>} ";"]
        [ "VAR" <ID> {"," <ID>} ";"]
        { "PROCEDURE" <ID> ";" block ";" } statement

statement = [ <ID> ":=" expression | "CALL" <ID> |
            "BEGIN" statement {";" statement } "END" |
            "IF" condition "THEN" statement |
            "WHILE" condition "DO" statement |
            "!" expression ]

condition = "ODD" expression |
            expression ("="|"#"|"<"|"<="|">"|">=") expression

expression = [ "+"|"-"] term { ("+"|"-") term}

term = factor {("*"|"/") factor}

factor = <ID> | <NUM> | "(" expression ")"
}}}

2. The scanner - pypl0_scanner.py {{{

The scanner is invoked by the parser but it's in a different module
(pypl0_scanner). All we need from the scanner in the parser is 3 things:

    Token - the class definition
    nexttoken - the function retriving the next (one) token
    lookahead - the function looking ahead the next n tokens without actually
                retrieving them

The implementation of the scanner is extremely straightforward:

    - it keeps a buffer of tokens and return one or more of them on demand
      (nexttoken or lookahead)
    - when tokens are in need, it first checks if the buffer has enough of
      them; if not, it reads in lines of source code, tokenizes them and puts
      them into the buffer until the demand is satisfied. By reading in lines,
      the line number of each token is kept right in that token, thus it's easy
      to report errors
    - it moves the token out of the buffer when nexttoken gets called
    - it does NOT move tokens out of the buffer when lookahead gets called
    - the tokenizing algorithm is self-documented in the code

Maybe in the future we can use some existing tokenizer to further shrink our
code.

One thing worth noticing is the evil magic I cast:
(lambda doc:
        map(lambda t: setattr(Token, t, t),
            [ t.rstrip(',') for t in doc[doc.find('(')+1:doc.find(')')].split() ]
        )
)(Token.__doc__)

This saved me from writing:

Token.ID, Token.NUM, Token.DOT, \
Token.CONST, Token.EQUAL, Token.COMMA, \
.
.
= 'id', 'num', 'dot', \
'const', 'equal', 'comma', \
.
.
}}}

3. The parser - pypl0_parser.py {{{

It's a classical recursive descent parser with lookahead - 1 token lookahead is
sufficient for the PL/0 grammer.

Note that lookahead happens when "|" (alternative), "[" (0 or 1 of the
previous), "{" (0 or more of the previous) are seen in the EBNF grammer.
Moreover, note how we "mirror" these in a almost mechanical way:

|       : if/elif/else statement
[ ... ] : if statement
{ ... } : while statement

These control structures all imply the idea of *conditions*.

Look at the code: indeed it's mechanical and can be generated - think how we
can write a parser generator.

However it can be harder than expected, since PL/0 grammer is very simple.

Analysis of the nullables and the first/follow sets (the capitalized U means
the union operation of sets):

nullables: block, statement

FIRST(program)
= { "." } U FIRST(block)
= { ".", "CONST", "VAR", "PROCEDURE", <ID>, "CALL", "BEGIN", "IF", "WHILE", "!" }
FOLLOW(program)
= { <EOF> }
FIRST(block)
= { "CONST", "VAR", "PROCEDURE" } U FIRST(statement)
= { "CONST", "VAR", "PROCEDURE", <ID>, "CALL", "BEGIN", "IF", "WHILE", "!" }
FOLLOW(block)
= { ".", ";" }
FIRST(statement)
= { <ID>, "CALL", "BEGIN", "IF", "WHILE", "!" }
FOLLOW(statement)
= { ";", "END" } U FOLLOW(block)
= { ";", "END", "." }
FIRST(condition)
= { "ODD" } U FIRST(expression)
= { "ODD", "+", "-", <ID>, <NUM>, "(" }
FOLLOW(condition)
= { "THEN", "DO" }
FIRST(expression)
= { "+", "-" } U FIRST(term)
= { "+", "-", <ID>, <NUM>, "(" }
FOLLOW(expression)
= { ")", "=", "#", "<", "<=", ">", ">=" } U FOLLOW(statement) U FOLLOW(condition)
= { ")", "=", "#", "<", "<=", ">", ">=", ";", "END", ".", "THEN", "DO" }
FIRST(term)
= FIRST(factor)
= { <ID>, <NUM>, "(" }
FOLLOW(term)
= { "+", "-" } U FOLLOW(expression)
= { "+", "-", ")", "=", "#", "<", "<=", ">", ">=", ";", "END", ".", "THEN", "DO" }
FIRST(factor)
= { <ID>, <NUM>, "(" }
FOLLOW(factor)
= { "*", "/" } U FOLLOW(term)
= { "*", "/", "+", "-", ")", "=", "#", "<", "<=", ">", ">=", ";", "END", ".", "THEN", "DO" }

TODO: a predictive parsing table (will be very big!!!)

The data structure we use to represent the parse tree is simply a tree, whose
nodes are tagged lists. For example:

    ['<PROGRAM>', ['<BLOCK>' ... ], DOT("."), EOF("")]

We can print it out with the prettyPrintTree function in the pypl0_utils module.
We can even write some code to mock up the tree directly, if we wish to skip
the whole parsing phase to test following phases.
}}}

4. The abstract syntax tree - pypl0_astnodes.py, pypl0_astgen.py {{{

There are many stuff we don't need in the parse tree for the next steps -
whether we want to write an interpreter to "run the program" or to generate
target code. E.g., our parse tree is like this:

    ['<PROGRAM>', ['<BLOCK>' ... ], DOT("."), EOF("")]

The '<PROGRAM>' tag is a lame way to represent the tree node - we still have to
retrieve its useful "block" subnode as the 2nd element in the list, according
to the *grammer*; moreover, the 3rd and 4th elements: DOT(".") and EOF("") are
useless at all. We don't need them from now on because they have nothing to do
with the semantics (meaning/behavior) of the program - they have served as
delimitors of the source code to facilitate our parsing and they have served
well. But once we obtain the correct structure of the program, their tasks are
done.

The (concrete) parse tree mirrors the source code at every detail. But what is
needed now is semantics for interpreting and transformation (to target code).
What we need is a data structure conveying the meaning of the program,
independent of the concrete syntax or grammer - that's why it's called abstract
syntax tree (AST). You are going to see how we'll use it in the subsequent
phases.

There are different ways to construct the AST. We could have done it in the
parser - that's pretty much the standard way and it's totally justified. We
still use "lookahead" and "expect" to see if the source code conforms to the
grammer but instead of taking in all the tokens, we drop useless ones like DOT
and EOF above, and construct AST nodes on the fly instead of lists of lists and
tokens.

The way we choose to do it, however, is to generate the parse tree first and
walk the parse tree later at any time we want to build a corresponding AST.
This means another pass. We do this because, 1) well, the parser is already
written as is without considering the abstract syntax stuff; 2) we want to make
every step clear because we're learning.

We have also boldly assumed that the input of the tree walker is always valid
since it's the responsibility of the parser to guarantee the syntax rules are
obeyed, and we have done a lot of error checking for it so only a minimal error
checking is done when generating the AST.

We use ASTNode and its subclasses as the abstraction of the AST. To utilize the
prettyPrintTree function, these classes mimic list behaviors. This is possible
thanks to Python's duck typing. Every ASTNode instance has a children property,
which holds its child nodes in the tree.

Note that although procedure definitions are not non-terminals in the grammer,
we abstract them as an ASTNodes nontheless for our own convenience.

The AST is just another form of the source code - a step furthur towards the
machine code. And it's close enough for us to write an interpreter for it.

}}}

5. The interpreter - pypl0_interp.py {{{

The fact that PL/0 is a very simple language makes the interpreter very easy
to write once the code is in a proper form to be manipulated (the AST).

The interpreter simply traverses the whole AST and executes it, with certain
semantic checkings:

    identiers must be declared and initialized before use
    left-hand side of an assignment must be a variable (instead of a constant
        or a procedure)
    identiers in an expression must be variables or constants
    every called identier is a subroutine


Other possible but undone checkings:
    one identier can be declared no more than once in the same scope
    type checking (since we only have one type now)

Possible "sanity checks":
    is every variable both written and read?
    is every constant read?
    is every subroutine called?
    is the program non-empty?

}}}

6. Generating x86 assembly - pypl0_x86codegen.py, pypl0_x86codegentpl.py {{{

Translating the AST to x86 assembly is less straight forward than to write an
interpreter. For example, the while loop

    while( condition ) {
        body of loop
    }

would correspond to

    while:
        ; code to set FLAGS based on condition
        jxx endwhile ; select xx so that the loop ends when the condition met
        ; body of loop
        jmp while
    endwhile:

Work needs to be done so that the condition and the body of loop are generated
first and then "injected" to proper location, only then the translation of the
while statement can be done.

To do it we use Python's "formatted string" facility:

"""
; while label
%s:
; while condition
%s
; while statement
%s
; while next turn
        jmp     %s
; while endlabel
%s:
"""

All the "%s" will be replaced with proper assembly once they're generated.

Other language constructs are done in the similar way.

We chose NASM to be our target code (http://nasm.us). An excellent introduction can be found at http://www.drpaulcarter.com/pcasm/.

}}}

7. Compile to native executables {{{

We use NASM (http://nasm.us) to assemble the x86 assembly code to object files
(just another transformation). Once we got the the object files we can link
them with our I/O library (asm_io) to produce real executables!

Related commands (can be found in the makefiles)
For Cygwin:
nasm -f win32 -o lib/cygobj/asm_io.o lib/x86asm/asm_io.asm
nasm -f win32 -o build/cygobj/filename.o build/x86asm/filename.asm
gcc -o build/cygbin/filename build/cygobj/filename.o lib/cygobj/asm_io.o

For Linux/ELF:
nasm -f elf -d ELF_TYPE -o lib/elfobj/asm_io.o lib/x86asm/asm_io.asm
nasm -f elf -d ELF_TYPE -o build/elfobj/filename.o build/x86asm/filename.asm
gcc -o build/elfbin/filename build/elfobj/filename.o lib/elfobj/asm_io.o

}}}

Appendix I. To use the code {{{

Directory layout:

build/                          ... generated stuff
    cygbin/                     ... executables for Cygwin from build/cygobj
    cygobj/                     ... object files for Cygwin from build/x86asm
    elfbin/                     ... executables for ELF/Linux from build/elfobj
    elfobj/                     ... object files for ELF/Linux from build/x86asm
    x86asm/                     ... x86 assembly from PL/0 source code under pl0
lib/                            ... I/O routine
    cygobj/                     ... asm_io.o, linked with build/cygobj
    elfobj/                     ... asm_io.o, linked with build/elfobj
    x86asm/
        asm_io.asm              ... included by build/x86asm
        asm_io.inc
pl0/                            ... PL/0 example code
    factorial.pl0               ... factorial using recursion
    gcd.pl0                     ... greatest common divisor
    minimal.pl0                 ... minimal legal PL/0 program
    square.pl0
    uninitialized.pl0           ... an example with legal syntax but wrong semantics
src/                            ... Python implementation
    pypl0_astgen.py             ... AST generator
    pypl0_astnodes.py           ... AST nodes definition
    pypl0_interp.py             ... AST interpreter
    pypl0_main.py               ... *ENTRY POINT* of the system
    pypl0_parser.py
    pypl0_scanner.py
    pypl0_utils.py              ... pretty print the parse tree and the AST
    pypl0_x86codegen.py         ... x86 assembly generator
    pypl0_x86codegentpl.py      ... string templates used by pypl0_x86codegen
unittest/                       ... UT code, to be added...
Makefile.cygwin                 ... Makefile for Cygwin
Makefile.elf                    ... Makefile for ELF/Linux
Makefile.in                     ... included Makefile
README                          ... this documentation

Just grab everything by

    svn checkout http://pypl0.googlecode.com/svn/trunk/ pypl0-read-only

.

Main functionalities are centralized in src/pypl0_main.py, Run

    python src/pypl0_main.py -h

for help. Python 2.5 or higher is needed.

The makefiles are used to facilitate building native executables from PL/0
source code. Currently Cygwin and Linux/ELF formats are supported. Just rename
the right makefile to "Makefile" or use the "-f" option to designate the right
one (see below), and type "make" in the command line under the top directory.
Note that you need Python 2.5 or higher, NASM 2.07 (http://nasm.us, NASM 2.00
should suffice but not tested) and GCC 4.3 for the makefile to work.

The PL/0 source code can be either interpreted, or compiled to native code to
be executed. For example, on a Linux box

    python src/pypl0_main.py interp pl0/factorial.pl0

and

    make -f Makefile.elf
    build/elfbin/factorial.pl0

yield the same result. 

}}}

Appendix II. pypl0_utils.py {{{
}}}

Appendix III. Known issues and possible improvements {{{

We haven't done any unit test at this point - it's important because the input
can be diverse. We need broad test case coverage. And it's highly desirable
that pieces of the whole program can be tested (thus debugged and maintained)
separately.

Error reporting & recovery of the parser - error msg doesn't give all
"possibilities".  Because it doesn't take things like '[]' into account,
instead it simply gives the next Token after these.

The x86 assembly code generator should check variable references to see if
they're used before initialized. This is done in the interpreter.

Extending PL/0 itself - functions with parameters, first-class functions, else branch, string type, boolean type (distinguished from integer type), user defined types, etc.

}}}

vim:fdm=marker: