Skip to content

speezepearson/astprogramming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Motivation

Almost every piece of syntax in a language describes a pattern which expressible in terms of simpler constructs. For example, while (test) {body} is equivalent to

startloop:
  if (!test) {goto endloop;}
  body;
  goto startloop;
endloop:

This particular pattern of ifs and gotos was so common and so useful that at some point, languages started including a keyword called while. while made code easier to write (because you were already thinking "while x, do y", and now you can just type that, you don't have to take the time to encode that thought as a pile of gotos) and easier to read (because you don't have to decode the pile of gotos in order to realize "oh! That's just 'while x do y'.").

And you don't have to go back that far to find examples of new syntax making people's lives better. To scratch the surface:

  • C++11's for (x : y) and lambdas and smarter >>,
  • Ruby 2's keyword arguments,
  • Python 3's star-unpacking and yield from,
  • Python 3.5's async and @ operator,
  • Java 7's strings-in-switches and type inference, and
  • ECMAScript 6's generators and modules and arrow functions and spread and for-of and deconstructing.

These are all things that we knew we needed for years before they came out. But we muddled along without them until the language maintainers put them in, because adding new syntax to a language is hard:

  • you must adhere to the hard-to-reason-about limitations of your parser;
  • you probably shouldn't break backwards-compatibility; and
  • you really need to make sure you do it the best possible way on your first try, because once you've dedicated some syntax to the new feature, you can never introduce any similar syntax (because your parser has to be able to distinguish the two).

All this difficulty arises from the fact that your program is a text file that tries to be both human-readable and unambiguously translatable into machine instructions. What if we ditched that? What if we separated the model (the logical layout of the program) from the view (the pattern of characters that appears on your screen when you look at the program)? Then we could store the program in an extensible, easily computer-readable format, let UI programs deal with the human-readability aspect, and never again have to worry about whether a new piece of syntax would make a language's grammar non-LR(1)-parseable.

To skip ahead a little: I'm gonna suggest that programs be stored as abstract syntax trees, since that's both

  • an easily computer-readable format; and
  • closely related to the logical structure of the program in your head.

Without the difficulties of parsing text, defining new "syntax" (if that's even the right word anymore) would be easy enough that anybody could do it: if you find a particular useful programming pattern syntactically cumbersome, you could define a new kind of AST node, define what it means in terms of standard node types (e.g. describe how to turn while into ifs and gotos), and define how it should be displayed in your program-editor application.

...Seriously?

You are currently wondering, "Is this idea a good idea?" Let's see. As with most everything, there are good aspects and bad aspects.

The Bad

  • Retraining is hard: you're already very familiar with how to manipulate text. You don't even have to think: you just know how to move your fingers to change the program in the way you want, because you've been doing this for decades.

    Now I'm suggesting you ditch all that training and learn how to manipulate some weird new data structure. How long will it take to reach your old efficiency? Months? Years? I really don't know.

  • Collaboration with non-adopters is hard: if you adopt this idea, your programs will be in some funky AST format; your friends who didn't adopt won't be able to edit them. You could certainly generate, say, Python source code for your AST, and they could modify that source code, and the source could be turned back into an AST... but if you used fancy custom AST node types, the generated source code would be ugly autogenerated stuff, and it might be hard to reconstruct those custom AST nodes from the modified standard Python source code.

  • Proliferation of node types is bad: some programmer will define a new node type for a trivial dumb thing. And then he'll post that new dumb type, and your friend will download it and use it and now you need to install the dumb node type on your computer to even read your friend's program.

    Hopefully, there can be a stigma against doing making frivolous custom node types, just like there's a stigma against naming variables asdf.

The Good

  • Encapsulation of programming patterns: patterns that are verbose or syntactically painful can be encapsulated in new node types. If you find yourself doing a lot of regex-matching like

    match = re.match(s, ADDRESS_PATTERN)
    if match:
      return Address(number=match.group('number'), street=match.group('street'), city=match.group('city'),
                     state=match.group('state'), zip=match.group('zip'))
    else:
      raise ValueError('malformed address', s)

    maybe it'd be worth your time to start using a custom AST node type defined by someone on the Internet, which lets you instead write

    regex-extract (number, street, city, state, zip) from s using ADDRESS_PATTERN:
      return Address(number=number, street=street, city=city, state=state, zip=zip)
    elif no match:
      raise ValueError('malformed address', s)

    which you find a little easier to read.

  • Customizable view: you could customize how code appears to you:

    • define where to show whitespace and how much
    • make math appear in fancy typesetting
    • show or hide comments
    • (for Rubyists) define when parentheses should appear around method calls
    • display regular expressions in a more human-friendly way
  • Easy IDE plugin development: pass

How?

In this section, I derive the way things should work.

Data Structure

How do we naturally think about programs? What kind of data structure would be well-suited to a program's representation?

Well: we think about programs as having a treelike structure, where nodes represent things like "a for loop" or "a class definition," and a node's children are what parametrize it. For example:

  • an addition is parametrized by two expressions (the left and right operands);
  • an attribute access is parametrized by an expression (the object) and a string (the attribute name);
  • a class definition is parametrized by a string (the name), zero or more expressions (the parent classes), and zero or more statements (the class body).

Hold on -- I'm just describing abstract syntax trees. Great! A program's natural representation is as an AST.

Extensibility

Nowadays, programs are stored as text. Introducing a new piece of syntax means:

  • defining a class for a new AST node type, describing what kind of node it is and what parametrizes it (e.g. "an AttributeAccess is an expression with one child (an expression, the object) and one attribute (a string, the attrname)
  • telling your code generator how to create bytecode/assembly for the new node type; and
  • modifying the lexer/parser to instantiate the new node type when a certain pattern is recognized in the source code (e.g. making a new grammar rule in Bison)

How would you introduce a new piece of syntax in a world where programs are stored as ASTs?

  • defining a class for a new AST node type, same as before;
  • since mere mortals can't modify the code generator, instead telling some kind of precompiler how to translate an instance of the new node type into nodes of types defined in the base language (e.g. "to simplify a ForLoop(init, inc, test, body), move init before the ForLoop node, then replace the ForLoop with WhileLoop(test, (body; inc)); and
  • describing how to present a node of the new type in your favorite program-editing application (e.g. telling it how to generate pseudocode)

About

Programming with ASTs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages