Almost every piece of syntax in a language describes a pattern which expressible in terms of
simpler constructs. For example, while (test) {body}
is equivalent to
startloop:
if (!test) {goto endloop;}
body;
goto startloop;
endloop:
This particular pattern of if
s and goto
s was so common and so useful that at some point,
languages started including a keyword called while
. while
made code easier to write
(because you were already thinking "while x, do y", and now you can just type that,
you don't have to take the time to encode that thought as a pile of goto
s)
and easier to read (because you don't have to decode the pile of goto
s in order to realize
"oh! That's just 'while x do y'.").
And you don't have to go back that far to find examples of new syntax making people's lives better. To scratch the surface:
- C++11's
for (x : y)
and lambdas and smarter>>
, - Ruby 2's keyword arguments,
- Python 3's star-unpacking and
yield from
, - Python 3.5's
async
and@
operator, - Java 7's strings-in-switches and type inference, and
- ECMAScript 6's generators and modules and arrow functions and spread and for-of and deconstructing.
These are all things that we knew we needed for years before they came out. But we muddled along without them until the language maintainers put them in, because adding new syntax to a language is hard:
- you must adhere to the hard-to-reason-about limitations of your parser;
- you probably shouldn't break backwards-compatibility; and
- you really need to make sure you do it the best possible way on your first try, because once you've dedicated some syntax to the new feature, you can never introduce any similar syntax (because your parser has to be able to distinguish the two).
All this difficulty arises from the fact that your program is a text file that tries to be both human-readable and unambiguously translatable into machine instructions. What if we ditched that? What if we separated the model (the logical layout of the program) from the view (the pattern of characters that appears on your screen when you look at the program)? Then we could store the program in an extensible, easily computer-readable format, let UI programs deal with the human-readability aspect, and never again have to worry about whether a new piece of syntax would make a language's grammar non-LR(1)-parseable.
To skip ahead a little: I'm gonna suggest that programs be stored as abstract syntax trees, since that's both
- an easily computer-readable format; and
- closely related to the logical structure of the program in your head.
Without the difficulties of parsing text, defining new "syntax" (if that's even the right word anymore)
would be easy enough that anybody could do it: if you find a particular useful programming pattern
syntactically cumbersome, you could define a new kind of AST node, define what it means in terms of
standard node types (e.g. describe how to turn while
into if
s and goto
s), and define how it should
be displayed in your program-editor application.
You are currently wondering, "Is this idea a good idea?" Let's see. As with most everything, there are good aspects and bad aspects.
-
Retraining is hard: you're already very familiar with how to manipulate text. You don't even have to think: you just know how to move your fingers to change the program in the way you want, because you've been doing this for decades.
Now I'm suggesting you ditch all that training and learn how to manipulate some weird new data structure. How long will it take to reach your old efficiency? Months? Years? I really don't know.
-
Collaboration with non-adopters is hard: if you adopt this idea, your programs will be in some funky AST format; your friends who didn't adopt won't be able to edit them. You could certainly generate, say, Python source code for your AST, and they could modify that source code, and the source could be turned back into an AST... but if you used fancy custom AST node types, the generated source code would be ugly autogenerated stuff, and it might be hard to reconstruct those custom AST nodes from the modified standard Python source code.
-
Proliferation of node types is bad: some programmer will define a new node type for a trivial dumb thing. And then he'll post that new dumb type, and your friend will download it and use it and now you need to install the dumb node type on your computer to even read your friend's program.
Hopefully, there can be a stigma against doing making frivolous custom node types, just like there's a stigma against naming variables
asdf
.
-
Encapsulation of programming patterns: patterns that are verbose or syntactically painful can be encapsulated in new node types. If you find yourself doing a lot of regex-matching like
match = re.match(s, ADDRESS_PATTERN) if match: return Address(number=match.group('number'), street=match.group('street'), city=match.group('city'), state=match.group('state'), zip=match.group('zip')) else: raise ValueError('malformed address', s)
maybe it'd be worth your time to start using a custom AST node type defined by someone on the Internet, which lets you instead write
regex-extract (number, street, city, state, zip) from s using ADDRESS_PATTERN: return Address(number=number, street=street, city=city, state=state, zip=zip) elif no match: raise ValueError('malformed address', s)
which you find a little easier to read.
-
Customizable view: you could customize how code appears to you:
- define where to show whitespace and how much
- make math appear in fancy typesetting
- show or hide comments
- (for Rubyists) define when parentheses should appear around method calls
- display regular expressions in a more human-friendly way
-
Easy IDE plugin development: pass
In this section, I derive the way things should work.
How do we naturally think about programs? What kind of data structure would be well-suited to a program's representation?
Well: we think about programs as having a treelike structure, where nodes represent things like "a for loop" or "a class definition," and a node's children are what parametrize it. For example:
- an addition is parametrized by two expressions (the left and right operands);
- an attribute access is parametrized by an expression (the object) and a string (the attribute name);
- a class definition is parametrized by a string (the name), zero or more expressions (the parent classes), and zero or more statements (the class body).
Hold on -- I'm just describing abstract syntax trees. Great! A program's natural representation is as an AST.
Nowadays, programs are stored as text. Introducing a new piece of syntax means:
- defining a class for a new AST node type, describing what kind of node it is
and what parametrizes it (e.g. "an
AttributeAccess
is an expression with one child (an expression, theobject
) and one attribute (a string, theattrname
) - telling your code generator how to create bytecode/assembly for the new node type; and
- modifying the lexer/parser to instantiate the new node type when a certain pattern is recognized in the source code (e.g. making a new grammar rule in Bison)
How would you introduce a new piece of syntax in a world where programs are stored as ASTs?
- defining a class for a new AST node type, same as before;
- since mere mortals can't modify the code generator, instead telling some kind of precompiler
how to translate an instance of the new node type into nodes of types defined in the base language
(e.g. "to simplify a
ForLoop(init, inc, test, body)
, moveinit
before theForLoop
node, then replace theForLoop
withWhileLoop(test, (body; inc))
; and - describing how to present a node of the new type in your favorite program-editing application (e.g. telling it how to generate pseudocode)