Making a toy programming language in Lua, part 3
This is part three of my series on writing a toy programming language using LPeg. You should start with part one here.
Last time, we added variables and arrays to the language. Assigning to arrays was starting to strain our design some, so this time we’ll refactor it a lot, and add two features that would be impossible without that refactoring: conditional statements and loops.
Abstract syntax trees
Right now, our parser evaluates expressions as soon as it can, while it’s still in the process of parsing them. This was fine while we were just parsing mathematical expressions, and it’s even fine when we start to include variables (as long as assignment statements can’t appear inside expressions), but conditionals and loops break that: we need to parse the entire “if” statement, but we only want the body to happen after we evaluate the condition. With our current parser, we would evaluate the body (and cause the side-effects of any assignment statements it contains) as soon as we parse it, so those statements would run whether the condition was true or false.
In order to fix this, we first need to do a major overhaul of our language. I’m going to part with The Unix Programming Environment a little here: they jumped right to compilation and code generation, whereas I’m going to do that, but not immediately: it’s more common to do what I’m going to do, and build an abstract syntax tree first.
So, the first thing we need to do is decide how to represent an AST. Let’s have each node be a table with index 1 being the type of node, and the rest being the captures. So, the AST for “2+3” would be:
{ "expr", { "term", 2 }, "+", { "term", 3 } }
We can construct this easily enough. Actually much more easily than if we were using Yacc. We’ll use two more LPeg functions, Cc
and Ct
. Ct
is a table capture, which wraps up all the captures for a pattern into a table, and Cc
is a constant capture, which consumes none of the input but captures its argument. By taking a nonterminal like EXPR, which looks like this:
EXPR = V("TERM") * ( expr_op * V("TERM") )^0 / eval,
…Removing the function captures (because we don’t want to evaluate it yet), wrapping it up in Ct
, and prefacing it with Cc
to tag it:
EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ),
…we can make our parser generate an AST. The way LPeg can build this directly into the grammar is really quite neat, I think.
When we do this to the entire grammar, here’s what we’re left with:
stmt = spc * P{ "STMT"; STMT = Ct( Cc("assign") * V("REF") * "=" * spc * V("VAL") ) + V("EXPR"), EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ), TERM = Ct( Cc("term") * V("FACT") * ( term_op * V("FACT") )^0 ), REF = Ct( Cc("ref") * name * (lbrack * V("EXPR") * rbrack)^0 ), FACT = number + lparen * V("EXPR") * rparen + V("REF"), ARRAY = Ct( Cc("array") * lbrack * Ct( V("VAL_LIST")^-1 ) * rbrack ), VAL_LIST = V("VAL") * (comma * V("VAL"))^0, VAL = V("EXPR") + V("ARRAY") }
As you can see, we’ve only changed a few of them, because some things are simple enough not to need their own nodes: numbers and REFs don’t really need to be wrapped in FACT nodes, because a FACT will always just evaluate to its only child, for example. Same with STMT, we only really care if it’s an assignment or an EXPR, because a STMT that contains an EXPR just evaluates that EXPR.
So, now, trying to match “2+3” yields the exact tree that we’d expect:
-- stmt:match("2+3") gives: { "expr", { "term", 2 }, "+", { "term", 3 } }
Evaluating an AST
Obviously we’ll need to change the eval
function to handle these. In fact, let’s split it up into several functions. TERMs and EXPRs can share a function that’s a lot like the current eval
:
function eval_expr(expr) local accum = eval(expr[2]) -- because 1 is "expr" for i = 3, #expr, 2 do local operator = expr[i] local num2 = eval(expr[i+1]) if operator == '+' then accum = accum + num2 elseif operator == '-' then accum = accum - num2 elseif operator == '*' then accum = accum * num2 elseif operator == '/' then accum = accum / num2 end end return accum end
We’re still doing the same basic thing, with the loop and accumulator. The main differences are that we start at index 2 (because index 1 is the name of the node, either “expr” or “term”), and that all the subvalues, num1
and num2
, get sent to eval
before being used. This recursion lets us handle EXPRs that have nested EXPRs or TERMs in them.
The eval
function itself becomes fairly simple as well. There aren’t many things we can possibly send it:
function eval(ast) if type(ast) == 'number' then return ast elseif ast[1] == 'expr' or ast[1] == 'term' then return eval_expr(ast) elseif ast[1] == 'array' then local new = {} for _, el in ipairs(ast[2]) do table.insert(new, eval(el)) end return new elseif ast[1] == 'ref' then return lookup(ast) elseif ast[1] == 'assign' then return assign(ast[2], eval(ast[3])) end end
Numbers are returned untouched, expressions and terms get sent to eval_expr
, and arrays are recursively evaluated and returned in a table. References and assignments are a little more complicated.
Variables in an AST
To evaluate a REF, we need to follow a chain of indices: the first element in the chain is a variable name, but the variable could be an array, in which case there will be more elements, until finally you reach a number. So, the lookup
function to do this, which has a lot in common with makeref
from last time:
function lookup(ref) local current = VARS for i = 2, #ref do local next_index = ref[i] if type(next_index) == 'table' then next_index = eval(next_index) end current = current[next_index] end return current end
At each step, the next index can be a variable name, a number, or an expression that evaluates to a number. So, if it’s a table, it must be an expression, and we eval
it.
An assignment is pretty much the same: we take a ref and a value, follow down to one before the end of the chain of indices, and then set the value:
function assign(ref, value) local current = VARS for i = 2, #ref do local next_index = ref[i] if type(next_index) == 'table' then next_index = eval(next_index) end if i == #ref then -- last one, set the value current[next_index] = value return value else -- not the last, keep following the chain current = current[next_index] end end end
So, with these changes, and one minor tweak in our test
function (add stmt = stmt / eval
to the top of it), all our tests will pass again. Now we’re ready to start adding functionality to use the AST.
Conditionals
First, let’s modify the parser to handle conditionals. We’ll need to add three new concepts: a statement list (that will go inside the conditional), a boolean expression (that will be the predicate of the conditional), and the conditional itself. Booleans seem easy enough to start with. Match the possible operators:
boolean = C( S("<>") + "<=" + ">=" + "!=" + "==" ) * spc
Then add another nonterminal to the parser:
BOOL = Ct( Cc("bool") * V("EXPR") * boolean * V("EXPR") )
Statement lists are easy enough, and they’ll become our new starting nonterminal, replacing STMT:
LIST = V("STMT") + Ct( Cc("list") * lcurly * V("STMT") * ( ";" * spc * V("STMT") )^0 * rcurly ),
So a LIST is either a single STMT, or a series of them inside braces and separated by semicolons (lcurly
and rcurly
match left and right curly braces). Finally, here’s how we’ll define a conditional:
IF = Ct( C("if") * spc * lparen * V("BOOL") * rparen * V("LIST") )
All together, here’s our parser now:
stmt = spc * P{ "LIST"; LIST = V("STMT") + Ct( Cc("list") * lcurly * V("STMT") * ( ";" * spc * V("STMT") )^0 * rcurly ), STMT = Ct( Cc("assign") * V("REF") * "=" * spc * V("VAL") ) + V("EXPR") + V("IF"), EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ), TERM = Ct( Cc("term") * V("FACT") * ( term_op * V("FACT") )^0 ), REF = Ct( Cc("ref") * name * (lbrack * V("EXPR") * rbrack)^0 ), FACT = number + lparen * V("EXPR") * rparen + V("REF"), ARRAY = Ct( Cc("array") * lbrack * Ct( V("VAL_LIST")^-1 ) * rbrack ), VAL_LIST = V("VAL") * (comma * V("VAL"))^0, VAL = V("EXPR") + V("ARRAY"), BOOL = Ct( Cc("bool") * V("EXPR") * boolean * V("EXPR") ), IF = Ct( C("if") * spc * lparen * V("BOOL") * rparen * V("LIST") ) }
There’s one final tweak to make. An “if” statement will actually be parsed as a name, and eval
will try to look up the variable called “if.” We need to tell the name
pattern to not recognize “if” as a name; this is the first keyword in our language (there will be more):
name = C( letter * (digit+letter+"_")^0 ) * spc keywords = P("if") * spc name = name - keywords
Parsing an example “if” statement like “if(a > 0) { 2+3; a=4*5 }” now gives this parse tree:
{ "if", { "bool", { "expr", { "term", { "ref", "a" } } }, ">", { "expr", { "term", 0 } } }, { "list", { "expr", { "term", 2 }, "+", { "term", 3 } }, { "assign", { "ref", "a" }, { "expr", { "term", 4, "*", 5 } } } } }
Evaluating conditionals
This is all obviously going to mean some changes to the eval
function. We’ll have to add a function like eval_expr
to handle boolean expressions, and two new options in eval
to handle statement lists and “if” statements. Here’s how we do booleans:
function eval_bool(expr) local num1 = eval(expr[2]) local operator = expr[3] local num2 = eval(expr[4]) if operator == '<' then return num1 < num2 elseif operator == '<=' then return num1 <= num2 elseif operator == '>' then return num1 > num2 elseif operator == '>=' then return num1 >= num2 elseif operator == '==' then return num1 == num2 elseif operator == '!=' then return num1 ~= num2 end end
This is a lot simpler than eval_expr
because we know exactly how many arguments there will be: we can’t chain together booleans like we can EXPRs and TERMs.
The two new paths through eval
differ from the rest in one way: they’re not guaranteed to return anything. Statement lists and conditionals can’t be nested inside of other value-returning constructs (like EXPRs) so we won’t ever have to return a value from those to eval
itself. Here’s the new eval
:
function eval(ast) if type(ast) == 'number' then return ast elseif ast[1] == 'expr' or ast[1] == 'term' then return eval_expr(ast) elseif ast[1] == 'array' then local new = {} for _, el in ipairs(ast[2]) do table.insert(new, eval(el)) end return new elseif ast[1] == 'ref' then return lookup(ast) elseif ast[1] == 'assign' then return assign(ast[2], eval(ast[3])) elseif ast[1] == 'list' then for i = 2, #ast do eval(ast[i]) end elseif ast[1] == 'if' then if eval_bool(ast[2]) then return eval(ast[3]) end end end
Evaluating a list just means evaluating everything in it, and evaluating an “if” means eval_bool
-ing the condition, then evaluating the body only if it’s true. This means we can now stick this in the bottom of test
and it will pass:
stmt:match("if(1 < 0) b = 5"); assert(VARS.b ~= 5)
Because 1 is not less than 0, the part of the AST that does the assignment is never run, and nothing gets assigned to “b”.
Loops
We can implement loops almost the same way. First we modify the parser to parse loops, which is already mostly done, since it’s the same general thing as conditionals:
WHILE = Ct( C("while") * spc * lparen * V("BOOL") * rparen * V("LIST") )
Add “while” to the list of keywords:
keywords = (P("if")+P("while")) * spc
And we should be able to parse loops:
while(n<10) { n = n+1 }
turns into
{ "while", { "bool", { "expr", { "term", { "ref", "n" } } }, "<", { "expr", { "term", 10 } } }, { "list", { "assign", { "ref", "n" }, { "expr", { "term", { "ref", "n" } }, "+", { "term", 1 } } } } }
Now we just need the eval
function to actually run the loops. We'll do this in much the same way as the conditionals. Add a case to eval
to handle "while" nodes:
function eval(ast) -- stuff elided -- elseif ast[1] == 'while' then while eval_bool(ast[2]) do eval(ast[3]) end end end
And now we can write a unit test for this:
VARS.n=0; VARS.x=1 stmt:match("while(n < 8) { x = x * 2; n = n + 1 }") assert(VARS.x == 256)
Next steps
So, now that we've separated parsing our language from running programs we've parsed, we can easily add new features like control structures. Most new features you might want to add follow this basic pattern: modify the parser to recognize it, then modify the evaluator to run the parse tree. And that's mostly what we're going to do next time too: we'll modify the parser to recognize function definitions, and we'll modify the evaluator to store and call them.
As always, the completed code for this chapter is available here.