Making a toy programming language in Lua, part 2
This is part two of my series on writing a toy programming language using LPeg. You should first read part one here.
Last time, we made an interpreter for a simple calculator: it would read in mathematical expressions and evaluate them, printing the result. This time, we’re going to add two new features: variables and arrays. First though, we’ll refactor our parser some.
Cleaning up the parser
Here’s the grammar we ended up with last time:
spc = lpeg.S(" \t\n")^0 number = spc * lpeg.C( lpeg.P('-')^-1 * lpeg.R('09')^0 * ( lpeg.P('.') * lpeg.R('09')^0 )^-1 ) * spc / tonumber expr = lpeg.P{ "EXPR"; EXPR = ( lpeg.V("TERM") * lpeg.C( lpeg.S('+-') ) * lpeg.V("EXPR") + lpeg.V("TERM") ) / eval, TERM = ( lpeg.V("FACT") * lpeg.C( lpeg.S('/*') ) * lpeg.V("TERM") + lpeg.V("FACT") ) / eval, FACT = ( spc * "(" * lpeg.V("EXPR") * ")" * spc + number ) / eval }
The first annoying thing here is that “lpeg” is repeated everywhere. If we add the lpeg
table to our environment, we won’t have to do that:
setmetatable(_ENV, { __index=lpeg })
Next is whitespace being matched everywhere. Let’s try to remove a few of those: we’ll say that most patterns can be followed by spc
but that whitespace before them should be matched by whatever token preceded them (in most cases). We’ll also factor out a few helper patterns like digit
.
Finally, let’s change where we call eval
. Right now each nonterminal is sent to eval
, instead let’s specify that each production of a nonterminal will be sent there. That will make it easier to call different functions for different productions later: if we want to send EXPRs that are just one term to a different function than EXPRs that recurse, say.
With all those changes, the new grammar looks like this:
spc = S(" \t\n")^0 digit = R('09') number = C( P('-')^-1 * digit^0 * ( P('.') * digit^0 )^-1 ) / tonumber * spc lparen = "(" * spc rparen = ")" * spc expr_op = C( S('+-') ) * spc term_op = C( S('*/') ) * spc expr = spc * P{ "EXPR"; EXPR = V("TERM") * expr_op * V("EXPR") / eval + V("TERM") / eval, TERM = V("FACT") * term_op * V("TERM") / eval + V("FACT") / eval, FACT = lparen * V("EXPR") * rparen / eval + number / eval }
With all these changes it’s easy to accidentally break one part of the parser while fixing another, so we’ll also write a quick function to ensure we didn’t create a regression:
function test(expr) assert(expr:match(" 1 + 2 ") == 3) assert(expr:match("1+2+3+4+5") == 15) assert(expr:match("2*3*4 + 5*6*7") == 234) assert(expr:match(" 1 * 2 + 3") == 5) assert(expr:match("( 2 +2) *6") == 24) end
Eliminating recursion
Finally, we can now make the parser even simpler by eliminating some of the recursive rules. Instead of saying an EXPR is a TERM followed by an EXPR, let’s say an EXPR is a list of repeated TERMs:
expr = spc * P{ "EXPR"; EXPR = V("TERM") * ( expr_op * V("TERM") )^0 / eval, TERM = V("FACT") * ( term_op * V("FACT") )^0 / eval, FACT = lparen * V("EXPR") * rparen / eval + number / eval }
This will also require some changes in eval
, since it’s going to get a variable-length argument list, instead of rolling up the expression one operation at a time. Instead of eval(2, '+', eval(4, '+', 3))
, we’re now effectively calling eval(2, '+', 4, '+', 3)
. So, here’s the new eval
:
function eval(...) local args = {...} local accum = args[1] for i = 2, #args, 2 do local operator = args[i] local num2 = args[i+1] if operator == '+' then accum = accum + num2 elseif operator == '-' then accum = accum - num2 elseif operator == '*' then accum = accum * num2 elseif operator == '/' then accum = accum / num2 end end return accum end
Now we have a much nicer and cleaner-looking grammar.
Variables
So let’s add in variables. Here’s what we’d like to be able to do: we should first of all be able to assign values to variables with =
:
a = 5
Next, we’d like to be able to use variables in expressions:
b = (a+3) / 2
So to start off with, let’s think about how this will affect the grammar. The most obvious thing is that we’ll need a way to match a variable name:
letter = R('AZ','az') name = C( letter * (digit+letter+"_")^0 ) * spc
This makes variables be some sequence of letters, digits, and underscores, that starts with a letter.
Next is, right now, we’re matching EXPRs only; we’ll need a new type of statement that can be either an assignment or an EXPR. Let’s call it a STMT:
stmt = spc * P{ "STMT"; STMT = name * "=" * spc * V("EXPR") / assign + V("EXPR"), EXPR = V("TERM") * ( expr_op * V("TERM") )^0 / eval, TERM = V("FACT") * ( term_op * V("FACT") )^0 / eval, FACT = lparen * V("EXPR") * rparen / eval + number / eval }
STMT is now the first thing we match, and STMTs can’t be part of EXPRs (though EXPRs are part of STMTs). Also, we don’t send STMTs to eval
, we send them to a new function, assign
:
VARS = {} function assign(name, value) VARS[name] = value return value end
We’re storing all the variables in a VARS
table. We return the value on the right-hand side of the assign not because we need it for anything, but because if we don’t return something then LPeg will just produce the next string index (just as if we hadn’t captured anything).
Now we need a way to reference variables. A variable can be used anywhere a number can, so let’s make them another production of FACT:
FACT = name / VARS + lparen * V("EXPR") * rparen / eval + number / eval
Notice that what we’re passing name
s to isn’t a function, but our table of variables: looking up names in a table is a common-enough task that LPeg lets you use a table as a capture handler instead of a function.
Also, the order of these is somewhat tricky: it only works if name
goes before number
. This is because number
will try to match anything: if it starts with a minus sign or a digit it works, otherwise it fails and fails the entire FACT match because LPeg does no backtracking. If you want, you can either use number-name
(“match number as long as name doesn’t match”) in place of number
, or change number
to this, to make the order irrelevant:
number = C( ('-' + digit) * digit^0 * ( P('.') * digit^0 )^-1 ) / tonumber * spc
In general you should avoid having a pattern begin with an optional element (any element raised to a negative power), to avoid tricky backtracking issues like this.
New we can add a few lines into our test
function to make sure variables work:
stmt:match("a=3"); assert(VARS.a == 3) assert(stmt:match("a") == 3) assert(stmt:match("a * 5") == 15); VARS.a=nil
Arrays
Having single variables is nice, but we should also allow for arrays. Here’s the syntax we’ll use:
- An array can be specified literally with brackets, like “[1, 2, 3]”
- Arrays can be subscripted with brackets, like “arr[2]”
- Arrays can be nested inside of arrays
Let’s add some productions to our grammar. An ARRAY looks like this:
ARRAY = lbrack * V("VAL_LIST")^-1 * rbrack, VAL_LIST = V("VAL") * (comma * V("VAL"))^0
(lbrack
, rbrack
, and comma
are simple patterns matching one punctuation character followed by spc
, just like lparen
and rparen
were).
We’re adding a new nonterminal VAL that represents a value. These can be either EXPRs or an ARRAY:
VAL = V("EXPR") + V("ARRAY")
And where we had EXPR as the right-hand side of an assignment statement, now it should be VAL:
STMT = name * "=" * spc * V("VAL") / assign + V("EXPR"),
So, now we can parse things like “a=[1, 2, 3]”. We’ll need to actually capture things though. LPeg has a function, Ct
, that packs all the captures of a pattern into a table, and treats it as one capture. We can use that on the VAL_LIST instead of bothering to write a function to capture ARRAYs:
ARRAY = lbrack * Ct( V("VAL_LIST")^-1 ) * rbrack
Now when we parse an assignment to an array, it will actually assign something, and we can test that:
stmt:match("a = [4, 5, 6]"); assert(VARS.a[1] == 4) assert(VARS.a[2] == 5) assert(VARS.a[3] == 6) VARS.a=nil stmt:match("b = []"); assert(VARS.b[1] == nil) VARS.b=nil
As well as, of course, nesting arrays:
stmt:match("c = [[1,2], [3,4]]") assert(VARS.c[1][1] == 1) assert(VARS.c[1][2] == 2) assert(VARS.c[2][1] == 3) assert(VARS.c[2][2] == 4) VARS.c=nil
Array subscripting
Now to work on how to get values out of arrays. Taking a value out of an array is a lot like reading the value of a variable in general. Let’s make a new nonterminal, “REF,” that will hold a reference that we can read or write to. A REF will be a variable name followed by a series of subscripts in brackets. FACT will match a REF, and read its value:
REF = name * (lbrack * V("EXPR") * rbrack)^0 / makeref, FACT = number / eval + lparen * V("EXPR") * rparen / eval + V("REF") / lookup,
The functions to make a REF and read its value look like this:
function makeref(...) local indices = {...} local tbl = VARS for i = 1, #indices-1 do tbl = tbl[indices[i]] end return {scope=tbl, index=indices[#indices]} end function lookup(ref) return ref.scope[ref.index] end
A REF consists of a table with scope
and index
keys. The scope is the table in which the variable resides (all variables that aren’t array elements reside in VARS
, so it’s simple enough to start there) and the index is the place in that table where you can find it. When we parse a REF we call makeref
to create this object, and when we parse a FACT that is a REF we call lookup
to read its current value.
This is the first thing our parser has generated that evaluates to an intermediate object, instead of a number. EXPRs, TERMs, and FACTs all get turned into the numbers they evaluate to when we parse them; a REF has no meaning until it’s evaluated by lookup
. We’re going to expand on this general idea of an intermediate representation a lot in the next post.
Anyway, for now, it’s easy to write a test that shows that we can look up array elements:
assert(stmt:match("c[4/2][1]") == 3)
Assigning to arrays
You may be wondering, why not combine makeref
and lookup
? Why not have makeref
go one step further down the chain of indices and return the final value?
It’s because we also want to support assignments; REFs as lvalues. Lua doesn’t let us take a reference to a number, numbers are always passed by value, so we have to store the table and the index to write to instead. By making one function to create that, we can use it in both FACT (with lookup
) and in assignments, with what we’re about to write.
Let’s change the grammar for STMTs:
STMT = V("REF") * "=" * spc * V("VAL") / assign + V("EXPR")
Now, change the assign
function to expect its first parameter to be a REF instead of a name:
function assign(ref, value) ref.scope[ref.index] = value return value end
With this one change, now we can assign to elements of arrays:
stmt:match("c[3] = 5") assert(VARS.c[3] == 5)
Next steps
Our little calculator from part 1 now understands variables, arrays, and even nested arrays. It’s starting to become a more complete language. However, we’re still missing a lot of concepts: there are no conditional statements or loops, both of which are needed to make variables and arrays useful (what good is a variable if it can’t affect program flow; what good is an array if you can’t loop over it?). We are also reaching (and edging past) the limits of the evaluate-as-you-parse model. We’ve already seen how it doesn’t work for assignments to arrays.
So, next week, we’re going to do a major overhaul of how this language gets run, and then use that new design to implement some features that would be impossible with this one: conditionals and looping.
The complete code to this post is available here.