A Toy Regular-Expression Matcher
There’s a book I have, Beautiful Code, which is a bunch of essays about programming by various famous programmers. It’s a great book. The chapters are on all different topics; the only thing they have in common is a piece of code that makes you think “wow, that’s really clever”. This post is about chapter one of that book, which is about a regular expression matcher.
Regular expressions are a domain-specific language for describing finite state machines. Specifically, finite state machines where the state transitions are successive characters in a string: you can use them to check if input strings match certain formats, like, a phone number: three digits followed by a dash followed by four digits. Kernighan’s matcher, which I’m going to translate to Lua and then extend a little bit, is really compact and beautiful. It’s about thirty lines, and recognizes these tokens:
- A dot (“.”) matches any single character
- A caret (“^”) matches the beginning of the string
- A dollar sign (“$”) matches the end of the string
- A star (“*”) matches zero or more repetitions of the previous token
- Any other character (like “a”) matches itself.
My plan is first to duplicate Kernighan’s matcher in Lua and then extend it to also handle character sets, like “[abc]” would match a single “a”, “b”, or “c”. Because of this, I’m not going to port his code across directly, but rather make some changes for extensibility.
Basic matcher
Lua’s string metatable already has a match function, so we’re gonna, uh, destroy it. If you like yours feel free to call this my_match
or something:
function string.match(str, pattern) if pattern:at(1) == "^" then return matchhere(str, pattern:sub(2)) else repeat if matchhere(str, pattern) then return true end str = str:sub(2) until str == "" end end
This is pretty straightforward, although you can see it calls a couple utility functions we’ll have to write. matchhere
tries to find a match starting at the beginning of the string, so, if our pattern starts with a caret, we cut it off and try to match our pattern once and return. If it doesn’t, we enter a loop where we attempt to find a match starting at every character, until we find one or the string is empty. The code for matchhere
is a little more complicated but not much:
function matchhere(str, pattern) -- Everything matches an empty pattern if pattern == "" then return true end -- Try to match end-of-string if pattern == "$" then return str == "" end local first_token, modifier, after = pattern:at(1), pattern:at(2), 2 if modifier == "*" then after = 3 end if modifier == "*" then return matchstar(str, first_token, pattern:sub(after)) elseif matches(str:at(1), first_token) then return matchhere(str:sub(2), pattern:sub(after)) end end
We first handle the two degenerate cases, matching against an empty pattern or an empty string. After that we want to find the token we’re going to match against (a single letter or a dot) and its modifier (a star or, well, not a star). The after
variable is used to try to find the portion of the pattern left over after this token/modifier tuple. This is a difference between what I wrote and Kernighan’s code: he assumes that the tokens are one character long (either a char or a dot) and I plan to add character classes, so I can’t.
After that we either call another helper function (matchstar
) or recurse. Since we directly return the result of the recursive call, this is actually not horrible: Lua supports tail-call elimination so this won’t blow out the stack. More on that later though.
The matches
function is really simple so I won’t bother writing this first version, instead let’s look at matchstar
:
function matchstar(str, token, rest) while true do if matchhere(str, rest) then return true end if not matches(str:at(1), token) or str == "" then break end str = str:sub(2) -- Next char end end
We take a string, a token that’s allowed to repeat, and a tail of a pattern. In a loop, we try to match the string to the tail. If we can’t, and the string starts with the allowed-repeatable token, then that’s probably why, so we strip that off and try again.
So that’s everything you need. It can use up a few stack frames, but because it’s tail-recursive it only uses one extra one per star in the pattern, and there usually aren’t many of those. So now let’s add character classes to it.
Character classes
A character class is a set of characters inside brackets, like “[abc]”. The intent is for a character class to act like one token, but match any letter in the class, so “[abc]” would match “a” but not “d”, and “[01]*” would match “101010” but not “01210”. In order to add these, we have to modify two functions, matches
(which we haven’t seen yet) and matchhere
(where it finds first_token
and after
). Let’s do matches
first:
function matches(ch, token) if token.chars then return token.chars:find(ch) else return token == "." or token == ch end end
We’re going to say, now, that a token is just a string containing a single character, or a table {chars = "..."}
containing a set of characters (this is to remove the ambiguity between “.” and “[.]”; we want the latter to only match a literal dot). Now we just need a way to pull the first token, whatever it is, off a pattern:
function grabtoken(pattern) local token, rest if pattern:at(1) == "[" then -- A char class! local close = pattern:find("]") token = {chars = pattern:sub(2, close-1)} rest = pattern:sub(close+1) else -- Not a class token = pattern:at(1) rest = pattern:sub(2) end local modifier = rest:at(1) if modifier == "*" then rest = rest:sub(2) end return token, modifier, rest end
We look at the first character to tell if this is a class (which will end in a close brace) or a normal character. Once we pull off the token part, we determine what the rest of the pattern is after it, which we may cut a character (a star) off of later. Then we return the token, the char right after the token (in case it’s a star), and the rest of the pattern.
We can pretty much plug this directly into matchhere
:
local first_token, modifier, rest = grabtoken(pattern)
It returns exactly what we’d expect. I did change matchhere
to want the rest of the pattern instead of the index of the rest of the pattern, just to make it a little cleaner.
Demonstration
It works pretty much the same way a subset of Lua’s built-in matcher works. Here’s an example of using it to determine if a string of binary digits is even:
assert(("10010110100"):match("^[01]*0$"))
The real payoff here is in seeing how easy it is to add more features to it. Adding named character classes (“\d” for all digits, say) is easy; more useful would be to add escapes, so we can match a literal dot. More useful and a lot harder would be adding captures. What have you always wished regular expressions would do? Play around with it, see what you can add. The code (with unit tests using lt) is here.