CPS-RE
A tiny regex engine in continuation-passing style 2024-12-15
Project Files • GitHub Repo • cps-re.c
Classic backtracking regex engines walk regular expression strings using the call stack and maintain a separate “backtrack stack” data structure for backtracking. CPS-RE turns this on its head: it walks regular expressions in continuation-passing style, freeing up the call stack to be used as the backtrack stack. See the project’s readme for more on this.
CPS-RE is 200 lines of C99 and its feature set is purposefully small—but not minimal, so as to remain interesting. In addition to the usual greedy quantifiers and alternation, CPS-RE supports lazy and possessive quantifiers as well as regular expression intersection and complementation.
Possessive quantifiers need to break the normal backtrack flow and jump to a sort of “backtrack checkpoint”. With CPS, this means longjmp
ing somewhere up the stack. Managing setjmp
s and longjmp
s across nested possessive quantifiers was getting unwieldy, so I wrote some macros to abstract away the details. Looking back, I think I accidentally implemented exceptions—which checks out, because this matches their semantics.
In CPS, a backtrack operation becomes a simple return
statement. And a backtrace from the found_match
function gives an executive summary of how the regex matched the input. This is neat and all, but there’s a price: match against large enough a haystack and quickly the call stack overflows. Depending on the application, this may be a deal breaker.
Here’s a mind-bender: represent as a graph the flow of continuations needed for walking some regular expression, making sure to merge nodes that correspond to identical continuations, and the result is an NFA for that regular expression. I think I was assuming that fewer IRs meant simpler code, which is why CPS-RE runs matches straight from the regular expression string. In the end, though, CPS-RE does have an IR; it’s just harder to see because it’s hidden in the continuation flow. The engine would’ve been much simpler without the middleman: scrap continuations and instead generate straightforward NFA bytecode to be sent to the recursive matcher.