LTRE NFA
Frozen version of LTRE on the NFA engine 2024-06-10
This is an archive of LTRE before abandoning the NFA engine in favor of Brzozowski derivatives.
LTRE
Finite automaton regular expression engine 2024-06-10
Project Readme • Project Files • GitHub Repo
LTRE is a linear-time regular expression library written in C99 that I built while dabbling in automata theory. For more information, see the project’s readme.
The regex grammar LTRE implements is unfamiliar, though is mostly backwards-compatible with POSIX ERE. The main addition is the concept of symsets, which represent sets of acceptable characters, and which can be unioned with circumfix []
, intersected with circumfix <>
and negated with prefix ^
. For example, <a-z^[aeiou]>
is a symset which matches a lowercase consonant. Symsets increase expressiveness at no cost to compilation performance and follow naturally from the engine’s design.
LTRE supports alternation with the usual infix |
, but also supports intersection with infix &
and complementation with prefix ~
. People on the internet love to babble about how compilation of regexes with intersections and complements would be too inefficient, and about the problem of worst-case exponential state blowout, but nobody seems to have sat down and actually worked through an implementation. The truth is that none of it matters in practice: LTRE compiles any sensible regular expression to a minimal DFA in under a millisecond.
One caveat is that LTRE doesn’t support submatch extraction, because the determinization and optimization of tagged finite automata—which appears to be the proper way forward from here—frankly sounds like more trouble than it’s worth. ripgrep and Google’s RE2 report capture group offsets for known matches by falling back to (either typically or asymptotically) slower engines, and I’m not willing to make that compromise.
As a real-world stress test for the engine I wrote LTREP, a small command-line search tool. I benchmarked it against ripgrep and GNU grep and, to my surprise, it performed competitively. LTREP gets left in the dust when ripgrep and GNU grep perform literal optimizations, but, when they run out of tricks and fall back to their core engines, LTREP takes the lead.
Writing a regex engine based on finite automata has shifted my outlook on regular expressions. Regular expressions are not intrinsically hacky; when used for what they’re designed for (describing regular languages) and run on a good engine (not on a backtracking engine), they’re expressive, precise and predictable.