LTRE
Finite automaton regular expression engine
#todo write writeup
#todo date
derivatives good cs:
- simpler impl & often faster than NFA
- trivial to add compl, and thus so to add inter
- derivatives easy to reason about (eg. repeat count in IR)
One might try forgoing the DFA, and only computing derivatives for the strings that we wish to match, but regex derivatives also blow up badly. This is a pity, as it suggests in general we may need NFAs. --- Haskell - Regex Calculus.pdf and The Big Brzozowski _ weaselhat.pdf --- https://crypto.stanford.edu/~blynn/haskell/re.html and https://www.weaselhat.com/2020/05/08/the-big-brzozowski/. pinky is telling me that structural sharing would prevent blowup. prove? see also his impl at https://github.com/Pomona-College-CS181-SP2020/regularity/blob/master/executable/Main.hs and https://github.com/Pomona-College-CS181-SP2020/regularity/blob/master/library/Regularity/Regex.hs
maybe not so much?:
test("0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?", "", true);
test("(00)?(00)?(00)?(00)?(00)?(00)?(00)?(00)?", "", true);
realization: Brzozowski derivatives with structural sharing is shockingly similar to the lock-step algorithm for NFAs. the lock-step algorithm for NFAs is an obtuse special-casing of Brzozowski derivatives. other:
- easier to reason about a derivative than about an NFA lock step, prb cs one less IR to think about. eg trivial to derive differentiation for
{m,n}
quants, so no need to splat them out like people do in NFA engines - people think that Brzozowski derivatives have exponential blowout, but they don't use structural sharing, and almost certain that using structural sharing would eliminate exponential blowout
turns out you can normal-form |&~ by just getting rid of either dual; in this case, got rid of &
breakthroughs:
- main contribution: implementation of Brzozowski derivatives in an imperative language
- smart constructors so we can use structural equality for similarity (some as Owen's)
- compute nullability in smart constructors because no reason not to. especially good cs nullability is a common question
- cache derivatives and the set of chars for which they hold. good cs get all the perf boost of Owen's char classes without needing a bulky hash set impl. good to cache at every level cs when a new symbol invalidates the derivative of a parent regex, it might not invalidate the derivative of a child regex --- derivatives rewrite/derivatives rewrite.md
dual concatenation & Kleene star: The dual of concatenation by Alexander Okhotin --- doi_10.1016_j.tcs.2005.07.019.pdf --- https://www.sciencedirect.com/science/article/pii/S0304397505004056
dup with ltre.c
regex
es are persistent data structure with acyclic structural sharing
#todo read about complementation --- https://langdev.stackexchange.com/questions/3421/why-dont-popular-regex-engines-support-complement-and-intersection:
- https://www.microsoft.com/en-us/research/wp-content/uploads/2020/08/pldi21-SBFA-final.pdf
- https://www.microsoft.com/en-us/research/wp-content/uploads/2019/02/SRM_tacas19.pdf
- https://link.springer.com/chapter/10.1007/978-3-540-71389-0_24
#todo read about lookarounds --- https://langdev.stackexchange.com/questions/3421/why-dont-popular-regex-engines-support-complement-and-intersection: