Great write up, particularly building up towards parsing simple expressions.
One benefit - hinted at, but not explicitly said - between this approach and parser generators is that all of the user's parsing code lives in the same place. An ANTLR grammar that solved this problem would take about the same amount of lines - but that wouldn't be enough. You would still need to write the Java code that takes actions on the grammar rules, and that code will live in a separate place. Add in the complexity of setting up your build to deal with generated source code, and you won't want to go the ANTLR (or other parser generator) approach unless you really have to.
I only compare against parser generators because the other solutions are, to me, in different classes. Regular expressions are not arbitrarily flexible, so they're not a full solution. And rolling your own recursive descent parser is not using an existing library, but going the "hammer and tongs" route (as my compilers professor used to say).
The last free episode of Swift Talks by objc.io was about parsing techniques (highly recommended as an introduction to parsing) and it had also a small section at the end covering parser combinators.
There are at least two parser combinator implementations in Swift on Github (one is from the first talk above, the other from the objc.io book 'Functional Swift'):
Hi lihaoyi, what advantages does your library have over `scala.util.parsing.combinator` and parboiled? I have only used the former, and while I loved it there were a couple of things that I felt could be improved.
Nope; you need to left-factor them unfortunately. Other parser combinators may have different implementation algorithms, but this one is naive PEG/recursive-descent so you need to help it out manually otherwise it'll just infinite-recurse
Sure! I didn't mention them in the post because it was already getting a bit long, but here are a few:
- Unlimited backtracking means it's really easy to accidentally end up with `O(2^n)` parsing behavior. Most real-life languages you parse won't be like that, but there's nothing in-the-framework that's stopping you from becoming exponential. In this way, it's just like recursive descent.
- Performance isn't as good as code-gen or hand-written; it's basically a grammar-interpreter rather than a grammar-compiler. We're looking at (depending on grammar) 4-10x slower than the most optimized hand-written parser, e.g. in the same benchmark, the FastParse JSON parser runs at ~2 million lines of JSON a second, vs ~8 million lines/second for "industry standard" parsers on the same platform (JVM) like Jackson. Whether this is a problem for you is an empirical question.
- They take some initialization time at-runtime. A hand-rolled or code-gen parser is ready almost immediately, whereas a parser-combinator parser takes a bit of time instantiating `Parser[T]` objects and plumbing them around. This is a one time cost, since the parsers are immutable and can be re-used, but it's still a cost nonetheless if you're trying to shave milliseconds off your startup time.
>Unlimited backtracking means it's really easy to accidentally end up with `O(2^n)` parsing behavior.
Does this apply to all parser combinators? In Bennu (http://bennu-js.com/), which I believe is based on Parsec, you only get backtracking at specific points where you explicitly choose to have it. Are most parser combinator libraries different from Bennu and always let you backtrack to any point implicitly? If you use that sparingly or stay away from it, does that mean you've avoided O(2^n) behavior, or is there some other kind of unavoidable backtracking behavior inherent to parser combinators?
It's worth mentioning that, except for the initialization time, these are not inherent downsides of parser combinators. Any parsing strategy can be implemented using combinators, for example the "earley" package for haskell produces Earley parsers using combinators. Writing combinators for an algorithm which needs to analyze the whole grammar does require explicit fixpoints (`A -> aA` has to be encoded as something like `fix(lambda A. aA)`, or in Haskell with `MonadFix` syntax), and unsafe type casts.
Yes, you are right. These are limitations of the PEG/recursive-descent style of parser-combinators, which FastParse is a member of. If you use a set of parser combinators that use a different execution model (e.g. https://github.com/djspiewak/gll-combinators) then the API will be similar, but the performance and runtime characteristics will be totally different.
> there's nothing in-the-framework that's stopping you from becoming exponential.
Recursive-descent parsing is executing a logic program, so couldn't you prevent backtracking by adding a `cut` operator? That's essentially what LL(1) parsing gives you: an implicit `cut` after successfully selecting a branch.
One benefit - hinted at, but not explicitly said - between this approach and parser generators is that all of the user's parsing code lives in the same place. An ANTLR grammar that solved this problem would take about the same amount of lines - but that wouldn't be enough. You would still need to write the Java code that takes actions on the grammar rules, and that code will live in a separate place. Add in the complexity of setting up your build to deal with generated source code, and you won't want to go the ANTLR (or other parser generator) approach unless you really have to.
I only compare against parser generators because the other solutions are, to me, in different classes. Regular expressions are not arbitrarily flexible, so they're not a full solution. And rolling your own recursive descent parser is not using an existing library, but going the "hammer and tongs" route (as my compilers professor used to say).