Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Easy Parsing with Parser Combinators (lihaoyi.com)
92 points by lihaoyi on Sept 8, 2016 | hide | past | favorite | 25 comments


Great write up, particularly building up towards parsing simple expressions.

One benefit - hinted at, but not explicitly said - between this approach and parser generators is that all of the user's parsing code lives in the same place. An ANTLR grammar that solved this problem would take about the same amount of lines - but that wouldn't be enough. You would still need to write the Java code that takes actions on the grammar rules, and that code will live in a separate place. Add in the complexity of setting up your build to deal with generated source code, and you won't want to go the ANTLR (or other parser generator) approach unless you really have to.

I only compare against parser generators because the other solutions are, to me, in different classes. Regular expressions are not arbitrarily flexible, so they're not a full solution. And rolling your own recursive descent parser is not using an existing library, but going the "hammer and tongs" route (as my compilers professor used to say).


For Swift fans, there is a parser combinators (with implementation) talk by Yasuhiro Inami:

https://realm.io/news/tryswift-yasuhiro-inami-parser-combina...

The last free episode of Swift Talks by objc.io was about parsing techniques (highly recommended as an introduction to parsing) and it had also a small section at the end covering parser combinators.

https://talk.objc.io/episodes/S01E13-parsing-techniques

There are at least two parser combinator implementations in Swift on Github (one is from the first talk above, the other from the objc.io book 'Functional Swift'):

https://github.com/tryswift/TryParsec

https://github.com/kareman/FootlessParser


Meanwhile, for Ruby fans...

http://kschiess.github.io/parslet/ is a parser combinator library for Ruby.


Meanwhile, for Rust fans...

https://github.com/Geal/nom is a parser combinator library for Rust.


Hi lihaoyi, what advantages does your library have over `scala.util.parsing.combinator` and parboiled? I have only used the former, and while I loved it there were a couple of things that I felt could be improved.


There is a short write-up in the main doc-site http://www.lihaoyi.com/fastparse/#Comparisons


Thanks, that's exactly what I wanted to know.


And for anyone interested in taking a peek at the Scala standard library, it now exists as a separate library here:

https://github.com/scala/scala-parser-combinators


Thanks for the article.

What about Error Recovery capabilities? I did not find any reference in the docs regarding that.

Is it something you would want to add to FastParse? Do you think it is even possible without a separate scanning phase?


Does this parser combinator implementation handle left-recursive grammars?


Nope; you need to left-factor them unfortunately. Other parser combinators may have different implementation algorithms, but this one is naive PEG/recursive-descent so you need to help it out manually otherwise it'll just infinite-recurse


Yeah, it's much easier to write a parser combinator implementation that doesn't support it. Thanks for answering.

For those interested, here is an article about a parser combinator implementation in Racket that supports left-recursive grammars:

https://epsil.github.io/gll/


Is it possible to detect and warn about the left recursion? Assuming a data structure representing the grammar is built at runtime to be interpreted.

Won't it be possible to detect the left recursion instead of going into an infinite loop?


Do parser combinators have any obvious downsides?


Sure! I didn't mention them in the post because it was already getting a bit long, but here are a few:

- Unlimited backtracking means it's really easy to accidentally end up with `O(2^n)` parsing behavior. Most real-life languages you parse won't be like that, but there's nothing in-the-framework that's stopping you from becoming exponential. In this way, it's just like recursive descent.

- Performance isn't as good as code-gen or hand-written; it's basically a grammar-interpreter rather than a grammar-compiler. We're looking at (depending on grammar) 4-10x slower than the most optimized hand-written parser, e.g. in the same benchmark, the FastParse JSON parser runs at ~2 million lines of JSON a second, vs ~8 million lines/second for "industry standard" parsers on the same platform (JVM) like Jackson. Whether this is a problem for you is an empirical question.

- They take some initialization time at-runtime. A hand-rolled or code-gen parser is ready almost immediately, whereas a parser-combinator parser takes a bit of time instantiating `Parser[T]` objects and plumbing them around. This is a one time cost, since the parsers are immutable and can be re-used, but it's still a cost nonetheless if you're trying to shave milliseconds off your startup time.


>Unlimited backtracking means it's really easy to accidentally end up with `O(2^n)` parsing behavior.

Does this apply to all parser combinators? In Bennu (http://bennu-js.com/), which I believe is based on Parsec, you only get backtracking at specific points where you explicitly choose to have it. Are most parser combinator libraries different from Bennu and always let you backtrack to any point implicitly? If you use that sparingly or stay away from it, does that mean you've avoided O(2^n) behavior, or is there some other kind of unavoidable backtracking behavior inherent to parser combinators?


It's worth mentioning that, except for the initialization time, these are not inherent downsides of parser combinators. Any parsing strategy can be implemented using combinators, for example the "earley" package for haskell produces Earley parsers using combinators. Writing combinators for an algorithm which needs to analyze the whole grammar does require explicit fixpoints (`A -> aA` has to be encoded as something like `fix(lambda A. aA)`, or in Haskell with `MonadFix` syntax), and unsafe type casts.


Yes, you are right. These are limitations of the PEG/recursive-descent style of parser-combinators, which FastParse is a member of. If you use a set of parser combinators that use a different execution model (e.g. https://github.com/djspiewak/gll-combinators) then the API will be similar, but the performance and runtime characteristics will be totally different.


> there's nothing in-the-framework that's stopping you from becoming exponential.

Recursive-descent parsing is executing a logic program, so couldn't you prevent backtracking by adding a `cut` operator? That's essentially what LL(1) parsing gives you: an implicit `cut` after successfully selecting a branch.


Yes, there is a `cut` operator, and it does exactly what you expect. It's heavily documented here http://www.lihaoyi.com/fastparse/#Cuts

Just because there's a `cut` operator doesn't mean people will use it though =P


Oh, very nice. :-)


The article does not touch the issue of error reporting. Can parser combinators produce sensible messages without much efforts?


The main doc-site has a section on debugging and error reporting http://www.lihaoyi.com/fastparse/#DebuggingParsers


It would help to have a list of similar libraries for languages other than Scala.


I'm not very deep into parsing stuff, but I believe attoparsec [1] is popular in Haskell world.

[1] https://hackage.haskell.org/package/attoparsec




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: