Parser Combinators in Elixir

dnautics · on April 6, 2021

A few protips for nimble_parsec:

- ninja in a struct for your context, just generally a better choice due to the stronger typing.

- decide if your parser is a transformer or an analyzer. If it only analyzes text, then throw away all of the text after you use it and use the text stream as a contextual buffer. If it transforms, be consistent about the tokens you return in the stream. Use structs if you can.

- build intermediate defparsecs which create testable functions that you can test. If you like, fence them in `if Mix.env == :test` blocks to keep them from being compiled except in test.

Example: https://github.com/ityonemo/zigler/blob/master/lib/zig/parse...

brightball · on April 6, 2021

This is a solid intro. I hadn't seen a good write up on NimbleParsec yet either. Kudos to the author.

ahelwer · on April 6, 2021

Can anyone give ballpark effort costs for writing a parser with a context-free grammar parser generator vs. parser combinators vs. by hand? I know good error messages are the main benefit of writing parsers by hand; can you also get those with parser combinators? Do parser combinators handle error recovery and stuff like that, or are you responsible for it?

CornCobs · on April 7, 2021

I've used ANTLR4 and parser combinators (haskell). I would say ANTLR4 was really fast to get going, but it hides a lot of execution semantics from the user. It was pretty difficult for me to debug precedence or even figure out when certain rules were being triggered or why (do not have the antlr4 IDE, maybe that helps).

On the other hand writing parser combinators is really reaallly easy. In my experience it's almost just as declarative (you build up chunks of your grammar from smaller chunks), and yet you maintain good control over the control flow of your parser - when it backtracks, what errors you report, etc. And since its in the language you are using, debugging uses regular debugging tools, and custom logic is expressed in regular haskell

Now ANTLR4 also allows writing predicates in the target language, but how and when they get executed is pretty difficult to figure out

Overall, I would say if the thing you need to parse is pretty simple (clearly specced, straightforward grammar) a parser generator could be worth it. But if theres anything more complex (optional semicolons, context-sensitive ignoring whitespace, etc) I would much prefer the comfort and understandability of parser combinators

auggierose · on April 7, 2021

Now that's a pretty big fail for a parser generator, as it should be at its best ESPECIALLY when the parse gets complicated.

yellowapple · on April 7, 2021

I guess I can chime in on the "by hand" front, since that's how I ended up going about the first non-trivial parser I wrote[1]: https://github.com/otpcl/otpcl/blob/master/src/otpcl_parse.e...

I'd say the difficulty was moderately high, but that was with no real prior experience with parsers. With that water under the bridge, I'd now rate it at around moderate effort. And the result was gaining a clear and precise understanding of the implicit state machine transitions, and being able to control exactly where and how those transitions happen, such that I didn't really need much of a lexer (the "lexer" just tags each character with its position, so that I didn't have to track that separately in the actual parser code itself).

That said, the result is a bit of a tangled mess; it didn't start that way, but eventually the parsing logic got complex enough that I needed to resort to Erlang's preprocessor macros, and while the end result is manageable through some judicious organization, in hindsight I probably could've done the same with functions, and in a more reusable and maintainable way. If I ever get around to another parser rewrite, I might try using parser combinators or some approximation thereof instead.

----

[1]: Technically the second or third, since I rewrote it a couple times as one can see from the commit history - although said history is a bit hard to pin down across all the renames of the relevant file.

bradrn · on April 7, 2021

> Can anyone give ballpark effort costs for writing a parser with a context-free grammar parser generator vs. parser combinators vs. by hand?

The few times I’ve attempted to use a parser generator, I could never figure them out. By contrast, when I started using parser combinators (in Haskell), I was able to start writing parsers almost immediately.

> I know good error messages are the main benefit of writing parsers by hand; can you also get those with parser combinators?

It depends on the library, of course, but they can get pretty good if you annotate your parsers correctly. For comparison, here’s the CSV parser from the article translated into Haskell (using ‘megaparsec’, https://hackage.haskell.org/package/megaparsec):

    -- stack --resolver lts-17.9 script --package megaparsec
    module Main where
    
    import Data.Bifunctor
    import Data.Void
    
    import Text.Megaparsec
    import Text.Megaparsec.Char
    
    type Parser = Parsec Void String
    
    quotedChar :: Parser String
    quotedChar = label "string character" $ string "\"\"" <|> (pure <$> anySingleBut '\"')
    
    regularVal :: Parser String
    regularVal = takeWhileP (Just "a non-escaped character") $ not . (`elem` ['\r', '\n', ','])
    
    escapedVal :: Parser String
    escapedVal = char '\"' *> (concat <$> many quotedChar) <* char '\"'
    
    value :: Parser String
    value = escapedVal <|> regularVal
    
    line :: Parser [String]
    line = sepBy1 value (char ',') <* newline
    
    parseCSVFile :: String -> Either String [[String]]
    parseCSVFile input = first errorBundlePretty $ parse (many line) "" input
    
    main :: IO ()
    main = do
        putStrLn "INPUT:\n-----"
        putStrLn input
        putStrLn "\nOUTPUT:\n------"
        case parseCSVFile input of
            Left err -> putStrLn err
            Right r  -> print r
      where
        input =
            "this,is,a,test\n\
            \1,\"an \"\"escaped,string\",2,3\n\
            \whoops,\"there is,an unpaired,quote"

And the output:

    INPUT:
    -----
    this,is,a,test
    1,"an ""escaped,string",2,3
    whoops,"there is,an unpaired,quote
    
    OUTPUT:
    ------
    3:35:
      |
    3 | whoops,"there is,an unpaired,quote
      |                                   ^
    unexpected end of input
    expecting '"' or string character

> Do parser combinators handle error recovery and stuff like that, or are you responsible for it?

It’s dependent on the implementation, but with ‘megaparsec’ it’s a combination of both. It’s often OK by default — e.g. my example above has no explicit error recovery — but in more complex parsers, you’ll need to wrap some components with ‘try’, to indicate that it needs to backtrack if an error was found. (There’s other error-recovery combinators as well, such as ‘withRecovery’ and ‘observing’, but in practise you mostly end up using ‘try’.)

sandbags · on April 7, 2021

This is a good article. Especially in that it talks a little bit about error handling which does not feel like a strength of NimbleParsec.

Another good resource is this talk by Saša Jurić on Parsing from First Principles (https://www.youtube.com/watch?v=xNzoerDljjo) where he talks about parser combinators and live-codes a simple Elixir PC library.

That inspired me to build my own alternative to NimbleParsec . Partly because my use case makes error handling very important and I couldn't get a grip on how to do that with NP and partly as a decent problem to get to grips with Elixir. I'm currently figuring out lookahead.

This is a great area to play in that has a lot of useful applications.