Skip to content

Parsing is Difficult

andychu edited this page Jan 20, 2018 · 31 revisions

Blog Theme

https://news.ycombinator.com/item?id=13821829 -- Real parsers are hand-written; they use code generators or meta-languages.

https://news.ycombinator.com/item?id=13041646 -- metalanguage feedback in response to old META II paper. Ruby parser is complex. JRuby parser is an example of writing two parsers.

https://news.ycombinator.com/item?id=13630686 -- my own metalanguage

https://news.ycombinator.com/item?id=15713825 -- someone who thinks parsing languages is simpler than it actually is (regarding language server protocol)

https://news.ycombinator.com/item?id=14379114 -- Walter Bright overstating the applicability of CFGs

Language Design and Theory of Computation -- CFGs aren't powerful enough. Java has two grammars.

Note about error messages: are parse errors really hard? They all seem similar, as long as you have location info? Python doesn't seem to have any special handling of parse errors. It uses a grammmar.

I think type-checking error messages are hard, like Clang diagnostics. And runtime errors to an extent. But parse errors may be easy if the language is straightforward enough.

The one place I might want an advanced error message is for $((. You want to know where it started thinking of things as arithmetic. The place you fix is different than the place the error occurred.

  • My reply to someone who says parsing is easy, with points from this page: Parsing Is Not Solved
    • another point: Microsoft and JetBrains have parsing technology that is beyond the state of the art in open source. It's still advancing.

Real Languages have at least Two Parsers

  • Python: CPython parser, lib2to3, redbaron

  • Ruby: MRI parser, JRuby parser

  • Go: C parser, Go parser

  • Scala: seems like scala-meta might have another parser

  • JavaScript: many different parsers

  • Unified parsers:

    • Clang
    • Roslyn

From coverage.py. The Python stdlib tokenizer isn't good enough, because it's not lossless! It's also a second tokenizer, since the C one is not wrapped.

def phys_tokens(toks):
    """Return all physical tokens, even line continuations.

    tokenize.generate_tokens() doesn't return a token for the backslash that
    continues lines.  This wrapper provides those tokens so that we can
    re-create a faithful representation of the original source.

    Returns the same values as generate_tokens()

Parsing is Slow

Well another takeaway is that people ship a lot of JavaScript over the wire that they never use!!!

Blog Outline

  • The state of the art is hand-written parsers

  • The state of the art is two hand-written parsers (possibly written to the same spec, or not)

  • The state of the art is two grammars? (TODO: Look at Java). Grammars are no longer declarative.

  • State of the art:

    • v8 and Clang for performance
    • TypeScript / C# for generality
    • v8 for safety? any others?

Gaps Between Theory and Practice

Parsing Requires Little Tricks

Contrast with paper: Pure Declarative Syntax Lost and Regained.

Idea for Parsing Tool

  • Stateful lexer -- generalization of regular lexer.
    • it's meant to handle \n in strings, 1.0e100 in numbers, etc.
  • LL(k) for efficiency
  • pratt parsing (or shunting yard) for efficiency and compact description of a grammar
  • LR(k) for C, Awk, maybe Ruby, R, etc.
    • TODO: research the ways in which LL(k) is not sufficient for those languages?
    • we know about C prototypes declarations vs. definitions. This is LR(k), but requires infinite k for LL(k).
    • JavaScript => function.
  • Zephyr ASDL for abstract syntax and Lossless Syntax Tree Pattern
  • what language for semantic actions? Maybe reuse the OVM data structures and library? OVM could be based on algebraic data types?

Papers

  • A Simple, Possibly Correct Parser for C11. They use OCaml lex, menhir, and a Context object to do the "lexical feedback". The Related Work section at the end is very good.

Clone this wiki locally