OMeta really changed my view on how to prototype languages. TL;DR, OMeta let's you map patterns in your language to patterns in another language (say python) and thereby, you can implement a whole language roughly as easily as defining a BNF.
I looked into OMeta awhile back, which sounded cool although somewhat overblown, and wrote my own toy backtracking PEG implementation (somewhat based on LPeg), which isn't too difficult. I read most of the VPRI papers because I'm interested in compact software and DSLs.
But then I had occasion to implement a "real" language, and I just went with a hand-coded parser instead, despite a desire to avoid that.
I think these kinds of tools are good for prototyping toy languages, as you say, but unfortunately they're still TWO steps away from production quality implementations of real language. They're not good for even prototyping "real" languages (think shell, C++, Perl, Make), or production-quality implementations of "toy" languages.
Possible reasons for this:
1) Developer experience... you end up having to reason about control flow and performance for "real" languages, breaking out of the declarative abstraction.
2) Most parsing tools are frameworks rather than libraries -- they kind of force you into a single lexer -> parser -> tree architecture (ANTLR being a prime example, re2c being a great counterexample for lexing). But this architecture doesn't work for many languages.
3) They make other unwarranted assumptions, e.g. about whitespace, which Bourne shell violates.
There are a bunch of other reasons I can list if anyone is interested. A a good research topic is to bridge this gap.
I wrote a very complete parser for shell in Python and sort of "discovered" what the language is... I would love to now port it to a meta-language and generate C/C++ code, rather than port it to C++ by hand.
But I think it's impossible. I was thinking of putting out a "meta-language bounty" to help me solve this problem.
I think what's more likely is that I will end up writing a custom code generator to port my subset of Python to C++. But I would love to be proven wrong. The Python parser is 3500 lines and the AST is 2000 lines. I would like to describe this with 200-500 lines of code in a meta-language (factor of 10 reduction). Although I think some things like good error messages are sort of "irreducible" so maybe that's impossible.
Basically the goal is: describe bash with a meta-language. I have described it in ~5K lines of Python, which is a pretty good compression over the actual bash source code, where the parser is between 10K and 20K lines, spread out over many files and many stages.
I wonder what Alan Kay would say about bash. I think he would probably say "don't use such a complex tool". But I think that's just an indication of the gap between research and practice. More people have used bash than any of his languages, and it consumes more CPU cycles worldwide as well. Although his ideas were influential, they had to be modified to actually work in practice.
I think it's possible if you make your meta grammar programmable, i.e. instead of generating code to compile, go into higher-order programming and return structures that contain functions to execute. This will let you insert programmable hooks into nodes on different stages and various specific behaviors, including proper errors, specific whitespaces, etc. Even switch parsers depending on context.
What I mean is something like this in pseudo code:
How is that different than what something like Yacc or ANTLR already provide?
Inserting arbitrary code is definitely necessary for real languages, but this breaks the model, any guarantees you might have gotten (about performance), and is often awkward. bash uses yacc, and there are a bunch of globals mutated in the semantic actions, which is not easy to reason about. They are REALLY breaking out of the model.
That is why people prefer hand-written parsers, because you can develop your own model, specific to the language you are parsing.
As I wrote in that blog post, my parser only requires two tokens of lookahead, so there are no fancy algorithms involved. The hard part isn't really deriving the parser from the grammar; it's everything else (parsing here docs, etc.)
I would question whether this saves any effort or lines of code over the hand-written approach. I think you would just end up with something almost as long -- and if you count the parser generator itself, it will be longer.
I guess the only option to keep clean grammars is to extend the meta language itself enough to support language-specific features. And if you implement it yourself it's not actually that hard to do. Parsing is not really a big deal after all.
If I remember correctly the idea behind OMeta is to invent new languages to fit better into your problem domain so you could save a ton of code, not to parse existing ones.
Yeah, I agree that they are more suited to designing new languages rather than parsing existing ones.
Parsing toy languages is not a big deal -- but then I have to make the observation that metalanguages are solving precisely the problem that's not a big deal.
But parsing real languages is a big deal. How many C, C++, Perl, Ruby, or shell parsers are there? How many text editors or lint tools contain poor implementations of such parsers?
Those are languages people actually use. I think the lesson is that humans can understand more complex languages than "standard algorithms" can, and they PREFER those languages (as shown by voting with their feet).
Metalanguages should be designed for the languages that people actually use rather than constraining them to an artificial model.
This is my experience as well, both from previous projects and for my long-running self-flaggelation exercise in writing a Ruby compiler.
My parser is a mix between a straight-forward recursive descent parser relying on its own small library, and an operator precedence parser, because I like the elegance of operator precedence parsers... That is, I like the elegance of them in languages that are sufficiently regular.
Ruby quickly turned out to be a frustrating exercise in taking my nice, clean (at the outset) operator precedence parser and mangling it to handle all kinds of exceptions. You can see the reasons for every rule if you squint hard enough - they make it "nicer" to write Ruby most of the time (except when they don't...). But they make the parsing an absolutely awful mess [1].
I fantasise about finding nice abstractions for all of those exceptions one day, but I'm not holding my breath.
And this is with a "library" approach rather than a framework/generator.
I'm sure I can, and eventually will, extract out some descently reusable components from it, but those will have a layer of crap piled on top of them when they're used to parse Ruby, and that layer of crap will make up a substantial proportion of the line count...
[1] Here's an example of the kind of fun stuff you find when parsing Ruby:
- If foo is a method, then "foo [1]" is "foo([1])" (calling the method "foo" with an Array with the element "1")
- If foo is a local variable, then "foo [1]" is "foo.[](1)" (calling the method "[]" on the object in "foo" with "1" as the argument. (if foo is a local variable, it will always alias a method with the same name)
I checked out your project, very interesting! My parser for bash is also recursive descent + operator precedence, along with 13 different lexer modes (maybe 14 or 15 when I'm done!). This seems to be the "standard" now for production quality parsers; AFAIK that's essentially Clang's strategy for parsing C++.
Python is the exception in that it actually uses a custom parser generator (although the speed is not great, hence the need for .pyc files).
I don't know Ruby very well, but I've heard that people do like its syntax, but it's quite hard to parse. I think that speaks to my point that humans are better at recognizing weird syntax than "standard algorithms" are. Humans like to use the part of their brain that makes subtle distinctions with languages.
I'm also dreaming about finding some kind of abstraction to describe my shell -- a proper meta-language. Like you, I'm starting from the goal and working backwards toward the tool, not working forward within the limitations of an arbitrary tool.
Your blog looks similar to my blog, although mine isn't quite step by step. I wrote about an interesting finding about operator precedence parsing: there are two different names floating around for the same algorithm. And also I think the way Crockford explained it in JavaScript had some "accidental" influence.
Yeah, Ruby's grammar is very much a result of piling features into the canonical implementation ("MRI" - Matz Ruby Interpreter - it's what you'll find as "ruby" most places) to make it nice for developers, without formalizing the grammar in any way whatsoever.
There's been some work in formalizing it later - there's an ISO standard for a subset of Ruby - and people have tried to document it, but the language in general is so convoluted that e.g. a number of years back I wrote an article on the Ruby object model that covered eigenclasses by instrumenting the then-current version of MRI.
Then I got "corrected" - because apparently the documented inheritance hierarchy was different. Except the interpreter hadn't followed the docs for several years, and nobody had noticed...
The "canonical" documentation of the Ruby parser is basically a 6000+ line Bison file where most of the linecount is taken up with handling exceptions... It's awful.
You mentioned here-docs I think - in Ruby you can do this:
foo(<<HERE1, <<HERE2)
Part of HERE1
HERE1
Part of HERE2
HERE2
I find here-docs that actually start right then and there ugly enough, but that's taking it a bit too far for me...
Another favourite abomination: "% 42 " parses as the string "42", while "x % 42 " parses as "x modulo 42". Basically if % occurs as a prefix operator, it starts a quoted string where the following character will be the quote character, with some exceptions (e.g. left braces, brackets and parantheses indicates it'll end with the opposite) - and even space and linefeed works...
One of those things that probably seemed useful at the time, and was added without any restrictions, and it's just become an accepted part of the language (if I ever see anyone using it, I might become homicidal).
In terms of naming of parsers, my operator precedence parser is specifically bottom up, so a variation over Djikstra's shunting yard algorithm, just to add another variant. I might have to look at the top down variants more too.
Yup, the multiple here docs on a line thing is also in the POSIX spec for sh. I had no idea that was possible before implementing the shell!
And I'm pretty sure it does NOT appear in the million lines of shell code that I've parsed. They use here docs in weird ways, but they don't have multiple here docs on a line.
This led me to find the algorithm for parsing here docs, which surprisingly all shells I've tested support, even though almost no programs use it.
The algorithm is: process the here docs in a post-order traversal of the AST.
There is an example in my blog with 3 here docs, with one of them being in the while CONDITION (which is itself a command):
while cat <<EOF1; read line; do echo " -> read '$line'"; cat <<EOF2; done <<EOF3
...
In shell at least, you can put the ENTIRE Program on one line. Change all the newlines to semi-colons. You could have 1000 here docs in a program. So you could put 1000 here docs beginnings on one line, and then you process the ends in that specific order.
Ruby must have borrowed this from shell... I guess Matz knows shell REALLY WELL.
I did not know that about sh. And I agree with the sibling comment that it's probably been inherited via Perl (which I also didn't realise had it, but then that fits totally into every stereotype about Perl...) - Ruby has a whole lot of Perl-ism in it, most of which we try not to talk about and pretend aren't there ;)
In terms of influences, Ruby is basically what happens if Perl and Smalltalk has a prettier baby.
Ruby borrowed them from Perl, which borrowed them from shell. I used mutliple heredocs once or twice in both perl and shell myself, but they break syntax highlighting, so it's hard to keep using them.
At what point does one abandon the lexer model? If you have 13 different lexer modes, why not just encode those as part of the recursive decent parser and act directly on characters?
I'm not a parser/pl person, but my view of lexing is that it only really buys you something if your lexer is sufficiently separate from your parse; once you have to tokenize things differently depending on where you are in the parse tree, it is just an extra layer of abstraction that fails to clarify the code.
You certainly can write scanner/lexer-less parsers. Most of Niklaus Wirth's languages for example are well suited for that as they tend to require only one character lookahead and lack ambiguity to the point where you can write a totally straight-forward recursive descent parser that calls straight into routines for parsing specific types of tokens.
In practice that still means you have pretty much the same code boundaries, just that you don't have a general "next token" function to discriminate between the different token types.
My Ruby parser does have a generic tokenizer, but uses a mostly lexer-free model for the high level structure and then calls the generic tokenizer for the lower level bits that I handle with an operator precedence parser.
I agree with in principle, though - most of my parsers have never maintained a very strict separation, and I've written parser generators too that didn't have lexer/parser distinction at all (you could draw an artificial line and say that leaves in the BNF represented lexer tokens, but there were absolutely no code to treat that differently)
That claim about the Wirth languages surprises me, because AFAIK all of his compilers used a lexer (called like get_next_token()) and you'd need to look at multiple characters to e.g. tell an identifier apart from a keyword. Maybe you're thinking of how his grammars only need one token of lookahead -- they're LL(1)?
I guess I'm quibbling. I do agree that they seem easy to parse with a scannerless parser too.
I think you're right that Wirth's own compilers usually used a lexer.
And you're right, you do need one token lookahead, not just one character, in a few places if you insist on entirely unambiguous symbols immediately.
The point is that it's not necessary, as the few places where there is ambiguity while looking at only a single character, it's sufficiently unambiguous that it doesn't matter until after you've parsed enough to be able to discrimate it.
All of the above except "assignment" and "ProcedureCall" starts with a keyword. ProcedureCall and assignment both starts with productions that ultimately starts with an "ident".
So you can parse it like this (Ruby-ish pseudocode)
ident = parse_ident
if !ident
# throw error here
end
case ident
when :if
return parse_if
when :case
return parse_case
... and so on ..
else
return parse_assignment_or_procedurecall(ident)
end
So you know that you are dealing with a valid program as long as you find an "ident".
It is still often cleaner to simply use a lexer vs. doing something like the above example of passing "ident" to a subroutine, or adding backtracking ("unget") support to the parser, so I'm not suggesting it's always a good choice to forgo a separate lexer, but the languages means you can.
In particular, if you don't want to add backtracking, even if it'll rarely be many characters, you'll need to "unroll" the grammar more, like with the "parse_assignment_or_procedurecall" example, and that can get messy, but it works fine.
So "parse_assignment_or_procedurecall(ident)" would for the most part be relatively simple, and the one complication is unaffected by whether you use a lexer or not, and actually probably the most tricky production I've seen in a Wirth grammar to the point where merging the parsing of the two is probably simpler - ActualParameters is a strict superset of "(" qualident ")" I believe, so as long as you keep seeing "(", you expect either just a qualident or a list of expressions; once you see something that isn't just a qualident followed by ")", you're at the end of a ProcedureCall or have an error.
Well, the practical answer that all languages I know of have separate lexical and grammatical structure. It's just easier to write if you separate them. Though it's baroque, even bash has clear lexical structure.
In my code, the lexer is the foundation and doing a lot of useful work. It's a map from the 13 modes to a list of regexes, which is a lot of logic that is kept out of the parser.
The parser just calls lexer.Read(lex_mode), which hides the complexity of all these regexes.
(The lexer might look weird, but I wrote it in a style to port to C, and in particular using re2c. This is an awesome tool which I've mentioned in my blog a couple times, but not written about yet. It combines regular expressions and turns them into a bunch of gotos in C.
For a more theoretical answer: I've had this discussion before, and I would claim that lexing and parsing should be separate because they differ in both computational power and running time.
Lexing is fast but not that powerful -- you are structuring the input stream with (at most) regular expressions. It can be done in linear time, without backtracking. On the other hand, parsing is slow but more powerful, e.g. CFGs or PEGs can recognize more than regular expressions, but they require backtracking/memoization/lookahead/etc.
In other words, why not do the easy work with the fast algorithm, and then do the harder work with the more complicated, expensive algorithm -- on a shorter input?
I came to the conclusion several years ago that PEGs only combined lexing and parsing for reasons of academic presentation and proof. I don't think it actually helps you in practice.
Hope that answered the question! I think my shell lexer is somewhat interesting so I plan to blog about it. Honestly re2c was almost the thing that inspired me to write the shell -- it would have been intractable otherwise.
Thanks for that. Indeed it makes sense. I would enjoy reading blog posts about your lexer.
What through me is that the BNF presented in the POSIX standard treats words as individual tokens and that's obviously not something you want to do only in a lexer.
Yes, so bash is unique in that there is a two level structure. Tokens turn are parsed into words. And then words are used as "tokens" for the "command sublanguage", the grammar for which is given in POSIX.
I mention that in a few places -- there are 4 sublanguages, arith/bool/word/command, but POSIX only covers the command sublanguage.
This is one of the reasons parser generators as "frameworks" don't work for shell.
I have a diagram in mind for the parser architecture which I think will be illuminating. The command language uses the words as "tokens", but they're also mutually recursive, because an entire subprogram can fit inside a word/string, like
That's really cool. I wrote a parser (and implementation for much of the semantics) for just sh; bash is much much bigger.
Just figuring out how to do things like alias expansion (which actually requires relexing as the grammar is defined in the POSIX spec) and here documents in a formal grammar is such a pain that one quickly reaches for recursive decent parsing. I used what was ostensibly a PEG tool (esrap), but ended up using an undocumented escape hatch that let you call arbitrary functions for parsing subexpressions.
Cool, I'd be curious to see it! I actually wanted to bootstrap a shell in Lisp, inspired by the es shell [1] and so forth, but I decided against it after some experimentation.
It was never intended to be actually used. I mainly wrote this to learn more about sh. I realized that the bourne family of shells was not in my top 10 languages in terms of knowledge, but was typically in my top 3 languages in terms of use, and thought implementing just the minimal sh stuff would be a good way to learn.
Very interesting blog.
Trying not to go too much off-topic here, I read in one of your posts that "Awk and shell are similar in an interesting way: they lack garbage collection. This leads to arrays that can neither be nested nor returned from functions."
Would you be willing to explain why that is the case? I'm a CS enthusiast without any formal CS education.
Another way to say it is the bash and awk have no references, and hence no aliasing (and no possiblity of cycles). Consider:
d = [1,2,3]
e = {key1: d}
Now you have an two names for the array [1,2,3] -- d and e['key1']. They are aliases.
Now consider something like this:
f() {
d = [1,2,3]
g(d) # The interpreter has no idea what this does, without runtime inspection
e = {key1: d}
return e
}
How do you figure out how to free memory? There's no way to do it without garbage collection, which is an algorithm that runs concurrently with your program. For contrast, Rust eliminates the runtime bookkeeping with compile time annotations like borrowing.
So basically if you have references and aliasing, and you can move them across function boundaries, then you must have garbage collection if you want to free memory automatically. (Function boundaries are important, because they use the idea of a stack. A stack is a (limited) way to automatically manage memory.)
But bash and awk both get around this by not having references. This eliminates the possiblity of nested compound data structures, or returning values by reference. But in return you don't have to implement a garbage collector. Compound data structures are flat and confined within a stack frame.
(You could still copy the whole array to another function's stack frame if you wanted, but they don't do that.)
So you can't express that program in either bash or awk. They have "flat" hash tables and arrays, where you can only COPY primitive values like ints and strings into them. You can't reference other arrays like {key1: d}.
Hopefully that helps. I hacked on awk a bit here [1] and that is when I noticed the limitations in the language. I thought it was sort of like JavaScript -- regexes and hash tables. But it's actually much more limited, due to its lack of garbage collection.
Not the author, but I'll have a shot: I think the problem is with responsibility for memory allocation and freeing.
To dispose of a nested array's memory, you would need to follow all of its pointers to an arbitrary depth, or, alternatively, keep track of every reference in the program (and garbage collect "dead" arrays).
Arrays can't be returned from a function if they're allocated on the heap, or to put it more abstractly, if you returned an array you would lose the ability to straighforwardly limit its lifetime by tying it to the function's scope.
(Conventionally, functions free all their "local" storage on returning.)
So, in the absence of both manual memory management (e.g. malloc and free) and garbage collection, you can't allow an array to be returned from a function, or cope with nested arrays without special support from the language.
Hi Andrea -- https://github.com/darius/parson is in Python. It's probably not very similar because I haven't got around to studying Ohm, but it shares the basic goal of not writing semantic actions inline in some particular language. (I wrote it in my first Recurse Center stint back in 2012.)
Joe Armstrong Interviews Alan Kay [video] (https://www.youtube.com/watch?v=fhOHn9TClXY)
Discussion: https://news.ycombinator.com/item?id=13033299