shok / implementation

The shok reference compiler has been designed with a UNIX philosophy in mind: orchestrate a complicated system by text-stream inter-process communication (IPC) between discrete single-minded processes. The shok lexer, parser, and "evaluator" (type checking, code execution) are independent programs that solve independent tasks, and they could be split up even more. The shell is a simple program that handles user input by streaming it through these other components and printing the result.

This structure has a few nice properties:

  • The design "feels" clean and modular
  • It's how compiler design is often described, if never implemented
  • The parts are independently useful
  • The parts are independently testable / verifiable
  • The parts can employ entirely separate implementations, perhaps written in different programming languages

The ease-of-use is important, and has been very helpful during the language's development. This is the lexer:

shok:~ : ./shok_lexer
(input)   {x += 2*3}
(output)  1 1:LBRACE 2:ID:'x' 3:WS 4:PLUSEQUALS 6:WS 7:INT:'2' 8:STAR 9:INT:'3' 10:RBRACE

We entered a line of text, and got back a line of tokens. We can pipe the tokens to the parser:

shok:~ : ./shok_lexer | ./shok_parser
(input)   {x += 2*3}
(output)  [{(assign (var ID:'x') PLUSEQUALS (exp INT:'2' STAR INT:'3'))}]

And we get back a partial-AST of the code snippet. (That would have been nice when I was studying for my compilers exam.)

Breaking the compiler into parts is useful for debugging, and for testing the language. Lots of tools could use the individual components directly: syntax highlighters, lint programs, static analysis tools and "bugfinders", IDEs, code diff/search/source-control tools, etc. Even if the nature of unidirectional stream processing limits some use-cases (e.g. an IDE may prefer an in-memory in-place AST update), there's also a place for the simplicity and convenience of UNIX-style text streams.

The architecture has some significant problems:

  • Code complexity and duplication of effort — each component needs to serialize and de-serialize its thoughts into text
  • The computational overhead due to the above, plus the cost of IPC and running separate processes
  • Platform-dependence regarding the IPC mechanism, perhaps with security implications
  • Components may need to be synchronized to recover from errors
  • Lack of support for streaming compiler development (tooling, insight, documentation)
  • Cannot support random-access / in-place code updates; less flexible than, say, Papa Carlo, which is probably the right way to do things

The reference compiler prioritizes language flexibility and correctness over performance, so some of these are not important now. When the language matures we can write a "normal" compiler. But let's consider some of these setbacks in more depth.

Compiler complexity and inefficiency:

Is this necessarily the case? I wonder if the design could be made relatively efficient. Maybe we could wrap I/O over an IPC library (something like Cap'n Proto or ØMQ) that could route streaming text I/O over shared memory instead of kernel-buffered pipes, if/when support for that is declared by any two programs. I'm not sure this could provide the speed returns I'm looking for, but even a library (or language! *ahem*) that makes it easier to build well-behaved UNIX-like streaming-I/O programs would be nice to experiment with.

The current implementation has a lot of unnecessary duplication of effort between components; primarily the code for serializing/de-serializing tree structures into ASCII. This will be important to fix especially if we decide to split the "evaluator" into more pieces (e.g. operator precedence reordering, type checking, bytecode generation, bytecode execution). Some of these pieces would be doing relatively light updates to the AST, so a lot of work could be shared, in code if not also in memory at runtime. And this code could be drastically improved.

I expect the real performance bottleneck will always be the text-serialization of data structures out of their in-memory representations, and I'm not sure that can be solved without moving away from some of the design goals.

Lack of support for streaming compiler development:

This was a big, surprising hurdle. I'm not yet a compiler guru, and getting streaming partial-ASTs out of the parser has deep implications relating to:

  • The language's syntax and grammar
  • The type of parser that can be used (LL works, LR not so much)
  • The parser's implementation, especially regarding I/O
  • The delegation of responsibilities between the parser and other components
  • The syntax of what gets communicated between components

I had to find these all out for myself, because I couldn't find any information about how to write a streaming parser.

An underlying trade-off is that giving flexibility to the parser fights against a desire for it to emit partial-ASTs as early as possible along the input token stream. It's not a very good stream processor if it buffers large swaths of tokens before it can emit anything. And the downstream components would be blocked waiting for more input before they can perform other types of analysis (type checks, scoped-variable lookups). This means "bottom-up" parsing strategies are inappropriate, and even top-down strategies need to make up their minds relatively quickly without backtracking.

I/O was also a real challenge. Most parsers / parser-generator frameworks are built with an implicit "parser-first" mentality called "pull parsing", where deep within some nested logic they might demand a new token (blocking if necessary) to see if it gives them what they want at that moment. An alternative is to make an input-driven "push parser": I feed the parser a token, I want it to advance its state as much as it can and output whatever it has decided thus far. There might even be certain tokens (e.g. newlines) at which time it is essential for the parser to have made some clear decisions.

For the sake of interpreter usability, the parser should be able to detect errors early along the input. If an error is knowable after any newline, the user should not have to finish the statement or function in order for the interpreter to notice the problem.

All these ideas suggest a re-thinking of the parser's design and mannerisms. We need an input-driven parser that can emit partial-ASTs and detect errors on demand. That is, we need it to parse ASAP: As Soon As Parseable!

An upcoming article on ASAP Parsing will describe how the shok parser is being built to meet these requirements.

Last updated: 2013-12-28