Lua's dispatch wasn't chosen because it was the fastest. If you read their Lua5 ...

haberman · on Nov 30, 2012

If you're giving up portability, you might as well go all the way and create a JIT. Techniques like what the paper calls "inline-threaded dispatch" seem of limited usefulness, since they give neither the speed of a JIT nor the portability of a bytecode VM.

My main point was that switch()-based dispatch isn't that bad. I'm surprised that this recent literature distinguishes between switch dispatch and token-threaded dispatch, since as Mike Pall notes, "Tail-merging and CSE will happily join all these common tails of each instruction and generate a single dispatch point," so the goto-spaghetti of token-threaded dispatch is likely not even worth it. (http://article.gmane.org/gmane.comp.lang.lua.general/75426)

> Indirect threading is the fastest simple technique

I'm confused; the paper itself says: "However, because of a level of indirection, indirect-threaded dispatch is not be as efficient as direct-threaded dispatch."

Even direct-threaded dispatch still results in an indirect branch for every VM instruction, it just tries to save a table lookup over token-threading.

Ultimately the most common and practical dispatch techniques seem to boil down to either a single indirect branch or replicated indirect branches.

sb · on Nov 30, 2012

You're not really "giving up" portability, only with ANSI C. For all architectures where GCC is available, its labels-as-values support will happily compile and hence give you a fully portable instruction dispatch.

Regarding the threaded code: the literature seems to be horribly inconsistent in this regard. Token threaded code might actually refer to indirect-threaded code depending on what the author has read. I was actually quite surprised to read what "indirect threaded code" originally meant when I read Debaere and van Campenhout's 1990 book "Interpretation and Instruction Path co-processing".

Regarding the effect of CSE & tail-merging: I have only seen GCC to do this once, by disabling basic block reordering (-fno-reorder-blocks, due to a so-called "software trace cache"), otherwise this has never happened to me. Since I have nothing but the highest respect for Mike Pall, I guess that he might have experienced this for Lua, which has--to the best of my knowledge--less than 40 instructions. I will try to verify this on the weekend.

haberman · on Nov 30, 2012

> You're not really "giving up" portability

I should have been more clear; that comment was in reference to some of the techniques the paper discusses which do give up portability, namely "inline-threaded dispatch" and "context-threaded dispatch."

My overall points are: switch() isn't that bad, and most practical dispatch techniques ultimately boil down to either a single indirect branch or replicated indirect branches.

> Regarding the effect of CSE & tail-merging: I have only seen GCC to do this once

I've never seen it happen with my own eyes, but it appears the Python guys experienced this also at one point and had to tweak the flags -fno-crossjumping and -fno-gcse. http://bugs.python.org/issue4753

> I will try to verify this on the weekend.

Do you have a blog? I'd love to hear what you discover.

pbiggar · on Nov 30, 2012

> Do you have a blog? I'd love to hear what you discover.

I second this. I'd be very interested to know who you (sb) are in real life.

pbiggar · on Nov 30, 2012

I'm not sure where you're going with this, but you seem to have an emotional attachment to switch statements, which is otherwise unsupported by reality :)

> If you're giving up portability, you might as well go all the way and create a JIT.

Well no. A JIT is a massive massive amount of work, an interpreter is not. Don't throw the baby out with the bathwater. Besides, you can make a damn portable token-threaded dispatch for supported platforms (which means anything using GCC, LLVM or Visual Studio, which means basically anything that exists today), falling back to switch statements for unsupported platforms.

> Techniques like what the paper calls "inline-threaded dispatch" seem of limited usefulness, since they give neither the speed of a JIT nor the portability of a bytecode VM.

I forget the numbers, but lets say its 20%. It gives a 20% speed boost, for a few dozen lines of code. Hardly "limited usefulnes".

> My main point was that switch()-based dispatch isn't that bad.

Its portable and easy to implement. It is not the fastest dispatch mechanism. It will generally be the bottleneck in an otherwise-efficient interpreter. It does not compare well to most other dispatch techniques, and is only useful if you demand that absolutely only ANSI C will do.

> I'm surprised that this recent literature distinguishes between switch dispatch and token-threaded dispatch, since as Mike Pall notes, "Tail-merging and CSE will happily join all these common tails of each instruction and generate a single dispatch point," so the goto-spaghetti of token-threaded dispatch is likely not even worth it. (http://article.gmane.org/gmane.comp.lang.lua.general/75426)

Firstly, the work on interpreters was done in the 80s and 90s. Really awesome optimizing compilers post-date that considerably.

Secondly, Pall does not say those those dispatching techniques are equivalent. What he says is that interpreters in C are difficult, and you have to fight the optimizer. If you wrote them in assembly, as Pall did, you would not find that the techniques were equivalent.

> Ultimately the most common and practical dispatch techniques seem to boil down to either a single indirect branch or replicated indirect branches.

Its certainly common to use a single indirect branch. However, it is suboptimal. People like speed.

Finally, note that my original point is the switch-based dispatch is slow, but that better techniques in PHP are not worth it, because the rest of the interpreter is so much slower.

haberman · on Nov 30, 2012

Wow, you read a way more argumentative tone into my comment than I intended. It's all good, I'm trying to expand my understanding. :) Though I'm not sure I agree with your characterization of my points as "nonsense."

> [Inline-threaded dispatch] gives a 20% speed boost, for a few dozen lines of code. Hardly "limited usefulness".

Unless I'm missing something, I think you would be very hard-pressed to generate basic blocks of native code in only a few dozen lines of platform-dependent code. I'm not seeing how you could do it without implementing your own code generator (which would take far more than a dozen lines). Is there some clever way of leveraging the C compiler's output that I'm missing?

Also, once you are generating basic blocks of native code, I would consider it a JIT at that point, even if it is not as sophisticated as a JIT that's doing register allocation and optimization. Writing this sort of JIT that uses a static register allocation and a fixed instruction sequence for each bytecode op doesn't seem that much more work than writing an interpreter in assembly language (especially if you use Mike Pall's excellent framework DynASM: http://luajit.org/dynasm.html).

> Its certainly common to use a single indirect branch. However, it is suboptimal.

I don't believe that a single indirect branch is a priori slower; while replicated dispatch makes better use of the branch predictor, it also results in less optimal i-cache usage (the LuaJIT code notes this trade-off, but observes that replicated dispatch is 20-30% faster in practice: http://repo.or.cz/w/luajit-2.0.git/blob/2ad9834df6fe47c7150d...).

I'm not saying switch is better, what I'm saying is that you can't reliably replicate dispatch in C (at least according to Mike's message that I cited earlier), which makes switch() a totally reasonable choice for interpreters written in C.

By the way, I'm not just making stuff up, I have written a JIT of my own: https://github.com/haberman/upb/blob/master/upb/pb/decoder_x...

pbiggar · on Nov 30, 2012

> Wow, you read a way more argumentative tone into my comment than I intended. It's all good, I'm trying to expand my understanding. :) Though I'm not sure I agree with your characterization of my points as "nonsense."

Indeed, that was much too harsh and I retracted it. I thought I might have retracted it before anyone noticed, but I guess not. My bad, sorry about that.

> Unless I'm missing something, I think you would be very hard-pressed to generate basic blocks of native code in only a few dozen lines of platform-dependent code.

Whoops, I thought we were still talking about token-threading. I believe people implement token threading by fiddling with the compiler output. John Aycock had a paper about doing this for Perl or Python as I recall (also, since you're into JITs, you might enjoy his "a brief history of Just-in-time").

> Also, once you are generating basic blocks of native code, I would consider it a JIT at that point,

Yes, people often consider this to be a JIT. However, its much much easier to write than a "real" JIT. The complexity of writing a method-jit or trace-jit is very very high. This could probably be done in an afternoon by a great hacker.

Don't underestimate the work of writing the assembly language interpreter like Mike Pall did. Most incredibly hardcore compiler guys with whom I have discussed it were simply blown away by it. It might be one of the most hardcore feats in VM implementation history. He manually register allocated across dynamic flow control boundaries for gods sake!!

> reasonable choice for interpreters written in C

But not a fast one.

haberman · on Nov 30, 2012

> also, since you're into JITs, you might enjoy his "a brief history of Just-in-time"

Indeed, I've bookmarked it, thanks! I couldn't find the other paper you mentioned about fiddling with compiler output; would be very interested in seeing that.

LuaJIT2 was the first interpreter and JIT that I ever studied deeply, so while I am very impressed by it, I don't have a great understanding of how it differs from previous work. How is its interpreter substantially different or more impressive than previous interpreters written in assembly language? Don't all assembly language interpreters need a fixed register allocation?

I've been a big fan of Mike's work for a long time and helped him get some Google funding: http://google-opensource.blogspot.com/2010/01/love-for-luaji...

By the way, when I was searching for the John Aycock paper I came across your dissertation which I've also bookmarked. I see that you went to Trinity College Dublin -- I visited Dublin a few years ago and what a beautiful university that is! Looks like there's a lot of interesting JIT-related research coming from there lately.

pbiggar · on Nov 30, 2012

> How is its interpreter substantially different or more impressive than previous interpreters written in assembly language?

Traditionally, there were two implementation strategies: compilers and interpreters. JITs didnt become mainstream until Sun bought Hotspot and put it into the 2nd version of Java.

So you had to choose between compilers and interpreters. Compilers led to optimization, fast object code, but were complex to implement (especially multiplatform). Interpreters were simple to implement, very portable, but were very very slow.

Obviously there is more to both, but until recently, basically this is how people thought about language implementation. Considerations like a REPL, slow compilation speed, ease of prototyping, etc, were a complete side show (perhaps people cared, but you'd rarely see it discussed).

When all of the dynamic languages were implemented, they all used interpreters, written in C. They all used a simple dispatch loop to opcode implementation, and let the C compiler do the heavy lifting. All the research into fast compilers (the David Gregg/Anton Ertl stuff for example), looked at instruction set (registers/stack) and dispatch type. So when making interpreters fast, there were only 4 strategies:

- make a compiler,

- use better dispatch,

- rewrite using a register interpreter set,

- make a JIT.

Making a JIT is lunacy of course, because JITs are ridiculously hard, they're not portable. So that Pall was making a 1-man JIT (LuaJIT1) was incredible.

But that he made an interpreter that was as fast as a JIT was even more insane. In Trinity, all of us language/JIT/scripting language people were in one room, and when we heard about this we were just amazed. Nobody had even thought about the stuff - it was all brand new and novel in a field that barely had anything novel in decades! Until that point, basically all interpreters were one big while loop.

> How is its interpreter substantially different or more impressive than previous interpreters written in assembly language?

I wouldnt know, since I've not heard of any mainstream interpreters in assembly. I can only imagine that they were exactly the same as C interpreters: essentially a while loop with a switch statement, just written in assembly.

I find it amusing that you started at LuaJIT2. I would liken it to studying modern war, then wondering "why didnt they just use drone strikes at the Somme" :) Looking back from LuaJIT2, interpreters must seem really really primitive.

monopede · on Nov 30, 2012

I don't think assemblers written in assembly were that bad. LuaJIT 2 uses direct threading (not new at all), register-based bytecode (relatively new), and manually optimised register assignment (perhaps new). AFAICT, the key innovations are that he did not use Lua 5.1's register-based bytecode format, but simplified it even further so it can be decoded efficiently on x86. The second key component is that he pre-decodes the following instruction in order to take advantage of out-of-order execution. This technique also required fixing some register's roles.

Don't get me wrong, I think LuaJIT2's interpreter is great, but interpreters before LuaJIT2 weren't complete crap, either. Many emulators, for example, have very good interpreters written in assembly (some aim to be cycle-accurate).

pbiggar · on Nov 30, 2012

I was trying to describe how it looked from an academic standpoint. Direct threading and register bytecode was well known (register stuff is actually very old, but the jury was out until about 2003), but everything else Pall did was basically new to programming language researchers and implementers.

pbiggar · on Nov 30, 2012

Aycock paper is available here: http://pages.cpsc.ucalgary.ca/~aycock/

title: "Converting Python Virtual Machine Code to C."