How to speed up the Rust compiler

keyle · on Oct 14, 2016

I've never used rust but seeing it's so oriented towards correctness and type checking, I'm not surprised it's a little slow to compile!

From someone stuck with dynamic languages, mark my words, it's best for your compiler to be a bit slow, than your brain being the compiler! :-)

vintagedave · on Oct 14, 2016

Why? You can have both fast compilers and safe languages.

Take Delphi for example: statically typed, strong type correctness, you have to work to avoid type safety. Compiles a million lines of code in 4 seconds (including linking.)

* https://www.youtube.com/watch?v=Yq2mNUzcpE0 (1 million lines, 4 seconds)

* https://www.youtube.com/watch?v=kvuxeaMDmMc (277,000 lines of much more complex code (the most complex, compiler-stressing library Delphi has) in 16 seconds counting linking.

pcwalton · on Oct 14, 2016

Delphi didn't have anywhere near the level of optimization that modern backends like LLVM have.

pjmlp · on Oct 15, 2016

Still we shouldn't need a full optimized build when on the development loop.

adrianN · on Oct 14, 2016

I think the correctness guarantees that Rust provides are a little stronger than what Delphi provides.

bryanlarsen · on Oct 14, 2016

Perhaps more significantly, the checking that Delphi provides is both explicit and local. Rust does a lot more inferential checks and at the crate level rather than at the line level.

Rusky · on Oct 14, 2016

Rust's inference and checks are actively limited to function level, not crate level- function types are not inferred, generics are type-checked before instantiation, the borrow checker works on individual functions, etc.

userbinator · on Oct 14, 2016

I don't know about the latest versions, but I remember looking at Delphi compiler output and it seemed like it did not optimise at all. That would certainly explain its speed.

pjmlp · on Oct 15, 2016

Debug builds shouldn't need to be fully optimized release builds, every time something needs to be compiled on the edit-compiled-debug loop.

Narishma · on Oct 15, 2016

Depends on your domain. For video games for example, unoptimized builds are useless if they can't run at acceptable framerates.

pjmlp · on Oct 15, 2016

I am aware of that, but it should be up to the developers to choose.

And I am not making this up, it is how Eiffel, Lisp, JVM, .NET and Swift (from what I could gather in WWDC presentations) toolchains work.

Also how long gone C++ toolchains like Lucid C++ and IBM Visual Age for C++ (only the v4.0) used to work.

Even Microsoft is going this route for C++ on the upcoming VS "15", with the help of new linker and the experimental modules support.

https://blogs.msdn.microsoft.com/vcblog/2016/10/05/faster-c-...

duaneb · on Oct 14, 2016

Delphi is not much safer than C; it's not surprising it compiles quickly.

It's also not competitive speed-wise with either clang/gcc OR rust.

pjmlp · on Oct 15, 2016

> Delphi is not much safer than C; it's not surprising it compiles quickly.

That is a funny one....

krylon · on Oct 14, 2016

At work, I am stuck doin maintenance on an in-house application our accounting department uses, which is written in Delphi. It's small (~10k lines of code), but it compiles so fast I sometimes wonder if I clicked the wrong button / hit the wrong key.

chamakits · on Oct 14, 2016

I'm not familiar enough to comment on what different feature set and different type of work each compiler does to compare the speed, but WOW, that's an INSANE quick build speed.

Was it always an explicit goal of theirs compilation speed? Is there any documentation of specific things they did to achieve that build speed?

mike_hearn · on Oct 14, 2016

I used to use Delphi a lot, in the 1990's. It's a few things.

Compiler designed for compilation speed above all else. Delphi did not attempt to ever compete with other languages/toolchains on raw runtime performance of the generated code .... I don't recall ever seeing Borland/Inprise boast that a new version generated more optimal code. This is because the code the compiler produced was fast enough for Delphi's target market (desktop business apps) and it was still a lot faster than Visual Basic, and a lot more compact than Java, which were its primary competitors. Delphi sold new versions to developers by giving them new productivity tools: for desktop apps that spend all their time idling, raw code performance was rarely an issue.

Single pass compiler. Structuring Delphi apps could be a nightmare. You could not have circular unit dependencies and breaking such dependencies could often be irritating as hell. Functions had to be declared in the right order.

No preprocessor or include based model. "Units" defined their header and implementation in the same file, so the compiler could rapidly load a binary representation of what a unit exported whilst it was compiling.

No generics. Thus, no explosion of code to generate at runtime, like with C++.

Grammar that was very easy to parse.

Compiler codebase that at its core was designed for very old computers, so is/was very tightly written.

Probably more. That's just what I recall.

chamakits · on Oct 14, 2016

Sweet. Thanks for the info. That provides a lot more context.

I never wrote Delphi myself, but I always appreciated it, because it was used to develop the first IDE I used fory first programming class. Bloodshed's Dev C++. Quite an impressive IDE, so I imagine Delphi was (still is? No idea) a great platform for desktop app development.

pjmlp · on Oct 15, 2016

It is, but since Borland lost their vision how to sell software and the tools division changed hands a few times, they lost most of the customers.

Nowadays most of them are enterprise customers with big maintenance contracts.

Also for shops focusing on Windows only customers, Delphi cannot compete with .NET.

mamcx · on Oct 14, 2016

Is a goal of pascal (Being famous for start as a 1-pass compiler). Delphi get it from that.

topspin · on Oct 15, 2016

It is also a goal of Go. Glacial C++ builds at Google was a prime motivator in the creation and design of Go.

pjmlp · on Oct 15, 2016

That compilation speed tends to be common on the Wirth family of languages, since Algol days.

Hence why many of us, Wirth fans, enjoy but aren't that much impressed by Go compilation speed novelty.

Still it is good that they do pursue this goal, for newer generations not to assume C and C++ compilation speeds are normal and to be expected.

Also C++ compilation can improve when moving away from its #include model. When VC++ "15" gets release we will get an improved linker with DB support and experimental module support.

whyever · on Oct 14, 2016

If you forego optimizations, you can compile the Linux kernel in seconds (see tcc).

mmargerum · on Oct 15, 2016

Start using lots of generics and see how fast Delphi compiles.

ekr · on Oct 14, 2016

Never underestimate the edit-compile-test loop. It has an huge impact on productivity.

aninhumer · on Oct 14, 2016

Well with stronger type guarantees, a lot of the "test" part of that loop gets pushed into "compile".

Indeed the core loop becomes edit-typecheck, and while the compiler can be slow, the type checker tends to be a lot faster than the test suite.

openasocket · on Oct 14, 2016

The biggest win for that is going to be incremental compilation. I can't wait for that to stabilize.

amelius · on Oct 14, 2016

Can we have the best of both worlds?

arunc · on Oct 14, 2016

AKA, REPL: Read Evaluate Print Loop.

majewsky · on Oct 14, 2016

No, a REPL is an interactive program, whereas the ECT loop belongs to the workflow of programming.

pjmlp · on Oct 14, 2016

So how would you describe the experience of languages that offer both a REPL and also have integrated toolchains for AOT compilation to native code?

joshmarlow · on Oct 14, 2016

OCaml offers a great REPL-experience in the form of `coretop`. It's got auto-complete that makes exploration awesome/easy. It feels very much like IPython.

OCaml installations tend to come with both a native compiler and a byte-code compiler/runtime, so you really get the best of both worlds there - minus some occasional hiccups in the build process.

pjmlp · on Oct 14, 2016

OCaml was one of the languages I had in mind when I made the remark. :)

sgift · on Oct 14, 2016

Nice. I would describe the experience (and the languages) as nice. And respectful of my time.*

* Waiting patiently for my JDK9 JShell

krylon · on Oct 14, 2016

There used to be a program called BeanShell that tried to provide a REPL for Java, but it was not as nice as the REPLs offered by, e.g. Python or Ruby (Let alone Lisps where the debugger is integrated into the REPL). OTOH, I looked at it years ago, so maybe it has gotten better (or it stagnated completely).

moron4hire · on Oct 14, 2016

A REPL is just an ECT loop speed up enough to be actually good.

k__ · on Oct 14, 2016

> From someone stuck with dynamic languages, mark my words, it's best for your compiler to be a bit slow, than your brain being the compiler!

lol, I'm on a project right now that uses ES2016 (dynamic) and I'm still stuck on a slow compiler.

patates · on Oct 14, 2016

Switch to TypeScript or Flow. Then at least it's worth the wait (in terms of correctness).

k__ · on Oct 14, 2016

Can't switch in that project, but I will try TypeScript in the next

jnbiche · on Oct 14, 2016

If you're only using standard ES6 features, you may be able to switch to Buble. I got a pretty significant speedup after switching one project to Buble ES6 compiler.

https://buble.surge.sh/guide/

moron4hire · on Oct 14, 2016

Buble is subtly broken in a number of ways. For example, if any of your method names mirror something you need in the global namespace, your method will clobber that global object. For example:

    class A {
      constructor() {
        this.handle = setInterval(/* blah */, 1000);
      }
      clearInterval() {
         clearInterval(this.handle);
      }
    }

Becomes:

    function A() {
      this.handle = setInterval(/* blah */, 1000);
    }
    A.prototype.clearInterval = function clearInterval() {
      clearInterval(this.handle);
    };

And if you're switching from babel to buble, you'll be confused as to why you're suddenly getting a bunch of stack overflow errors.

I have a fairly large project and don't see a significant difference in compile times between the two. Slowness in my build process usually means I've accidentally touched the node_modules directory in my gulp.src expression. It's not enough to use a negation! You have to avoid testing it completely.

Zarel · on Oct 14, 2016

How does that clobber the global object? Are you talking about `function clearInterval()`? That doesn't go into the global scope.

    > let foo = function bar() {}
    undefined
    > bar
    ReferenceError: bar is not defined

agency · on Oct 14, 2016

It doesn't overwrite the function on the global scope, but it does shadow the `clearInterval` name within the function definition, which I don't believe it should do according to the ES6 spec. Compare:

https://buble.surge.sh/#class%20A%20%7B%0A%20%20clearInterva...

with

https://babeljs.io/repl/#?babili=false&evaluate=true&lineWra...

hueving · on Oct 14, 2016

If you're not, can you use Babel to use Buble?

nostrademons · on Oct 14, 2016

If you're targeting modern browsers you may want to look into disabling all plugins except for modules. The latest version of every major browser supports ES2015 natively (as well as much of ES2016 and ES2017). There are various Babel presets (eg. babel-preset-modern) available on NPM that do this.

Even if you're not targeting modern browsers, you might consider developing on Chrome with a very minimal, modules-only Babel config, and then running the full transform only when you're ready to QA & deploy.

pjmlp · on Oct 14, 2016

Still it can be improved and it is nice to see work being done there.

For me, the sweet spot would be the compilation speed of Delphi, .NET Native and similar languages.

nickpsecurity · on Oct 14, 2016

I still think they should investigate doing the LISP solution to the problem where you combine several options:

1. Interpreter (eg REPL) that executes it instantly with dynamic checks for rapid prototyping of logic.

2. Fast, non-optimizing compile that at least does the safety checks.

3. Fully-optimizing compile that can be done in background or overnight while programmer works on other things like tests or documentation.

4. Incremental, per-function version of 2 or 3 for added boost. Rust is already adding something like that.

This combo maximizes mental flow of programmer, lets them tinker around with algorithms, quickly produces executables for test deployments, and will often do full compiles without interrupting flow with wait times. It's the workflow available in a good LISP. It doesn't have to be exclusive to LISP. Other languages should copy it. I'll go further to say they should even design the language and compiler to make it easy to pull that off.

whyever · on Oct 14, 2016

If you are only interested in whether your Rust program typechecks, you can use cargo check, it is quite fast, because it does not generate any code.

qwertyuiop924 · on Oct 14, 2016

In CLs, anyways. In Schemes, we usually only have 1 and 3. Maybe 2 and four, but only sometimes.

dleslie · on Oct 14, 2016

Chicken can be configured for the different levels of optimization it will do; now I wouldn't call it fast...

qwertyuiop924 · on Oct 14, 2016

Like I said: it has 1 and 3.

dleslie · on Oct 14, 2016

I have high hopes for the next major release; it's shaping up to be a significant improvement across the board.

qwertyuiop924 · on Oct 14, 2016

I am subbed to the mailing list. I must have missed that. What's coming?

I know I was really exciting about the debugger (feather) in the previous release.

nickpsecurity · on Oct 14, 2016

The Scheme I was thinking of using to re-learn LISP and Scheme was Racket. Its environment + How to Design Programs looked like a strong combo. Do you know how it fares on this list?

qwertyuiop924 · on Oct 14, 2016

Not really. I'm pretty sure Racket is bytecode compiled, but it's not nativecode compiled. It has a realtime interpreter, of course.

I'm more of a Chicken fan, myself, and Chicken has a REPL/interpreter and an optimizing compiler. It's compiler isn't especially fast, even at lower settings.

Racket really is the C++ to Scheme's C in that it's similar to its parent (although unlike C++, it's actually not compatible), and is more complex (I would say too complex, but unlike C++, it's fairly well designed, so it's a matter of opinion: if you like it, good on you).

Guile is also bytecode compiled. I don't know how fast compilation is.

soegaard · on Oct 16, 2016

Racket can compile a racket source file to platform independent bytecode (stored as zo-files). When the bytecode is run, it isn't interpreted, but a just-in-time compiler kicks in, generates machine code, and the code is run.

As far as I know - a new compiler is in the works.

dleslie · on Oct 14, 2016

FWIW, Racket explicitly dropped Scheme from its branding because the language designers disagreed with the direction of Scheme and wanted to go their own way; and they have, to a great extent. Racket has scheme in its syntax, but as does C++ has C.

nickpsecurity · on Oct 14, 2016

Ahh. So probably best to think of it as its own, Scheme-like language.

qwertyuiop924 · on Oct 14, 2016

Indeed: Beneath some syntactic similiaries, the semantics and idioms of the languages are radically different.

nickpsecurity · on Oct 14, 2016

Darn. So, HtDP and The Little Schemer keep being posted as good intros. What your pic on best Scheme and into text to use to get started in a way where foundation knowledge will keep paying off vs non-standard dialect?

qwertyuiop924 · on Oct 15, 2016

Teach Yourself Scheme in Fixnum Days (http://ds26gte.github.io/tyscheme/), SICP, The Scheme Programming Language 3rd edition (the forth edition is about the much maligned R6RS, which isn't well supported) (http://www.scheme.com/tspl3/), and the Scheme Wiki (although it's a bit out of date) (http://community.schemewiki.org) are all helpful resources. In addition, the Scheme reports are small enough that you can read them cover to cover for a full coverage of the base language (save the very long R6RS), although there are some things left intentionally undefined (an error, for example, can do anything from raising an exception to crashing your program, or anything in between). The most popular standards are R5RS (http://www.schemers.org/Documents/Standards/R5RS/) and R7RS, (http://trac.sacrideo.us/wg/wiki/R7RSHomePage). Pretty much all Scheme implementations are at least mostly compliant with one of them, if not entirely so.

Because Scheme is so minimal, pretty much every implementation extends the language in non-standard ways. The difference between this and what Racket does is that other implentations still honour the standard, merely extending rather than altering, and usually extend a bit less dramatically. As such, your implementation's documentation is an important resource.

The most important language extension is macros: imperative macros aren't part of any spec save r6rs. Most schemes opt to implement er/ir macros, sc/rsc macros, or syntax-case.

er macros are the simplest and closest to standard defmacros, but you have to rename functions as well as variables, so it gets hard to see what's going on pretty fast. ir macros fix this, but are O(n) when implemented in terms of er, which they usually are. This is the system Chicken uses.

sc/rsc are probably the best system. They're based around syntactic closures, and it takes about five minutes to learn the basics. Chibi and MIT Scheme use it.

syntax-case is akin to Racket's macro system, and unifies pattern matching and procedural macros. Some love it, others think it's an overcomplicated mess. Guile and other R6RS (or partially R6RS) compliant Schemes use this.

There are some widely ported code libraries that you may wish to use, in addition to those specific to your implementation. SLib, SXML, the descendants of TinyCLOS (there are many), irregex, matchable, the various loop macros (there are several), and the awk macro are all useful tools.

Of special note are the SRFIs: Scheme Requests for Implementation. They're optional standards for extensions to the Scheme language and also libraries, and Scheme's equivalent of RFCs. Typically, implementations will include a selection of the more popular SRFIs in their stdlib, and sometimes more in their package repositories, if they have one. SRFIs are required to provide portable implementations if at all possible, which is usually is. However, it is sometimes possible to write better optimized code using implementation specific features, so if there's an implementation specific implementation of a SRFI, choose it over the standard one. SRFIs are usually referred to by number, which confuses everybody. You can find them all on http://srfi.schemers.org. the most important ones for new users to know are probably srfi-1 (list manipulation functions you'd expect to be in the core language but aren't), srfi-13 (string manipulation), and srfis 69 and 125 (hash tables: 69 is older and more rudimentary, but extremely widely supported. 125 is new, will become part of R7RS large when it's finished, and is fully backwards compatible with 69, but provides additional convenience functions).

Finally, there's choosing an implementation. Here's a summary:

-MIT Scheme: it's frozen, and essentially only on life support. Use it for doing SICP, if at all.

-Guile: bytecode compiled. One of the best Schemes for getting Real Work done: a bit lacking in libraries, but has a good FFI, and the libraries it does have are of excellent quality. Very well documented, with friendly developers (the lead, Andy Wingo, is a great guy, and his blog is worth a read no matter what Scheme you pick) and a good community. It's R5RS, with some R6RS support. The only Scheme I know of to support native threads. Also good for dynamic linking into applications as an extension language.

-Chicken: My personal favorite. R5RS, working toward R7RS, relatively conservative, extension-wise. Also an excellent choice for Real Work. Compiles to C, quite fast. It has a large package repository (for a scheme): probably the largest of the schemes on this list, including a web framework (guile as one as well), as well as countless other useful things. Because it's C compiled, the FFI is dead simple. Also uses Cheney on the MTA compilation, so continuations are fast enough to practical in real work, whereas they're mostly toys in other schemes. Notably, it's not re-entrant (necessarily), and doesn't support OS-level threads.

-Gambit: Also compiles to C. Probably the fastest of the Schemes in the general case. Has some libraries (SchemeSpheres, among others), but not many, so be prepared to write a lot of FFI code, and code that would otherwise be provided by libraries. Has coroutines like Chicken, and also has a message-passing library/system akin to erlang called termite. R5RS.

-Chibi: Written by Alex Shinn, the author of many scheme packages, including irregex, the most common regex package on the Schemes. The reference implementation of R7RS. Has a package repository through Snow: better than Gambit, not as good as Guile or Chicken. Very small, bytecode compiled. Has a small stdlib with some neat stuff in it. Has a decent FFI, but not as good as Chicken or Gambit. Built for static linking into an application.

Welcome to Scheme. Sorry for the infodump, but I hope it helped.

dleslie · on Oct 15, 2016

Cyclone is a new scheme that uses Cheney-on-the-MTA _and_ has native threads.

https://justinethier.github.io/cyclone/index

qwertyuiop924 · on Oct 15, 2016

...which is cool, but it doesn't have the library support Chicken does, yet. I honestly forgot about Cyclone. It's not one of The Big 3, or anything, so it just kinda slipped my mind. Thanks for bringing it up.

dleslie · on Oct 15, 2016

It's definitely very much a toy.

nickpsecurity · on Oct 15, 2016

The linked 90-minute compiler might be useful combined with the paper I have on incrementally building one in N steps. They both use Scheme.

nickpsecurity · on Oct 15, 2016

Appreciate the detailed response. :)

qwertyuiop924 · on Oct 15, 2016

Ooh! Almost forgot: The Little Schemer. It's actually (mostly) pure R5RS, but if you want to do Real Work with scheme, it won't be super helpful, because it's primarily trying to teach recursion. However, it and its sequels (also R5RS mostly) are all worthwhile reading if you've got the time, and I'd highly reccomend them.

nickpsecurity · on Oct 15, 2016

"it won't be super helpful, because it's primarily trying to teach recursion."

I was literally just rolling my eyes as I read that in the book itself. ;) Table of contents on Fixnum looks good. I still have SICP somewhere but couldn't remember if it was easy or if their style was still relevant. I'll check it out again.

qwertyuiop924 · on Oct 15, 2016

Easy? No. Relevant? Yes. It's really about learning CS, but it'll teach you a lot of Scheme along the way. Although it notably skips some parts of the language, namely macros and I think some of the I/O stuff. Also dynamic-wind. Most are not hard to grasp, but important in Real scheme programming. Except that you also have to understand why dynamic-wind has tot deal with call/cc, and the resultant problems. But that's not too hard to grasp.

acqq · on Oct 14, 2016

As far as I know, the Rust compiler is written in Rust? That should mean that if the features of the language are used, a lot of "small" allocations shouldn't happen at all, instead, the language is supposed to allow that the short lived variables (and objects) exist only on the stack, costing effectively nothing to allocate them?

Moreover, the compilers in general have an advantage to know that most of the objects are needed only during specific times, allowing optimized allocation and bundled deallocation. I see "a lot" of "arena" allocators are mentioned, so it seems there is something like that, just maybe too fine grained?

In short, there's no need for the Rust compiler to have any measurable overhead due to the allocations.

Then only the changes in the compiling algorithms and the underlying data structures (how often is what traversed and why, is something kept or recalculated, sometimes it's even more expensive to produce too much data to be kept in memory, and sometimes the data structures aren't optimal for the most of the uses) should have the potential of changing the compilation speed.

Chirono · on Oct 14, 2016

Don't know why you are being down voted. Even if you are incorrect, it would be interesting to hear why. Anyone care to comment?

__s · on Oct 14, 2016

Compilers spend a lot of time with hashtables, which exist on the heap. Rust isn't some magical language that makes the heap some unnecessary entity. One still has objects whose lifetime do not fit the stack model or whose size is dynamic

Their comment went on to claim that given their hypothesis, allocation optimization shouldn't be impactful. Yet this post demonstrates otherwise, perhaps showing that therefore their hypothesis is false (if a then b, b is false, therefore a is false)

One of the post's optimizations was to use Cow<str> over String tho, which is an example of reducing allocations with borrows

Additionally it's oft mentioned that a weak point in Rust's compile time is LLVM. This has to do with various things, some of which are Rust's fault (it could generate less IR), but it points out that there's a lot of time being spent in C++ code. LLVM's great, but it's guidelines essentially suggest 'if you want to make llvm play nice, generate IR like clang'

acqq · on Oct 14, 2016

> Their comment went on to claim that given their hypothesis, allocation optimization shouldn't be impactful.

And what was the hypothesis? That the code for the compiler itself should use the features imagined to be used by the language, just like your example:

> One of the post's optimizations was to use Cow<str> over String

Then:

> allocation optimization shouldn't be impactful. Yet this post demonstrates otherwise

"not impactful": It looks to me like that: 36.526s vs 32.373s for the longest compile tested. It is measurable, but not something that I as the programmer will be able to notice. My definition of impactful for the compilation times is "at least a factor of 2" not 1.1.

And the claim is, once the language possibilities are actually used. If the code is written differently, of course not much is expected. The result of a moderate speedup exactly shows that the language possibilities weren't always used in a way to minimize allocations, but the difference is still not, for my perception, significant (10% is hard to recognize unless measured or the actual compilations last hours and not seconds). But I still claim that much bigger potential gains are the actual changes in the algorithms and the data structures. The data structures especially can hugely change the result if they are made to be really cache friendly, and together with the changes in algorithms such modifications have the potential to make an order of magnitude speedups, if they are actually possible.

valarauca1 · on Oct 14, 2016

Because the parent poster's thesis

>>Moreover, the compilers in general have an advantage to know that most of the objects are needed only during specific times, allowing optimized allocation and bundled deallocation

Is flawed.

Compiler can know this. Most the time they don't. In Turing complete languages you can't reason about how much space/time a program will need to run.

When you do know that and object, or objects will be determanstically allocated/de-allocated together. Then you need to know how large they'll grow too over their their lifetimes. Which again may hit a Turing Tarpit, and you lose this ability.

Here is an incredibly simple example

        fn read_file(path: &str) -> String;
        fn get_tokens<'a>(s: &'a str, r: &Regex) -> HashMap<&'a str, &'str>
        fn main() {
            let my_regex = //something
            let file_data = read_file(&arg[1]);
            let tokens = get_tokens( &file_data, &my_regex);
        }

On can see the String that will contain the file's data, and Hashmap can be allocated together. But one cannot reason about how large either will be head of time to pre-size these on the heap.

Lastly, address randomization is a good thing. Knowing ahead of time which allocations will determanstically be in relation to each other is a security vulnerability.

acqq · on Oct 14, 2016

My claim about "the compilers in general have an advantage to know that most of the objects are needed only during specific times" is directed to the construction of the compiler, like the Rust compiler written in Rust, not that there's any magic. That means, the writer of the compiler can write the compiler knowing much more about the lifetime of the objects used in the compilation than most of the software.

When you write a server, you don't know who much which pieces of the database should be in the memory. When you write a compiler, you know that everything goes away after the compilation is done. It allows extremely speedy compilers (considering the number of their malloc calls), which seldom allocate memory blocks and seldom deallocate.

That means that even for the stuff that you allocate "on the heap" you don't have to allocate objects one-by-one, but allocate big chunks of memory and then just deliver the small pieces of it as the memory for "objects." The same way, for the deallocation, instead of deallocation many small objects, you can deallocate a lot at once.

I've personally made the compiler that made only a few malloc calls for the whole compilation process, even if it internally "created" tens of thousand of "objects." I've done it in C, and Rust is made to allow the speed of C with the higher safety, so the equivalent strategies should be possible. The mention of "arenas" suggests to me that the writers actually think in that direction.

Manishearth · on Oct 14, 2016

> That means that even for the stuff that you allocate "on the heap" you don't have to allocate objects one-by-one, but allocate big chunks of memory and then just deliver the small pieces of it as the memory for "objects." The same way, for the deallocation, instead of deallocation many small objects, you can deallocate a lot at once.

IIRC jemalloc does this anyway?

Regardless, the Rust compiler uses arenas a lot, but that doesn't make them a one-size-fits-all solution. Arenas are a great solution when allocations are used to be able to let data escape the stack frame (i.e. when you aren't able to just use a stack variable like your original comment said). But that's not the only reason allocation is used. For growable collections like vectors and maps a typed arena isn't necessarily the best solution. An untyped arena might help, but again, that's sort of one of the things jemalloc gets you.

Even with arenas, of course allocations will get you measurable overhead. With arenas individual allocations cause no overhead, but there's a tipping point (i.e. when a block of the arena fills up) when it starts to matter.

valarauca1 · on Oct 14, 2016

>>I've created a similar system

Cool then submit a patch to the Rust project.

Zarathustra30 · on Oct 15, 2016

> ...if the features of the language are used...

The Rust compiler uses Rust's features terribly, just because Rust's features were a moving target for the majority of the compiler's lifetime.

pjmlp · on Oct 14, 2016

In conclusion, allocating memory is expensive, regardless how it is done. :)

mrottenkolber · on Oct 14, 2016

> In conclusion, allocating memory is expensive, regardless how it is done. :)

My experience goes contrary to that statement. How allocation is done makes all the difference. Between what the language implementations memory model is good or bad at, and what allocations/usage patterns the CPU is good or bad at, there are worlds of possibilities and trade-offs. Especially the “increased allocation equals decreased performance” myth doesn’t hold true. See http://mr.gy/blog/maxpc.html#section-3-2 for my own ragtag exploration of this phenomenon.

zigzigzag · on Oct 14, 2016

It sounds like a lot of the allocations in the Rust compiler follow the generational hypothesis, even though Rust isn't GCd. If the Rust compiler was using a parallel generational GC then it's possible it'd be a lot faster, as the hard work of the allocation would be done on separate cores and all those tiny unused TypedArenas would die young which makes them rather cheap.

Manishearth · on Oct 14, 2016

Rust used to be GCd and the compiler made extensive use of this. It was slowly migrated off this, but I suspect the code is still structured that way.

lucozade · on Oct 14, 2016

I'm not sure that this follows from the analysis. Why would a GC remove allocations from the hot path? I could see that it may take deallocations off and may be faster if it's a very good GC.

But you'll also get into the territory of what you're GCing. I take it from the fact that there are a lot of arenas that there are an awful lot of small allocations. Unless your GC lets you choose the level of allocation i.e. it doesn't necessarily allocate at the object level, your GC'd language is likely to be much, much slower. There may be GCs like this but I think they are relatively unusual (I'm not aware of them but I'm no expert).

It seems to me that the issue here is actually the same as you get when you try to improve performance of GC'd languages i.e. you want to try to avoid any allocation wherever possible.

I must say I was a little surprised by how much allocation is going on. Does it imply that idiomatic rust may lend itself to somewhat excessive memory use? Or maybe the rust code in the compiler is sufficiently old to not be considered idiomatic. Would be interested in the views of those more knowledgeable than me.

zigzigzag · on Oct 14, 2016

Allocation with a modern GC is just a single subtraction in most cases: it's very fast.

Deallocation is obviously a lot more involved, but it can be done in parallel on cores that may sit around not always fully utilised.

Compilers can normally max out all the cores you've got during a full build, so perhaps in this case it doesn't matter: enhancing the parallelism doesn't help you. On the other hand if you're rebuilding a single file due to incremental compilation, it's probably hard to parallelise that entirely by hand, so shoving some of the memory management work onto your spare cores can help.

naasking · on Oct 14, 2016

> Why would a GC remove allocations from the hot path?

It wouldn't reduce allocation count, it would reduce allocation cost to a simple pointer bump which is virtually guaranteed to already be in cache (unlike malloc tables).

The added cost is only tracing, but if, like the OP said, allocations conform to the generational hypothesis, then frequent minor collections are quite fast.

Manishearth · on Oct 14, 2016

rustc is far from idiomatic.

Rust used to be GCd, so the compiler uses a lot of things designed in a GC-esque way but tweaked so that they no longer need a GC.

In general the language has changed a lot, and the compiler source takes some time to catch up completely.

pjmlp · on Oct 14, 2016

> There may be GCs like this but I think they are relatively unusual (I'm not aware of them but I'm no expert).

Yes, there are.

GC enabled system programming languages like Mesa/Cedar, Modula-3, Active Oberon, D, Sing#, System C#...

All of them allow you to control what goes into the GC heap, stack, global memory or is eventually managed manually in unsafe code sections/modules.

In any case my remark still goes, it doesn't matter how one acquires OS resources, there is always a performance penalty when it happens too often.

kzrdude · on Oct 14, 2016

The allocation fixes are scratching on the surface of the compiler. It must be something more fundamental underlying the long compilation time.

Manishearth · on Oct 14, 2016

One of the main slow bits is just LLVM. Rust produces LLVM IR which LLVM takes a long time to optimize.

Since MIR has landed, IIRC work on doing MIR optimizations and LLVM IR simplifications is the next step here.

kzrdude · on Oct 14, 2016

Rustc is of course responsible for what it's feeding llvm, and being smarter about that sounds exciting.

qznc · on Oct 14, 2016

Apart from the borrow checker there is nothing fundamentally different to other compilers. I doubt that borrow checking is fundamentally more expensive than other type checks.

My guess is: Large compilation units and LLVM optimizations.

gribbly · on Oct 14, 2016

Not sure how much is due to this, but LLVM has gotten much slower in recent releases, probably due to improved optimizations.

I would not be surprised if down the line, the Rust devs decided to write their own Rust backend rather than relying on LLVM.

pjmlp · on Oct 14, 2016

That would be a good way for removing the C++ dependency on their toolchain, but it would take decades to achieve parity with the quality of LLVM optimisers and supported ISAs (which are a subset of gcc ones).

C and C++ optimisers weren't that good in the 80 and 90's even against junior Assembly developers, the 40 years of compiler research in optimizing C and C++ code that made them what they are nowadays.

gribbly · on Oct 14, 2016

While I certainly agree that it would take time, I'm not so sure of 'decades' needed, I believe the scope of a Rust specific backend would be quite different from that of a everything-but-the-kitchen-sink backend like LLVM.

Also not sure that reaching optimization parity with LLVM is an absolute necessity, Clang/LLVM is not as good at optimizing as GCC but that hasn't rendered it unpopular.

Looking at Go as a benchmark, the compiler seems to be improving the optimization at a good pace and supports X86, ARM, PPC, MIPS, S390 despite being relatively young language and also having recently rewritten the compiler.

pcwalton · on Oct 14, 2016

> Also not sure that reaching optimization parity with LLVM is an absolute necessity, Clang/LLVM is not as good at optimizing as GCC but that hasn't rendered it unpopular.

Clang/LLVM is essentially as good as GCC. It depends on the benchmark you're looking at. Getting up to LLVM parity would take years and years.

> Looking at Go as a benchmark, the compiler seems to be improving the optimization at a good pace and supports X86, ARM, PPC, MIPS, S390 despite being relatively young language and also having recently rewritten the compiler.

It will take years and years for Go to achieve LLVM's level of optimization (assuming that they want to; I suspect that they won't when it starts regressing compile times).

gribbly · on Oct 15, 2016

>It will take years and years for Go to achieve LLVM's level of optimization

Looking at Rust vs Go over at 'benchmarksgame' (granted these are micro-benchmarks), Go seems to compare very well for tests that are mainly computational and thus avoid involving the GC very much (even beating Rust on a couple).

http://benchmarksgame.alioth.debian.org/u64q/compare.php?lan...

And let's not forget that Go hasn't really had any focus on optimizations up until the last release where SSA was introduced.

Anyway, maybe I'm projecting as I would like to see a full Rust toolchain in Rust, and my (very possibly entirely wrong) impression is that Rust devs have been struggling with getting the full performance potential of Rust through LLVM.

pcwalton · on Oct 15, 2016

The benchmarks game is one thing. Doing well on real code that isn't ridiculously microoptimized is quite another.

It makes zero sense to write a new backend for Rust from scratch because Go does well on a few microbenchmarks. Go's compiler infrastructure is across the board worse than that of LLVM (in compile-time performance too; some of Go's algorithms have worse asymptotics than the more sophisticated ones in LLVM). LLVM doesn't do things like SROA and alias analysis just for fun.

> my (very possibly entirely wrong) impression is that Rust devs have been struggling with getting the full performance potential of Rust through LLVM.

I'm a Rust compiler dev and I have no idea what you're talking about. LLVM is working wonderfully for us. If we had done what Go did, we'd quite possibly have failed.

I have a different proposal. Instead of reinventing the world for no reason, let's treat bugs in LLVM/rustc as bugs, and fix them.

pjmlp · on Oct 14, 2016

Go still doesn't quite match what most Java native compilers are capable of.

For that you need to use gccgo.

The main problem with this kind of effort is mostly political, as many developers don't make a difference between language and implementation, thus making it harder for a language like Rust to win the hearts of its target audience, e.g. C developers that micro-optimize code as they write it.

geodel · on Oct 14, 2016

Are there benchmarks for those compilers? I ask because Go is quite competitive to Oracle JDK.

http://benchmarksgame.alioth.debian.org/u64q/go.html

pjmlp · on Oct 14, 2016

Yes, but usually Java compilers still tend to win when the benchmark is done in such a way that the 2nd level JIT is allowed to do its work.

Go usually wins in those benchmarks, where Java code is either interpreted or still compiled with the 1st level JIT.

And as mentioned, this is only relevant with the reference compiler, gccgo is much better.

igouy · on Oct 14, 2016

> allowed to do its work <

It's not like the programs are run with -Xint :-)

It's not like a panacea -- https://arxiv.org/abs/1602.00602

gribbly · on Oct 15, 2016

>gccgo is much better.

Actually I did a benchmark run not that long ago between Gccgo and Go with the latter winning most of the benchmarks, I believe this mainly has to do with Gccgo lacking escape analysis.

qznc · on Oct 14, 2016

At least for development a custom fast backend would be great. For production code you still want all the optimizations you can get, which means LLVM.

wyldfire · on Oct 14, 2016

Writing different backends for each supported target would either be taking on a lot of work or shedding some targets.

coldtea · on Oct 14, 2016

>Writing different backends for each supported target would either be taking on a lot of work or shedding some targets.

Do any non statistical outliers care about non amd64 and ARM?

bzbarsky · on Oct 14, 2016

For a browser?

1/3 or more of desktop Windows users are on 32-bit Windows, last I checked. So you definitely care at least about ix86 of some form.

wyldfire · on Oct 14, 2016

What's a browser got to do with it?

bzbarsky · on Oct 16, 2016

In the specific context of Rust, being able to build a browser is a large part of the point. That means supporting platforms where people use browsers.

coldtea · on Oct 15, 2016

I guess the parent thinks or implies that since Rust was created to help Mozilla build a next-gen rendering engine, platforms with browsers is the only domain they care about supporting.

bzbarsky · on Oct 16, 2016

No, I mean that platforms with browsers are a domain that Rust _must_ support for the use case Mozilla is developing it for. That doesn't preclude desiring support for other platforms, but is a minimum bar.

Manishearth · on Oct 15, 2016

> platforms with browsers is the only domain they care about supporting.

What? No. It means that platforms with browsers are part of the domain Rust cares about supporting. Not the only domain. You asked for non-outliers, bz gave an example of browsers.

There's plenty of work to make Rust work on embedded platforms where no browsers operate, and some of this is being done by Rust contributors employed by Mozilla. No need to assume that Rust is only focused to the needs of browsers.

klodolph · on Oct 14, 2016

"Statistical outliers" are more important than their frequency suggests. Architecture lock-in shifts political power away from developers and to CPU vendors.

soulbadguy · on Oct 14, 2016

OpenPower,xeon phi for one;

Even within the amd64 and arm families, there are enough variations (AArch vs qualcomm) and evolutions (SSE3 to SSE4) to make the task of developing a new and performant backend from scratch very resource consuming

Also while taken individually most of the non amd64/arm platforms might not be significant, combine together they still represent a big share of the pie, doubly more so in the embedded world which rust (tries?) to target.

coldtea · on Oct 14, 2016

Xeon yeah, haven't thought of that.

For OpenPower, let IBM do it :)

geodel · on Oct 14, 2016

Very unlikely. Rust party line is optimization is critical, compile speed desirable.

notriddle · on Oct 14, 2016

Allocating memory can be as cheap as adding to a number. Even if you want a concurrent allocator, atomic increments are pretty fast compared to any typical malloc. It's allocating and then freeing that's slow.

A copying-compacting GC has very fast allocation at the cost of very slow freeing (you have to walk memory).

DMD has very fast allocation and never frees memory.

lilyball · on Oct 14, 2016

> malloc and free are still the two hottest functions in most benchmarks. Avoiding heap allocations can be a win.

This is a non-sequitur. If you see a hot function in a profile, that is an indication that you should optimize that function if you can. But it does not mean that you should try to reduce calls to that function elsewhere (unless you can cut out a lot in one go). As a trivial example, if you have a super-fast function that cannot be optimized any further, but it's called millions of times, that function may appear hot, but any individual invocation of the function might be taking just a few microseconds. Getting rid of a handful of calls to this function (or even a few hundred) isn't going to make an appreciable difference. Of course, if you can cut the number of calls in half, then that would be good.

I guess another way of saying this is, if you're trying to optimize a hot function by reducing the number of times it's called, you need to reduce it by a significant percentage of the total calls rather than reducing it by a specific number of times. No matter how many times it's called, if you reduce the number of calls by 50%, the overall time taken by the function drops by 50%. But if you reduce the number of calls by 50, then whether that's meaningful or not depends entirely on how long each individual call to the function takes.

Edit: Why am I being downvoted for this? I'm speaking a factually correct statement. It's not like this is some sort of controversial opinion. If I'm wrong, please tell me why.

elihu · on Oct 14, 2016

The original statement isn't a non-sequitur. If the compiler is spending a lot of time in malloc, it makes sense to replace those heap allocations with stack allocations where possible.

You are also correct that removing mallocs might not be much of a low-hanging fruit, since it might entail changing a lot of malloc calls rather than just a few, but that doesn't mean the original statement was wrong.

lilyball · on Oct 14, 2016

Fair enough, but it's a good bet that most of those mallocs cannot be replaced with stack allocations.

MaulingMonkey · on Oct 14, 2016

So grab an fixed-size pool allocator. Malloc is very generic, and faster alternatives aren't limited to stack allocators.

EDIT: Or do what the article does, eliminate allocations that were straight up unnecessary (because they were unused, or over-eager deep copies)

nnethercote · on Oct 15, 2016

My post indicates that I've already removed a large chunk of the existing mallocs. The remaining ones are harder, but I expect I'll be able to still remove a decent fraction.

MaulingMonkey · on Oct 14, 2016

> Edit: Why am I being downvoted for this? I'm speaking a factually correct statement. It's not like this is some sort of controversial opinion. If I'm wrong, please tell me why.

Didn't downvote you, and considering you're currently on the top of the replies list, I wouldn't worry about potential misclicks. But I'll bite.

> This is a non-sequitur. If you see a hot function in a profile, that is an indication that you should optimize that function if you can.

This is just as much of a "non-sequitur" - maybe you can optimize that function further, but if it's reached the point of diminishing returns it may not be worth it even if it's technically possible. In-context, there has already been a lot of optimization work put into malloc and free, lots of low hanging fruit is gone.

Further, malloc and free are very generic allocation strategies that generally do a lot of work to try and handle many different cases with OK efficiency. I challenge you to make them as fast as, say, stack allocation - go ahead, I'll wait. Sure, you can't replace all use of a heap with a stack, but most programs I've seen have at least some cases where you could.

And there is actually a way to make malloc and free as fast as a stack - turn your malloc implementation into a pointer increment, and free into a noop. Not appropriate for all tools, but a short lived compiler pass might be able to get away with it! Although it'd possibly cause problems for large 32-bit builds. Regardless - this is also technically no longer a heap.

As such, "Avoiding heap allocations can be a win" seems to be a very true and straightforward reply. Confusingly, you seem to agree with this in your next paragraph despite labeling it a "non-sequitur"? Perhaps you're interpreting this as claiming that avoiding a single heap allocation call site is going to be a significant win? Because that's reading far too much into the statement, IMO.

> Getting rid of a handful of calls to this function (or even a few hundred) isn't going to make an appreciable difference. Of course, if you can cut the number of calls in half, then that would be good.

An important note here:

Cutting the number of calls in half could, in theory, involve eliminating one call site (because e.g. the calling code itself is very hot) although this probably isn't the case in the Rust codebase or it'd have been optimized already. Even if we pessimistically assume that we need to eliminate N% of call sites to eliminate N% of calls, the first step to eliminating N% of call sites is to eliminate one call site. The second step is to eliminate a second call site.

lilyball · on Oct 14, 2016

> Didn't downvote you, and considering you're currently on the top of the replies list, I wouldn't worry about potential misclicks. But I'll bite.

Thank you for your reply. At the time I added that edit, I was at -2 points. And I'm only at 1 point right now (and aren't showing up at the top of the stack when I look at the comments page). In any case, the edit did seem to do what I wanted, which was to get replies.

> As such, "Avoiding heap allocations can be a win" seems to be a very true and straightforward reply. Confusingly, you seem to agree with this in your next paragraph despite labeling it a "non-sequitur"? Perhaps you're interpreting this as claiming that avoiding a single heap allocation call site is going to be a significant win? Because that's reading far too much into the statement, IMO.

The way I read it was "malloc is a hot function, therefore you should try to avoid using it". And that's a non-sequitur because the conclusion does not follow from the premise. Whether or not it's a hot function has no bearing on what the cost of an individual call is. And when you identify a hot function in a profile, that doesn't mean you should remove calls to it, that means you should optimize it. If you can't optimize it, then you should charge it to its parent frame. If you take all your un-optimizable hot functions and charge them to their parent frame, then you can see if any new hot functions pop up that can be optimized. e.g. you can't optimize malloc, but if you find one function is calling malloc hundreds or thousands of times, and it doesn't have to, then you can optimize that function (or if it does need all those allocations, you can investigate using alternatives to malloc).

> Even if we pessimistically assume that we need to eliminate N% of call sites to eliminate N% of calls, the first step to eliminating N% of call sites is to eliminate one call site. The second step is to eliminate a second call site.

When talking about a function with many thousands of individual call sites, eliminating them one at a time isn't practical. You need to come with some way to systematically eliminate a lot of them. Of course, if you do the above and charge malloc to its callers then maybe you can identify a small number of call sites that are responsible for a high percentage of calls to malloc.

MaulingMonkey · on Oct 14, 2016

> The way I read it was "malloc is a hot function, therefore you should try to avoid using it".

We agree that's a non-sequiter in the more general case at least ("X is a hot function ...") - but even in the more general case, it's not an error to state that it may be worth avoiding, which would be a more accurate reading IMO. It's an avenue to investigate.

> When talking about a function with many thousands of individual call sites, eliminating them one at a time isn't practical.

I've done it before (refactoring logging macros, static_assert s due to macros conflicting with builtins on new compilers, even fixing up allocation interfaces to better trace call sites, etc.)

It's painful and laborious, but sometimes worthwhile.