A Uniﬁed Theory of Garbage Collection [pdf]

thamer · on July 14, 2014

See also Michael Bernstein's articles and talks on the subject, including the paper linked: http://michaelrbernste.in/2013/06/10/to-know-a-garbage-colle...

jkbyc · on July 14, 2014

Garbage collection is quite similar to view maintenance in databases (or more specifically reason maintenance / truth maintenance). Especially the fact that they found a least fix-point formulation of the problem hints at strong similarity with computing and maintaining fixpoints in deductive databases - that is computing and maintaining materialized views from some base facts and rules (a distinction that does not need to be made; base facts can be seen as special kinds of rules).

In view maintenance, there are various counting algorithms known since around 30 years - for example the PF (Propagate and Filter) algorithm, or DRed (derive-rederive) algorithm. One difference is that in garbage collection you always have the whole graph: edges and vertices, while in view maintenance you typically have only the vertices. That gives rise to the "proof traces" method of view/reason maintenance where you build up the whole graph and keep the edges even though they are useful only for maintenance (or for explanation).

The distinction of tracing garbage collection vs. reference counting garbage collection could be also seen as sort of analogous to the difference between forward chaining and backward chaining in datalog or rule systems in general.

It seems like another case where bigger awareness of results from different but related fields could help to bear fruit much faster. This paper in particular seems like something that could have been inspired by the old view maintenance research. Maybe there still are some opportunities left for transferring results from one of the fields to the other.

mrbbk · on July 14, 2014

This is a beautiful result that I've written up and spoken about a few times - http://michaelrbernste.in/2014/02/24/papers-we-love-garbage.... - it's surprisingly deep and foments a deep intuition that may developers have.

rayiner · on July 14, 2014

There are some neat hybrids of GC and RC as well. This is probably state of the art in this area: https://www.google.com/url?sa=t&source=web&rct=j&ei=xkTDU-GU...

Conceptually, the key insight is that both GC and RC involve tracing. In GC you trace from live roots to find survivors. In RC, you trace from newly dead objects to find others that should also become dead. This has interesting implications when you combine the algorithms in a generational collector.

Imagine the following steady state situation. You're allocating at X MB/sec, with a young-generation survival rate of 0.1X. Consequently, this is the allocation rate into the old generation. Since we're in a steady state, this is also the death rate in the old generation. If you use GC in the young generation and RC in the old generation, you'll trace 0.2X MB per second. This holds true even for an extremely large old generation live size. You'll never have to trace the whole heap during an old-generation collection, just the objects that die during that interval. And if you're in a steady-state, this will be proportional to your allocation rate, not your heap size.

Now, ideally, in a tracing GC you can get your work down to being proportional to your allocation rate, but this requires, even ideally, a heap sized at 2x the maximum live size. With an RC old generation, you can get away with a heap not much bigger than your live size.

There are also certain advantages in terms of concurrent collection. The nice thing about tracing dead objects is that they are never mutated. So you do it concurrently with the application without worrying about mutations of the object graph, like a concurrent mark-sweep collector has to. The downside is that the write barrier is more complicated. See: https://www.google.com/url?sa=t&source=web&rct=j&ei=wEnDU_23...

haberman · on July 14, 2014

It is strange, because you are almost exactly restating the central thesis of the paper, but phrasing it as though you are saying something different.

rayiner · on July 14, 2014

Posting a tl;dr of a theory paper you read years ago is low-hanging karma. ;)

jheriko · on July 14, 2014

still of the opinion that this is not a problem to solve with algorithms but with just doing your god damned job and accepting that some of it is mindless chore work and that it requires some basic level of attention to detail and skill.

my software doesn't leak and it doesn't take forever to develop and its easy to debug memory problems because the architecture is trivial.

jerf · on July 14, 2014

Are you familiar with the oft-quoted phrase in programmer circles "It's a poor craftsman who complains about their tools."? It's often badly misunderstood to mean that craftsman should never complain about their tools and just make do with whatever they have, a perversion of the quote I've only ever seen in the programmer community.

What it really means is that it is a poor craftsman who acquires or puts up with bad tools.

It's an even poorer craftsman who celebrates their poor tools.

pkolaczk · on July 14, 2014

Some important programming paradigms are extremely hard or near-to-impossible without GC. See functional or concurrent lockless programming. Sure, for many imperative programs you can get away with simple memory management schemes like RAII.

dandrews · on July 14, 2014

I'd make your same observation. Like the parent, I'm proud of my imperative code and never used to consider memory management a particularly onerous task. I was brought up on it y'know, just a part of writing good code.

But then I started Lisp in earnest, found GC useful (necessary!) there, and more recently have gone functional with Clojure. Couldn't do without GC now.

One of these days perhaps we'll have concurrent collectors a la Azul, running on a specialized core/MMU on some multiprocessor SoC, and GC will fade into the background and nobody will care so much about performance.

Hey, I can dream.

jheriko · on July 23, 2014

i do a lot of lockless programming and avoid these problems almost entirely... although i will admit to having written a custom allocator or two for that scenario.

i don't really see why being functional adds some extra requirement where stuff can't be freed automatically or by hand... without resorting to gc

jblow · on July 14, 2014

I don't think this is quite the right attitude but I still agree with the basic concept.

If you are building a high-end video game (my field), garbage collection simply makes the user experience poorer than it would be otherwise. It is not some invisible implementation factor, but in fact affects the output in the form of stuttering frame times all over the place.

Good luck trying to do a VR game with a reasonable amount of stuff on screen with a GC running (you need to be between 75-90fps times 2 eyes. And every time you miss the frame deadline in VR, it is not just that the game feels worse, but it disorients people and gives them headaches and nausea.)

Every time I pick up an Android phone and try to use it, I want to throw it out a window, because it is all stuttery and janky. I presume at least part of this is because of GC.

I think this is another occurrence of people improperly weighing obvious benefits versus hidden costs. The obvious benefit of GC is you don't have to manage your memory so much, maybe. (I say maybe because when you start saying that GC performance is not good, the answer is always thinking about memory management in some way in order to reduce load on the GC. "It's a really great car as long as you don't drive it very much.") The obvious benefit is there, but there is also a hidden cost in terms of performance and solid-feeling-ness, and that cost is really pretty large actually. The theory has always been "pretty soon now GCs will get better and this will go away", but this has never happened. I went to college in 1989-94 and used to design GC'd languages as a hobby, so I have witnessed a couple of decades of this.

As a productive working programmer who writes a lot of code that does complicated things, I do not find the memory management to be a large part of what takes my time. If I were to pull a not-scientifically-derived number out of a hat, I would say it takes less than 2% of my time. To get a 2% improvement in productivity, but to pay for it so heavily ("well, there's a class of results that are simply impossible for me to achieve now because the GC might go off at any time"), is just a really bad tradeoff.

I am sympathetic to the idea that some paradigms of programming (functional, etc) are harder to do without GC. Exploring those ways of programming is a good reason to like GC, but given that functional programming is not really quite here yet for most classes of large and demanding problems, well, it's just a very different world from the one I need to build software in today.

jheriko · on July 16, 2014

actually GC's aren't that bad. considerate apps on android will not hitch from the GC unless the device is quite old

my bigger problem is debugging these things. whenever i am forced into this environment i usually end up debugging someone elses memory problem and having a really hard time of solving it because of the sloppy practice GC encourages of not thinking about memory, so you end up with some weird distant or circular reference to an object preventing it from being deleted when you know better - that the problem is the coding style which allows that reference.

new and delete force you to think a tiny bit, just enough to not shoot yourself in the foot like that.

i'm always extra bitter because i have a track record of writing software where the only problematic leaks are from gc or ref counted stuff in opaque libraries... if i never had to touch this stuff there are 3 or 4 apps that i wrote which would have had zero leaks.

acjohnson55 · on July 14, 2014

Meanwhile, I'll be over here actually enjoying the programming I'm doing!

innguest · on July 14, 2014

It is only "your god damned job" because you can't let go of the fact you're programming on some machine. You should be trying to abstract from that so your programs are more portable and less implementation-dependent. Will you also be obsessing about the internals of the next computing paradigm? I'd rather solve real-world problems in a general, machine-independent way, instead of focusing on the moving parts of a machine that is simply incidental to my job.