Memo On Garbage Collection

This memo is drawn from original notes to Chris Dollin at HP Labs and has five sections :-

  1. Desiderata, my thoughts on the ideal garbage collector
  2. Amortize Costs by Pre-Allocating Free-lists, a trick that should be employed more often in Poplog/Pepper/Spice, I now believe.
  3. Almost Depth First Copying, a minor variation on a Cheney-type collector that copies in (almost) depth first order. The hypothesis, on which I am neutral, is that this pattern of copying will improve caching.
  4. Critique of Description of Mark and Sweep Costs in Jones & Lin, I take issue with Jones & Lin over their enthusiasm for mark and lazy sweep.
  5. Personal Opinion, I offer my views on garbage collection and contrast them with Jones & Lin.

The following notes have been prompted by a detailed reading of
Garbage Collection: Algorithms for Dynamic Memory Management by Jones
& Lin. Although this is a superb reference work, I think that it must
be read with a degree of scepticism in several places as sometimes
the authors' enthusiasm for technique leads them to conclusions that
are not properly supported by the text.

Desiderata

In no particular order, here are the criteria which I would like my
ideal garbage collector to satisfy. If the garbage collector is a
copying collector then certain specific restraints against copying
are entailed. I think these criteria more or less demand a two-level
allocator with a hybrid garbage collection strategy.

I note that Poplog has a hybrid GC. It has a copying collector for
speed and an in-place compactor when the heap cannot be expanded.

  • Must always be able to allocate a cell of size N if free > N. In other words, it must be possible to consolidate free space which, in effect, implies a compactifying GC algorithm.
  • It is essential that the GC is able to cope with variable sized cells. It is strongly desired that cells up to the maximum process size are permitted - although tagging schemes may reduce this theoretical maximum without being completely ruled-out. For example, I consider that it is reasonable that the length of any array +1 should be a fixnum (or Small, in Spice-speak). This puts an upper limit of 229 - 2 using a two bit tagging scheme. However, a packed arrays of bits of that size would only occupy 229 / 8 = 226 =

64 MB. I think this is an acceptable tradeoff for a 32-bit architecture.

  • Self-tuning: it must be possible for the GC to adjust its operating parameters (especially heap size) to meet an overall %cost goal. It

must be possible to fix upper and lower limits on memory and, ideally, other operating system resources.

  • The GC must cooperate with explicit/manual deallocation of store. I do not think this is any restraint in practice - so the fact it is so rare is all the more puzzling.
  • The GC must support replacement pointers and destroy actions without excessive cost. I am inclined to wonder how Poplog avoids a nasty performance impact here.

* The heap layout must provide explicit support for immutable and
mutable data structures. In particular, there is an expectation that
a programmer can dynamically assert that a data structure is no
longer mutable. The GC is expected to take advantage of this
information at appropriate intervals.

  • Similarly, the heap layout must also support full and non-full

cells (or large atoms, to use Jones and Lin's terminology) and the GC
is expected to make very effective use of this information.

* Also, the heap layout must provide support for non-relocatable
objects. In particular, there is an expectation that a programmer can
dynamically assert an object as non-relocatable or as relocatable.
This is an enhancement over Poplog's approach which requires the
programmer to allocate fixed (non-relocatable) store in a distinct
fashion.

* Furthermore, large or ancient objects should not be relocated until
there is a clear advantage to doing so e.g. the attempt to allocate a
large object fails because of store fragmentation. This can be
thought of as a weakly-fixed cell.

  • The asymptotic cost of GC must be zero. In other words, the %cost

of the inactive data can be driven to zero by increasing memory (or
other resources).

* The cost of the GC for a program that never invokes it must be
zero or extremely close to zero. In other words, write barriers are
unacceptable and so are heap layouts that require linear arrays to be
represented as trees and hence turn O(1) operations into O(log N).

* Cells that can never be reclaimed should only be relocated at most once.

* It must be possible to run the GC either incrementally or
speculatively. It is vital that halting or aborting such an
invocation of the GC is immediate in terms of interaction with a
human being. A practical test is whether the mouse can be smoothly
tracked despite such incremental/speculative action.

Amortize Costs by Pre-Allocating Free-lists

One trick that should be exploited more fully in Poplog/Pepper/Spice
is the use of allocation from a free-list. Firstly, the cost of
end-of-heap checking can be amortized by allocating a bunch of
similar cells together and it also makes in-lining constructors
plausible. Secondly, this is a likely mechanism for cooperating with
a manual deallocator.

Almost Depth First Copying

One of the issues for copying collectors is the fact that they all
reorder store. The iterative Cheney algorithm works breadth first.
The obvious implementation of depth-first copying requires
potentially unbounded store because it is recursive. Whether or not
breadth-first or depth-first is best is not answered satisfactorily,
in my opinion.

However, there is a straightforward algorithm which operates in
constant space which will typically give depth-first copying and uses
breadth-first as a fallback. It is a simple and obvious adaption of
Knuth's roll-around stack. This algorithm is, as far as I can tell,
both simpler and likely to be much more effective than the ones
reported in Jones & Lin! [see Section 6.6, Stackless recursive
copying collection and Approximately depth-first copying.]

In order to perform depth-first copying, we begin with an ordinary
recursive algorithm. We make the algorithm iterative by introducing
an explicit recursion stack. type stack = Array( cell:address *
offset:int * size:int ) Each element of the stack encodes which
children of the cell have been forwarded and which remain to be
forwarded. In effect we are recoding the forwarding of the children
as a continuation stack.

We can operate in constant space by limiting the size of this stack.
We handle stack overflow by rolling the stack around as if it was
circular. When we go to overwrite the old elements we fallback to a
depth-first strategy by forwarding the remaining unforwarded children
- having conveniently retained enough information in the stack to do
this. [N.B. We choose this rollaround strategy because it makes the
leaves of trees depth-first in preference to the root. The inverse
effect requires a trivial modification. However the former is almost
certainly the effect people are looking for.]

General reports are that the hierarchies observed are typically bushy
but shallow. This means that, with a little bit of measurement, this
algorithm will give perfect depth-first copying for the typical
cases. As to whether or not one would see any improvement in locality
as a result I would not like to guess, to be honest.

Critique of Description of Mark and Sweep Costs in Jones & Lin

In section 4.5, Jones & Lin discuss Zorn's lazy sweeper. This is a
mark and sweep algorithm which relies on (a) a bitmap used to store
the mark bits, (b) the sectioning of the heap into cages of identical
sized cells and (c) a cache vector of n objects for each common
object size. This work is cited as the basis for the statement that
"the cost of sweeping can be made small [relative to the cost of
pointer-bumping]" In other words, that it is competitive with the
allocation costs of a copying collector. This claim is referred to
repeatedly throughout the text.

Note on terminology: a cage is a fixed size "large" block of store
organised into equal sized cells. The cells need not be the same
type. The cage-type is the phrase I have chosen for all cages of a
given cell-size.

This is one of those instances where a very sceptical approach should
be taken. The big question is whether or not the organisation into
cages has any consequences. Intuitively, it is a dangerous game to
play - so can you get away with it? As far as I can tell, Jones & Lin
do not address this vital issue.

The lazy sweep gets its excellent efficiency from being able to sweep
the bitmap and process 32 mark bits in one go - using the cache as a
buffer. It is especially important that the cage layout means that
low-level fragmentation issues do not arise and nor does their
attendant costs. This is a two-level allocation strategy that
initially appears very effective at eliminating the search costs
associated with store fragmentation.

However, it is well-known and obvious that cage layouts simply push
the fragmentation problem into the next level up. Firstly, let us
observe that the performance claims for this algorithm critically
depend on the ability to predict how many cages of each type is
needed. Since garbage collection is triggered by the exhaustion of
any cage-type, it is vital that all cage types are exhausted more or
less at the same time. This algorithm will be pathologically bad if
one cage type is exhausted long before any other cage type has a
chance to fill.

Obviously we can always balance cage-types correctly if we grow the
heap without limit. However, if we come back to practical reality we
will have to work within a fixed limit. And this means that when we
have reached a static (or near-static) population of cages our
ability to reassign cage-types is dependent on the emptying of entire
cages. Detecting empty cages introduces a definite overhead on
either the mark or sweep stage that is proportional to the size of
the heap (not merely live store). This is the first unaccounted
overhead.

Next, relying on cage emptying is a poor strategy that makes the
algorithm fragile. So we are also going to have to relocate store
from time to time to make better use of our cages. This is another
overhead that is likely to depend on the size of the heap that is
unaccounted for.

When we attempt to satisfy an allocation request for a cell of size
K, we must determine which of our available cage-types it fits into.
This means that the allocation routine for vector types (e.g.
strings) must include a code sequence of the form :-

cage_type := if K <= tiny then K else ceiling( log2( K ) ) endif;

This overhead on allocation was not described by Jones & Lin. It is
our third unaccounted overhead.

Lastly, as the cage->size mapping is dynamically changing, there is
another unaccounted lookup overhead. Once we have determined
cage_type we must discover where to resume the sweep. It is not
possible to wire this into a constructor because it changes
dynamically. Each cage-type maintains its own sweep pointer and we
have to locate this. This is a small overhead but when, as Jones &
Lin do, we start counting instructions and cycles it is highly
misleading to omit even such a small extra cost.

Without a relocator, we cannot keep all cages of a given type
together. As the application programs needs change, so the
distribution of types becomes fragmented. This means that there is a
final overhead of relinking the sweep pointer when it reaches the end
of the current cage to the start of the next cage. Assuming that
cages are relatively large, this overhead boils down to the
conditional test for end-of-cage.

There is also a locality issue here. We know that objects tend to
live and die in groups. So it is beneficial if, when we allocate a
group of objects, they are allocated close together. This benefit is
lost by a caged layout.

We must also recognise that for vector types, such as strings or
arrays, we will not have cages dedicated to them. If we adopt the
usual strategy of powers of 2 (which is not particularly good, I have
to admit) then the wastage will be 25%. This is in addition to the
wastage of 3% entailed by the bitmap. This wastage increases the
frequency of the next garbage collection when we are operating at our
maximum headroom. (This argument has to be applied to the copying
collector too. It transpires there is a modest difference between
the 50% wastage of a copying collector and the wastage due to a caged
layout. See below.)

Finally, we will never perfectly predict the required ratios of
cage-types. This imposes a final wastage penalty. I know of no
statistics that would help us assess this but I suggest 10% would be
a plausible number. This means that the wastage due to the bitmap
and caged layout will be roughly 3% + 25% + 10% = 38% for many common
applications. This wastage can either be thought of as a store
penalty or, because of the intimate relationship in this algorithm
between store and speed, as a speed penalty.

To round off this analysis, we have seen 7 deficits of this
mark-and-sweep algorithm not described by Jones & Lin. The critical
issue of dynamically balancing cage types needs a good algorithm and
we do not have one. Failure to provide a good self-tuning algorithm
could easily result in the garbage collector requesting extremely
large amounts of store and thrashing.

In fairness, Jones & Lin merely restrict themselves to what is
reported. The points I have made above involve a lot of guesswork on
my part and would not be part of such a reference work. But I think
the overall issue of caged layout should have been emphasized and
omitting it is very unhelpful.

Personal Notes Regarding Jones & Lin's Book

Overall, I am inclined to think that Jones & Lin are too trusting in
other authors' claims. For example, in numerous places there is
mention made of New Jersey ML of which I have had considerable
practical experience. The famous claim is that the garbage
collection of call-frames was faster than cutting back the stack. We
found that although on small benchmarks it was indeed astoundingly
fast it was unusably slow in practice - even with purely functional
code. We never proved the reason for this vast disparity but I am
inclined to suspect two factors. Firstly, many promising garbage
collection algorithms are fragile. When an application program
strays outside of a particular pattern of store use, their behaviour
degrades catastrophically. Secondly, complex garbage collection
algorithms often require tuning of a set of operating parameters.
Without a self-tuning mechanism, these algorithms are often used
ineptly with very bad results.

One of the reasons I am keen on the simple copying collectors is
that, despite the enormous 50% wastage, they are robust and there is
a simple and effective self-tuning mechanism (grow/shrink the heap to
maintain a mutator/collector CPU ratio.) And although generational
garbage collectors sound wonderful, I have become very cautious
regarding their effectiveness.

Aside from New Jersey ML, I recall with some horror the comment from
the technical support of AI in Watford. We were finding that on our
Symbolics, the Edwin program would trigger a major GC and effectively
disappear indefinitely. So we contacted them to ask whether this was
correct. They responded that they didn’t know because whenever they
triggered a major GC they simply powered off and rebooted the
machine.1

I read an interesting paper, many years ago, which talked about the
garbage collector of CMU-Lisp. It sounded great. As it happened,
Dipanker Gupta [a mutual colleague] had used the system for real.
He said that it was universally hated within the group he worked,
largely because of the fragility of the garbage collector. Although
this is all lost in the mists of time, I believe this was another
generational GC.

This is not to say that I have no hope that a generational garbage
collector cannot perform well. I have simply come to the conclusion
that it is vital that the damn thing tunes itself and does not rely
on being shipped with a software engineer attached.