Memory management
Steel uses a combination of reference counting and a mark and sweep garbage collector in order to manage memory. Important notes regarding this:
-
All immutable values are reference counted, this includes built in data structures such as:
- Lists
- Immutable vectors
- Hash maps
- Hash sets
- Strings
- Pairs
- Big integers
- Symbols (although, most are interned)
-
All primitive types are unboxed, these include:
- Characters
- booleans
- floats (f64)
- integers (isize)
- the
#<void>
type
-
Mutable values are transformed into
boxes
- which are mutable memory locations managed by the garbage collector. -
Mutable vectos are a special form handled by the garbage collector as well.
-
Mutable structs, are immutable vectors with boxes in each slot for each mutable memory location.
-
Heap allocated boxes are weak references to a reference counted pointer living in the heap. Thus, even memory managed by the garbage collector is also reference counted. This means that unboxing a mutable memory location requires two pointer jumps - first the weak pointer to the heap, then the strong pointer to the underlying value.
- A nice consequence of this is that we can perform fast collections for unreachable values managed by the heap without having to perform a mark and sweep.
Reference counting optimizations
One consequence of using reference counted variables is that there will be a non trivial amount of time spent performing reference count operations on values coming and going from the stack. The steel compiler and vm performs a few optimizations to reduce the reference counting thrash, and also to improve the usage of the functional data structures built in.
Consider the following:
(define (reverse input output)
(if (null? input)
output
(reverse (cdr input) (cons (car input) output))))
(reverse (list 10 20 30 40) '()) ;; => (list 40 30 20 10)
This is a simple function that takes an input list and reverses it. It does so recursively, calling reverse
in tail position, meaning we'll get a tail call optimization and this gets converted into a JUMP
in the VM.
Lists in Steel aren't your classic cons cells - they're implemented more like unrolled linked lists or vlists - meaning they're more like chunks of large contiguous vectors strung together. Copying those on each write to it would be silly, so the compiler analyzes a function and attempts to find the last usage of variable. For every last usage of a variable, the VM doesn't just copy the value from the stack, it actually moves it off of the VM stack - meaning the reference count has the potential to be 1 by the time it hits the last usage of a variable.
When the reference count is 1, we can perform an in-place mutation of the resulting list - which gives us a nice performance win! So in the above snippet, we see the following bytecode:
0 SDEF : 0 reverse
1 PUREFUNC : 21 lambda
2 PASS : 0
3 PASS : 256
4 READLOCAL0 : 0 input
5 CALLGLOBAL : 155 null?
6 FUNC : 1 null?
7 IF : 6
8 READLOCAL1 : 1 output
9 JMP : 17
10 READLOCAL0 : 0 input
11 CALLGLOBAL : 86 cdr
12 FUNC : 1 cdr
13 MOVEREADLOCAL0 : 0 input ;; <- Last usage of input
14 CALLGLOBAL : 92 car
15 FUNC : 1 car
16 MOVEREADLOCAL1 : 1 output ;; <- last usage of output
17 CALLGLOBAL : 94 cons
18 FUNC : 2 cons
19 TCOJMP : 2 reverse
20 PASS : 0
21 POPPURE : 2
22 ECLOSURE : 2
23 EDEF : 0
24 BIND : 975 reverse
25 VOID : 0
26 POPPURE : 0
Which corresponds to these call sites:
(define (reverse input output)
(if (null? input)
output
(reverse (cdr input) (cons (car input) output))))
^^^^^ ^^^^^^
In this case, the mutation of output
is great - since we can cons
onto the list and this under the hood is just a push onto the underlying vector. However, for input it doesn't mean much; car
is a reference operation and has no optimizations associated with it. The compiler isn't quite smart enough to do this yet, but rewriting this slightly, we can get the optimization we want:
(define (reverse input output)
(if (null? input)
output
(let ([first-element (car input)])
(reverse (cdr input) (cons first-element output)))))
Which results in this bytecode:
0 SDEF : 0 reverse
1 PUREFUNC : 24 lambda
2 PASS : 0
3 PASS : 256
4 READLOCAL0 : 0 input
5 CALLGLOBAL : 156 null?
6 FUNC : 1 null?
7 IF : 6
8 READLOCAL1 : 1 output
9 JMP : 20
10 BEGINSCOPE : 0
11 READLOCAL0 : 0 input
12 CALLGLOBAL : 98 car
13 FUNC : 1 car
14 MOVEREADLOCAL0 : 0 input ;; <- Here
15 CALLGLOBAL : 97 cdr
16 FUNC : 1 cdr
17 READLOCAL2 : 2 first-element
18 MOVEREADLOCAL1 : 1 output ;; <- Here
19 CALLGLOBAL : 86 cons
20 FUNC : 2 cons
21 TCOJMP : 2 reverse
22 PASS : 0
23 LETENDSCOPE : 2
24 POPPURE : 2
25 ECLOSURE : 2
26 EDEF : 0
27 BIND : 975 reverse
28 VOID : 0
29 POPPURE : 0
cdr
when coupled with a unique list will just move the view of the underlying storage over one element - the element is still there - but we don't see it, and it means we don't have to allocate a new list (since in this new world, cdr
can allocate). But - we get some nice cache locality for this!
The first example takes 123 ms to reverse a list of 100000 - whereas the second example takes only 23 ms! In racket, the equivalent code takes somewhere between 2 and 5 ms.
Another toy example of this optimization can be seen here, comparing Steel to Racket:
(define values-to-insert
(map (lambda (n)
(cons (number->string n) n)) (range 0 10000)))
(define (hash-insert-loop-test hmap values)
(if (null? values)
hmap
(let ([p (car values)])
(hash-insert-loop-test (hash-insert hmap (car p) (cdr p)) (cdr values)))))
(hash-insert-loop-test (hash) values-to-insert)
This snippet takes about 10-20 ms on my machine, and the equivalent body of code in Racket (replacing hash-insert
with hash-set
) takes about 8 ms. So this is a nice win. Without this optimization for hash maps, the same code took 5 seconds!