Hybrid Register/Stack VM for Faster Function Calls

Table of Contents

Tags:

I’m building another programming language - purple-garden - this time less lispy, but still s-expr based. It’s not inherently functional, but has pure functions. I wanted to make a fast programming language that’s fun to implement and fun (for me) to program in. So far, this is the result.

Some virtual machines are stack based and implement function calls via this approach, others are using registers for all function arguments. I decided to go with a hybrid approach and use registers for function calls with a single argument and a stack for functions with n>1 arguments. This will hopefully allow me to keep the vm impl simple, optimize for the common case of a single argument and use the least amount of instructions possible.

VM Architecture, Bytecode and Disassembly

Since some consider programming languages, and implementing one modern day magic, I’ll introduce some basic concepts while going over the architecture of purple gardens interpretation pipeline.

(@println "Hello World") or (* 3.1415 (* 1.5 1.5)) for a bytecode interpeter, would generally pass through the following stages:

  1. Tokenisation
  2. Parsing
  3. Compiling to Bytecode
  4. Executing Bytecode

For instance (* 3.1415 (* 1.5 1.5)):

1. Tokenisation

TEXT
 1[T_DELIMITOR_LEFT]
 2[T_ASTERISKS]
 3[T_DOUBLE](3.1415)
 4[T_DELIMITOR_LEFT]
 5[T_ASTERISKS]
 6[T_DOUBLE](1.5)
 7[T_DOUBLE](1.5)
 8[T_DELIMITOR_RIGHT]
 9[T_DELIMITOR_RIGHT]
10[T_EOF]

Tokenisation refers to - as the name implies, splitting the input up into tokens with meaning. In purple garden a token is simply a tagged value container. Generally it would also hold data related to its source location, like a line and a column.

C
 1typedef enum {
 2  // (
 3  T_DELIMITOR_LEFT = 1,
 4  // assigned OP numbers because we directly map these in the compiler, see
 5  // vm.h#VM_OP
 6  T_PLUS = 2,
 7  T_MINUS = 3,
 8  T_ASTERISKS = 4,
 9  T_SLASH = 5,
10  // =
11  T_EQUAL = 6,
12  // )
13  T_DELIMITOR_RIGHT,
14  // [
15  T_BRAKET_LEFT,
16  // ]
17  T_BRAKET_RIGHT,
18  // anything between ""
19  T_STRING,
20  T_TRUE,
21  T_FALSE,
22  // floating point numbers
23  T_DOUBLE,
24  // whole numbers
25  T_INTEGER,
26  // builtins in the format @<builtin>
27  T_BUILTIN,
28  // any identifier
29  T_IDENT,
30  // end marker
31  T_EOF,
32} TokenType;
33
34typedef struct {
35  TokenType type;
36  union {
37    // filled when .type=T_STRING|T_IDENT
38    Str string;
39    // filled when .type=T_DOUBLE
40    double floating;
41    // filled when .type=T_INTEGER
42    int64_t integer;
43  };
44} Token;

2. Parsing

The parser uses the tokens produced by the tokenizer to build an abstract syntax tree (AST):

TEXT
1N_BIN[T_ASTERISKS](
2 N_ATOM[T_DOUBLE](3.1415),
3 N_BIN[T_ASTERISKS](
4  N_ATOM[T_DOUBLE](1.5),
5  N_ATOM[T_DOUBLE](1.5)
6 )
7)

Now we know:

  • the precedence of the s-expr: 1.5*1.5 needs to be executed before multiplying the result with 3.1415
  • how many elements to apply which s-expr to
  • what element belongs to what s-expr

3. Compiling to Bytecode

Bytecode is generally defined as a list of operators and their respective arguments encoded as bytes.

Since this vm is registerbased, the bytecode operates on the accumulator register (r0) and its operand. The compiler emits bytecode for nodes of the AST from bottom to top:

Atoms

  N_ATOM[T_DOUBLE](1.5)

Compiles to:

ASM
1; load value at index 3 of global pool into r0
2LOAD 3

Where the argument 3 refers to the index into the global pool at which Double(1.5) is stored at.

“Binary” operators

TEXT
1 N_BIN[T_ASTERISKS](
2  N_ATOM[T_DOUBLE](1.5),
3  N_ATOM[T_DOUBLE](1.5)
4 )

The atoms parts will be compiled as explained above, the only change is moving the temporary value needed for addition into another register:

ASM
1; load value at index 3 of global pool into r0
2LOAD  3
3; move value in r0 into r1
4STORE 1
5; load value at index 4 of global pool into r0
6LOAD  4

The multiplication itself is done by MUL. Lhs is assumed to be r0, rhs is the argument passed to the opcode:

ASM
1; load value at index 3 of global pool into r0
2LOAD  3
3; move value in r0 into r1
4STORE 1
5; load value at index 4 of global pool into r0
6LOAD  4
7; muliply r0 and r1
8MUL   1

This means the first multiplication s-expr (* 1.5 1.5) compiles to the above and leaves the result in the accumulator register r0. Applying this to the root s-expr, we get:

ASM
 1; (* 1.5 1.5)
 2    ; 1.5
 3    LOAD  3
 4    ; r0->r1
 5    STORE 1
 6    ; 1.5
 7    LOAD  4
 8    ; r0*r1
 9    MUL   1
10
11; (* 3.1415 r0)
12    ; r0->r1
13    STORE 1
14    ; 3.1415
15    LOAD  5
16    ; r0*r1
17    MUL   1

4. Executing Bytecode

At a high level the bytecode virtual machine:

  1. indexes into bytecode array at vm->pc (program counter) for the current op code
  2. indexes into bytecode array at vm->pc+1 for the current argument
  3. execute corresponding logic for the opcode and manipulate registers or other internal state

Bytecode overview and some decisions

The whole pipeline uses a two byte tuple: OP and ARG, this keeps the fetching, decoding and emitting lean + fast.

Since there may be some not obvious instruction meanings, here’s a table:

Opcode / InstructionDescription
STORE <reg>Moves the contents of r0 to reg
ARGS <amount>Instructs the vm to expect amount of arguments to calls and builtins
LEAVELeaves a scope
LOAD <global pool idx>Loads a global into r0
BUILTIN <hash>Calls a builtin by the hash of its name
CALL <hash>Calls a function by the hash of its name
VAR <reg>Creates a variable by its name in r0 and its value in reg
LOADV <hash>Loads a variable by its hash from the variable table into r0
POPPops a value from the stack into r0
PUSHPushes r0 to the stack
PUSHG <global pool idx>Pushes a global to the stack

PUSHG is an optimisation to merge LOAD and PUSH, which is a common pattern in function calling setup…

Calling and defining Functions

Since every good language needs a way to structure and reuse code, purple garden (pg) needs functions.

SCHEME
1; (@function <name> [<list of arguments>] <s-expr body>)
2(@function no-args [] (@None))
3(@function one-arg [a] (@Some a))
4(@function two-args [a b] (@Some a))
5(@function three-args [a b c] (@Some a))
6(@function four-args [a b c d] (@Some a))
7; ...

And we just call them in the

SCHEME
1(no-args)
2(one-arg 1)
3(two-args 1 2)
4(three-args 1 2 3)
5(four-args 1 2 3 4)

Generating Definition Bytecode

Once a function is defined the compiler emits bytecode for said function including its body and the setup for all params, this differs for 0, 1 and n arguments:

No argument

SCHEME
1(@function no-args [] (@None))

No argument, thus we can omit prep for argument passing:

ASM
 1__globals:
 2        False; {idx=0}
 3        True; {idx=1}
 4        Option(None); {idx=2}
 5
 6__entry:
 7__0x000000[01EC]: no-args
 8        JMP 6
 9        ARGS 0
10        BUILTIN 1019: <@None>
11        LEAVE

Single Argument

SCHEME
1(@function one-arg [a] (@Some a))

A singular argument is passed by a pointer to its value being stored in r0, this makes calling functions with a single argument way faster than using the stack based method for enabling n arguments:

ASM
 1__0x000008[02B6]: one-arg
 2        JMP 20
 3        ; function head / parameter creation
 4        STORE 1 ; 1. move &Value from r0 to r1
 5        LOAD 4: Str(`a`) ; 2. load name of variable into r0
 6        VAR 1   ; 3. create an entry in variable table with the hash of its name
 7                ; (computed at compile time) and its &Value in r1
 8
 9        ; function body
10        LOADV 796972: $a
11        BUILTIN 145: <@Some>
12        LEAVE

n Arguments

SCHEME
1(@function three-args [a b c] (@Some a))

This is where the mix of registers and stack is used. Since the value for the last argument is always in r0. Any further arguments are stored on the stack:

TEXT
1args: a b c
2vals: 1 2 3
3
4stack: 1 2
5r0: 3

In the disassembled bytecode this results in simply one POP before comencing with the already known argument setup:

ASM
 1__entry:
 2__0x000000[02CB]: three-args
 3        JMP 28
 4
 5        ; function head
 6
 7        ; create variable 'c' pointing to &Value
 8        STORE 1
 9        LOAD 3: Str(`c`)
10        VAR 1
11
12        ; pop &Value for variable 'b' from the stack top
13        POP
14        STORE 1
15        LOAD 4: Str(`b`)
16        VAR 1
17
18        ; pop &Value for variable 'a' from the stack top
19        POP
20        STORE 1
21        LOAD 5: Str(`a`)
22        VAR 1
23
24        ; function body
25        LOADV 796972: $a
26        BUILTIN 145: <@Some>
27        LEAVE

Generating Call site Bytecode

No argument

SCHEME
1(@function no-args [] (@None))
2(no-args)

Since we don’t pass any arguments, the call is just the OP_CALL and bytecode index for the function definition, here 0:

ASM
1__entry:
2        ; ...
3        CALL 0: <no-args>

Single Argument

SCHEME
1(@function one-arg [a] (@Some a))
2(one-arg 1)

As introduced before, a single argument can just be used from r0, thus the callsite loads the global and invokes the call, no stack interaction necessary:

ASM
1__entry:
2        ; ...
3
4        LOAD 4: Int(1)
5        CALL 0: <one-arg>

n Arguments

SCHEME
1(@function two-args [a b] (@Some a))
2(@function three-args [a b c] (@Some a))
3(@function four-args [a b c d] (@Some a))
4(two-args 1 2)
5(three-args 1 2 3)
6(four-args 1 2 3 4)

Like above, the last argument is always in r0, thus all other arguments will need to be on the stack. This is implemented by PUSH (or PUSHG if LOAD and PUSH are emitted consecutively) at the callsite. For each argument except the last, a PUSH is required. The instruction before CALL or BUILTIN is ARGS, instructing the vm to pop the right amount of arguments of the stack.

ASM
 1__entry:
 2        ; ...
 3
 4        PUSHG 7: Int(1)
 5        LOAD 8: Int(2)
 6        ARGS 2
 7        CALL 0: <two-args>
 8
 9        PUSHG 9: Int(1)
10        PUSHG 10: Int(2)
11        LOAD 11: Int(3)
12        ARGS 3
13        CALL 22: <three-args>
14
15        PUSHG 12: Int(1)
16        PUSHG 13: Int(2)
17        PUSHG 14: Int(3)
18        LOAD 15: Int(4)
19        ARGS 4
20        CALL 52: <four-args>

Benchmark

Warning

This is a microbenchmark done on a high noise & high load laptop system, take everything here with a grain of salt, either way, here are the specs:

TEXT
 1System:
 2  Kernel: 6.11.0-24-generic arch: x86_64 bits: 64
 3  Desktop: i3 v: 4.23 Distro: Ubuntu 24.04.2 LTS (Noble Numbat)
 4Machine:
 5  Type: Laptop System: LENOVO product: 21F8002TGE v: ThinkPad T14s Gen 4
 6CPU:
 7  Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64
 8    type: MT MCP cache: L2: 8 MiB
 9  Speed (MHz): avg: 1950 min/max: 400/5132 cores: 1: 1996 2: 1997 3: 1996
10    4: 1996 5: 1996 6: 1852 7: 1848 8: 1852 9: 1852 10: 1853 11: 1996 12: 1996
11    13: 1996 14: 1996 15: 1996 16: 1996

TLDR: 25k function calls with builtins in each of them (~50k calls) in ~4.2ms vm runtime, which is ~15.5% of the total runtime, it also scales very well, for 250k function calls and 500k total calls with 13ms vm runtime.

Terminal
$ head -n10 examples/bench.garden
(@function no-args [] (@None))
(@function one-arg [a] (@Some a))
(@function two-args [a b] (@Some a))
(@function three-args [a b c] (@Some a))
(@function four-args [a b c d] (@Some a))
(no-args)
(one-arg 1)
(two-args 1 2)
(three-args 1 2 3)
(four-args 1 2 3 4)
Terminal
$ wc -l examples/bench.garden
25010 examples/bench.garden
Terminal
$ make bench PG=examples/bench.garden
// [...]
[ 0.0000ms] main::Args_parse: Parsed arguments
[ 0.0360ms] io::IO_read_file_to_string: mmaped input of size=380261B
[ 0.0080ms] mem::init: Allocated memory block of size=612368384B
[ 3.8690ms] lexer::Lexer_all: lexed tokens count=125085
[ 5.1560ms] parser::Parser_next created AST with node_count=25010
[ 3.4340ms] cc::cc: Flattened AST to byte code/global pool length=180148/50017
[ 4.1530ms] vm::Vm_run: executed byte code
[ 3.7390ms] mem::Allocator::destroy: Deallocated memory space
[ 0.0000ms] vm::Vm_destroy: teared vm down
[ 0.0000ms] munmap: unmapped input
Terminal
$ make release
$ hyperfine "./purple_garden examples/bench.garden"
Benchmark 1: ./purple_garden examples/bench.garden
  Time (mean ± σ):      27.2 ms ±   2.6 ms    [User: 9.1 ms, System: 15.6 ms]
  Range (min … max):    19.7 ms …  32.8 ms    93 runs