I’m building another programming language - purple-garden - this time less lispy, but still s-expr based. It’s not inherently functional, but has pure functions. I wanted to make a fast programming language that’s fun to implement and fun (for me) to program in. So far, this is the result.
Some virtual machines are stack based and implement function calls via this approach, others are using registers for all function arguments. I decided to go with a hybrid approach and use registers for function calls with a single argument and a stack for functions with n>1 arguments. This will hopefully allow me to keep the vm impl simple, optimize for the common case of a single argument and use the least amount of instructions possible.
VM Architecture, Bytecode and Disassembly
Since some consider programming languages, and implementing one modern day magic, I’ll introduce some basic concepts while going over the architecture of purple gardens interpretation pipeline.
(@println "Hello World")
or (* 3.1415 (* 1.5 1.5))
for a bytecode
interpeter, would generally pass through the following stages:
- Tokenisation
- Parsing
- Compiling to Bytecode
- Executing Bytecode
For instance (* 3.1415 (* 1.5 1.5))
:
1. Tokenisation
1[T_DELIMITOR_LEFT]
2[T_ASTERISKS]
3[T_DOUBLE](3.1415)
4[T_DELIMITOR_LEFT]
5[T_ASTERISKS]
6[T_DOUBLE](1.5)
7[T_DOUBLE](1.5)
8[T_DELIMITOR_RIGHT]
9[T_DELIMITOR_RIGHT]
10[T_EOF]
Tokenisation refers to - as the name implies, splitting the input up into tokens with meaning. In purple garden a token is simply a tagged value container. Generally it would also hold data related to its source location, like a line and a column.
1typedef enum {
2 // (
3 T_DELIMITOR_LEFT = 1,
4 // assigned OP numbers because we directly map these in the compiler, see
5 // vm.h#VM_OP
6 T_PLUS = 2,
7 T_MINUS = 3,
8 T_ASTERISKS = 4,
9 T_SLASH = 5,
10 // =
11 T_EQUAL = 6,
12 // )
13 T_DELIMITOR_RIGHT,
14 // [
15 T_BRAKET_LEFT,
16 // ]
17 T_BRAKET_RIGHT,
18 // anything between ""
19 T_STRING,
20 T_TRUE,
21 T_FALSE,
22 // floating point numbers
23 T_DOUBLE,
24 // whole numbers
25 T_INTEGER,
26 // builtins in the format @<builtin>
27 T_BUILTIN,
28 // any identifier
29 T_IDENT,
30 // end marker
31 T_EOF,
32} TokenType;
33
34typedef struct {
35 TokenType type;
36 union {
37 // filled when .type=T_STRING|T_IDENT
38 Str string;
39 // filled when .type=T_DOUBLE
40 double floating;
41 // filled when .type=T_INTEGER
42 int64_t integer;
43 };
44} Token;
2. Parsing
The parser uses the tokens produced by the tokenizer to build an abstract syntax tree (AST):
1N_BIN[T_ASTERISKS](
2 N_ATOM[T_DOUBLE](3.1415),
3 N_BIN[T_ASTERISKS](
4 N_ATOM[T_DOUBLE](1.5),
5 N_ATOM[T_DOUBLE](1.5)
6 )
7)
Now we know:
- the precedence of the s-expr: 1.5*1.5 needs to be executed before multiplying the result with 3.1415
- how many elements to apply which s-expr to
- what element belongs to what s-expr
3. Compiling to Bytecode
Bytecode is generally defined as a list of operators and their respective arguments encoded as bytes.
Since this vm is registerbased, the bytecode operates on the accumulator
register (r0
) and its operand. The compiler emits bytecode for nodes of the
AST from bottom to top:
Atoms
N_ATOM[T_DOUBLE](1.5)
Compiles to:
1; load value at index 3 of global pool into r0
2LOAD 3
Where the argument 3
refers to the index into the global pool at which
Double(1.5)
is stored at.
“Binary” operators
1 N_BIN[T_ASTERISKS](
2 N_ATOM[T_DOUBLE](1.5),
3 N_ATOM[T_DOUBLE](1.5)
4 )
The atoms parts will be compiled as explained above, the only change is moving the temporary value needed for addition into another register:
1; load value at index 3 of global pool into r0
2LOAD 3
3; move value in r0 into r1
4STORE 1
5; load value at index 4 of global pool into r0
6LOAD 4
The multiplication itself is done by MUL
. Lhs is assumed to be r0, rhs is the
argument passed to the opcode:
1; load value at index 3 of global pool into r0
2LOAD 3
3; move value in r0 into r1
4STORE 1
5; load value at index 4 of global pool into r0
6LOAD 4
7; muliply r0 and r1
8MUL 1
This means the first multiplication s-expr (* 1.5 1.5)
compiles to the above
and leaves the result in the accumulator register r0. Applying this to the root s-expr, we get:
1; (* 1.5 1.5)
2 ; 1.5
3 LOAD 3
4 ; r0->r1
5 STORE 1
6 ; 1.5
7 LOAD 4
8 ; r0*r1
9 MUL 1
10
11; (* 3.1415 r0)
12 ; r0->r1
13 STORE 1
14 ; 3.1415
15 LOAD 5
16 ; r0*r1
17 MUL 1
4. Executing Bytecode
At a high level the bytecode virtual machine:
- indexes into bytecode array at
vm->pc
(program counter) for the current op code - indexes into bytecode array at
vm->pc+1
for the current argument - execute corresponding logic for the opcode and manipulate registers or other internal state
Bytecode overview and some decisions
The whole pipeline uses a two byte tuple: OP
and ARG
, this keeps the
fetching, decoding and emitting lean + fast.
Since there may be some not obvious instruction meanings, here’s a table:
Opcode / Instruction | Description |
---|---|
STORE <reg> | Moves the contents of r0 to reg |
ARGS <amount> | Instructs the vm to expect amount of arguments to calls and builtins |
LEAVE | Leaves a scope |
LOAD <global pool idx> | Loads a global into r0 |
BUILTIN <hash> | Calls a builtin by the hash of its name |
CALL <hash> | Calls a function by the hash of its name |
VAR <reg> | Creates a variable by its name in r0 and its value in reg |
LOADV <hash> | Loads a variable by its hash from the variable table into r0 |
POP | Pops a value from the stack into r0 |
PUSH | Pushes r0 to the stack |
PUSHG <global pool idx> | Pushes a global to the stack |
PUSHG
is an optimisation to merge LOAD
and PUSH
, which is a common
pattern in function calling setup…
Calling and defining Functions
Since every good language needs a way to structure and reuse code, purple garden (pg) needs functions.
1; (@function <name> [<list of arguments>] <s-expr body>)
2(@function no-args [] (@None))
3(@function one-arg [a] (@Some a))
4(@function two-args [a b] (@Some a))
5(@function three-args [a b c] (@Some a))
6(@function four-args [a b c d] (@Some a))
7; ...
And we just call them in the
1(no-args)
2(one-arg 1)
3(two-args 1 2)
4(three-args 1 2 3)
5(four-args 1 2 3 4)
Generating Definition Bytecode
Once a function is defined the compiler emits bytecode for said function including its body and the setup for all params, this differs for 0, 1 and n arguments:
No argument
1(@function no-args [] (@None))
No argument, thus we can omit prep for argument passing:
1__globals:
2 False; {idx=0}
3 True; {idx=1}
4 Option(None); {idx=2}
5
6__entry:
7__0x000000[01EC]: no-args
8 JMP 6
9 ARGS 0
10 BUILTIN 1019: <@None>
11 LEAVE
Single Argument
1(@function one-arg [a] (@Some a))
A singular argument is passed by a pointer to its value being stored in r0
,
this makes calling functions with a single argument way faster than using the
stack based method for enabling n arguments:
1__0x000008[02B6]: one-arg
2 JMP 20
3 ; function head / parameter creation
4 STORE 1 ; 1. move &Value from r0 to r1
5 LOAD 4: Str(`a`) ; 2. load name of variable into r0
6 VAR 1 ; 3. create an entry in variable table with the hash of its name
7 ; (computed at compile time) and its &Value in r1
8
9 ; function body
10 LOADV 796972: $a
11 BUILTIN 145: <@Some>
12 LEAVE
n Arguments
1(@function three-args [a b c] (@Some a))
This is where the mix of registers and stack is used. Since the value for the
last argument is always in r0
. Any further arguments are stored on the stack:
1args: a b c
2vals: 1 2 3
3
4stack: 1 2
5r0: 3
In the disassembled bytecode this results in simply one POP
before comencing
with the already known argument setup:
1__entry:
2__0x000000[02CB]: three-args
3 JMP 28
4
5 ; function head
6
7 ; create variable 'c' pointing to &Value
8 STORE 1
9 LOAD 3: Str(`c`)
10 VAR 1
11
12 ; pop &Value for variable 'b' from the stack top
13 POP
14 STORE 1
15 LOAD 4: Str(`b`)
16 VAR 1
17
18 ; pop &Value for variable 'a' from the stack top
19 POP
20 STORE 1
21 LOAD 5: Str(`a`)
22 VAR 1
23
24 ; function body
25 LOADV 796972: $a
26 BUILTIN 145: <@Some>
27 LEAVE
Generating Call site Bytecode
No argument
1(@function no-args [] (@None))
2(no-args)
Since we don’t pass any arguments, the call is just the OP_CALL
and bytecode
index for the function definition, here 0
:
1__entry:
2 ; ...
3 CALL 0: <no-args>
Single Argument
1(@function one-arg [a] (@Some a))
2(one-arg 1)
As introduced before, a single argument can just be used from r0
, thus the
callsite loads the global and invokes the call, no stack interaction necessary:
1__entry:
2 ; ...
3
4 LOAD 4: Int(1)
5 CALL 0: <one-arg>
n Arguments
1(@function two-args [a b] (@Some a))
2(@function three-args [a b c] (@Some a))
3(@function four-args [a b c d] (@Some a))
4(two-args 1 2)
5(three-args 1 2 3)
6(four-args 1 2 3 4)
Like above, the last argument is always in r0
, thus all other arguments will
need to be on the stack. This is implemented by PUSH
(or PUSHG
if LOAD
and PUSH
are emitted consecutively) at the callsite. For each argument except
the last, a PUSH
is required. The instruction before CALL
or BUILTIN
is
ARGS
, instructing the vm to pop the right amount of arguments of the stack.
1__entry:
2 ; ...
3
4 PUSHG 7: Int(1)
5 LOAD 8: Int(2)
6 ARGS 2
7 CALL 0: <two-args>
8
9 PUSHG 9: Int(1)
10 PUSHG 10: Int(2)
11 LOAD 11: Int(3)
12 ARGS 3
13 CALL 22: <three-args>
14
15 PUSHG 12: Int(1)
16 PUSHG 13: Int(2)
17 PUSHG 14: Int(3)
18 LOAD 15: Int(4)
19 ARGS 4
20 CALL 52: <four-args>
Benchmark
Warning
This is a microbenchmark done on a high noise & high load laptop system, take everything here with a grain of salt, either way, here are the specs:
1System:
2 Kernel: 6.11.0-24-generic arch: x86_64 bits: 64
3 Desktop: i3 v: 4.23 Distro: Ubuntu 24.04.2 LTS (Noble Numbat)
4Machine:
5 Type: Laptop System: LENOVO product: 21F8002TGE v: ThinkPad T14s Gen 4
6CPU:
7 Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64
8 type: MT MCP cache: L2: 8 MiB
9 Speed (MHz): avg: 1950 min/max: 400/5132 cores: 1: 1996 2: 1997 3: 1996
10 4: 1996 5: 1996 6: 1852 7: 1848 8: 1852 9: 1852 10: 1853 11: 1996 12: 1996
11 13: 1996 14: 1996 15: 1996 16: 1996
TLDR: 25k function calls with builtins in each of them (~50k calls) in ~4.2ms vm runtime, which is ~15.5% of the total runtime, it also scales very well, for 250k function calls and 500k total calls with 13ms vm runtime.
$ head -n10 examples/bench.garden
(@function no-args [] (@None))
(@function one-arg [a] (@Some a))
(@function two-args [a b] (@Some a))
(@function three-args [a b c] (@Some a))
(@function four-args [a b c d] (@Some a))
(no-args)
(one-arg 1)
(two-args 1 2)
(three-args 1 2 3)
(four-args 1 2 3 4)
$ wc -l examples/bench.garden
25010 examples/bench.garden
$ make bench PG=examples/bench.garden
// [...]
[ 0.0000ms] main::Args_parse: Parsed arguments
[ 0.0360ms] io::IO_read_file_to_string: mmaped input of size=380261B
[ 0.0080ms] mem::init: Allocated memory block of size=612368384B
[ 3.8690ms] lexer::Lexer_all: lexed tokens count=125085
[ 5.1560ms] parser::Parser_next created AST with node_count=25010
[ 3.4340ms] cc::cc: Flattened AST to byte code/global pool length=180148/50017
[ 4.1530ms] vm::Vm_run: executed byte code
[ 3.7390ms] mem::Allocator::destroy: Deallocated memory space
[ 0.0000ms] vm::Vm_destroy: teared vm down
[ 0.0000ms] munmap: unmapped input
$ make release
$ hyperfine "./purple_garden examples/bench.garden"
Benchmark 1: ./purple_garden examples/bench.garden
Time (mean ± σ): 27.2 ms ± 2.6 ms [User: 9.1 ms, System: 15.6 ms]
Range (min … max): 19.7 ms … 32.8 ms 93 runs