This article highlights the first optimizations I made while redesigning and semi-porting the runtime from C to Rust. These aren’t benchmarked or verified, since the virtual machine is currently under construction and will probably be finished this week.
At a high level, purple-garden current works like this, with 2+3*4-1 as an
exemplary input:
1.
2+- Tokenizer
3|
4]: Token(2) Token(+) Token(3)
5]: Token(*)
6]: Token(4) Token(-) Token(1)
7|
8 \
9 +- Parsing (Tokens -> Abstract Syntax Tree)
10 |
11 ]: (Asteriks
12 ]: (Plus
13 ]: Integer("2")
14 ]: Integer("3")
15 ]: )
16 ]: (Minus
17 ]: Integer("4")
18 ]: Integer("1")
19 ]: )
20 ]: )
21 |
22 |
23<planned section start>
24 \
25 +- Planned IR and Optimisation Boundary
26 |
27 / \
28 | +- JIT Compiler (IR -> x86/ARM)
29 | ^
30 | \
31 | \
32 | \
33 | \
34 | \
35<planned section end> |Calls
36 | |JIT'ed
37 \ |functions
38 +- Compiler (AST/IR -> bytecode) |
39 | /
40 ]: __entry: /
41 ]: load_imm r0, #2 |
42 ]: load_imm r1, #3 |
43 ]: add r2, r0, r1 |
44 ]: load_imm r1, #4 |
45 ]: load_imm r0, #1 |
46 ]: sub r3, r1, r0 |
47 ]: mul r0, r2, r3 |
48 | |
49 \ |
50 +- Peephole Optimiser |
51 | |
52 ]: __entry: |
53 ]: load_imm r2, #5 |
54 ]: load_imm r3, #3 |
55 ]: mul r0, r2, r3 |
56 | /
57 \ /
58 +- Baseline interpreter --+
59 |
60 ] [vm][0000] LoadImm { dst: 2, value: 5 }
61 ] [vm][0001] LoadImm { dst: 3, value: 3 }
62 ] [vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }
63 |
64 'Peephole Optimizations
Peephole optimisations
are, as the name implies, optimisations performed on a small section of a
larger input. For a virtual machine, like purple-garden this means using a
window of size 3 (anything larger is no longer local1 subject to IR
optimisation, not peephole and will have happened before peephole optimisation
is reached in the compilation pipeline) and merging operators, rewriting
redundant or useless operations.
So to summarize:
- peephole optimisations are done on the bytecode in a local setting (non-global)
- in this case they are fallback optimisations for things the IR optimisation rewriting missed
- these are local, single-pass, and meant only to catch what the IR opt didn’t
Purple garden implementation
For purple garden specifics and questions regarding the runtime, do consult:
Peephole optimisations in purple-garden are intentionally kept single pass to keep startup time cost as low as possible and to move heavy optimisation into the IR.
This introduces the problem for recursive optimisations due to the result of a previous optimisation, this is mitigated by peephole optimisations being the fallback for the previous optimisation pipeline.
1const WINDOW_SIZE: usize = 3;
2
3/// Peephole optimisations
4///
5/// See:
6/// - [Peephole optimization - Wikipedia]
7/// (https://en.wikipedia.org/wiki/Peephole_optimization)
8/// - [W. M. McKeeman "Peephole Optimization"]
9/// (https://dl.acm.org/doi/epdf/10.1145/364995.365000)
10pub fn bc(bc: &mut Vec<Op>) {
11 for i in 0..=bc.len().saturating_sub(WINDOW_SIZE) {
12 let window = &mut bc[i..i + WINDOW_SIZE];
13 bc::const_binary(window);
14 bc::self_move(window);
15 }
16
17 bc.retain(|op| !matches!(op, Op::Nop))
18}Since optimisations can both rewrite, replace and remove instructions, the
Op::Nop encodes removal by being the replacement for instructions that were
optimised away. These are then removed from the bytecode list after all
optimisations are applied.
self_move
“Self move” is a mov instruction having equivalent dst and src:
1__entry:
2 load_imm r0 #5
3 load_imm r1 #5
4 mov r1 r1 ; <-- basically NOP
5 add r1 r0If this pattern is encountered, the VM would waste processing power on running
a NOP instruction, to combat this, they are removed:
1__entry:
2 load_imm r0 #5
3 load_imm r1 #5
4 add r1 r0This is achieved by iterating the window and replacing self movs:
1/// self_move removes patterns conforming to
2///
3/// Mov { dst: x, src: x },
4///
5/// where both dst == src
6pub fn self_move(window: &mut [Op]) {
7 for op in window.iter_mut() {
8 if let Op::Mov { dst, src } = op {
9 if dst == src {
10 *op = Op::Nop;
11 opt_trace!("self_move", "removed self_moving Mov");
12 }
13 }
14 }
15}const_binary
This optimisation refers to a binary instruction and both lhs and rhs are
created via LoadImm directly in the window beforehand:
1__entry:
2 load_imm r0, #2
3 load_imm r1, #3
4 add r2, r0, r1 ; <-- this is a compile time known result := 2+3=5
5 load_imm r1, #4
6 load_imm r0, #1
7 sub r3, r1, r0 ; <-- this is a compile time known result := 4-1=3
8 mul r0, r2, r3In this example add, sub and mul are constant. The optimisation itself
applies to all arithmetics. Thus after optimising:
1__entry:
2 load_imm r2, #5
3 load_imm r3, #3
4 mul r0, r2, r3This looks like the optimiser pass missed the load_imm, load_imm, mul
pattern, but this implementation is non recursive, recursive constant folding
is subject to IR optimisation, which comes before peephole optimisation, thus
this case will not happen once the IR and the IR optimiser is done. Overflow is
currently handled silently using wrapping arithmetic; this could and probably
will be changed to trigger a compile-time error in the future.
1/// const_binary fuses
2///
3/// LoadImm{ dst: a, value: x },
4/// LoadImm{ dst: b, value: y },
5/// bin { dst, lhs: a, rhs: b }
6///
7/// into
8///
9/// LoadImm { dst, value: x bin y }
10///
11/// where bin := Add | Sub | Mul | Div
12pub fn const_binary(window: &mut [Op]) {
13 let [
14 Op::LoadImm { dst: a, value: x },
15 Op::LoadImm { dst: b, value: y },
16 op,
17 ] = window
18 else {
19 return;
20 };
21
22 let (dst, result) = match *op {
23 Op::Add { dst, lhs, rhs }
24 if lhs == *a && rhs == *b => (dst, x.wrapping_add(*y)),
25 Op::Sub { dst, lhs, rhs }
26 if lhs == *a && rhs == *b => (dst, x.wrapping_sub(*y)),
27 Op::Mul { dst, lhs, rhs }
28 if lhs == *a && rhs == *b => (dst, x.wrapping_mul(*y)),
29 Op::Div { dst, lhs, rhs }
30 if lhs == *a &&
31 rhs == *b && *y != 0 => (dst, x.wrapping_div(*y)),
32 _ => return,
33 };
34
35 window[0] = Op::LoadImm { dst, value: result };
36 window[1] = Op::Nop;
37 window[2] = Op::Nop;
38
39 opt_trace!("const_binary", "fused a constant binary op");
40}Observability
1macro_rules! opt_trace {
2 ($optimisation:literal, $text:literal) => {
3 #[cfg(feature = "trace")]
4 println!("[opt::{}]: {}", $optimisation, $text);
5 };
6}The opt_trace! macro is used through out the optimisation pipeline for
enabling insights into decision making and performed optimisations.
1opt_trace!("const_binary", "fused two imm loads and an add");Results in [opt::const_binary]: fused a constant binary op, when the
runtime is compiled with --features=trace.
Integration and flag guard
Since startup time is one of the most important parts of a fast runtime (for
me!), peephole optimisation is guarded behind -O1 at this time:
1// all error handling is hidden
2fn main() {
3 // [...] tokenizing, parsing, further setup
4
5 let mut cc = cc::Cc::new();
6 cc.compile(&ast);
7
8 if args.opt >= 1 {
9 opt::bc(&mut cc.buf);
10 }
11
12 let mut vm = cc.finalize();
13
14 // [...] other flags and running the VM
15}Running something like 2+3*4-1 through the full pipeline:
1cargo run \
2 # compile with tracing prints included
3 --features trace \
4 -- \
5 # args for purple-garden
6
7 # print the abstract syntax tree
8 --ast
9 # disassemble the produced bytecode
10 --disassemble \
11 # set optimisation level
12 -O1 \
13 # specify input, if not set just
14 # pass a file to execute in as an argument
15 -r="2+3*4-1" 1(Asteriks
2 (Plus
3 Integer("2")
4 Integer("3")
5 )
6 (Minus
7 Integer("4")
8 Integer("1")
9 )
10)
11Cc::cc(Asteriks)
12Cc::cc(Plus)
13Cc::cc(Integer("2"))
14RegisterAllocator::alloc(r0)
15Cc::cc(Integer("3"))
16RegisterAllocator::alloc(r1)
17RegisterAllocator::alloc(r2)
18RegisterAllocator::free(r0)
19RegisterAllocator::free(r1)
20Cc::cc(Minus)
21Cc::cc(Integer("4"))
22RegisterAllocator::alloc(r1)
23Cc::cc(Integer("1"))
24RegisterAllocator::alloc(r0)
25RegisterAllocator::alloc(r3)
26RegisterAllocator::free(r1)
27RegisterAllocator::free(r0)
28RegisterAllocator::alloc(r0)
29RegisterAllocator::free(r2)
30RegisterAllocator::free(r3)
31[opt::const_binary]: fused a constant binary op
32[opt::const_binary]: fused a constant binary op
33__entry:
34 load_imm r2, #5
35 load_imm r3, #3
36 mul r0, r2, r3
37[vm][0000] LoadImm { dst: 2, value: 5 }
38[vm][0001] LoadImm { dst: 3, value: 3 }
39[vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }Testing
To ensure correctness for the expected optimisation case, tests are necessary. These tests validate pattern rewriting, not full program semantic equivalence.
1#[cfg(test)]
2mod bc {
3 use crate::op::Op;
4
5 #[test]
6 fn self_move() {
7 let mut bc = vec![
8 Op::Mov { src: 64, dst: 64 },
9 Op::Mov { src: 64, dst: 64 },
10 Op::Mov { src: 64, dst: 64 },
11 ];
12 crate::opt::bc::self_move(&mut bc);
13 assert_eq!(bc, vec![Op::Nop, Op::Nop, Op::Nop])
14 }
15
16 #[test]
17 fn const_binary() {
18 let mut bc = vec![
19 Op::LoadImm { dst: 0, value: 1 },
20 Op::LoadImm { dst: 1, value: 2 },
21 Op::Add {
22 dst: 0,
23 lhs: 0,
24 rhs: 1,
25 },
26 Op::LoadImm { dst: 0, value: 1 },
27 Op::LoadImm { dst: 1, value: 2 },
28 Op::Sub {
29 dst: 0,
30 lhs: 0,
31 rhs: 1,
32 },
33 Op::LoadImm { dst: 0, value: 3 },
34 Op::LoadImm { dst: 1, value: 5 },
35 Op::Mul {
36 dst: 0,
37 lhs: 0,
38 rhs: 1,
39 },
40 Op::LoadImm { dst: 0, value: 64 },
41 Op::LoadImm { dst: 1, value: 8 },
42 Op::Div {
43 dst: 0,
44 lhs: 0,
45 rhs: 1,
46 },
47 ];
48
49 for i in 0..=bc.len().saturating_sub(3) {
50 crate::opt::bc::const_binary(&mut bc[i..i + 3]);
51 }
52
53 bc.retain(|op| *op != Op::Nop);
54 assert_eq!(
55 bc,
56 vec![
57 Op::LoadImm { dst: 0, value: 3 },
58 Op::LoadImm { dst: 0, value: -1 },
59 Op::LoadImm { dst: 0, value: 15 },
60 Op::LoadImm { dst: 0, value: 8 },
61 ]
62 )
63 }
64}fight me on this one, I make the rules, if WINDOW_SIZE > 3, I say that’s no longer local :^) ↩︎