Poking holes into bytecode with peephole optimisations

Tags:

This article highlights the first optimizations I made while redesigning and semi-porting the runtime from C to Rust. These aren’t benchmarked or verified, since the virtual machine is currently under construction and will probably be finished this week.

At a high level, purple-garden current works like this, with 2+3*4-1 as an exemplary input:

TEXT
 1.
 2+- Tokenizer
 3|
 4]: Token(2) Token(+) Token(3)
 5]: Token(*)
 6]: Token(4) Token(-) Token(1)
 7|
 8 \
 9  +- Parsing (Tokens -> Abstract Syntax Tree)
10  |
11  ]: (Asteriks
12  ]:   (Plus
13  ]:     Integer("2")
14  ]:     Integer("3")
15  ]:   )
16  ]:   (Minus
17  ]:     Integer("4")
18  ]:     Integer("1")
19  ]:   )
20  ]: )
21  |
22  |
23<planned section start>
24  \
25   +- Planned IR and Optimisation Boundary
26   |
27  / \
28  |  +- JIT Compiler (IR -> x86/ARM)
29  |                           ^
30  |                            \
31  |                             \
32  |                              \ 
33  |                               \ 
34  |                                \ 
35<planned section end>               |Calls 
36  |                                 |JIT'ed    
37  \                                 |functions 
38   +- Compiler (AST/IR -> bytecode) |
39   |                                / 
40   ]:  __entry:                    /
41   ]:          load_imm r0, #2    |
42   ]:          load_imm r1, #3    |
43   ]:          add r2, r0, r1     |
44   ]:          load_imm r1, #4    |
45   ]:          load_imm r0, #1    |
46   ]:          sub r3, r1, r0     |
47   ]:          mul r0, r2, r3     |
48   |                              |
49   \                              |
50    +- Peephole Optimiser         |
51    |                             |
52    ]:  __entry:                  |
53    ]:          load_imm r2, #5   |
54    ]:          load_imm r3, #3   |
55    ]:          mul r0, r2, r3    |
56    |                            /
57    \                           /
58     +- Baseline interpreter --+
59     |
60     ] [vm][0000] LoadImm { dst: 2, value: 5 }
61     ] [vm][0001] LoadImm { dst: 3, value: 3 }
62     ] [vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }
63     |
64     '

Peephole Optimizations

Peephole optimisations are, as the name implies, optimisations performed on a small section of a larger input. For a virtual machine, like purple-garden this means using a window of size 3 (anything larger is no longer local1 subject to IR optimisation, not peephole and will have happened before peephole optimisation is reached in the compilation pipeline) and merging operators, rewriting redundant or useless operations.

So to summarize:

  • peephole optimisations are done on the bytecode in a local setting (non-global)
  • in this case they are fallback optimisations for things the IR optimisation rewriting missed
  • these are local, single-pass, and meant only to catch what the IR opt didn’t

Purple garden implementation

For purple garden specifics and questions regarding the runtime, do consult:

Peephole optimisations in purple-garden are intentionally kept single pass to keep startup time cost as low as possible and to move heavy optimisation into the IR.

This introduces the problem for recursive optimisations due to the result of a previous optimisation, this is mitigated by peephole optimisations being the fallback for the previous optimisation pipeline.

RUST
 1const WINDOW_SIZE: usize = 3;
 2
 3/// Peephole optimisations
 4///
 5/// See:
 6/// - [Peephole optimization - Wikipedia]
 7/// (https://en.wikipedia.org/wiki/Peephole_optimization)
 8/// - [W. M. McKeeman "Peephole Optimization"]
 9/// (https://dl.acm.org/doi/epdf/10.1145/364995.365000)
10pub fn bc(bc: &mut Vec<Op>) {
11    for i in 0..=bc.len().saturating_sub(WINDOW_SIZE) {
12        let window = &mut bc[i..i + WINDOW_SIZE];
13        bc::const_binary(window);
14        bc::self_move(window);
15    }
16
17    bc.retain(|op| !matches!(op, Op::Nop))
18}

Since optimisations can both rewrite, replace and remove instructions, the Op::Nop encodes removal by being the replacement for instructions that were optimised away. These are then removed from the bytecode list after all optimisations are applied.

self_move

“Self move” is a mov instruction having equivalent dst and src:

ASM
1__entry:
2    load_imm r0 #5
3    load_imm r1 #5
4    mov r1 r1 ; <-- basically NOP
5    add r1 r0

If this pattern is encountered, the VM would waste processing power on running a NOP instruction, to combat this, they are removed:

ASM
1__entry:
2    load_imm r0 #5
3    load_imm r1 #5
4    add r1 r0

This is achieved by iterating the window and replacing self movs:

RUST
 1/// self_move removes patterns conforming to
 2///
 3///     Mov { dst: x, src: x },
 4///
 5/// where both dst == src
 6pub fn self_move(window: &mut [Op]) {
 7    for op in window.iter_mut() {
 8        if let Op::Mov { dst, src } = op {
 9            if dst == src {
10                *op = Op::Nop;
11                opt_trace!("self_move", "removed self_moving Mov");
12            }
13        }
14    }
15}

const_binary

This optimisation refers to a binary instruction and both lhs and rhs are created via LoadImm directly in the window beforehand:

ASM
1__entry:
2        load_imm r0, #2
3        load_imm r1, #3
4        add r2, r0, r1 ; <-- this is a compile time known result := 2+3=5
5        load_imm r1, #4
6        load_imm r0, #1
7        sub r3, r1, r0 ; <-- this is a compile time known result := 4-1=3
8        mul r0, r2, r3

In this example add, sub and mul are constant. The optimisation itself applies to all arithmetics. Thus after optimising:

ASM
1__entry:
2        load_imm r2, #5
3        load_imm r3, #3
4        mul r0, r2, r3

This looks like the optimiser pass missed the load_imm, load_imm, mul pattern, but this implementation is non recursive, recursive constant folding is subject to IR optimisation, which comes before peephole optimisation, thus this case will not happen once the IR and the IR optimiser is done. Overflow is currently handled silently using wrapping arithmetic; this could and probably will be changed to trigger a compile-time error in the future.

RUST
 1/// const_binary fuses
 2///
 3///     LoadImm{ dst: a, value: x },
 4///     LoadImm{ dst: b, value: y },
 5///     bin { dst, lhs: a, rhs: b }
 6///
 7/// into
 8///
 9///     LoadImm { dst, value: x bin y }
10///
11/// where bin := Add | Sub | Mul | Div
12pub fn const_binary(window: &mut [Op]) {
13    let [
14        Op::LoadImm { dst: a, value: x },
15        Op::LoadImm { dst: b, value: y },
16        op,
17    ] = window
18    else {
19        return;
20    };
21
22    let (dst, result) = match *op {
23        Op::Add { dst, lhs, rhs } 
24            if lhs == *a && rhs == *b => (dst, x.wrapping_add(*y)),
25        Op::Sub { dst, lhs, rhs } 
26            if lhs == *a && rhs == *b => (dst, x.wrapping_sub(*y)),
27        Op::Mul { dst, lhs, rhs } 
28            if lhs == *a && rhs == *b => (dst, x.wrapping_mul(*y)),
29        Op::Div { dst, lhs, rhs } 
30            if lhs == *a && 
31                rhs == *b && *y != 0 => (dst, x.wrapping_div(*y)),
32        _ => return,
33    };
34
35    window[0] = Op::LoadImm { dst, value: result };
36    window[1] = Op::Nop;
37    window[2] = Op::Nop;
38
39    opt_trace!("const_binary", "fused a constant binary op");
40}

Observability

RUST
1macro_rules! opt_trace {
2    ($optimisation:literal, $text:literal) => {
3        #[cfg(feature = "trace")]
4        println!("[opt::{}]: {}", $optimisation, $text);
5    };
6}

The opt_trace! macro is used through out the optimisation pipeline for enabling insights into decision making and performed optimisations.

RUST
1opt_trace!("const_binary", "fused two imm loads and an add");

Results in [opt::const_binary]: fused a constant binary op, when the runtime is compiled with --features=trace.

Integration and flag guard

Since startup time is one of the most important parts of a fast runtime (for me!), peephole optimisation is guarded behind -O1 at this time:

RUST
 1// all error handling is hidden
 2fn main() {
 3    // [...] tokenizing, parsing, further setup
 4
 5    let mut cc = cc::Cc::new();
 6    cc.compile(&ast);
 7
 8    if args.opt >= 1 {
 9        opt::bc(&mut cc.buf);
10    }
11
12    let mut vm = cc.finalize();
13
14    // [...] other flags and running the VM
15}

Running something like 2+3*4-1 through the full pipeline:

SHELL
 1cargo run \
 2    # compile with tracing prints included
 3    --features trace \
 4    -- \
 5    # args for purple-garden
 6
 7    # print the abstract syntax tree
 8    --ast
 9    # disassemble the produced bytecode
10    --disassemble \ 
11    # set optimisation level
12    -O1 \
13    # specify input, if not set just 
14    # pass a file to execute in as an argument
15    -r="2+3*4-1"
TEXT
 1(Asteriks
 2  (Plus
 3    Integer("2")
 4    Integer("3")
 5  )
 6  (Minus
 7    Integer("4")
 8    Integer("1")
 9  )
10)
11Cc::cc(Asteriks)
12Cc::cc(Plus)
13Cc::cc(Integer("2"))
14RegisterAllocator::alloc(r0)
15Cc::cc(Integer("3"))
16RegisterAllocator::alloc(r1)
17RegisterAllocator::alloc(r2)
18RegisterAllocator::free(r0)
19RegisterAllocator::free(r1)
20Cc::cc(Minus)
21Cc::cc(Integer("4"))
22RegisterAllocator::alloc(r1)
23Cc::cc(Integer("1"))
24RegisterAllocator::alloc(r0)
25RegisterAllocator::alloc(r3)
26RegisterAllocator::free(r1)
27RegisterAllocator::free(r0)
28RegisterAllocator::alloc(r0)
29RegisterAllocator::free(r2)
30RegisterAllocator::free(r3)
31[opt::const_binary]: fused a constant binary op
32[opt::const_binary]: fused a constant binary op
33__entry:
34        load_imm r2, #5
35        load_imm r3, #3
36        mul r0, r2, r3
37[vm][0000] LoadImm { dst: 2, value: 5 }
38[vm][0001] LoadImm { dst: 3, value: 3 }
39[vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }

Testing

To ensure correctness for the expected optimisation case, tests are necessary. These tests validate pattern rewriting, not full program semantic equivalence.

RUST
 1#[cfg(test)]
 2mod bc {
 3    use crate::op::Op;
 4
 5    #[test]
 6    fn self_move() {
 7        let mut bc = vec![
 8            Op::Mov { src: 64, dst: 64 },
 9            Op::Mov { src: 64, dst: 64 },
10            Op::Mov { src: 64, dst: 64 },
11        ];
12        crate::opt::bc::self_move(&mut bc);
13        assert_eq!(bc, vec![Op::Nop, Op::Nop, Op::Nop])
14    }
15
16    #[test]
17    fn const_binary() {
18        let mut bc = vec![
19            Op::LoadImm { dst: 0, value: 1 },
20            Op::LoadImm { dst: 1, value: 2 },
21            Op::Add {
22                dst: 0,
23                lhs: 0,
24                rhs: 1,
25            },
26            Op::LoadImm { dst: 0, value: 1 },
27            Op::LoadImm { dst: 1, value: 2 },
28            Op::Sub {
29                dst: 0,
30                lhs: 0,
31                rhs: 1,
32            },
33            Op::LoadImm { dst: 0, value: 3 },
34            Op::LoadImm { dst: 1, value: 5 },
35            Op::Mul {
36                dst: 0,
37                lhs: 0,
38                rhs: 1,
39            },
40            Op::LoadImm { dst: 0, value: 64 },
41            Op::LoadImm { dst: 1, value: 8 },
42            Op::Div {
43                dst: 0,
44                lhs: 0,
45                rhs: 1,
46            },
47        ];
48
49        for i in 0..=bc.len().saturating_sub(3) {
50            crate::opt::bc::const_binary(&mut bc[i..i + 3]);
51        }
52
53        bc.retain(|op| *op != Op::Nop);
54        assert_eq!(
55            bc,
56            vec![
57                Op::LoadImm { dst: 0, value: 3 },
58                Op::LoadImm { dst: 0, value: -1 },
59                Op::LoadImm { dst: 0, value: 15 },
60                Op::LoadImm { dst: 0, value: 8 },
61            ]
62        )
63    }
64}

  1. fight me on this one, I make the rules, if WINDOW_SIZE > 3, I say that’s no longer local :^) ↩︎