Strategies for very fast Lexers

Jul 14, 2025
8278 Words
39 Minute read

Tags:

In this blog post I’ll explain strategies I used to make the purple garden lexer really fast.

purple-garden is an s-expr based language I am currently developing for myself. Its my attempt at building a language I like, with a battery included approach, while designing it with performance in mind.

This doesn’t mean all approaches are feasible for your use case, architecture and design. I tried to bring receipts for my performance claims, so watch out for these blocks at the end of chapters:

Info - Benchmark

Introduction to Lexing

A lexer (often also referred to as a tokeniser) is the easiest part of any compilation and language pipeline. The idea is to convert a list of characters into a list of tokens in which each token conveys some meaning. This list of tokens can then be used by the parser to generate an abstract syntax tree (AST), which the compiler consumes, converting it to bytecode, which the vm executes.

A compilation pipeline

As an overview:

┌───────────────────┐
│                   │
│       Lexer       │ <- we are here
│                   │
└─────────┬─────────┘
          │
          │ Tokens <- and here
          │
┌─────────▼─────────┐
│                   │
│       Parser      │
│                   │
└─────────┬─────────┘
          │
          │ AST
          │
┌─────────▼─────────┐
│                   │
│      Compiler     │
│                   │
└─────────┬─────────┘
          │
          │ Bytecode
          │
┌─────────▼─────────┐
│                   │
│  Virtual Machine  │
│                   │
└───────────────────┘

For a list of characters, lets say (@std.fmt.println "my pi is: " 3.1415):

Input to the lexer:
TEXT
```
1(@std.fmt.println "my pi is: " 3.1415)
```

As characters:

TEXT

 1(
 2@
 3s
 4t
 5d
 6.
 7f
 8m
 9t
10// ....
11)

As pseudo tokens:

TEXT

1(
2@std
3.
4fmt
5.
6println
7"my pi is: "
83.1415
9)

The above is just an example and I’ll go into detail below:

Defining purple garden’s Tokens

A token is not only a set of characters it can be mapped to, but it also holds:

A token type, to easily distinguish between tokens
Positional information:
- start point
- end or length
- line number

Purple garden keeps it as minimal as possible and just has a String and a token type:

 1typedef enum {
 2  // (
 3  T_DELIMITOR_LEFT = 1,
 4  // +
 5  T_PLUS = 2,
 6  // -
 7  T_MINUS = 3,
 8  // *
 9  T_ASTERISKS = 4,
10  // /
11  T_SLASH = 5,
12  // =
13  T_EQUAL = 6,
14  // )
15  T_DELIMITOR_RIGHT,
16  // [
17  T_BRAKET_LEFT,
18  // ]
19  T_BRAKET_RIGHT,
20  // anything between ""
21  T_STRING,
22  // true
23  T_TRUE,
24  // false
25  T_FALSE,
26  // floating point numbers
27  T_DOUBLE,
28  // whole numbers
29  T_INTEGER,
30  // builtins in the format @<builtin>
31  T_BUILTIN,
32  // any identifier
33  T_IDENT,
34  // end marker
35  T_EOF,
36} TokenType;
37
38typedef struct {
39  TokenType type;
40  // stores all values for T_STRING,T_IDENT,T_INTEGER and T_DOUBLE
41  Str string;
42} Token;

Architecting a minimal Lexer

As explained in A compilation pipeline, the lexer iterates over the characters in the input and attempts to match found characters to sets of characters, like keywords, numbers and symbols. For this we will need to keep some state, even if though its not much:

1typedef struct {
2  Str input; // <- String view over the input
3  size_t pos; // <- current position in the input
4} Lexer;

Str as an abstraction is introduced in Operating on zero copy, zero alloc string windows.

A naive approach would be:

 1#define SINGLE_TOK(t) ((Token){.type = t})
 2
 3Lexer Lexer_new(Str str) {
 4    return {
 5        .input=str,
 6        .pos=0
 7    };
 8}
 9
10Token Lexer_next(Lexer *l) {
11    if (l->input.len == 0 || l->pos >= l->input.len) {
12        return SINGLE_TOK(T_EOF)
13    }
14
15    Token t;
16    switch(Str_get(l->input, l->pos)) {
17        case '+':
18            t = SINGLE_TOK(T_PLUS);
19            break;
20        case '-':
21            t = SINGLE_TOK(T_MINUS);
22            break;
23        case '*':
24            t = SINGLE_TOK(T_ASTERISKS);
25            break;
26        case '/':
27            t = SINGLE_TOK(T_SLASH);
28            break;
29        // ... thinks like strings, quoted symbols, numbers, comments, whitespace :^)
30        default:
31            t = SINGLE_TOK(T_EOF);
32            break;
33    }
34    l->pos++;
35    return t;
36}

And one could consume the api as follows (even with a little type->string lookup for debugging):

 1const char* TOKEN_TYPE_MAP[] = {
 2    [T_PLUS] = "T_PLUS",
 3    [T_MINUS] = "T_MINUS",
 4    [T_ASTERISKS] = "T_ASTERISKS",
 5    [T_SLASH] = "T_SLASH",
 6    [T_EOF] = "T_EOF"
 7};
 8
 9Str s = STRING("+-*/");
10Lexer l = Lexer_new(s);
11while(l.pos < l.input.len) {
12    Token t = Lexer_next(&l);
13    puts(TOKEN_TYPE_MAP[t.type]);
14}

Tip - Designated initializers

Designated initializers ([<int>] = <value>) are allowed by ISO C99, while designated initializers with ranges ([<int> .. <int>] = <value>) are a gnu extension, see 6.2.11 Designated Initializers .

Making it fast

Lets get into some optimisations and strategies for creating a clean, minimal and especially fast lexer.

Computed gotos, or: Threaded Lexing

Threaded Lexing is a reference to Threaded code, see wikipedia. The idea is to replace the switch statement with a jump to a block inside of the lexer. The easiest way I could think of implementing this was to:

Define a jump table mapping lexeme start characters to their respective blocks

 1static void *jump_table[256] = {
 2  // first bind all possible bytes to the unknown label, so there are no out
 3  // of bound reads
 4  [0 ... 255] = &&unknown,
 5
 6  // replace all known bytes with correct jump labels
 7  ['('] = &&delimitor_left,
 8  [')'] = &&delimitor_right,
 9  ['['] = &&bracket_left,
10  [']'] = &&bracket_right,
11  ['@'] = &&builtin,
12  ['+'] = &&plus,
13  ['-'] = &&minus,
14  ['/'] = &&slash,
15  ['*'] = &&asterisk,
16  ['='] = &&equal,
17  [' '] = &&whitespace,
18  ['\t'] = &&whitespace,
19  ['\n'] = &&whitespace,
20  [';'] = &&comment,
21  ['.'] = &&number,
22  ['0' ... '9'] = &&number,
23  ['a' ... 'z'] = &&ident,
24  ['A' ... 'Z'] = &&ident,
25  ['\''] = &&quoted,
26  ['_'] = &&ident,
27  ['"'] = &&string,
28
29  // handle the edgecase of being at the end of the input
30  [0] = &&end,
31};

Create a macro for jumping to the label

1#define JUMP_TARGET goto *jump_table[(int8_t)l->input.p[l->pos]]

At the start of the lexer, call the macro
C
```
1JUMP_TARGET
```

Putting it all together lets us build the following threaded lexer:

 1size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
 2  ASSERT(out != NULL, "Failed to allocate token list");
 3
 4  // empty input
 5  if (l->input.len == 0) {
 6    return 1;
 7  }
 8
 9  static void *jump_table[256] = {
10      [0 ... 255] = &&unknown,
11      ['('] = &&delimitor_left,
12      [')'] = &&delimitor_right,
13      ['['] = &&bracket_left,
14      [']'] = &&bracket_right,
15      ['@'] = &&builtin,
16      ['+'] = &&plus,
17      ['-'] = &&minus,
18      ['/'] = &&slash,
19      ['*'] = &&asterisk,
20      ['='] = &&equal,
21      [' '] = &&whitespace,
22      ['\t'] = &&whitespace,
23      ['\n'] = &&whitespace,
24      [';'] = &&comment,
25      ['.'] = &&number,
26      ['0' ... '9'] = &&number,
27      ['a' ... 'z'] = &&ident,
28      ['A' ... 'Z'] = &&ident,
29      ['\''] = &&quoted,
30      ['_'] = &&ident,
31      ['"'] = &&string,
32      [0] = &&end,
33  };
34
35#define JUMP_TARGET goto *jump_table[(int8_t)l->input.p[l->pos]]
36
37  JUMP_TARGET;
38
39delimitor_left:
40  JUMP_TARGET;
41
42delimitor_right:
43  JUMP_TARGET;
44
45bracket_left:
46  JUMP_TARGET;
47
48bracket_right:
49  JUMP_TARGET;
50
51builtin:
52  JUMP_TARGET;
53
54plus:
55  JUMP_TARGET;
56
57minus:
58  JUMP_TARGET;
59
60slash:
61  JUMP_TARGET;
62
63equal:
64  JUMP_TARGET;
65
66asterisk:
67  JUMP_TARGET;
68
69number:
70  JUMP_TARGET;
71
72ident:
73  JUMP_TARGET;
74
75quoted:
76  JUMP_TARGET;
77
78string:
79  JUMP_TARGET;
80
81comment:
82  JUMP_TARGET;
83
84whitespace:
85  JUMP_TARGET;
86
87unknown:
88  return 1;
89
90end:
91  return 0;
92}

Replacing the switch with an array index and a jump. The latter is significantly faster than the former due to:

Better code density
Less cache misses, better branch prediction

Warning

There are two downsides to this approach:

It is not supported by MSVC toolchain (clang, gcc only)
It makes reading and debugging the lexer far more complicated

Abstracting allocations via the Allocator interface

I want to allow the user of purple-garden to choose between allocation strategies like my garbage collectors and a bump allocator. For this I need an interface to marry several backing implementations into a singular api:

 1#ifndef MEM_H
 2#define MEM_H
 3
 4#include <stddef.h>
 5
 6#ifdef DEBUG
 7#if DEBUG
 8#define VERBOSE_ALLOCATOR 1
 9#endif
10#else
11#define VERBOSE_ALLOCATOR 0
12#endif
13
14// 1MB
15#define GC_MIN_HEAP 1024 * 1024
16
17typedef struct {
18  size_t current;
19  size_t allocated;
20} Stats;
21
22// CALL is used to emulate method calls by calling a METHOD on SELF with
23// SELF->ctx and __VA_ARGS__, this is useful for interface interaction, such as
24// Allocator, which reduces alloc_bump.request(alloc_bump.ctx, 64); to
25// CALL(alloc_bump, request, 64), removing the need for passing the context in
26// manually
27#if VERBOSE_ALLOCATOR
28#include <stdio.h>
29#define CALL(SELF, METHOD, ...)                                                \
30  (fprintf(stderr, "[ALLOCATOR] %s@%s::%d: %s->%s(%s)\n", __FILE__, __func__,  \
31           __LINE__, #SELF, #METHOD, #__VA_ARGS__),                            \
32   (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__))
33#else
34#define CALL(SELF, METHOD, ...) (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__)
35#endif
36
37// Allocator defines an interface abstracting different allocators, so the
38// runtime of the virtual machine does not need to know about implementation
39// details, can be used like this:
40//
41//
42//  #define ALLOC_HEAP_SIZE = 1024
43//  Allocator alloc_bump = bump_init(ALLOC_HEAP_SIZE);
44//
45//  size_t some_block_size = 16;
46//  void *some_block = alloc_bump.request(alloc_bump.ctx, some_block_size);
47//
48//  alloc_bump.destroy(alloc_bump.ctx);
49//
50typedef struct {
51  // Allocator::ctx refers to an internal allocator state and owned memory
52  // areas, for instance, a bump allocator would attach its meta data (current
53  // position, cap, etc) here
54  void *ctx;
55
56  // Allocator::stats is expected to return the current statistics of the
57  // underlying allocator
58  Stats (*stats)(void *ctx);
59  // Allocator::request returns a handle to a block of memory of size `size`
60  void *(*request)(void *ctx, size_t size);
61  // Allocator::destroy cleans state up and deallocates any owned memory areas
62  void (*destroy)(void *ctx);
63} Allocator;
64
65Allocator *bump_init(size_t size);
66Allocator *xcgc_init(size_t size, void *vm);
67
68#endif

For instance: this allocator is then passed to the virtual machine. The vm uses Allocator->request(Allocator->ctx, sizeof(Value)) to allocate a new runtime value.

I included a CALL macro for cleaning this duplicated context passing up:

1Allocator *a = bump_init(1024);
2
3// before:
4void *block = a->request(a->ctx, 512);
5a->destroy(a->ctx)
6
7// after:
8void *block = CALL(a, request, 512);
9CALL(a, destroy)

This macro also enables verbose allocation insights for debugging purposes:

TEXT

1[ALLOCATOR] vm.c@freelist_preallocate::59: fl->alloc->request(sizeof(Frame))
2[ALLOCATOR] main.c@main::245: vm.alloc->stats()
3[ALLOCATOR] main.c@main::255: pipeline_allocator->destroy()
4[ALLOCATOR] vm.c@Vm_destroy::283: vm->alloc->destroy()

The implementation of said interface is straightforward:

 1// Bump allocator header
 2typedef struct {
 3  // points to the start of the allocated block from which Allocator::request
 4  // will hand out aligned chunks
 5  void *block;
 6  // the size of said allocated block
 7  size_t len;
 8  // the current amount of bytes in use
 9  size_t pos;
10} BumpCtx;
11
12void *bump_request(void *ctx, size_t size) {
13  BumpCtx *b_ctx = (BumpCtx *)ctx;
14  size_t align = sizeof(void *);
15  b_ctx->pos = (b_ctx->pos + align - 1) & ~(align - 1);
16  ASSERT(b_ctx->pos + size <= b_ctx->len, "OOM :( with %zu", b_ctx->len);
17  void *block_entry = (char *)b_ctx->block + b_ctx->pos;
18  b_ctx->pos += size;
19  return block_entry;
20}
21
22void bump_destroy(void *ctx) {
23  ASSERT(ctx != NULL, "bump_destroy on already destroyed allocator");
24  BumpCtx *b_ctx = (BumpCtx *)ctx;
25  // madvise(2):
26  // The  application  no  longer requires the pages in the range
27  // specified by addr and len. The kernel can thus  free  these
28  // pages,  but  the freeing could be delayed until memory pres‐
29  // sure occurs.
30  madvise(b_ctx->block, b_ctx->len, MADV_FREE);
31  int res = munmap(b_ctx->block, b_ctx->len);
32  ASSERT(res == 0, "munmap failed");
33  free(ctx);
34}
35
36Stats bump_stats(void *ctx) {
37  BumpCtx *b_ctx = (BumpCtx *)ctx;
38  return (Stats){.allocated = b_ctx->len, .current = b_ctx->pos};
39}
40
41Allocator *bump_init(size_t size) {
42  long page_size = sysconf(_SC_PAGESIZE);
43  size = (size + page_size - 1) & ~(page_size - 1);
44
45  void *b = mmap(NULL, size, PROT_READ | PROT_WRITE,
46                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
47  ASSERT(b != MAP_FAILED, "failed to mmap allocator buffer");
48
49  BumpCtx *ctx = malloc(sizeof(BumpCtx));
50  ASSERT(ctx != NULL, "failed to bump allocator context");
51  ctx->len = size;
52  ctx->pos = 0;
53  ctx->block = b;
54
55  Allocator *a = malloc(sizeof(Allocator));
56  ASSERT(ctx != NULL, "failed to alloc bump allocator");
57  a->ctx = (void *)ctx;
58  a->destroy = bump_destroy;
59  a->request = bump_request;
60  a->stats = bump_stats;
61
62  return a;
63}

Info - Benchmarks

This reduced the total runtime by like 24ms or made it 1.58x faster, see 4251b6b.

The separate parsing stage was due to me experimenting with attaching the parser to the compiler, but this made the whole thing a bit slower than just finishing parsing before compiling

TEXT

 1mem+main+cc: replace dynamic alloc with bump allocator (~1.6x faster)
 2- preallocated 4MB for bytecode and 256KB for the global pool to improve memory management.
 3- cc now takes an allocator to omit inline memory allocations for global
 4  pool entries and bytecode creation
 5- with previous changes shaving off 24ms
 6
 7Old perf (before b59737a)
 8
 9    [    0.0070ms] main::Args_parse: Parsed arguments
10    [    0.0080ms] io::IO_read_file_to_string: mmaped input
11    [   41.2460ms] parser::Parser_run: Transformed source to AST
12    [   20.7330ms] cc::cc: Flattened AST to byte code
13    [    0.9640ms] mem::Allocator::destroy: Deallocated AST memory space
14    [    2.1270ms] vm::Vm_run: Walked and executed byte code
15    [    0.5560ms] vm::Vm_destroy: Deallocated global pool and bytecode list
16
17New:
18
19    [    0.0050ms] main::Args_parse: Parsed arguments
20    [    0.0120ms] io::IO_read_file_to_string: mmaped input
21    [   37.8600ms] cc::cc: Flattened AST to byte code
22    [    2.3540ms] vm::Vm_run: Walked and executed byte code
23    [    0.7380ms] mem::Allocator::destroy: Deallocated AST memory space
24
25    $ hyperfine "./purple_garden examples/bench.garden" "../purple_garden_old/purple_garden examples/bench.garden"
26    Benchmark 1: ./purple_garden examples/bench.garden
27      Time (mean ± σ):      42.2 ms ±   0.7 ms    [User: 28.0 ms, System: 13.8 ms]
28      Range (min … max):    41.3 ms …  45.8 ms    70 runs
29
30    Benchmark 2: ../purple_garden_old/purple_garden examples/bench.garden
31      Time (mean ± σ):      66.6 ms ±   1.1 ms    [User: 45.9 ms, System: 20.2 ms]
32      Range (min … max):    64.8 ms …  69.8 ms    43 runs
33
34    Summary
35      ./purple_garden examples/bench.garden ran
36        1.58 ± 0.04 times faster than ../purple_garden_old/purple_garden examples/bench.garden

Before I also made the parser use the block allocator, see 0324051:

TEXT

 1parser: allocate with bump alloc (2x faster parse, 26x faster clean)
 2I replaced the inline allocation logic for the growing of Node.children with
 3the bump allocator from the allocator abstraction in mem::Allocator
 4(mem::bump_*). This results in 2x faster parsing and 27x faster AST cleanup:
 5
 6- [   89.4830ms/41.7840ms=02.1417x faster] parser::Parser_run: Transformed
 7  source to AST
 8- [   22.3900ms/00.8440ms=26.5284x faster] parser::Node_destroy: Deallocated
 9  AST Nodes renamed to mem::Allocator::destroy
10
11Old:
12
13    $ make bench PG=examples/bench.garden
14    [    0.0050ms] main::Args_parse: Parsed arguments
15    [    0.0110ms] io::IO_read_file_to_string: mmaped input
16    [   89.4830ms] parser::Parser_run: Transformed source to AST
17    [   18.8010ms] cc::cc: Flattened AST to byte code
18    [   22.3900ms] parser::Node_destroy: Deallocated AST Nodes
19    [    2.3200ms] vm::Vm_run: Walked and executed byte code
20    [    0.4670ms] vm::Vm_destroy: Deallocated global pool and bytecode list
21
22New:
23
24    $ make bench PG=examples/bench.garden
25    [    0.0050ms] main::Args_parse: Parsed arguments
26    [    0.0100ms] io::IO_read_file_to_string: mmaped input
27    [   41.7840ms] parser::Parser_run: Transformed source to AST
28    [   21.2160ms] cc::cc: Flattened AST to byte code
29    [    0.8440ms] mem::Allocator::destroy: Deallocated AST memory space
30    [    2.2510ms] vm::Vm_run: Walked and executed byte code
31    [    0.7590ms] vm::Vm_destroy: Deallocated global pool and bytecode list

Operating on zero copy, zero alloc string windows

Our lexer doesn’t require mutable strings, C does not come with a string abstraction out of the box and I wanted to attach metadata to strings - so I came up with a small string abstraction:

 1// str is a simple stack allocated wrapper around C char arrays, providing
 2// constant time length access and zero allocation+copy interactions for all
 3// methods except Str_to
 4typedef struct {
 5  // store the pointer to the underlying char
 6  const uint8_t *p;
 7  // hash of the input, do not expect it to be filled, has to be computed via
 8  // Str_hash or inline in the lexer
 9  uint64_t hash;
10  // length of the input without a zero terminator
11  size_t len;
12} Str;

1#define STRING(str) ((Str){.len = sizeof(str) - 1, .p = (const uint8_t *)str})
2#define STRING_EMPTY ((Str){.len = 0, .p = NULL})

Creating a Str from a c style const char* can be done by passing it into the STRING macro, gcc can evaluate all operations inside of it at compile time. Since the view doesn’t own its underlying data, its cheap to copy and create slices by just pointing new views to the underlying buffer.

I use this struct throughout the whole codebase, but specifically inside of the lexer to create views over the original input that point to the contents of some tokens:

 1// example inside of handling strings in the lexer:
 2
 3// dummy values for start and hash
 4size_t start = 3;
 5size_t hash = 0xAFFEDEAD;
 6
 7Token *t = &(Token){};
 8t->type = T_STRING;
 9t->string = (Str){
10    .p = l->input.p + start,
11    .len = l->pos - start,
12    .hash = hash,
13};

Due to the window nature of this struct I had to reimplement some things myself, such as slicing, concatination, equality checking, hashing and converting to int64 and double:

1char Str_get(const Str *str, size_t index);
2Str Str_from(const char *s);
3Str Str_slice(const Str *str, size_t start, size_t end);
4Str Str_concat(const Str *a, const Str *b, Allocator *alloc);
5bool Str_eq(const Str *a, const Str *b);
6void Str_debug(const Str *str);
7size_t Str_hash(const Str *str);
8int64_t Str_to_int64_t(const Str *str);
9double Str_to_double(const Str *str);

Lets quickly go over the interesting ones, specifically slicing and converting to other data types. Slicing is super easy, since we just move the start and the end to the new slice start and end.

1Str Str_slice(const Str *str, size_t start, size_t end) {
2  ASSERT(end >= start, "Str_slice: Invalid slice range: end must be >= start");
3  ASSERT(end <= str->len, "Str_slice: Slice range exceeds string length");
4
5  return (Str){
6      .p = str->p + start,
7      .len = end - start,
8  };
9}

Converting to int64 is also fairly uncomplicated, since the lexer is expected to make sure all characters are in the integer set. The algorithm consists of converting the character representation of an integer component into its literal value by subtracting '0' from it. The resulting value is added to the product of the previous iteration and 10, since we are working in the decimal system.

 1int64_t Str_to_int64_t(const Str *str) {
 2  int64_t r = 0;
 3  ASSERT(str->len > 0, "Cant convert empty string into int64_t");
 4
 5  for (size_t i = 0; i < str->len; i++) {
 6    int digit = str->p[i] - '0';
 7    ASSERT(r < (INT64_MAX - digit) / 10,
 8           "int64_t number space overflow: `%.*s`", (int)str->len, str->p)
 9    r = r * 10 + digit;
10  }
11
12  return r;
13}

Doubles are represented differently, specifcally by their mantiassa and exponent, requiring a slightly more sophisticated conversion algorithm. In the same vein as Str_to_int64_t, validating is already done by the lexer to the extend of only allowing any of .1234567890.

 1double Str_to_double(const Str *str) {
 2  ASSERT(str->len > 0, "Can't convert empty string into double");
 3
 4  const char *p = (const char *)str->p;
 5  size_t len = str->len;
 6
 7  uint64_t mantissa = 0;
 8  int exponent = 0;
 9  bool seen_dot = false;
10  bool has_digits = false;
11
12  // we dont check that all chars are numbers here, since the lexer already does
13  // that
14  for (size_t i = 0; i < len; i++) {
15    char c = p[i];
16
17    if (c == '.') {
18      seen_dot = true;
19      continue;
20    }
21
22    has_digits = true;
23    short digit = c - '0';
24    ASSERT(mantissa <= (UINT64_MAX - digit) / 10, "Mantissa overflow");
25    mantissa = mantissa * 10 + digit;
26    if (seen_dot) {
27      exponent -= 1;
28    }
29  }
30
31  // if there were no digits after the '.'
32  ASSERT(has_digits, "Can't parse `%.*s` into a double", (int)len, p);
33
34  double result = (double)mantissa;
35  // skip exponent computation for <mantissa>.0, since these are just the
36  // mantissa
37  if (exponent != 0) {
38    result *= pow(10.0, exponent);
39  }
40
41  return result;
42}

Info - Benchmarks

1.4x Speedup by using the above abstraction in a non allocating way, see b19088a:

TEXT

 1common: make String methods 0 alloc (~1.4x faster)
 2- String_slice no longer requires a malloc, now just returns a window into its
 3  argument
 4- String_to no longer returns just the underlying pointer (String.p) but
 5  allocates a new char*
 6- new String_debug method to print the window
 7- lexer::num no longer allocates and frees for its string window, instead uses
 8  a stack buffer
 9- parser::Node_destroy no longer calls Token_destroy
10- Token_destroy no longer needed, since the lexer no longer allocates
11
12Main improvements in runtime while parsing (less allocations and frees for
13ident, string and double handling) and in cleanup (no more deallocations for
14tokens)
15
16With 250k loc of "hello world":
17
18- Parsing went from 20.9ms to 13.8ms => 1.51x faster
19- Cleanup went from 3.6ms to 0.83ms => 4.34x faster
20
21Old:
22
23    [BENCH] (T-0.0060ms): parsed arguments
24    [BENCH] (T-0.0420ms): read file to String
25    [BENCH] (T-20.8780ms): parsed input
26    [BENCH] (T-6.8080ms): compiled input
27    [BENCH] (bc=500002|globals=250001)
28    [BENCH] (T-0.3440ms): ran vm
29    [BENCH] (T-3.5960ms): destroyed Nodes, vm and input
30
31New:
32
33    [BENCH] (T-0.0060ms): parsed arguments
34    [BENCH] (T-0.0410ms): read file to String
35    [BENCH] (T-13.8280ms): parsed input
36    [BENCH] (T-7.9410ms): compiled input
37    [BENCH] (bc=500002|globals=250001)
38    [BENCH] (T-0.3490ms): ran vm
39    [BENCH] (T-0.8280ms): destroyed Nodes, vm and input

Hashing everything

I want to distinguish atoms (strings, numbers, idents) from other atoms for interning purposes and faster comparisons in the pipeline. This can be done via hashes, especially since we already “visit” each member of said atoms while converting them to tokens. Hashing them is therefore just a matter of computations while walking the atoms underlying bytes.

For instance strings: before this, the lexer just advanced until it hit the closing delimitor or EOF:

 1size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
 2
 3    // ...
 4
 5string: {
 6  // skip "
 7  l->pos++;
 8  size_t start = l->pos;
 9  for (char cc = cur(l); cc > 0 && cc != '"'; l->pos++, cc = cur(l)) {}
10
11  if (UNLIKELY(cur(l) != '"')) {
12    Str slice = Str_slice(&l->input, l->pos, l->input.len);
13    fprintf(stderr, "lex: Unterminated string near: '%.*s'", (int)slice.len,
14            slice.p);
15    out[count++] = INTERN_EOF;
16  } else {
17    Token *t = CALL(a, request, sizeof(Token));
18    t->type = T_STRING;
19    t->string = (Str){
20        .p = l->input.p + start,
21        .len = l->pos - start,
22    };
23    out[count++] = t;
24    // skip "
25    l->pos++;
26  }
27  JUMP_TARGET;
28}
29
30    // ...
31
32}

Adding hashing to this is fairly easy:

DIFF

 1diff --git a/lexer.c b/lexer.c
 2index 316a494..2280a6b 100644
 3--- a/lexer.c
 4+++ b/lexer.c
 5@@ -286,10 +286,7 @@ string: {
 6   // skip "
 7   l->pos++;
 8   size_t start = l->pos;
 9+  size_t hash = FNV_OFFSET_BASIS;
10   for (char cc = cur(l); cc > 0 && cc != '"'; l->pos++, cc = cur(l)) {
11+    hash ^= cc;
12+    hash *= FNV_PRIME;
13   }
14
15   if (UNLIKELY(cur(l) != '"')) {
16@@ -303,7 +300,6 @@ string: {
17     t->string = (Str){
18         .p = l->input.p + start,
19         .len = l->pos - start,
20+        .hash = hash,
21     };
22     out[count++] = t;
23     // skip "

Numbers, identifiers and builtins are all also being hashed, I omitted this here for clearity and since we will revisit this topic again in this article.

Interning Tokens

Most of the tokens we will encounter are constant, we know:

their size
their type

We can use this knowledge, statically allocate these and use their pointers to reduce memory pressure:

 1// we can "intern" these, since all of them are the same, regardless of position
 2Token *INTERN_DELIMITOR_LEFT = &SINGLE_TOK(T_DELIMITOR_LEFT);
 3Token *INTERN_DELIMITOR_RIGHT = &SINGLE_TOK(T_DELIMITOR_RIGHT);
 4Token *INTERN_BRAKET_LEFT = &SINGLE_TOK(T_BRAKET_LEFT);
 5Token *INTERN_BRAKET_RIGHT = &SINGLE_TOK(T_BRAKET_RIGHT);
 6Token *INTERN_MINUS = &SINGLE_TOK(T_MINUS);
 7Token *INTERN_PLUS = &SINGLE_TOK(T_PLUS);
 8Token *INTERN_ASTERISKS = &SINGLE_TOK(T_ASTERISKS);
 9Token *INTERN_SLASH = &SINGLE_TOK(T_SLASH);
10Token *INTERN_FALSE = &SINGLE_TOK(T_FALSE);
11Token *INTERN_TRUE = &SINGLE_TOK(T_TRUE);
12Token *INTERN_EQUAL = &SINGLE_TOK(T_EQUAL);
13Token *INTERN_EOF = &SINGLE_TOK(T_EOF);
14
15// size_t Lexer_all(Lexer *l, Allocator *a, Token **out)

SINGLE_TOK is just:

1#define SINGLE_TOK(t) ((Token){.type = t})

Prehashing keywords for comparisons

As introduced in the previous chapters, all identifers are hashed, thus we can also hash the known keywords at startup and make comparing them very fast.

Lets take a look at true and false, both are known keywords and we will need to compare found identifers to them.

 1ident: {
 2  size_t start = l->pos;
 3  size_t hash = FNV_OFFSET_BASIS;
 4  for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
 5    hash ^= cc;
 6    hash *= FNV_PRIME;
 7  }
 8
 9  size_t len = l->pos - start;
10  Token *t;
11
12  // comparing to the keywords is now just a number comparison
13  if (hash == true_hash) {
14    t = INTERN_TRUE;
15  } else if (hash == false_hash) {
16    t = INTERN_FALSE;
17  } else {
18    t = CALL(a, request, sizeof(Token));
19    t->type = T_IDENT;
20    t->string = (Str){
21        .p = l->input.p + start,
22        .len = len,
23        .hash = hash,
24    };
25  }
26  out[count++] = t;
27  JUMP_TARGET;
28}

Both true_hash and false_hash are computed at startup of Lexer_all:

 1size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
 2  ASSERT(out != NULL, "Failed to allocate token list");
 3
 4  // empty input
 5  if (l->input.len == 0) {
 6    out[0] = INTERN_EOF;
 7    return 1;
 8  }
 9
10  size_t true_hash = Str_hash(&STRING("true"));
11  size_t false_hash = Str_hash(&STRING("false"));
12
13  // [...]
14}

Tip - is_alphanum performance deep dive

I know this function shouldn’t be called is_alphanum because it allows [A-Za-z0-9_-]

A naive check of is_alphanum can be:

1bool is_alphanum(char cc) {
2    return (cc >= 'a' && cc <= 'z') || (cc >= 'A' && cc <= 'Z')
3        || (cc >= '0' && cc <= '9') || cc == '_' || cc == '-'
4}

We know we can omit the uppercase check by converting the character to its lowercase representation, so lets fold the character, since ASCII upper and lowercase characters only differ by a single bit:

1bool is_alphanum(char cc) {
2  uint8_t lower = cc | 0x20;
3  bool is_alpha = (lower >= 'a' && lower <= 'z');
4  bool is_digit = (cc >= '0' && cc <= '9');
5  return is_alpha || is_digit || cc == '_' || cc == '-';
6}

In benchmarks I was able to measure inline and parameter type uint8_t have a reproducible impact of reducing the runtime by 1-5% for identifier heavy inputs, so I marked the function as “private” static inline:

1__attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) {
2  uint8_t lower = cc | 0x20;
3  bool is_alpha = (lower >= 'a' && lower <= 'z');
4  bool is_digit = (cc >= '0' && cc <= '9');
5  return is_alpha || is_digit || cc == '_' || cc == '-';
6}

There are some other ways that could be more efficient, but I haven’t benchmarked these:

statically allocated lookup table like:

 1static const bool is_alphanum_lookup[128] = {
 2    ['0' ... '9'] = true,
 3    ['A' ... 'Z'] = true,
 4    ['a' ... 'z'] = true,
 5    ['_'] = true,
 6    ['-'] = true,
 7};
 8__attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) {
 9    return cc < 128 && is_alphanum_lookup[cc];
10}

weird bit sets:

I don’t fully understand this one and it sucks to read, so no thanks

1static const uint64_t table1 = 0x03ff000000000000
2static const uint64_t table2 = 0x07fffffe07fffffe
3
4 __attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) {
5    if (cc >= 128) return false;
6    return cc < 64 ? (table1 >> cc) & 1 : (table2 >> (cc - 64)) & 1;
7}

On demand double and int64_t parsing

Let’s revisit hashing numbers: as introduced before, all atoms are hashed, therefore I am able to use this hash for and while interning. This way the compiler converts all duplicated integers and doubles into their numerical representation only once (in the compiler).

The lexer therefore only needs to store the window of the atoms input. While verifying all characters in this window are valid arguments for Str_to_double and Str_to_int64_t. In a later chapter I’ll show the compiler doing on demand parsing with the token we created here in the lexer.

 1number: {
 2  size_t start = l->pos;
 3  size_t i = start;
 4  bool is_double = false;
 5  size_t hash = FNV_OFFSET_BASIS;
 6  for (; i < l->input.len; i++) {
 7    char cc = l->input.p[i];
 8    hash ^= cc;
 9    hash *= FNV_PRIME;
10    if (cc >= '0' && cc <= '9')
11      continue;
12    if (cc == '.') {
13      ASSERT(!is_double, "Two dots in double");
14      is_double = true;
15      continue;
16    }
17    break;
18  }
19
20  l->pos = i;
21  Token *n = CALL(a, request, sizeof(Token));
22  n->string = (Str){
23      .p = l->input.p + start,
24      .len = i - start,
25      .hash = hash,
26  };
27  if (is_double) {
28    n->type = T_DOUBLE;
29  } else {
30    n->type = T_INTEGER;
31  }
32
33  out[count++] = n;
34  JUMP_TARGET;
35}

Tip - Distinguishing between integers and doubles

At first, purple garden’s runtime represented all numbers as doubles. After benchmarking, I found out that integer math is a lot faster, so it makes a lot more sense to store all non floating point numbers as integers. For general advice, always benchmark and read the following:

Int VS FP performance (java, but the point stands)
6.2.2 Integer benefits and pitfalls - GNU Astronomy Utilities
Integer operations vs floating point operations on Computational Science Stack Exchange

Extra: memory mapping the input

Normaly one would consume a file in C by

open file descriptor: fopen
fread the buffer into a malloced space
zero terminate
close file: fclose

 1// https://stackoverflow.com/questions/14002954/c-how-to-read-an-entire-file-into-a-buffer
 2#include <stdio.h>
 3#include <stdlib.h>
 4
 5int main(void) {
 6    FILE *f = fopen("textfile.txt", "rb");
 7    fseek(f, 0, SEEK_END);
 8    long fsize = ftell(f);
 9    fseek(f, 0, SEEK_SET);
10
11    char *string = malloc(fsize + 1);
12    fread(string, fsize, 1, f);
13    fclose(f);
14
15    string[fsize] = 0;
16    free(string);
17    return EXIT_SUCCESS;
18}

However, you can also do the whole thing a lot faster by instructing the kernel to dump the whole file into our virtual memory via mmap (Just not walking a file two times is already faster).

After opening the file (opening a file with O_RDONLY and mapping it with PROT_READ can be faster than making it mutable), we need it’s type (we dont’t want to open or dump a directory) and it’s size (the api wants a mapping block size). fstat helps us with filling a struct with meta data containing exactly the info we need:

 1// taken from https://www.commandlinux.com/man-page/man2/fstat.2.html
 2struct stat {
 3    dev_t     st_dev;     /* ID of device containing file */
 4    ino_t     st_ino;     /* inode number */
 5    mode_t    st_mode;    /* protection */
 6    nlink_t   st_nlink;   /* number of hard links */
 7    uid_t     st_uid;     /* user ID of owner */
 8    gid_t     st_gid;     /* group ID of owner */
 9    dev_t     st_rdev;    /* device ID (if special file) */
10    off_t     st_size;    /* total size, in bytes */
11    blksize_t st_blksize; /* blocksize for file system I/O */
12    blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
13    time_t    st_atime;   /* time of last access */
14    time_t    st_mtime;   /* time of last modification */
15    time_t    st_ctime;   /* time of last status change */
16};

I use S_ISREG: to check if the handled is a regular file. After mmapping with the size stored in stat.st_size I do a bookkeeping, see the combined snippet below:

 1#include <fcntl.h>
 2#include <sys/mman.h>
 3#include <sys/stat.h>
 4#include <unistd.h>
 5
 6#include "common.h"
 7#include "io.h"
 8
 9Str IO_read_file_to_string(char *path) {
10  ASSERT(path != NULL, "path was NULL");
11
12  int fd = open(path, O_RDONLY);
13  ASSERT(fd != -1, "failed to open input file");
14
15  struct stat s;
16  fstat(fd, &s);
17  ASSERT(S_ISREG(s.st_mode), "path is not a file");
18
19  long length = s.st_size;
20  if (length < 0) {
21    close(fd);
22    ASSERT(length > 0, "input is empty")
23  }
24
25  char *buffer = 0;
26  if (length != 0) {
27    buffer = mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0);
28  }
29
30  ASSERT(close(fd) == 0, "failed to close file");
31  ASSERT(buffer != MAP_FAILED, "failed to mmap input")
32  return (Str){.len = length, .p = (const uint8_t *)buffer};
33}

Info - Benchmarks

In my benchmark this made the stage before even starting lexing 6x-35x faster, see a2cae88:

Please excuse the ugly debug info, I have reworked this till then. Also, lexing and parsing was a single stage back then.

TEXT

 1io: use mmap to make IO_read_file_to_string 6x-35x faster
 2For 5k lines with 4 atoms each (20k atoms), the initial string read was
 3reduced from 0.4390ms to 0.0730ms (6x faster):
 4
 5Old:
 6    [BENCH] (T-0.1620ms): parsed arguments
 7    [BENCH] (T-0.4390ms): read file to String
 8    [BENCH] (T-10.2020ms): parsed input
 9    [BENCH] (T-1.2820ms): compiled input
10    [BENCH] (bc=40008|globals=20004)
11    [BENCH] (T-0.1620ms): ran vm
12    [BENCH] (T-0.6190ms): destroyed Nodes, vm and input
13
14New:
15    [BENCH] (T-0.1510ms): parsed arguments
16    [BENCH] (T-0.0730ms): read file to String
17    [BENCH] (T-10.1350ms): parsed input
18    [BENCH] (T-1.3210ms): compiled input
19    [BENCH] (bc=40008|globals=20004)
20    [BENCH] (T-0.1710ms): ran vm
21    [BENCH] (T-0.6460ms): destroyed Nodes, vm and input
22
23For larger files, such as 250k lines with 4 atoms each (1mio atoms), the
24initial string read was reduced from 3.472ms to 0.0980ms (35x faster):
25
26Old:
27    [BENCH] (T-0.1430ms): parsed arguments
28    [BENCH] (T-3.4720ms): read file to String
29    [BENCH] (T-434.8770ms): parsed input
30    [BENCH] (T-30.7538ms): compiled input
31    [BENCH] (bc=2040408|globals=1020204)
32    [BENCH] (T-7.5610ms): ran vm
33    [BENCH] (T-37.2170ms): destroyed Nodes, vm and input
34
35New:
36    [BENCH] (T-0.1490ms): parsed arguments
37    [BENCH] (T-0.0980ms): read file to String
38    [BENCH] (T-437.4770ms): parsed input
39    [BENCH] (T-30.8820ms): compiled input
40    [BENCH] (bc=2040408|globals=1020204)
41    [BENCH] (T-7.4540ms): ran vm
42    [BENCH] (T-36.9500ms): destroyed Nodes, vm and input

Consuming numeric tokens in the compiler

As introduced in On demand double and int64_t parsing, the lexer does not perform string to numerical conversions, but rather stores a hash and a window of said string. The compiler converts any tokens with this hash only once and refers any duplicates to the global pool index of this number.

The compiler itself will probably be the topic of a future blog article, but I kept it simple at this time.

token_to_value is called for all unique (not encountered before and thus not interned) atoms:

 1// token_to_value converts tokens, such as strings, idents and numbers to
 2// runtime values
 3inline static Value *token_to_value(Token *t, Allocator *a) {
 4  Value *v = CALL(a, request, sizeof(Value));
 5  switch (t->type) {
 6  case T_STRING:
 7  case T_IDENT:
 8    v->type = V_STR;
 9    v->string = t->string;
10    break;
11  case T_INTEGER:
12    v->type = V_INT;
13    v->integer = Str_to_int64_t(&t->string);
14    break;
15  case T_DOUBLE:
16    v->type = V_DOUBLE;
17    v->floating = Str_to_double(&t->string);
18    break;
19  default:
20    ASSERT(0, "Unsupported value for token_to_value");
21    break;
22  }
23  return v;
24}

Note the missing cases for T_FALSE and T_TRUE? They are omitted, because there are hard coded entries 0 and 1 in the global pool (@None is the same, its bound to index 2).

Info - Benchmarks

This resulted in crazy 15ms/64% faster total runtime results for number and duplicate heavy test inputs, see a55a190.

TEXT

 1lexer+cc: move Str_to_(double|int64_t) parsing from lexer to cc
 2This change omits all integer and number parsing from the pipeline but
 3the first occurence of each unique integer or number by storing a hash
 4of the string representation of said values. At the interning stage in
 5the compiler only the first occurence of any hash of a double or integer
 6is parsed via Str_to_int64_t or Str_to_double, which reduces the
 7theoretically workload for any duplicated number of integers and doubles
 8from N to 1.
 9
10For a double and integer heavy benchmark (250k loc with 250k duplicated
11doubles and integers) results in:
12
13    - 15ms faster
14    - 64% faster
15    - ~2.8x faster
16
17Prev commit:
18    ./build/bench +V examples/bench.garden
19    [    0.0000ms] main::Args_parse: Parsed arguments
20    [    0.0120ms] io::IO_read_file_to_string: mmaped input of size=2500090B
21    [    0.0050ms] mem::init: Allocated memory block of size=153092096B
22    [   23.8300ms] lexer::Lexer_all: lexed tokens count=1000033
23    [   12.5190ms] parser::Parser_next created AST with node_count=250003
24    [    9.2090ms] cc::cc: Flattened AST to byte code/global pool length=1500048/4
25    [   36.3060ms] vm::Vm_run: executed byte code
26    [    0.3730ms] mem::Allocator::destroy: Deallocated memory space
27    [    0.0410ms] vm::Vm_destroy: teared vm down
28    [    0.0000ms] munmap: unmapped input
29
30New:
31    ./build/bench +V examples/bench.garden
32    [    0.0000ms] main::Args_parse: Parsed arguments
33    [    0.0170ms] io::IO_read_file_to_string: mmaped input of size=2500090B
34    [    0.0060ms] mem::init: Allocated memory block of size=153092096B
35    [    8.5270ms] lexer::Lexer_all: lexed tokens count=1000033
36    [   12.2070ms] parser::Parser_next created AST with node_count=250003
37    [    9.4020ms] cc::cc: Flattened AST to byte code/global pool length=1500048/4
38    [   36.9900ms] vm::Vm_run: executed byte code
39    [    0.3960ms] mem::Allocator::destroy: Deallocated memory space
40    [    0.0480ms] vm::Vm_destroy: teared vm down
41    [    0.0010ms] munmap: unmapped input

Benchmark

So I created, what i would consider a fairly heavy lexer benchmark:

SCHEME

 1(@Some (@Some (@Some (@None))))
 2true false true false
 33.1415 22222222222 .12345
 4"string me this, string me that"
 5'quoted-strings-is-a-must-do
 6(@let unquoted-strings-are-just-idents (@None))
 7unquoted-strings-are-just-idents
 8(@None) (+) (-) (*) (/) (=)
 9;; COMMENT COMMENT COMMENT
10;; COMMENT COMMENT COMMENT
11;; COMMENT COMMENT COMMENT with whitespace for 3 lines
12
13
14
15;; whitespace end

And I typed VggyG66666p to fill 1mio lines (1000005).

On a Laptop

Terminal

$ inxi -CMD

System:
  Host: ************* Kernel: 6.11.0-28-generic arch: x86_64 bits: 64
  Desktop: i3 v: 4.23 Distro: Ubuntu 24.04.2 LTS (Noble Numbat)
Machine:
  Type: Laptop System: LENOVO product: 21F8002TGE v: ThinkPad T14s Gen 4
    serial: <superuser required>
  Mobo: LENOVO model: 21F8002TGE v: SDK0T76530 WIN
    serial: <superuser required> UEFI: LENOVO v: R2EET41W (1.22 )
    date: 09/22/2024
CPU:
  Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64
    type: MT MCP cache: L2: 8 MiB
  Speed (MHz): avg: 883 min/max: 400/5132 cores: 1: 1388 2: 400 3: 1396
    4: 400 5: 400 6: 400 7: 1374 8: 400 9: 1331 10: 400 11: 1357 12: 400
    13: 1357 14: 1346 15: 1393 16: 400

With the above components.

Terminal

$ make bench

./build/bench +V examples/bench.garden
[    0.0000ms] main::Args_parse: Parsed arguments
[    0.0150ms] io::IO_read_file_to_string: mmaped input of size=25466794B
[    0.0060ms] mem::init: Allocated memory block of size=929537981B
[   43.9190ms] lexer::Lexer_all: lexed tokens count=3133350
[   48.8460ms] parser::Parser_next created AST with node_count=1200006
[   18.2070ms] cc::cc: Flattened AST to byte code/global pool length=2666680/8
[    8.9970ms] vm::Vm_run: executed byte code
[   26.7470ms] mem::Allocator::destroy: Deallocated memory space
[    1.0180ms] vm::Vm_destroy: teared vm down
[    0.0000ms] munmap: unmapped input

I can confidently say I do a million lines or 25,466,794 Bytes in 44ms. Let’s do some math:

$$ \begin{align} 44 \textrm{ms} &\triangleq 25,466,749 \mathrm{B} \\ 1 \textrm{ms} &\triangleq 578,789.75 \mathrm{B} \\ 1000 \textrm{ms} &\triangleq 578,789,750 \mathrm{B} \\ &= \underline{578,79 \mathrm{MB}/\textrm{s}} \end{align} $$

In token:

$$ \begin{align} 44 \textrm{ms} &\triangleq 3,133,350 \mathrm{T} \\ 1 \textrm{ms} &\triangleq 71212.5 \mathrm{T} \\ 1000 \textrm{ms} &\triangleq 71,212,500 \mathrm{T} \\ &= \underline{71,212,500 \mathrm{T}/\textrm{s}} \end{align} $$

That’s pretty fast, but SIMD can probably do a lot better at this point. However, I haven’t started that experiment yet.

On a Tower

Terminal

$ inxi -CMD

System:
  Host: comfyputer Kernel: 6.15.4-arch2-1 arch: x86_64
    bits: 64
  Desktop: i3 v: 4.24 Distro: Arch Linux
Machine:
  Type: Desktop Mobo: ASUSTeK model: PRIME B450-PLUS
    v: Rev X.0x serial: <superuser required>
    UEFI: American Megatrends v: 2008 date: 12/06/2019
CPU:
  Info: 8-core model: AMD Ryzen 7 3700X bits: 64
    type: MT MCP cache: L2: 4 MiB
  Speed (MHz): avg: 4052 min/max: 2200/4979 cores:
    1: 4052 2: 4052 3: 4052 4: 4052 5: 4052 6: 4052
    7: 4052 8: 4052 9: 4052 10: 4052 11: 4052 12: 4052
    13: 4052 14: 4052 15: 4052 16: 4052

So we are around 14ms faster on my tower.

TEXT

 1./build/bench +V examples/bench.garden
 2[    0.0000ms] main::Args_parse: Parsed arguments
 3[    0.0070ms] io::IO_read_file_to_string: mmaped input of size=25400127B
 4[    0.0030ms] mem::init: Allocated memory block of size=927104635B
 5[   30.9930ms] lexer::Lexer_all: lexed tokens count=3133350
 6[   22.6340ms] parser::Parser_next created AST with node_count=1200006
 7[   10.1480ms] cc::cc: Flattened AST to byte code/global pool length=2666680/8
 8[    7.4800ms] vm::Vm_run: executed byte code
 9[    0.7520ms] mem::Allocator::destroy: Deallocated memory space
10[    0.0620ms] vm::Vm_destroy: teared vm down
11[    0.0000ms] munmap: unmapped input

The same math as above, just with 30ms instead of 44ms:

$$ \begin{align} 30 \textrm{ms} &\triangleq 25,466,749 \mathrm{B} \\ 1 \textrm{ms} &\triangleq 848,891.633333 \mathrm{B} \\ 1000 \textrm{ms} &\triangleq 848,891,633.333 \mathrm{B} \\ &= \underline{848.89 \mathrm{MB}/\textrm{s}} \end{align} $$

In token:

$$ \begin{align} 30 \textrm{ms} &\triangleq 3,133,350 \mathrm{T} \\ 1 \textrm{ms} &\triangleq 104,445 \mathrm{T} \\ 1000 \textrm{ms} &\triangleq 104,445,000 \mathrm{T} \\ &= \underline{104,445,000 \mathrm{T}/\textrm{s}} \end{align} $$

Benchmark contexts

For a C input of 7.5 mio loc, which is of course more complex to tokenize then my language, see Some Strategies For Fast Lexical Analysis when Parsing Programming Languages. The following numbers are available and I added the performance the purple-garden lexer has for 7.5mio lines lexer heavy benchmark inputs.

lexer	performance
flex (default)	13.56 s
stb_lex (w/symbol hashing)	4.84 s
stb_lex	4.23 s
flex -F (fast)	3.07 s
flex -f (full)	2.92 s
handcoded	2.45 s
handcoded mmap	2.14 s
wc	1.73 s

purple-garden (laptop)	0.308s
purple-garden (tower)	0.150s

What next

A summary what I implemented in this article:

Jump table for direct threading
0 copy and window based tokens
interned and stateless tokens
bump allocator for unique tokens
inline hashing for atoms that need it (strings, idents, numeric)
fast paths for true and false

While 580-848 MB/s is already pretty fast, I want to go further, some things I have planned:

use the absurd bit set based is_alphanum checks
use SIMD for comments and whitespace
use SIMD as a preprocessing step to find markers for tokens 16 bytes at a time
replace FNV-1a with a faster hashing algorithm, something like xxHash
prefetch some amount of bytes to reduce L1 & L2 latency
mmap larger inputs with MAP_HUGETLB
maybe align mmap to 64 byte boundaries for SIMD

Putting it all together

  1#include "lexer.h"
  2#include "common.h"
  3#include "mem.h"
  4#include "strings.h"
  5#include <stddef.h>
  6#include <stdint.h>
  7#include <stdio.h>
  8#include <stdlib.h>
  9
 10#define SINGLE_TOK(t) ((Token){.type = t})
 11
 12Str TOKEN_TYPE_MAP[] = {[T_DELIMITOR_LEFT] = STRING("T_DELIMITOR_LEFT"),
 13                        [T_DELIMITOR_RIGHT] = STRING("T_DELIMITOR_RIGHT"),
 14                        [T_BRAKET_LEFT] = STRING("T_BRAKET_LEFT"),
 15                        [T_BRAKET_RIGHT] = STRING("T_BRAKET_RIGHT"),
 16                        [T_STRING] = STRING("T_STRING"),
 17                        [T_TRUE] = STRING("T_TRUE"),
 18                        [T_FALSE] = STRING("T_FALSE"),
 19                        [T_DOUBLE] = STRING("T_DOUBLE"),
 20                        [T_INTEGER] = STRING("T_INTEGER"),
 21                        [T_BUILTIN] = STRING("T_BUILTIN"),
 22                        [T_IDENT] = STRING("T_IDENT"),
 23                        [T_PLUS] = STRING("T_PLUS"),
 24                        [T_MINUS] = STRING("T_MINUS"),
 25                        [T_ASTERISKS] = STRING("T_ASTERISKS"),
 26                        [T_SLASH] = STRING("T_SLASH"),
 27                        [T_EQUAL] = STRING("T_EQUAL"),
 28                        [T_EOF] = STRING("T_EOF")};
 29
 30Lexer Lexer_new(Str input) {
 31  return (Lexer){
 32      .input = input,
 33      .pos = 0,
 34  };
 35}
 36
 37#define cur(L) (L->input.p[L->pos])
 38
 39__attribute__((always_inline)) inline static bool is_alphanum(uint8_t cc) {
 40  uint8_t lower = cc | 0x20;
 41  bool is_alpha = (lower >= 'a' && lower <= 'z');
 42  bool is_digit = (cc >= '0' && cc <= '9');
 43  return is_alpha || is_digit || cc == '_' || cc == '-';
 44}
 45
 46// we can "intern" these, since all of them are the same, regardless of position
 47Token *INTERN_DELIMITOR_LEFT = &SINGLE_TOK(T_DELIMITOR_LEFT);
 48Token *INTERN_DELIMITOR_RIGHT = &SINGLE_TOK(T_DELIMITOR_RIGHT);
 49Token *INTERN_BRAKET_LEFT = &SINGLE_TOK(T_BRAKET_LEFT);
 50Token *INTERN_BRAKET_RIGHT = &SINGLE_TOK(T_BRAKET_RIGHT);
 51Token *INTERN_MINUS = &SINGLE_TOK(T_MINUS);
 52Token *INTERN_PLUS = &SINGLE_TOK(T_PLUS);
 53Token *INTERN_ASTERISKS = &SINGLE_TOK(T_ASTERISKS);
 54Token *INTERN_SLASH = &SINGLE_TOK(T_SLASH);
 55Token *INTERN_FALSE = &SINGLE_TOK(T_FALSE);
 56Token *INTERN_TRUE = &SINGLE_TOK(T_TRUE);
 57Token *INTERN_EQUAL = &SINGLE_TOK(T_EQUAL);
 58Token *INTERN_EOF = &SINGLE_TOK(T_EOF);
 59
 60size_t Lexer_all(Lexer *l, Allocator *a, Token **out) {
 61  ASSERT(out != NULL, "Failed to allocate token list");
 62
 63  // empty input
 64  if (l->input.len == 0) {
 65    out[0] = INTERN_EOF;
 66    return 1;
 67  }
 68
 69  size_t true_hash = Str_hash(&STRING("true"));
 70  size_t false_hash = Str_hash(&STRING("false"));
 71
 72  size_t count = 0;
 73  static void *jump_table[256] = {
 74      [0 ... 255] = &&unknown,
 75      [' '] = &&whitespace,
 76      ['\t'] = &&whitespace,
 77      ['\n'] = &&whitespace,
 78      [';'] = &&comment,
 79      ['('] = &&delimitor_left,
 80      [')'] = &&delimitor_right,
 81      ['@'] = &&builtin,
 82      ['.'] = &&number,
 83      ['0' ... '9'] = &&number,
 84      ['a' ... 'z'] = &&ident,
 85      ['A' ... 'Z'] = &&ident,
 86      ['_'] = &&ident,
 87      ['\''] = &&quoted,
 88      ['"'] = &&string,
 89      ['+'] = &&plus,
 90      ['-'] = &&minus,
 91      ['/'] = &&slash,
 92      ['*'] = &&asterisk,
 93      ['='] = &&equal,
 94      ['['] = &&bracket_left,
 95      [']'] = &&bracket_right,
 96      [0] = &&end,
 97  };
 98
 99#define JUMP_TARGET goto *jump_table[(int32_t)l->input.p[l->pos]]
100
101  JUMP_TARGET;
102
103delimitor_left:
104  out[count++] = INTERN_DELIMITOR_LEFT;
105  l->pos++;
106  JUMP_TARGET;
107
108delimitor_right:
109  out[count++] = INTERN_DELIMITOR_RIGHT;
110  l->pos++;
111  JUMP_TARGET;
112
113bracket_left:
114  out[count++] = INTERN_BRAKET_LEFT;
115  l->pos++;
116  JUMP_TARGET;
117
118bracket_right:
119  out[count++] = INTERN_BRAKET_RIGHT;
120  l->pos++;
121  JUMP_TARGET;
122
123builtin: {
124  l->pos++;
125  // not an ident after @, this is shit
126  if (!is_alphanum(cur(l))) {
127    out[count++] = INTERN_EOF;
128  }
129  size_t start = l->pos;
130  size_t hash = FNV_OFFSET_BASIS;
131  for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
132    hash ^= cc;
133    hash *= FNV_PRIME;
134  }
135
136  size_t len = l->pos - start;
137  Str s = (Str){
138      .p = l->input.p + start,
139      .len = len,
140      .hash = hash,
141  };
142  Token *b = CALL(a, request, sizeof(Token));
143  b->string = s;
144  b->type = T_BUILTIN;
145  out[count++] = b;
146  JUMP_TARGET;
147}
148
149plus:
150  out[count++] = INTERN_PLUS;
151  l->pos++;
152  JUMP_TARGET;
153
154minus:
155  out[count++] = INTERN_MINUS;
156  l->pos++;
157  JUMP_TARGET;
158
159slash:
160  out[count++] = INTERN_SLASH;
161  l->pos++;
162  JUMP_TARGET;
163
164equal:
165  out[count++] = INTERN_EQUAL;
166  l->pos++;
167  JUMP_TARGET;
168
169asterisk:
170  out[count++] = INTERN_ASTERISKS;
171  l->pos++;
172  JUMP_TARGET;
173
174number: {
175  size_t start = l->pos;
176  size_t i = start;
177  bool is_double = false;
178  size_t hash = FNV_OFFSET_BASIS;
179  for (; i < l->input.len; i++) {
180    char cc = l->input.p[i];
181    hash ^= cc;
182    hash *= FNV_PRIME;
183    if (cc >= '0' && cc <= '9')
184      continue;
185    if (cc == '.') {
186      ASSERT(!is_double, "Two dots in double");
187      is_double = true;
188      continue;
189    }
190    break;
191  }
192
193  l->pos = i;
194  Token *n = CALL(a, request, sizeof(Token));
195  n->string = (Str){
196      .p = l->input.p + start,
197      .len = i - start,
198      .hash = hash,
199  };
200  if (is_double) {
201    n->type = T_DOUBLE;
202  } else {
203    n->type = T_INTEGER;
204  }
205
206  out[count++] = n;
207  JUMP_TARGET;
208}
209
210ident: {
211  size_t start = l->pos;
212  size_t hash = FNV_OFFSET_BASIS;
213  for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
214    hash ^= cc;
215    hash *= FNV_PRIME;
216  }
217
218  size_t len = l->pos - start;
219  Token *t;
220  if (hash == true_hash) {
221    t = INTERN_TRUE;
222  } else if (hash == false_hash) {
223    t = INTERN_FALSE;
224  } else {
225    t = CALL(a, request, sizeof(Token));
226    t->type = T_IDENT;
227    t->string = (Str){
228        .p = l->input.p + start,
229        .len = len,
230        .hash = hash,
231    };
232  }
233  out[count++] = t;
234  JUMP_TARGET;
235}
236
237// same as string but only with leading '
238quoted: {
239  // skip '
240  l->pos++;
241  size_t start = l->pos;
242  size_t hash = FNV_OFFSET_BASIS;
243  for (char cc = cur(l); cc > 0 && is_alphanum(cc); l->pos++, cc = cur(l)) {
244    hash ^= cc;
245    hash *= FNV_PRIME;
246  }
247
248  size_t len = l->pos - start;
249  Token *t;
250  t = CALL(a, request, sizeof(Token));
251  t->type = T_STRING;
252  t->string = (Str){
253      .p = l->input.p + start,
254      .len = len,
255      .hash = hash,
256  };
257  out[count++] = t;
258  JUMP_TARGET;
259}
260
261string: {
262  // skip "
263  l->pos++;
264  size_t start = l->pos;
265  size_t hash = FNV_OFFSET_BASIS;
266  for (char cc = cur(l); cc > 0 && cc != '"'; l->pos++, cc = cur(l)) {
267    hash ^= cc;
268    hash *= FNV_PRIME;
269  }
270
271  if (UNLIKELY(cur(l) != '"')) {
272    Str slice = Str_slice(&l->input, l->pos, l->input.len);
273    fprintf(stderr, "lex: Unterminated string near: '%.*s'", (int)slice.len,
274            slice.p);
275    out[count++] = INTERN_EOF;
276  } else {
277    Token *t = CALL(a, request, sizeof(Token));
278    t->type = T_STRING;
279    t->string = (Str){
280        .p = l->input.p + start,
281        .len = l->pos - start,
282        .hash = hash,
283    };
284    out[count++] = t;
285    // skip "
286    l->pos++;
287  }
288  JUMP_TARGET;
289}
290
291comment:
292  for (char cc = cur(l); cc > 0 && cc != '\n'; l->pos++, cc = cur(l)) {
293  }
294  JUMP_TARGET;
295
296whitespace:
297  l->pos++;
298  JUMP_TARGET;
299
300unknown: {
301  uint8_t c = cur(l);
302  ASSERT(0, "Unexpected byte '%c' (0x%X) in input", c, c)
303}
304
305end:
306  out[count++] = INTERN_EOF;
307  return count;
308}
309
310#undef SINGLE_TOK

Thank you for reading this far. If you have any suggestions or feedback, feel free to send me an email at contact@xnacly.me or:

contact@xnacly.me

Strategies for very fast Lexers

Table of Contents

Info - Benchmark

Introduction to Lexing

A compilation pipeline

Defining purple garden’s Tokens

Architecting a minimal Lexer

Tip - Designated initializers

Making it fast

Computed gotos, or: Threaded Lexing

Warning

Abstracting allocations via the Allocator interface

Info - Benchmarks

Operating on zero copy, zero alloc string windows

Info - Benchmarks

Hashing everything

Interning Tokens

Prehashing keywords for comparisons

Tip - is_alphanum performance deep dive

On demand double and int64_t parsing

Tip - Distinguishing between integers and doubles

Extra: memory mapping the input

Info - Benchmarks

Consuming numeric tokens in the compiler

Info - Benchmarks

Benchmark

On a Laptop

On a Tower

Benchmark contexts

What next

Putting it all together