Optimizing libjson, or: beating encoding/json by at least 2.5x

Jan 10, 2025
2362 Words
12 Minute read

Tags:

Libjson is a JSON parsing/interaction library for which I attempted to go the not so gopher way: disregarding interoperability and backwards compatibility:

libjson does not support writing json into structure fields via reflection
libjson does not support accessing json via a structure, because: see point 1.
libjson instead uses a Javascript like object access syntax

The three points above result in the following api:

 1package main
 2
 3import (
 4    "github.com/xnacly/libjson"
 5)
 6
 7func main() {
 8	input := `{ "hello": {"world": ["hi"] } }`
 9	jsonObj, _ := New(input) // or libjson.NewReader(r io.Reader)
10
11	// accessing values
12	fmt.Println(Get[string](jsonObj, ".hello.world.0")) // hi
13}

Libjson is fast, but I want to make it even faster, thus i established a baseline of the current results via ./bench.sh | rg "faster - it already is about 2x faster than encoding/json.

TEXT

12.10 ± 0.11 times faster than ./test -libjson=false ./1MB.json
22.33 ± 0.07 times faster than ./test -libjson=false ./5MB.json
32.37 ± 0.03 times faster than ./test -libjson=false ./10MB.json

Benchmark setup

I use:

python to generate large json files for avoiding microbenchmarks
hyperfine to compare the execution time of the program
bash to put everything together

Generating input

View the source for the benchmarking code at xnacly/libjson/test

I decided on the following json object, to simulate the real world usage:

JSON

1[
2  {
3    "key1": "value",
4    "array": [],
5    "obj": {},
6    "atomArray": [11201, 1e112, true, false, null, "str"]
7  }
8]

I then write a small python script to generate three different file sizes (1mb, 5mb and 10mb) with the amount of objects in them to arrive at around these sizes:

PYTHON

 1from os.path import exists
 2import math
 3
 4sizes = [1,5,10]
 5
 6line = """\t{
 7        "key1": "value",
 8        "array": [],
 9        "obj": {},
10        "atomArray": [11201,1e112,true,false,null,"str"]
11    }"""
12
13def write_data(size: int):
14    name = f"{size}MB.json"
15    if not exists(name):
16        with open(name, mode="w", encoding="utf8") as f:
17            f.write("[\n")
18            size = math.floor((size*1000000)/len(line))
19            f.write(",\n".join([line for _ in range(0, size)]))
20            f.write("\n]")
21
22[write_data(size) for size in sizes]

Creating a program

To call a program with hyperfine, we need to write one:

 1package main
 2
 3import (
 4	"encoding/json"
 5	"flag"
 6	"log"
 7	"os"
 8
 9	"github.com/xnacly/libjson"
10)
11
12func main() {
13	lj := flag.Bool("libjson", true, "benchmark libjson or gojson")
14	flag.Parse()
15	args := flag.Args()
16	if len(args) == 0 {
17		log.Fatalln("Wanted a file as first argument, got nothing, exiting")
18	}
19	file, err := os.Open(args[0])
20	if err != nil {
21		log.Fatalln(err)
22	}
23	if *lj {
24		_, err := libjson.NewReader(file)
25		if err != nil {
26			log.Fatalln(err)
27		}
28	} else {
29		v := []struct {
30			Key1      string
31			Array     []any
32			Obj       any
33			AtomArray []any
34		}{}
35		d := json.NewDecoder(file)
36		err := d.Decode(&v)
37		if err != nil {
38			log.Fatalln(err)
39		}
40	}
41}

This program takes a flag (-libjson) to control whether it should use libjson or encoding/json as a backend.

Calling hyperfine

Putting the above together, i came up with the following shell script:

 1#!/bin/bash
 2echo "generating example data"
 3python3 gen.py
 4
 5echo "building executable"
 6rm ./test
 7go build ./test.go
 8
 9hyperfine "./test ./1MB.json" "./test -libjson=false ./1MB.json"
10hyperfine "./test ./5MB.json" "./test -libjson=false ./5MB.json"
11hyperfine "./test ./10MB.json" "./test -libjson=false ./10MB.json"

Preciser benchmarks

The benchmarking via a whole program serves as a good comparison between different implementations and their total speeds, however it is very imprecise regarding the allocations and time in the functions we are interested in. To create a more precise and relevant benchmark, I will also include the results of the following benchmarks which are using the testing.B tooling:

 1package libjson
 2
 3import (
 4	"encoding/json"
 5	"strings"
 6	"testing"
 7
 8	"github.com/stretchr/testify/assert"
 9)
10
11const amount = 50_000
12
13func BenchmarkLibJson(b *testing.B) {
14	data := strings.Repeat(`{"key1": "value","array": [],"obj": {},"atomArray": [11201,1e112,true,false,null,"str"]},`, amount)
15	d := []byte("[" + data[:len(data)-1] + "]")
16	b.ResetTimer()
17	for i := 0; i < b.N; i++ {
18		_, err := New(d)
19		assert.NoError(b, err)
20	}
21	b.ReportAllocs()
22}
23
24func BenchmarkEncodingJson(b *testing.B) {
25	data := strings.Repeat(`{"key1": "value","array": [],"obj": {},"atomArray": [11201,1e112,true,false,null,"str"]},`, amount)
26	d := []byte("[" + data[:len(data)-1] + "]")
27	b.ResetTimer()
28	for i := 0; i < b.N; i++ {
29		v := []struct {
30			Key1      string
31			Array     []any
32			Obj       any
33			AtomArray []any
34		}{}
35		err := json.Unmarshal(d, &v)
36		assert.NoError(b, err)
37	}
38	b.ReportAllocs()
39}

Taking both benchmarks into account, the total baseline is as follows:

TEXT

 1$ cd test && ./bench.sh
 2generating example data
 3building executable
 4Benchmark 1: ./test ./1MB.json
 5  Time (mean ± σ):      12.4 ms ±   0.6 ms    [User: 16.0 ms, System: 4.8 ms]
 6  Range (min … max):    11.1 ms …  14.3 ms    217 runs
 7
 8Benchmark 2: ./test -libjson=false ./1MB.json
 9  Time (mean ± σ):      24.9 ms ±   0.6 ms    [User: 25.2 ms, System: 3.5 ms]
10  Range (min … max):    23.7 ms …  26.4 ms    117 runs
11
12Summary
13  ./test ./1MB.json ran
14    2.01 ± 0.11 times faster than ./test -libjson=false ./1MB.json
15Benchmark 1: ./test ./5MB.json
16  Time (mean ± σ):      51.8 ms ±   1.5 ms    [User: 77.5 ms, System: 12.4 ms]
17  Range (min … max):    49.4 ms …  54.8 ms    57 runs
18
19Benchmark 2: ./test -libjson=false ./5MB.json
20  Time (mean ± σ):     116.7 ms ±   1.2 ms    [User: 133.1 ms, System: 9.8 ms]
21  Range (min … max):   114.1 ms … 119.6 ms    25 runs
22
23Summary
24  ./test ./5MB.json ran
25    2.25 ± 0.07 times faster than ./test -libjson=false ./5MB.json
26Benchmark 1: ./test ./10MB.json
27  Time (mean ± σ):      97.9 ms ±   1.2 ms    [User: 142.4 ms, System: 22.7 ms]
28  Range (min … max):    95.0 ms … 100.4 ms    29 runs
29
30Benchmark 2: ./test -libjson=false ./10MB.json
31  Time (mean ± σ):     228.8 ms ±   1.7 ms    [User: 257.9 ms, System: 17.7 ms]
32  Range (min … max):   226.2 ms … 231.4 ms    13 runs
33
34Summary
35  ./test ./10MB.json ran
36    2.34 ± 0.03 times faster than ./test -libjson=false ./10MB.json
37
38
39$ go test ./... -bench=.
40goos: linux
41goarch: amd64
42pkg: github.com/xnacly/libjson
43cpu: AMD Ryzen 7 3700X 8-Core Processor
44BenchmarkLibJson-16                   21          53156910 ns/op        34744407 B/op     500024 allocs/op
45BenchmarkEncodingJson-16               8         134535103 ns/op        39723750 B/op     650033 allocs/op
46PASS
47ok      github.com/xnacly/libjson       2.404s
48?       github.com/xnacly/libjson/cmd   [no test files]
49?       github.com/xnacly/libjson/test  [no test files]

Or as tables:

backend	1mb	5mb	10mb
encoding/json	24.9ms	116.7ms	225.9ms
libjson	12.4ms (-12.5ms)	51.8ms (-64.9ms)	97.9ms (-128ms)

backend	ns/op	B/op	allocs/op
encoding/json	134_535_103	39_723_750	650_033
libjson	53_156_910 (-8_137_819)	34_744_407 (-4_979_343)	500_024 (-150_009)

Previous improvements

Since I made most of these before thinking about this blog article, i will just mention them and their relative impact (-ms for improvement, +ms for regression) in a tuple of the input sizes:

change	impact 1mb	impact 5mb	impact 10mb
disabled unique key checking for objects, duplicates just overwrite the previous value	not tested	not tested	not tested
lex on demand, instead of before starting the parser	-4ms	-14ms	not tested
use `(string)(unsafe.Pointer(&l.buf))` instead of `string()` to skip an allocation	-0.4ms	-2ms	-4ms
move allocations and string/number parsing to the parser	-1ms	-2ms	-4ms
fast paths for `true`, `false` and `null` & byte slices instead of strings for string/number token	-6ms	-25ms	-60ms
=	-11.4ms	-43ms	-68ms

Replacing byte slices with offsets and lengths

Currently a token-structure is defined as:

1type token struct {
2	Type t_json
3    Val  []byte
4}

While t_json defines all possible types of tokens a json document can contain, and token.Val is only filed for t_json::t_number and t_json::t_string:

 1const (
 2	t_string       t_json = iota // anything between ""
 3	t_number                     // floating point, hex, etc
 4	t_true                       // true
 5	t_false                      // false
 6	t_null                       // null
 7	t_left_curly                 // {
 8	t_right_curly                // }
 9	t_left_braket                // [
10	t_right_braket               // ]
11	t_comma                      // ,
12	t_colon                      // :
13	t_eof                        // for any non structure characters outside of strings and numbers
14)

Replacing token.Val with token.Start and token.End does not exactly remove an allocation, see Go Slices: usage and internals, but reduces the size the token struct occupies in memory. The new token structure:

1type token struct {
2	Type t_json
3	// only populated for number and string
4	Start int
5	End   int
6}

Switching the lexer and the parser from byte slices to this wasn’t that hard and results in the following runtimes:

backend	1mb	5mb	10mb
encoding/json	24.9ms	116.7ms	225.9ms
libjson before	12.4ms (-12.5ms)	51.8ms (-64.9ms)	97.9ms (-128ms)
libjson	12ms (-0.4ms)	49.8ms (-2ms)	93.8ms (-4.1ms)

And the following Benchmark* output:

backend	ns/op	B/op	allocs/op
encoding/json	134_535_103	39_723_750	650_033
libjson before	53_156_910 (-81_378_193)	34_744_407 (-4_979_343)	500_024 (-150_009)
libjson	50_504_689 (-2_652_221)	34_744_392 (-15)	500_024

Manually inlining parser::expect

Currently the parser uses the parser::expect method to throw an error if the current token.Type does not match the expected t_json:

1func (p *parser) expect(t t_json) error {
2	if p.cur_tok.Type != t {
3		return fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t])
4	}
5	return p.advance()
6}

It is, for example, used in parser::array:

 1func (p *parser) array() ([]any, error) {
 2	err := p.expect(t_left_braket)
 3	if err != nil {
 4		return nil, err
 5	}
 6
 7	if p.cur_tok.Type == t_right_braket {
 8		err = p.advance()
 9		return []any{}, err
10	}
11
12	a := make([]any, 0, 8)
13
14	for p.cur_tok.Type != t_eof && p.cur_tok.Type != t_right_braket {
15		if len(a) > 0 {
16			err := p.expect(t_comma)
17			if err != nil {
18				return nil, err
19			}
20		}
21		node, err := p.expression()
22		if err != nil {
23			return nil, err
24		}
25		a = append(a, node)
26	}
27
28	return a, p.expect(t_right_braket)
29}

This change replaces every parser::expect invocation with the body of the function:

 1func (p *parser) array() ([]any, error) {
 2    if p.cur_tok.Type != t_left_braket {
 3        return nil, fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t_left_braket])
 4    }
 5    err := p.advance()
 6    if err != nil {
 7        return nil, err
 8    }
 9
10    if p.cur_tok.Type == t_right_braket {
11            return []any{}, p.advance()
12    }
13
14    a := make([]any, 0, 8)
15
16    for p.cur_tok.Type != t_eof && p.cur_tok.Type != t_right_braket {
17        if len(a) > 0 {
18            if p.cur_tok.Type != t_comma {
19                return nil, fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t_comma])
20            }
21            err := p.advance()
22            if err != nil {
23                return nil, err
24            }
25        }
26        node, err := p.expression()
27        if err != nil {
28            return nil, err
29        }
30        a = append(a, node)
31    }
32
33    if p.cur_tok.Type != t_right_braket {
34        return nil, fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t_right_braket])
35    }
36
37    return a, p.advance()
38}

The results of this change:

backend	1mb	5mb	10mb
encoding/json	24.9ms	116.7ms	225.9ms
libjson before	12ms (-12.9ms)	49.8ms (-66.9ms)	93.8ms (-134.1ms)
libjson	11.5ms (-0.5ms)	48.5ms (-1.3ms)	91ms (-2.3ms)

backend	ns/op	B/op	allocs/op
encoding/json	134_535_103	39_723_750	650_033
libjson before	50_504_689 (-84_030_414)	34_744_392 (-4_979_358)	500_024 (-150_009)
libjson	49_262_503 (-1_242_186)	34_744_160 (-232)	500_024

Parallelising lexer::next and parser::parse

Moving from lexing on demand to possibly prelexing in paralell did not result a decrease in runtime, but regressed the performance towards the encoding/json results.

I tested buffered channel sizes from 2,8,16,32 up until 50k (because thats the number i use for the amount of objects for the benchmarks) and none of these tests yielded any runtime improvements.

Currently said on demand lexing is attached to the parser as follows:

 1type parser struct {
 2	l       lexer
 3	cur_tok token
 4}
 5
 6func (p *parser) advance() error {
 7	t, err := p.l.next()
 8	p.cur_tok = t
 9	if p.cur_tok.Type == t_eof && err != nil {
10		return err
11	}
12	return nil
13}

Replacing this with a channel and moving the tokenisation to a go routine:

1type parser struct {
2	l       lexer
3	c       <-chan token
4	cur_tok token
5}
6
7func (p *parser) advance() {
8	p.cur_tok = <-p.c
9}

Previously, the lexer and parser invocation worked as shown below:

1func New(data []byte) (JSON, error) {
2	p := parser{l: lexer{data: data}}
3	obj, err := p.parse()
4	if err != nil {
5		return JSON{}, err
6	}
7	return JSON{obj}, nil
8}

Now we have to manage error handling, channel creation, etc:

 1const chanSize = 64
 2
 3func New(data []byte) (JSON, error) {
 4	l := lexer{data: data}
 5	c := make(chan token, chanSize)
 6	var lexerErr error
 7	go func() {
 8		for {
 9			if tok, err := l.next(); err == nil {
10				if tok.Type == t_eof {
11					break
12				}
13				c <- tok
14			} else {
15				lexerErr = err
16				break
17			}
18		}
19		close(c)
20		c = nil
21	}()
22	p := parser{l: l, c: c}
23	obj, err := p.parse()
24	if lexerErr != nil {
25		return JSON{}, lexerErr
26	}
27	if err != nil {
28		return JSON{}, err
29	}
30	return JSON{obj}, nil
31}

The changes are kept in the paralell-lexer-and-parser branch.

Benchmarking this sadly did not satisfy my search for better performance, but I wanted to include it either way:

backend	1mb	5mb	10mb
encoding/json	24.9ms	116.7ms	225.9ms
libjson before	11.5ms (-13.4ms)	48.5ms (-68.2ms)	91ms (-136.4ms)
libjson	24.3ms (+12.8ms)	112.5ms (+64ms)	218.6ms (+127.6ms)

backend	ns/op	B/op	allocs/op
encoding/json	134_535_103	39_723_750	650_033
libjson before	50_504_689 (-85_272_600)	34_744_392 (-4_979_590)	500_024 (-150_009)
libjson	125_533_704 (+75_029_015)	35_949_392 (+1_205_000)	500_031 (+7)

Results

Again, these result from design choices, benchmarking, profiling and exploration of many things similar to the parallelisation attempt above.

backend	1mb	5mb	10mb
encoding/json	24.9ms	116.7ms	225.9ms
libjson before	11.5ms (-13.4ms)	48.5ms (-68.2ms)	91ms (-136.4ms)

backend	ns/op	B/op	allocs/op
encoding/json	134_535_103	39_723_750	650_033
libjson	50_504_689 (-85_272_600)	34_744_392 (-4_979_590)	500_024 (-150_009)

Summing up, libjson is currently around 2.5x faster than encoding/json. For 50k objects in a json array, it uses 5MB less memory and makes 150k less allocations - pretty good.