Optimizing libjson, or: beating encoding/json by at least 2.5x

Table of Contents

Tags:

Libjson is a JSON parsing/interaction library for which I attempted to go the not so gopher way: disregarding interoperability and backwards compatibility:

  1. libjson does not support writing json into structure fields via reflection
  2. libjson does not support accessing json via a structure, because: see point 1.
  3. libjson instead uses a Javascript like object access syntax

The three points above result in the following api:

GO
 1package main
 2
 3import (
 4    "github.com/xnacly/libjson"
 5)
 6
 7func main() {
 8	input := `{ "hello": {"world": ["hi"] } }`
 9	jsonObj, _ := New(input) // or libjson.NewReader(r io.Reader)
10
11	// accessing values
12	fmt.Println(Get[string](jsonObj, ".hello.world.0")) // hi
13}

Libjson is fast, but I want to make it even faster, thus i established a baseline of the current results via ./bench.sh | rg "faster - it already is about 2x faster than encoding/json.

TEXT
12.10 ± 0.11 times faster than ./test -libjson=false ./1MB.json
22.33 ± 0.07 times faster than ./test -libjson=false ./5MB.json
32.37 ± 0.03 times faster than ./test -libjson=false ./10MB.json

Benchmark setup

I use:

  • python to generate large json files for avoiding microbenchmarks
  • hyperfine to compare the execution time of the program
  • bash to put everything together

Generating input

View the source for the benchmarking code at xnacly/libjson/test

I decided on the following json object, to simulate the real world usage:

JSON
1[
2  {
3    "key1": "value",
4    "array": [],
5    "obj": {},
6    "atomArray": [11201, 1e112, true, false, null, "str"]
7  }
8]

I then write a small python script to generate three different file sizes (1mb, 5mb and 10mb) with the amount of objects in them to arrive at around these sizes:

PYTHON
 1from os.path import exists
 2import math
 3
 4sizes = [1,5,10]
 5
 6line = """\t{
 7        "key1": "value",
 8        "array": [],
 9        "obj": {},
10        "atomArray": [11201,1e112,true,false,null,"str"]
11    }"""
12
13def write_data(size: int):
14    name = f"{size}MB.json"
15    if not exists(name):
16        with open(name, mode="w", encoding="utf8") as f:
17            f.write("[\n")
18            size = math.floor((size*1000000)/len(line))
19            f.write(",\n".join([line for _ in range(0, size)]))
20            f.write("\n]")
21
22[write_data(size) for size in sizes]

Creating a program

To call a program with hyperfine, we need to write one:

GO
 1package main
 2
 3import (
 4	"encoding/json"
 5	"flag"
 6	"log"
 7	"os"
 8
 9	"github.com/xnacly/libjson"
10)
11
12func main() {
13	lj := flag.Bool("libjson", true, "benchmark libjson or gojson")
14	flag.Parse()
15	args := flag.Args()
16	if len(args) == 0 {
17		log.Fatalln("Wanted a file as first argument, got nothing, exiting")
18	}
19	file, err := os.Open(args[0])
20	if err != nil {
21		log.Fatalln(err)
22	}
23	if *lj {
24		_, err := libjson.NewReader(file)
25		if err != nil {
26			log.Fatalln(err)
27		}
28	} else {
29		v := []struct {
30			Key1      string
31			Array     []any
32			Obj       any
33			AtomArray []any
34		}{}
35		d := json.NewDecoder(file)
36		err := d.Decode(&v)
37		if err != nil {
38			log.Fatalln(err)
39		}
40	}
41}

This program takes a flag (-libjson) to control whether it should use libjson or encoding/json as a backend.

Calling hyperfine

Putting the above together, i came up with the following shell script:

SH
 1#!/bin/bash
 2echo "generating example data"
 3python3 gen.py
 4
 5echo "building executable"
 6rm ./test
 7go build ./test.go
 8
 9hyperfine "./test ./1MB.json" "./test -libjson=false ./1MB.json"
10hyperfine "./test ./5MB.json" "./test -libjson=false ./5MB.json"
11hyperfine "./test ./10MB.json" "./test -libjson=false ./10MB.json"

Preciser benchmarks

The benchmarking via a whole program serves as a good comparison between different implementations and their total speeds, however it is very imprecise regarding the allocations and time in the functions we are interested in. To create a more precise and relevant benchmark, I will also include the results of the following benchmarks which are using the testing.B tooling:

GO
 1package libjson
 2
 3import (
 4	"encoding/json"
 5	"strings"
 6	"testing"
 7
 8	"github.com/stretchr/testify/assert"
 9)
10
11const amount = 50_000
12
13func BenchmarkLibJson(b *testing.B) {
14	data := strings.Repeat(`{"key1": "value","array": [],"obj": {},"atomArray": [11201,1e112,true,false,null,"str"]},`, amount)
15	d := []byte("[" + data[:len(data)-1] + "]")
16	b.ResetTimer()
17	for i := 0; i < b.N; i++ {
18		_, err := New(d)
19		assert.NoError(b, err)
20	}
21	b.ReportAllocs()
22}
23
24func BenchmarkEncodingJson(b *testing.B) {
25	data := strings.Repeat(`{"key1": "value","array": [],"obj": {},"atomArray": [11201,1e112,true,false,null,"str"]},`, amount)
26	d := []byte("[" + data[:len(data)-1] + "]")
27	b.ResetTimer()
28	for i := 0; i < b.N; i++ {
29		v := []struct {
30			Key1      string
31			Array     []any
32			Obj       any
33			AtomArray []any
34		}{}
35		err := json.Unmarshal(d, &v)
36		assert.NoError(b, err)
37	}
38	b.ReportAllocs()
39}

Taking both benchmarks into account, the total baseline is as follows:

TEXT
 1$ cd test && ./bench.sh
 2generating example data
 3building executable
 4Benchmark 1: ./test ./1MB.json
 5  Time (mean ± σ):      12.4 ms ±   0.6 ms    [User: 16.0 ms, System: 4.8 ms]
 6  Range (min … max):    11.1 ms …  14.3 ms    217 runs
 7
 8Benchmark 2: ./test -libjson=false ./1MB.json
 9  Time (mean ± σ):      24.9 ms ±   0.6 ms    [User: 25.2 ms, System: 3.5 ms]
10  Range (min … max):    23.7 ms …  26.4 ms    117 runs
11
12Summary
13  ./test ./1MB.json ran
14    2.01 ± 0.11 times faster than ./test -libjson=false ./1MB.json
15Benchmark 1: ./test ./5MB.json
16  Time (mean ± σ):      51.8 ms ±   1.5 ms    [User: 77.5 ms, System: 12.4 ms]
17  Range (min … max):    49.4 ms …  54.8 ms    57 runs
18
19Benchmark 2: ./test -libjson=false ./5MB.json
20  Time (mean ± σ):     116.7 ms ±   1.2 ms    [User: 133.1 ms, System: 9.8 ms]
21  Range (min … max):   114.1 ms … 119.6 ms    25 runs
22
23Summary
24  ./test ./5MB.json ran
25    2.25 ± 0.07 times faster than ./test -libjson=false ./5MB.json
26Benchmark 1: ./test ./10MB.json
27  Time (mean ± σ):      97.9 ms ±   1.2 ms    [User: 142.4 ms, System: 22.7 ms]
28  Range (min … max):    95.0 ms … 100.4 ms    29 runs
29
30Benchmark 2: ./test -libjson=false ./10MB.json
31  Time (mean ± σ):     228.8 ms ±   1.7 ms    [User: 257.9 ms, System: 17.7 ms]
32  Range (min … max):   226.2 ms … 231.4 ms    13 runs
33
34Summary
35  ./test ./10MB.json ran
36    2.34 ± 0.03 times faster than ./test -libjson=false ./10MB.json
37
38
39$ go test ./... -bench=.
40goos: linux
41goarch: amd64
42pkg: github.com/xnacly/libjson
43cpu: AMD Ryzen 7 3700X 8-Core Processor
44BenchmarkLibJson-16                   21          53156910 ns/op        34744407 B/op     500024 allocs/op
45BenchmarkEncodingJson-16               8         134535103 ns/op        39723750 B/op     650033 allocs/op
46PASS
47ok      github.com/xnacly/libjson       2.404s
48?       github.com/xnacly/libjson/cmd   [no test files]
49?       github.com/xnacly/libjson/test  [no test files]

Or as tables:

backend1mb5mb10mb
encoding/json24.9ms116.7ms225.9ms
libjson12.4ms (-12.5ms)51.8ms (-64.9ms)97.9ms (-128ms)
backendns/opB/opallocs/op
encoding/json134_535_10339_723_750650_033
libjson53_156_910 (-8_137_819)34_744_407 (-4_979_343)500_024 (-150_009)

Previous improvements

Since I made most of these before thinking about this blog article, i will just mention them and their relative impact (-ms for improvement, +ms for regression) in a tuple of the input sizes:

changeimpact 1mbimpact 5mbimpact 10mb
disabled unique key checking for objects, duplicates just overwrite the previous valuenot testednot testednot tested
lex on demand, instead of before starting the parser-4ms-14msnot tested
use *(*string)(unsafe.Pointer(&l.buf)) instead of string() to skip an allocation-0.4ms-2ms-4ms
move allocations and string/number parsing to the parser-1ms-2ms-4ms
fast paths for true, false and null & byte slices instead of strings for string/number token-6ms-25ms-60ms
=-11.4ms-43ms-68ms

Replacing byte slices with offsets and lengths

Currently a token-structure is defined as:

GO
1type token struct {
2	Type t_json
3    Val  []byte
4}

While t_json defines all possible types of tokens a json document can contain, and token.Val is only filed for t_json::t_number and t_json::t_string:

GO
 1const (
 2	t_string       t_json = iota // anything between ""
 3	t_number                     // floating point, hex, etc
 4	t_true                       // true
 5	t_false                      // false
 6	t_null                       // null
 7	t_left_curly                 // {
 8	t_right_curly                // }
 9	t_left_braket                // [
10	t_right_braket               // ]
11	t_comma                      // ,
12	t_colon                      // :
13	t_eof                        // for any non structure characters outside of strings and numbers
14)

Replacing token.Val with token.Start and token.End does not exactly remove an allocation, see Go Slices: usage and internals, but reduces the size the token struct occupies in memory. The new token structure:

GO
1type token struct {
2	Type t_json
3	// only populated for number and string
4	Start int
5	End   int
6}

Switching the lexer and the parser from byte slices to this wasn’t that hard and results in the following runtimes:

backend1mb5mb10mb
encoding/json24.9ms116.7ms225.9ms
libjson before12.4ms (-12.5ms)51.8ms (-64.9ms)97.9ms (-128ms)
libjson12ms (-0.4ms)49.8ms (-2ms)93.8ms (-4.1ms)

And the following Benchmark* output:

backendns/opB/opallocs/op
encoding/json134_535_10339_723_750650_033
libjson before53_156_910 (-81_378_193)34_744_407 (-4_979_343)500_024 (-150_009)
libjson50_504_689 (-2_652_221)34_744_392 (-15)500_024

Manually inlining parser::expect

Currently the parser uses the parser::expect method to throw an error if the current token.Type does not match the expected t_json:

GO
1func (p *parser) expect(t t_json) error {
2	if p.cur_tok.Type != t {
3		return fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t])
4	}
5	return p.advance()
6}

It is, for example, used in parser::array:

GO
 1func (p *parser) array() ([]any, error) {
 2	err := p.expect(t_left_braket)
 3	if err != nil {
 4		return nil, err
 5	}
 6
 7	if p.cur_tok.Type == t_right_braket {
 8		err = p.advance()
 9		return []any{}, err
10	}
11
12	a := make([]any, 0, 8)
13
14	for p.cur_tok.Type != t_eof && p.cur_tok.Type != t_right_braket {
15		if len(a) > 0 {
16			err := p.expect(t_comma)
17			if err != nil {
18				return nil, err
19			}
20		}
21		node, err := p.expression()
22		if err != nil {
23			return nil, err
24		}
25		a = append(a, node)
26	}
27
28	return a, p.expect(t_right_braket)
29}

This change replaces every parser::expect invocation with the body of the function:

GO
 1func (p *parser) array() ([]any, error) {
 2    if p.cur_tok.Type != t_left_braket {
 3        return nil, fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t_left_braket])
 4    }
 5    err := p.advance()
 6    if err != nil {
 7        return nil, err
 8    }
 9
10    if p.cur_tok.Type == t_right_braket {
11            return []any{}, p.advance()
12    }
13
14    a := make([]any, 0, 8)
15
16    for p.cur_tok.Type != t_eof && p.cur_tok.Type != t_right_braket {
17        if len(a) > 0 {
18            if p.cur_tok.Type != t_comma {
19                return nil, fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t_comma])
20            }
21            err := p.advance()
22            if err != nil {
23                return nil, err
24            }
25        }
26        node, err := p.expression()
27        if err != nil {
28            return nil, err
29        }
30        a = append(a, node)
31    }
32
33    if p.cur_tok.Type != t_right_braket {
34        return nil, fmt.Errorf("Unexpected %q at this position, expected %q", tokennames[p.cur_tok.Type], tokennames[t_right_braket])
35    }
36
37    return a, p.advance()
38}

The results of this change:

backend1mb5mb10mb
encoding/json24.9ms116.7ms225.9ms
libjson before12ms (-12.9ms)49.8ms (-66.9ms)93.8ms (-134.1ms)
libjson11.5ms (-0.5ms)48.5ms (-1.3ms)91ms (-2.3ms)
backendns/opB/opallocs/op
encoding/json134_535_10339_723_750650_033
libjson before50_504_689 (-84_030_414)34_744_392 (-4_979_358)500_024 (-150_009)
libjson49_262_503 (-1_242_186)34_744_160 (-232)500_024

Parallelising lexer::next and parser::parse

Moving from lexing on demand to possibly prelexing in paralell did not result a decrease in runtime, but regressed the performance towards the encoding/json results.

I tested buffered channel sizes from 2,8,16,32 up until 50k (because thats the number i use for the amount of objects for the benchmarks) and none of these tests yielded any runtime improvements.

Currently said on demand lexing is attached to the parser as follows:

GO
 1type parser struct {
 2	l       lexer
 3	cur_tok token
 4}
 5
 6func (p *parser) advance() error {
 7	t, err := p.l.next()
 8	p.cur_tok = t
 9	if p.cur_tok.Type == t_eof && err != nil {
10		return err
11	}
12	return nil
13}

Replacing this with a channel and moving the tokenisation to a go routine:

GO
1type parser struct {
2	l       lexer
3	c       <-chan token
4	cur_tok token
5}
6
7func (p *parser) advance() {
8	p.cur_tok = <-p.c
9}

Previously, the lexer and parser invocation worked as shown below:

GO
1func New(data []byte) (JSON, error) {
2	p := parser{l: lexer{data: data}}
3	obj, err := p.parse()
4	if err != nil {
5		return JSON{}, err
6	}
7	return JSON{obj}, nil
8}

Now we have to manage error handling, channel creation, etc:

GO
 1const chanSize = 64
 2
 3func New(data []byte) (JSON, error) {
 4	l := lexer{data: data}
 5	c := make(chan token, chanSize)
 6	var lexerErr error
 7	go func() {
 8		for {
 9			if tok, err := l.next(); err == nil {
10				if tok.Type == t_eof {
11					break
12				}
13				c <- tok
14			} else {
15				lexerErr = err
16				break
17			}
18		}
19		close(c)
20		c = nil
21	}()
22	p := parser{l: l, c: c}
23	obj, err := p.parse()
24	if lexerErr != nil {
25		return JSON{}, lexerErr
26	}
27	if err != nil {
28		return JSON{}, err
29	}
30	return JSON{obj}, nil
31}

The changes are kept in the paralell-lexer-and-parser branch.

Benchmarking this sadly did not satisfy my search for better performance, but I wanted to include it either way:

backend1mb5mb10mb
encoding/json24.9ms116.7ms225.9ms
libjson before11.5ms (-13.4ms)48.5ms (-68.2ms)91ms (-136.4ms)
libjson24.3ms (+12.8ms)112.5ms (+64ms)218.6ms (+127.6ms)
backendns/opB/opallocs/op
encoding/json134_535_10339_723_750650_033
libjson before50_504_689 (-85_272_600)34_744_392 (-4_979_590)500_024 (-150_009)
libjson125_533_704 (+75_029_015)35_949_392 (+1_205_000)500_031 (+7)

Results

Again, these result from design choices, benchmarking, profiling and exploration of many things similar to the parallelisation attempt above.

backend1mb5mb10mb
encoding/json24.9ms116.7ms225.9ms
libjson before11.5ms (-13.4ms)48.5ms (-68.2ms)91ms (-136.4ms)
backendns/opB/opallocs/op
encoding/json134_535_10339_723_750650_033
libjson50_504_689 (-85_272_600)34_744_392 (-4_979_590)500_024 (-150_009)

Summing up, libjson is currently around 2.5x faster than encoding/json. For 50k objects in a json array, it uses 5MB less memory and makes 150k less allocations - pretty good.