Revisiting and Optimising go-iso8601-duration

Tags:

While looking through my repos, an issue in go-iso8601-duration, which I previously missed, popped up, informing me of a missing license in the last release, resulting in missing documentation for the v1.1.0 release (Since the gopkg website doesnt render documentation for non licensed projects).

Since I had to do a new release either way, I took the opportunity and benchmarked the current state.

Benchmarks and a Baseline

I reused the cases previously employed for checking the FSM to establish some micro benchmarks spanning ALL branches the FSM can take. Since I also run the tests normally before benchmarking, I dont have to validate results in the benchmark.

For an iso8601 FSM deepdive and a rant about the ISO org, see my previous article: Handrolling ISO8601 Duration Support for Go

GO
 1var testcases = []struct {
 2	str string
 3	dur Duration
 4}{
 5	{"P0D", Duration{}},
 6	{"PT15H", Duration{hour: 15}},
 7	{"P1W", Duration{week: 1}},
 8	{"P15W", Duration{week: 15}},
 9	{"P1Y15W", Duration{year: 1, week: 15}},
10	{"P15Y", Duration{year: 15}},
11	{"P15Y3M", Duration{year: 15, month: 3}},
12	{"P15Y3M41D", Duration{year: 15, month: 3, day: 41}},
13	{"PT15M", Duration{minute: 15}},
14	{"PT15M10S", Duration{minute: 15, second: 10}},
15	{
16		"P3Y6M4DT12H30M5S",
17		Duration{
18			year:   3,
19			month:  6,
20			day:    4,
21			hour:   12,
22			minute: 30,
23			second: 5,
24		},
25	},
26}
27
28func BenchmarkDuration(b *testing.B) {
29	for _, i := range testcases {
30		b.Run(i.str, func(b *testing.B) {
31			for n := 0; n < b.N; n++ {
32				_, _ = From(i.str)
33			}
34		})
35	}
36}

Benchmarks use Go’s testing package on:

TEXT
1goos: linux
2goarch: amd64
3pkg: github.com/xnacly/go-iso8601-duration
4cpu: AMD Ryzen 7 3700X 8-Core Processor

Executed with go test ./... -bench=. -benchmem

TEXT
 1goos: linux
 2goarch: amd64
 3pkg: github.com/xnacly/go-iso8601-duration
 4cpu: AMD Ryzen 7 3700X 8-Core Processor
 5BenchmarkDuration/P0D-16                24741985                48.10 ns/op            8 B/op          1 allocs/op
 6BenchmarkDuration/PT15H-16              18533790                63.26 ns/op            8 B/op          1 allocs/op
 7BenchmarkDuration/P1W-16                24164163                48.64 ns/op            8 B/op          1 allocs/op
 8BenchmarkDuration/P15W-16               20793126                58.09 ns/op            8 B/op          1 allocs/op
 9BenchmarkDuration/P1Y15W-16             13904265                84.21 ns/op            8 B/op          1 allocs/op
10BenchmarkDuration/P15Y-16               19943671                59.35 ns/op            8 B/op          1 allocs/op
11BenchmarkDuration/P15Y3M-16             14227842                83.24 ns/op            8 B/op          1 allocs/op
12BenchmarkDuration/P15Y3M41D-16          10262575                115.6 ns/op            8 B/op          1 allocs/op
13BenchmarkDuration/PT15M-16              18228138                63.90 ns/op            8 B/op          1 allocs/op
14BenchmarkDuration/PT15M10S-16           12289743                95.84 ns/op            8 B/op          1 allocs/op
15BenchmarkDuration/P3Y6M4DT12H30M5S-16   6758479                 177.3 ns/op            8 B/op          1 allocs/op
16PASS
17ok      github.com/xnacly/go-iso8601-duration   13.952s

Warning - Note

At the end of each section I will show the updated output of the benchsmarks with an average improvement in percent, this will be relative to the previous change, not to the original results.

Replacing the numBuffer with a running int64

The first weird detail I noticed, was previous me using bytes.Buffer to store the single characters making up the numbers for parsing the numeric value of each designator, for instance 593 in P593W, when a running int64 would suffice.

GO
 1func From(s string) (Duration, error) {
 2	var duration Duration
 3    curState := stateStart
 4    var col uint8
 5    numBuf := *bytes.NewBuffer(make([]byte, 0, 8))
 6
 7	for {
 8        // ...
 9        switch curState {
10        // ...
11		case stateTNumber:
12			if unicode.IsDigit(b) {
13				numBuf.WriteRune(b)
14				curState = stateTNumber
15			} else if strings.ContainsRune(timeDesignators, b) {
16				if numBuf.Len() == 0 {
17					return duration, wrapErr(MissingNumber, col)
18				}
19				num, err := numBufferToNumber(numBuf)
20				if err != nil {
21					return duration, err
22				}
23                // using the number
24				numBuf.Reset()
25				curState = stateTDesignator
26			} else {
27				return duration, wrapErr(UnknownDesignator, col)
28			}
29        }
30    }
31}

Especially since the bytes.Buffer interactions require multiple function calls for writing, a call for reading and allocations for the underlying buffer, while the final number parsing itself in numBufferToNum forming a second pass of the integer characters.

GO
 1func numBufferToNumber(buf bytes.Buffer) (int64, error) {
 2	var i int64
 3	for _, n := range buf.Bytes() {
 4		digit := int64(n - '0')
 5		if i > (math.MaxInt64-digit)/10 {
 6			return 0, DesignatorNumberTooLarge
 7		}
 8		i = (i * 10) + digit
 9	}
10
11	return i, nil
12}

For this problem, all of this is totally unnecessary and can be fused into a running int64 temporary value, updating value on seeing numeric bytes and then using the integer when creating the designators field of the goiso8601duration.Duration struct:

DIFF
  1diff --git a/duration.go b/duration.go
  2index 80a3b4c..4506457 100644
  3--- a/duration.go
  4+++ b/duration.go
  5@@ -128,7 +128,8 @@ func From(s string) (Duration, error) {
  6
  7 	curState := stateStart
  8 	var col uint8
  9-	numBuf := *bytes.NewBuffer(make([]byte, 0, 8))
 10+	var num int64
 11+	var hasNum bool
 12 	r := strings.NewReader(s)
 13
 14 	for {
 15@@ -171,23 +172,25 @@ func From(s string) (Duration, error) {
 16 			if b == 'T' {
 17 				curState = stateT
 18 			} else if unicode.IsDigit(b) {
 19-				numBuf.WriteRune(b)
 20+				num = (num * 10) + int64(b-'0')
 21+				hasNum = true
 22 				curState = stateNumber
 23 			} else {
 24 				return duration, wrapErr(MissingNumber, col)
 25 			}
 26 		case stateNumber:
 27 			if unicode.IsDigit(b) {
 28-				numBuf.WriteRune(b)
 29+				digit := int64(b - '0')
 30+				if num > (math.MaxInt64-digit)/10 {
 31+					return duration, DesignatorNumberTooLarge
 32+				}
 33+				num = (num * 10) + digit
 34+				hasNum = true
 35 				curState = stateNumber
 36 			} else if strings.ContainsRune(defaultDesignators, b) {
 37-				if numBuf.Len() == 0 {
 38+				if !hasNum {
 39 					return duration, wrapErr(MissingNumber, col)
 40 				}
 41-				num, err := numBufferToNumber(numBuf)
 42-				if err != nil {
 43-					return duration, err
 44-				}
 45 				switch b {
 46 				case 'Y':
 47 					if duration.year != 0 {
 48@@ -210,30 +213,36 @@ func From(s string) (Duration, error) {
 49 					}
 50 					duration.day = num
 51 				}
 52-				numBuf.Reset()
 53+				num = 0
 54 				curState = stateDesignator
 55 			} else {
 56 				return duration, wrapErr(UnknownDesignator, col)
 57 			}
 58 		case stateT, stateTDesignator:
 59 			if unicode.IsDigit(b) {
 60-				numBuf.WriteRune(b)
 61+				digit := int64(b - '0')
 62+				if num > (math.MaxInt64-digit)/10 {
 63+					return duration, DesignatorNumberTooLarge
 64+				}
 65+				num = (num * 10) + digit
 66+				hasNum = true
 67 				curState = stateTNumber
 68 			} else {
 69 				return duration, wrapErr(MissingNumber, col)
 70 			}
 71 		case stateTNumber:
 72 			if unicode.IsDigit(b) {
 73-				numBuf.WriteRune(b)
 74+				digit := int64(b - '0')
 75+				if num > (math.MaxInt64-digit)/10 {
 76+					return duration, DesignatorNumberTooLarge
 77+				}
 78+				num = (num * 10) + digit
 79+				hasNum = true
 80 				curState = stateTNumber
 81 			} else if strings.ContainsRune(timeDesignators, b) {
 82-				if numBuf.Len() == 0 {
 83+				if !hasNum {
 84 					return duration, wrapErr(MissingNumber, col)
 85 				}
 86-				num, err := numBufferToNumber(numBuf)
 87-				if err != nil {
 88-					return duration, err
 89-				}
 90 				switch b {
 91 				case 'H':
 92 					if duration.hour != 0 {
 93@@ -251,7 +260,7 @@ func From(s string) (Duration, error) {
 94 					}
 95 					duration.second = num
 96 				}
 97-				numBuf.Reset()
 98+				num = 0
 99 				curState = stateTDesignator
100 			} else {
101 				return duration, wrapErr(UnknownDesignator, col)

The above diff shows:

  1. Parsing number via num while iterating (no more second pass numeric parsing)
  2. Overflow checks, analog to the previous version, returing DesignatorNumberTooLarge on occurrence
  3. No more calls to numBuf.{WriteRune,Reset,Bytes} in the hotpath, or anywhere
  4. No allocations anymore

Benchmarks show an average runtime reduction of ~45%:

TEXT
 1goos: linux
 2goarch: amd64
 3pkg: github.com/xnacly/go-iso8601-duration
 4cpu: AMD Ryzen 7 3700X 8-Core Processor
 5BenchmarkDuration/P0D-16                45560650                25.55 ns/op            0 B/op          0 allocs/op
 6BenchmarkDuration/PT15H-16              34413621                35.08 ns/op            0 B/op          0 allocs/op
 7BenchmarkDuration/P1W-16                46468417                25.84 ns/op            0 B/op          0 allocs/op
 8BenchmarkDuration/P15W-16               37708790                31.77 ns/op            0 B/op          0 allocs/op
 9BenchmarkDuration/P1Y15W-16             26145358                45.38 ns/op            0 B/op          0 allocs/op
10BenchmarkDuration/P15Y-16               37434289                31.84 ns/op            0 B/op          0 allocs/op
11BenchmarkDuration/P15Y3M-16             26251488                45.27 ns/op            0 B/op          0 allocs/op
12BenchmarkDuration/P15Y3M41D-16          18727663                63.59 ns/op            0 B/op          0 allocs/op
13BenchmarkDuration/PT15M-16              33159258                35.16 ns/op            0 B/op          0 allocs/op
14BenchmarkDuration/PT15M10S-16           21954288                53.80 ns/op            0 B/op          0 allocs/op
15BenchmarkDuration/P3Y6M4DT12H30M5S-16   10706078                111.9 ns/op            0 B/op          0 allocs/op
16PASS
17ok      github.com/xnacly/go-iso8601-duration   13.605s

Throwing unicode.IsDigit away for ‘0’ <= b && b <= ‘9’

The least impactful change is inlining unicode.IsDigit, but only the ascii (<= MaxLatin1) portion, since the filter out the non-ascii inputs.

GO
1package unicode
2
3// IsDigit reports whether the rune is a decimal digit.
4func IsDigit(r rune) bool {
5	if r <= MaxLatin1 {
6		return '0' <= r && r <= '9'
7	}
8	return isExcludingLatin(Digit, r)
9}

This change results in ~6% less runtime:

TEXT
 1goos: linux
 2goarch: amd64
 3pkg: github.com/xnacly/go-iso8601-duration
 4cpu: AMD Ryzen 7 3700X 8-Core Processor
 5BenchmarkDuration/P0D-16                47473572                24.55 ns/op            0 B/op          0 allocs/op
 6BenchmarkDuration/PT15H-16              35477791                32.96 ns/op            0 B/op          0 allocs/op
 7BenchmarkDuration/P1W-16                48368948                24.96 ns/op            0 B/op          0 allocs/op
 8BenchmarkDuration/P15W-16               40997600                29.37 ns/op            0 B/op          0 allocs/op
 9BenchmarkDuration/P1Y15W-16             28936015                41.42 ns/op            0 B/op          0 allocs/op
10BenchmarkDuration/P15Y-16               39383282                30.27 ns/op            0 B/op          0 allocs/op
11BenchmarkDuration/P15Y3M-16             27719198                42.48 ns/op            0 B/op          0 allocs/op
12BenchmarkDuration/P15Y3M41D-16          20594792                58.53 ns/op            0 B/op          0 allocs/op
13BenchmarkDuration/PT15M-16              34655841                33.56 ns/op            0 B/op          0 allocs/op
14BenchmarkDuration/PT15M10S-16           24108157                50.88 ns/op            0 B/op          0 allocs/op
15BenchmarkDuration/P3Y6M4DT12H30M5S-16   11649555               102.50 ns/op             0 B/op          0 allocs/op
16PASS
17ok      github.com/xnacly/go-iso8601-duration   13.601s

Replacing strings.Reader with for-range rune iteration

TEXT
1go test ./... -bench=. -cpuprofile=cpu
2go tool pprof -http :8080 cpu

flamegraph1

While the next section targets strings.ContainsRune, this section focuses on strings.(*Reader).ReadRune. Specifically requesting a new rune.

GO
 1for {
 2    b, size, err := r.ReadRune()
 3    if err != nil {
 4        if err != io.EOF {
 5            return duration, wrapErr(UnexpectedReaderError, col)
 6        } else {
 7            // other error handling here
 8        }
 9    }
10    if size > 1 {
11        return duration, wrapErr(UnexpectedNonAsciiRune, col)
12    }
13    col++
14
15    switch curState {
16     // ...
17    }
18}

Each iteration requires a ReadRune call, multiple branches for error handling and ascii checking, since i want to error on non ascii inputs.

This is simply not necessary, since I can iterate over the runes of the input string using a for-range loop, check for ascii validity with r <= unicode.MaxASCII:

DIFF
 1diff --git a/duration.go b/duration.go
 2index e4083bb..52eec76 100644
 3--- a/duration.go
 4+++ b/duration.go
 5@@ -2,11 +2,11 @@ package goiso8601duration
 6
 7 import (
 8 	"encoding/json"
 9-	"io"
10 	"math"
11 	"strconv"
12 	"strings"
13 	"time"
14+	"unicode"
15 )
16
17 // constants for roundtripping between time.Duration and Duration
18@@ -112,33 +112,13 @@ func From(s string) (Duration, error) {
19 	}
20
21 	curState := stateStart
22-	var col uint8
23 	var num int64
24 	var hasNum bool
25-	r := strings.NewReader(s)
26-
27-	for {
28-		b, size, err := r.ReadRune()
29-		if err != nil {
30-			if err != io.EOF {
31-				return duration, wrapErr(UnexpectedReaderError, col)
32-			} else if curState == stateP {
33-				// being in stateP at the end (io.EOF) means we havent
34-				// encountered anything after the P, so there were no numbers
35-				// or states
36-				return duration, wrapErr(UnexpectedEof, col)
37-			} else if curState == stateNumber || curState == stateTNumber {
38-				// if we are in the state of Number or TNumber we had a number
39-				// but no designator at the end
40-				return duration, wrapErr(MissingDesignator, col)
41-			} else {
42-				curState = stateFin
43-			}
44-		}
45-		if size > 1 {
46+
47+	for col, b := range s {
48+		if b > unicode.MaxASCII {
49 			return duration, wrapErr(UnexpectedNonAsciiRune, col)
50 		}
51-		col++
52
53 		switch curState {
54 		case stateStart:
55@@ -254,6 +234,19 @@ func From(s string) (Duration, error) {
56 			return duration, nil
57 		}
58 	}
59+
60+	if curState == stateP {
61+		// being in stateP at the end (io.EOF) means we havent
62+		// encountered anything after the P, so there were no numbers
63+		// or states
64+		return duration, wrapErr(UnexpectedEof, len(s))
65+	} else if curState == stateNumber || curState == stateTNumber {
66+		// if we are in the state of Number or TNumber we had a number
67+		// but no designator at the end
68+		return duration, wrapErr(MissingDesignator, len(s))
69+	} else {
70+		return duration, nil
71+	}
72 }
73
74 func (i Duration) Apply(t time.Time) time.Time {

Thus reducing the hot path complexity by multiple function calls, branches and the bullshit strings.(*Reader).ReadRune does:

GO
 1package strings
 2
 3// ReadRune implements the [io.RuneReader] interface.
 4func (r *Reader) ReadRune() (ch rune, size int, err error) {
 5	if r.i >= int64(len(r.s)) {
 6		r.prevRune = -1
 7		return 0, 0, io.EOF
 8	}
 9	r.prevRune = int(r.i)
10	ch, size = utf8.DecodeRuneInString(r.s[r.i:])
11	r.i += int64(size)
12	return
13}

These changes result in ~38% runtime reduction:

TEXT
 1goos: linux
 2goarch: amd64
 3pkg: github.com/xnacly/go-iso8601-duration
 4cpu: AMD Ryzen 7 3700X 8-Core Processor
 5BenchmarkDuration/P0D-16                71080273                16.55 ns/op            0 B/op          0 allocs/op
 6BenchmarkDuration/PT15H-16              59493799                19.94 ns/op            0 B/op          0 allocs/op
 7BenchmarkDuration/P1W-16                70588683                16.73 ns/op            0 B/op          0 allocs/op
 8BenchmarkDuration/P15W-16               62168469                18.98 ns/op            0 B/op          0 allocs/op
 9BenchmarkDuration/P1Y15W-16             43747168                28.85 ns/op            0 B/op          0 allocs/op
10BenchmarkDuration/P15Y-16               63284880                19.99 ns/op            0 B/op          0 allocs/op
11BenchmarkDuration/P15Y3M-16             38357061                29.42 ns/op            0 B/op          0 allocs/op
12BenchmarkDuration/P15Y3M41D-16          28748721                41.85 ns/op            0 B/op          0 allocs/op
13BenchmarkDuration/PT15M-16              53639156                20.78 ns/op            0 B/op          0 allocs/op
14BenchmarkDuration/PT15M10S-16           37592148                31.97 ns/op            0 B/op          0 allocs/op
15BenchmarkDuration/P3Y6M4DT12H30M5S-16   16793139                71.17 ns/op            0 B/op          0 allocs/op
16PASS
17ok      github.com/xnacly/go-iso8601-duration   13.434s

Removing duplicate Branching in Designator validitiy checks

Lets look at the flamegraph again:

flamegraph2

The offender strings.ContainsRune. The previous implementation employed a duplicate check for making sure non numeric chars match the expected characters at a given FSM state.

GO
1// In representations of duration,
2// the following designators are used as part of the expression,
3// see the doc comment of the function
4//
5// [Y] [M] [W] [D] [H] [M] [S]
6const (
7	defaultDesignators = "YMWD"
8	timeDesignators    = "MHS"
9)

For instance in stateTNumber, there is a branch for a numeric character, and a conditional call to strings.ContainsRune before switching on each designator in timeDesignators again:

GO
 1case stateTNumber:
 2    if '0' <= b && b <= '9' {
 3        digit := int64(b - '0')
 4        if num > (math.MaxInt64-digit)/10 {
 5            return duration, DesignatorNumberTooLarge
 6        }
 7        num = (num * 10) + digit
 8        hasNum = true
 9        curState = stateTNumber
10    } else if strings.ContainsRune(timeDesignators, b) {
11        if !hasNum {
12            return duration, wrapErr(MissingNumber, col)
13        }
14        switch b {
15        case 'H':
16            if duration.hour != 0 {
17                return duration, wrapErr(DuplicateDesignator, col)
18            }
19            duration.hour = num
20        case 'M':
21            if duration.minute != 0 {
22                return duration, wrapErr(DuplicateDesignator, col)
23            }
24            duration.minute = num
25        case 'S':
26            if duration.second != 0 {
27                return duration, wrapErr(DuplicateDesignator, col)
28            }
29            duration.second = num
30        }
31        num = 0
32        curState = stateTDesignator
33    } else {
34        return duration, wrapErr(UnknownDesignator, col)
35    }

The same approach is used for stateNumber, only with defaultDesignators instead of timeDesignators.

While strings.ContainsRune is fairly optimised by having a assembly fast path for utf8.RuneSelf (characters below RuneSelf are represented as themselves in a single byte), it still has to iterate the haystack for the needle we want to check for. This is fully unnecessary for duration parsing, since the haystack is known at compile time and can therefore be represented as a switch block instead, which we already do, just after checking for validity.

Therefore these paths can be merged into a single one, moving the error on invalid designator from the condition to the default case of the switch block:

DIFF
 1@@ -204,7 +195,7 @@ func From(s string) (Duration, error) {
 2          num = (num * 10) + digit
 3          hasNum = true
 4          curState = stateTNumber
 5- } else if strings.ContainsRune(timeDesignators, b) {
 6+ } else {
 7          if !hasNum {
 8                  return duration, wrapErr(MissingNumber, col)
 9          }
10@@ -224,11 +215,12 @@ func From(s string) (Duration, error) {
11                          return duration, wrapErr(DuplicateDesignator, col)
12                  }
13                  duration.second = num
14+         default:
15+                 return duration, wrapErr(UnknownDesignator, col)
16          }
17          num = 0
18          curState = stateTDesignator
19- } else {
20-         return duration, wrapErr(UnknownDesignator, col)
21  }

This change reduces the runtime by 28%:

TEXT
 1goos: linux
 2goarch: amd64
 3pkg: github.com/xnacly/go-iso8601-duration
 4cpu: AMD Ryzen 7 3700X 8-Core Processor
 5BenchmarkDuration/P0D-16                100000000               10.51 ns/op            0 B/op          0 allocs/op
 6BenchmarkDuration/PT15H-16              80446293                14.58 ns/op            0 B/op          0 allocs/op
 7BenchmarkDuration/P1W-16                100000000               10.33 ns/op            0 B/op          0 allocs/op
 8BenchmarkDuration/P15W-16               93517448                12.75 ns/op            0 B/op          0 allocs/op
 9BenchmarkDuration/P1Y15W-16             70472916                17.19 ns/op            0 B/op          0 allocs/op
10BenchmarkDuration/P15Y-16               89461000                12.75 ns/op            0 B/op          0 allocs/op
11BenchmarkDuration/P15Y3M-16             61222558                18.65 ns/op            0 B/op          0 allocs/op
12BenchmarkDuration/P15Y3M41D-16          46391493                25.45 ns/op            0 B/op          0 allocs/op
13BenchmarkDuration/PT15M-16              80630947                14.98 ns/op            0 B/op          0 allocs/op
14BenchmarkDuration/PT15M10S-16           55008163                22.14 ns/op            0 B/op          0 allocs/op
15BenchmarkDuration/P3Y6M4DT12H30M5S-16   24965916                46.99 ns/op            0 B/op          0 allocs/op
16PASS
17ok      github.com/xnacly/go-iso8601-duration   12.956s

Replace rune based for-range with []byte

Since this isnt a function call, we cant view its affect in the flamegraph directly, but we can hone in on go-iso8601-duration.From in the graph and then switch to the source-view, showing us a total time taken in the rune loop and the MaxASCII check:

Before

TEXT
1108        1.65s      1.65s           	for col, b := range s {
2109        320ms      320ms           		if b > unicode.MaxASCII {

Replacing this with a []byte cast and thus omitting the MaxASCII check:

DIFF
 1diff --git a/duration.go b/duration.go
 2index ba295d7..440e71a 100644
 3--- a/duration.go
 4+++ b/duration.go
 5@@ -6,7 +6,6 @@ import (
 6        "strconv"
 7        "strings"
 8        "time"
 9-       "unicode"
10 )
11
12 // constants for roundtripping between time.Duration and Duration
13@@ -105,11 +104,7 @@ func From(s string) (Duration, error) {
14        var num int64
15        var hasNum bool
16
17-       for col, b := range s {
18-               if b > unicode.MaxASCII {
19-                       return duration, wrapErr(UnexpectedNonAsciiRune, col)
20-               }
21-
22+       for col, b := range []byte(s) {
23                switch curState {
24                case stateStart:
25                        switch b {

Results in 200ms less time spent in the loop:

TEXT
1107        1.47s      1.47s           	for col, b := range []byte(s) {

Running the whole suite results in ~8.6% less runtime:

TEXT
 1goos: linux
 2goarch: amd64
 3pkg: github.com/xnacly/go-iso8601-duration
 4cpu: AMD Ryzen 7 3700X 8-Core Processor
 5BenchmarkDuration/P0D-16                120725466                9.931 ns/op           0 B/o       0 allocs/op
 6BenchmarkDuration/PT15H-16              92520681                11.67 ns/op            0 B/o       0 allocs/op
 7BenchmarkDuration/P1W-16                100000000               10.15 ns/op            0 B/o       0 allocs/op
 8BenchmarkDuration/P15W-16               100000000               10.26 ns/op            0 B/o       0 allocs/op
 9BenchmarkDuration/P1Y15W-16             82810779                14.83 ns/op            0 B/o       0 allocs/op
10BenchmarkDuration/P15Y-16               120902158               10.29 ns/op            0 B/o       0 allocs/op
11BenchmarkDuration/P15Y3M-16             75393909                15.68 ns/op            0 B/o       0 allocs/op
12BenchmarkDuration/P15Y3M41D-16          56214748                22.31 ns/op            0 B/o       0 allocs/op
13BenchmarkDuration/PT15M-16              91787005                12.85 ns/op            0 B/o       0 allocs/op
14BenchmarkDuration/PT15M10S-16           53683030                19.45 ns/op            0 B/o       0 allocs/op
15BenchmarkDuration/P3Y6M4DT12H30M5S-16   28636116                42.47 ns/op            0 B/op      0 allocs/op
16PASS
17ok      github.com/xnacly/go-iso8601-duration   14.860s

This is easily explainable when looking at the assembly output. While for col, r := range s compiles to:

for col, r := range []byte(s) compiles to a third of the instructions:

Most noticeably, the latter lacks the call to runtime.decoderune for each iteration:

ASM
1        PCDATA  $1, $0
2        CALL    runtime.decoderune(SB)

So this change not only shaves off 2/3 of the instructions, but it also removes the per iteration utf8 decoding, bounds checks and multi-byte branching. This is possible because the FSM only needs to operate on ASCII inputs (ISO8601 duration is defined on an ASCII subset), allowing the loop to treat the string as raw bytes rather than decoding runes.

Hoisting overflow checks from each numeric change to field creation

Previously the FSM validates the running int64 value wouldnt exceed the valid representation on each numeric character occurence:

GO
 1case stateT, stateTDesignator:
 2    if '0' <= b && b <= '9' {
 3        digit := int64(b - '0')
 4        if num > (math.MaxInt64-digit)/10 {
 5            return duration, DesignatorNumberTooLarge
 6        }
 7        num = (num * 10) + digit
 8        hasNum = true
 9        curState = stateTNumber
10    } else {
11        return duration, wrapErr(MissingNumber, col)
12    }

To reduce branches in the hot path I decided to test moving this overflow checking to the location where the duration fields are created and the value is actually used, however this requires widening the running int64 to uint64:

DIFF
 1- var num int64
 2+ var num uint64
 3
 4  case stateT, stateTDesignator:
 5          if '0' <= b && b <= '9' {
 6-                 digit := int64(b - '0')
 7-                 if num > (math.MaxInt64-digit)/10 {
 8-                         return duration, DesignatorNumberTooLarge
 9-                 }
10-                 num = (num * 10) + digit
11+                 num = (num * 10) + uint64(b-'0')
12                  hasNum = true
13                  curState = stateTNumber
14          } else {

And finally re-adding the check to the usage site:

DIFF
1+ if num > math.MaxInt64 {
2+         return duration, DesignatorNumberTooLarge
3+ }

This was done to have less branches in the hot path for numbers, thus has little to no impact on shorter inputs, still a average ~5% reduction, tho:

TEXT
 1goos: linux
 2goarch: amd64
 3pkg: github.com/xnacly/go-iso8601-duration
 4cpu: AMD Ryzen 7 3700X 8-Core Processor
 5BenchmarkDuration/P0D-16                100000000               10.01 ns/op
 6BenchmarkDuration/PT15H-16              100000000               11.81 ns/op
 7BenchmarkDuration/P1W-16                100000000               10.19 ns/op
 8BenchmarkDuration/P15W-16               120588932               9.779 ns/op
 9BenchmarkDuration/P1Y15W-16             86265106                13.76 ns/op
10BenchmarkDuration/P15Y-16               122859478               9.910 ns/op
11BenchmarkDuration/P15Y3M-16             77137617                14.32 ns/op
12BenchmarkDuration/P15Y3M41D-16          60263616                20.74 ns/op
13BenchmarkDuration/PT15M-16              96869085                12.30 ns/op
14BenchmarkDuration/PT15M10S-16           63139969                18.33 ns/op
15BenchmarkDuration/P3Y6M4DT12H30M5S-16   29586862                39.14 ns/op
16PASS
17ok      github.com/xnacly/go-iso8601-duration   14.812s

This change results in a contract change compared to v.1.1.0, since previously the overflow check fired on each number, now it only fires on exceeding the valid int64 size on usage. Overflowing uint64 is not handled.

Results

The base idea of these improvements is to eliminate abstractions hiding:

  • allocations
  • loops
  • generic logic

When all of these can be omitted by knowing the problem domain / what durations in the iso8601 sense are made up of, specifically:

  • only ascii input
  • no decimal points
  • compile time known, non dynamic designators
  • number and designator pairs are parsable in a single pass

When applied result in a final total runtime performance improvement compared to the original results shown in Benchmarks and a Baseline:

  • Geometric Speedup Mean: 6.38x
  • Range: 4.82x - 7.60x
  • Median: 6.49x
BenchmarkBaseline (ns)Improved (ns)Runtime reduction (%)Speedup (x)Delta (ns)
P0D64.2610.0184.436.42x54.25
PT15H76.6911.8184.606.49x64.88
P1W64.1010.1984.116.29x53.91
P15W74.359.7886.807.60x64.57
P1Y15W95.3913.7685.566.93x81.63
P15Y74.449.9186.657.51x64.53
P15Y3M96.3314.3285.156.73x82.01
P15Y3M41D123.7020.7483.255.96x102.96
PT15M76.6112.3083.966.23x64.31
PT15M10S104.8018.3382.505.72x86.47
P3Y6M4DT12H30M5S188.7039.1479.274.82x149.56