Introduction
This is the first post of a four part series around implementing and understanding the steps for interpreting arithmetic expressions. The series is meant for explaining key concepts such as lexical analysis, parsing / building the ast, walking the ast / flatting it to byte code, bytecode virtual machines and TDD centered around compilers and interpreters.
Info
The corresponding GitHub repository can be found here.This first article contains the introduction to our problem domain, the setup of our project, the basics of TDD and the first steps towards interpreting arithmetic expressions: tokenizing our input / performing lexical analysis
The second article will be centered around converting the list of tokens we created in the first article to an abstract syntax tree, short ast
The third article will be focused on walking the abstract syntax tree and converting nodes into a list of instructions for our virtual machine
The fourth and last article will be consisting of implementing the bytecode virtual machine and executing expressions compiled to bytecode
All posts in this series will include an extensive amount of resources for the readers deeper understanding of the matter. The following books and articles are commonly referenced in the compiler and interpreter space:
- Writing An Interpreter in Go by Thorsten Ball (go) - TDD, scratches the surface
- Crafting Interpreters by Robert Nystrom (java & c) - very detailed, includes data structures and in depth topics, such as grammars, hashing, etc
- Compilers: Principles, Techniques, and Tools (java) - very theoretic
We will be using the go programming language in this series. All concepts can be applied to any programming language. Some level of proficiency with go is assumed, I will not explain syntax.
Problem domain
The expression we want to be able to execute with our interpreter is the smallest subset of a programming language i could imagine thus our problem domain is defined by a subset of arithmetic expressions:
- addition
- subtraction
- multiplication
- division
We will also support braces that can be used to indicate precedence, which we will talk about in the second post of this series.
Expressions
Some examples of expression we will accept:
1# comments
2
3# addition
41029+0.129
5# =1029.129
6
7# subtraction
85_120_129-5_120_128
9# =1
10
11# multiplication
125e12*3
13# =150000
14
15# division
161/2
17# =0.5
18
19# braces
20(1+1)/2
21# =1
Interpreter design
There are several well established ways of interpreting programming languages. Lets take a look at the stages an interpreter commonly performs before the passed in source code is interpreted.
Interpreter Stages
Lexical analysis -> the process of recognizing structures such as numbers, identifiers or symbols in the source code , converts recognized structures into a list of token, often referred to as scanning, lexing or tokenizing.
Parsing -> refers to the process of detecting precedence and checking whether the input conforms to the defined grammar. In our case the parser analyses the output of our previous stage and builds an abstract syntax tree while taking operator precedence into account.
Evaluating -> commonly means walking the tree starting from the leafs, computing a value for each node and exiting on computing the value of the root, see tree walk interpreter. While this is the simplest way to implement an interpreter, it introduces performance issues due to requiring recursion and the overhead of traversing many pointers. There are multiple ways of implementing this step, each one with different pros and cons:
- Tree walk / Abstract syntax tree interpreters: too slow, too much overhead
- Just-in-time compiler: too complex / too hard to implement
- bytecode interpreter: medium hard to implement, faster than tree walk interpreter
For our interpreter we decide to use the idea of bytecode interpreters, thus splitting the third step into two sub steps::
- Compiling to bytecode -> Walking the tree and compiling each node into bytecode
- Executing bytecode -> Iterating over the list of bytecode instructions and executing them
Example
Consider the following statement and lets visualize the stages using the example:
11.025*3+1
Lexical analysis
Performing the first stage converts the input from a character stream into a list of structures:
1token := []Token{
2 Token{Type: NUMBER, Raw: "1.025"},
3 Token{Type: ASTERIKS, Raw: "*"},
4 Token{Type: NUMBER, Raw: "3"},
5 Token{Type: PLUS, Raw: "+"},
6 Token{Type: NUMBER, Raw: "3"},
7}
Parsing
We now build an abstract syntax tree out of the list of token we get from the previous stage:
1ast := Addition{
2 Token: Token{Type: PLUS, Raw: "+"},
3 Left: Multiplication{
4 Token: Token{Type: ASTERIKS, Raw: "*"},
5 Left: Number{
6 Token: Token{Type: NUMBER, Raw: "1.025"},
7 },
8 Right: Number{
9 Token: Token{Type: NUMBER, Raw: "3"}
10 },
11 },
12 Right: Number{Token: NUMBER, Raw: "1"},
13}
Notice the depth of the tree, the deeper the node sits the earlier it is compiled to bytecode, thus considering operator precedence, see below for a visual explanation:
Initial ast
TEXT1+ 2│ 3├─ 1 4│ 5└─ * 6 │ 7 ├─ 1.025 8 │ 9 └─ 3
Multiplication
evaluated:TEXT1+ 2│ 3├─ 1 4│ 5└─ 3.075
Addition
evaluated:TEXT14.075
Compiling to bytecode
We use the AST we got from the previous step to compile each node to a list of bytecode instructions. The bottom most node, commonly referred to as leafs are all numbers, thus we will start there.
The bytecode VM we want to implement has a list of registers, comparable to the CPU registers on a real machine. We can load and manipulate values in these registers. In the third and fourth part of this series, we will go into great depth on registers, bytecode and virtual machines. For now simply know there are registers, we can manipulate them, our VM accepts an instruction and an argument.
Lets now take a look at the bytecode our previous example compiles to:
1;; multiplication
2 ;; loading 1.025 into register 0
3 OP_LOAD :: 1.025000
4 ;; moving 1.025 from register 0 to register 1
5 OP_STORE :: 1.000000
6
7 ;; loading 3 into register 0
8 OP_LOAD :: 3.000000
9 ;; multiplying the value of register 0
10 ;; with the value of register 1
11 OP_MULTIPY :: 1.000000
12
13 ;; storing the result of the
14 ;; multiplication in register 1
15 OP_STORE :: 1.000000
16
17;; addition
18 ;; loading 1 into register 0
19 OP_LOAD :: 1.000000
20 ;; adding the value of register 1
21 ;; to the value of register 0
22 OP_ADD :: 1.000000
The left hand side of each line is the operation the virtual machine is
executing, the right hand side is the argument of the operation, sides are
separated with ::
.
This should suffice as a high level overview over the steps we want to take until we reach the integer result of our expression, starting from the source code, tokenizing, parsing, compiling and executing bytecode.
Project setup
Tip
All code snippets used in this series start with a comment specifying the file
it points to. Code not relevant to the current topic is omitted but still
notated using [...]
.
1// main.go
2package main
3
4// [...]
5
6func main() { }
If a new file should be created it will be explicitly stated.
Code snippets starting with a $
must be executed in a shell:
1$ echo "This is a shell"
Creating a directory for our project:
TEXT1$ mkdir calc
Entry point
Using go we can start with the bare minimum a project requires:
GO1// main.go 2package main 3 4import "fmt" 5 6func main() { 7 fmt.Println("Hello World!") 8}
Running the above via
go run .
requires the creation of the projectsgo.mod
file:Initialising the project
TEXT1$ go mod init calc
Building and running the source code
TEXT1$ go run . 2Hello World!
All of our source code will live in the root of the project. Currently we have
main.go
and go.mod
in the root of our project.
Test driven development
Info
Test driven development refers to the process of defining a problem domain for a function, creating the corresponding test, preferably with as much edge cases as possible and incrementing the implementation of the function to satisfy all test cases.As we are implementing an interpreter both the input to our function and the
output of our function is known and therefore easily representable with tests
which screams we should use TDD and iterate until all tests are passing. We
will create our tests once we defined the different kinds a token can represent
and the Token
structure.
Tokenising
Leaving the above behind, lets now get to the gist of this part of the series: the tokeniser. Our main goal is to step through the source code we input and convert it to different tokens and afterwards spitting out this list of tokens.
Lets get started, create a new file in the projects directory, beside main.go
and go.sum
called lexer.go
:
1// lexer.go
2package main
3
4import (
5 "bufio"
6 "io"
7 "log"
8 "strings"
9)
For now this will be enough, we will fill this file with content in the following sections.
Token and Types of Token
In the classical sense a lexical token refers to a list of characters with an assigned meaning, see lexical token and lexical tokenisation and remember the first step of the example.
To define the meaning we attach to this list of characters we will define a list of possible meanings we want to support in our interpreter, remember our problem domain.
1// lexer.go
2package main
3
4const (
5 TOKEN_UNKNOWN = iota + 1
6
7 TOKEN_NUMBER
8
9 // symbols
10 TOKEN_PLUS
11 TOKEN_MINUS
12 TOKEN_ASTERISK
13 TOKEN_SLASH
14
15 // structure
16 TOKEN_BRACE_LEFT
17 TOKEN_BRACE_RIGHT
18
19 TOKEN_EOF
20)
Iota - the Go way of doing enums
Go does not have enums, it hasiota
, see the go spec
here. Its not a particularly good system but
its good enough for our use case - it basically increments all consts by 1 in a
const block starting from 0. Therefore TOKEN_UNKNOWN
is set to 1 and
TOKEN_EOF
to 9.We will now define the structure holding the detected structure, its type and its raw value:
1// lexer.go
2package main
3
4// [...] token kind definition
5
6type Token struct {
7 Type int
8 Raw string
9}
We defined the structure to hold tokens we found in the source code and their types.
Tests
Now lets get started with writing tests. Per go
convention we create a new file postfixed with
_test.go
. This lexer_test.go
file contains all tests for the tokeniser and
exists beside all previously created files in the root of the directory.
So lets create the foundation for our tests - we will make use of an idea called table driven tests:
1// lexer_test.go
2package main
3
4import (
5 "testing"
6 "strings"
7
8 "github.com/stretchr/testify/assert"
9)
10
11func TestLexer(t *testing.T) {
12 tests := []struct{
13 Name string
14 In string
15 Out []Token
16 }{}
17 for _, test := range tests {
18 t.Run(test.Name, func(t *testing.T) {
19 in := strings.NewReader(test.In)
20 out := NewLexer(in).Lex()
21 assert.EqualValues(t, test.Out, out)
22 })
23 }
24}
We use the assert.EqualValues
function to compare our expected and the actual
resulting arrays.
Tip
Including external packages in our project is a simple as running the following command:
1$ go get <repo>
For our example we want the github.com/stretchr/testify/assert
package, thus we execute:
1$ go get github.com/stretchr/testify/assert
Lets add our first test - an edge case - specifically the case of an empty
input for which we expect only one structure Token
with Token.Type: TOKEN_EOF
in the resulting token list.
1// lexer_test.go
2package main
3
4import (
5 "testing"
6 "strings"
7
8 "github.com/stretchr/testify/assert"
9)
10
11func TestLexer(t *testing.T) {
12 tests := []struct{
13 Name string
14 In string
15 Out []Token
16 }{
17 {
18 Name: "empty input",
19 In: "",
20 Out: []Token{
21 {Type: TOKEN_EOF, Raw: "TOKEN_EOF"},
22 },
23 },
24 }
25 for _, test := range tests {
26 t.Run(test.Name, func(t *testing.T) {
27 in := strings.NewReader(test.In)
28 out := NewLexer(in).Lex()
29 assert.EqualValues(t, test.Out, out)
30 })
31 }
32}
Running our tests with go test ./... -v
will result in an error simply
because we have not yet defined our Lexer:
1$ go test ./... -v
2# calc [calc.test]
3./lexer_test.go:35:11: undefined: NewLexer
4FAIL calc [build failed]
5FAIL
Debugging
If we try to print our Token
structure we will see the Token.Type
as an
integer, for example:
1package main
2
3func main() {
4 t := Token{Type: TOKEN_NUMBER, Raw: "12"}
5 fmt.Printf("Token{Type: %d, Raw: %s}\n", t.Type, t.Raw)
6}
This would of course not result in the output we want, due to the enum defining token types as integers:
1$ go run .
2Token{Type: 2, Raw: 12}
Therefore we add the TOKEN_LOOKUP
hash map:
1// lexer.go
2package main
3
4// [...] imports
5
6// [...] token types generation
7
8var TOKEN_LOOKUP = map[int]string{
9 TOKEN_UNKNOWN: "UNKNOWN",
10 TOKEN_NUMBER: "TOKEN_NUMBER",
11 TOKEN_PLUS: "TOKEN_PLUS",
12 TOKEN_MINUS: "TOKEN_MINUS",
13 TOKEN_ASTERISK: "TOKEN_ASTERISK",
14 TOKEN_SLASH: "TOKEN_SLASH",
15 TOKEN_BRACE_LEFT: "TOKEN_BRACE_LEFT",
16 TOKEN_BRACE_RIGHT: "TOKEN_BRACE_RIGHT",
17 TOKEN_EOF: "EOF",
18}
Tip
With vim the above is extremely easy to generate, simply copy the before defined types of tokens, paste them into the map, remove= iota +1
, white space and
comments. Afterwards mark them again with Shift+v
. Now regex all over the
place by typing :'<,'>s/\([A-Z_]\+\)/\1: "\1",
, this creates a capture group
for all upper case characters found one or more times, this group is reused in
the substitute replace part of the command (second part of the command,
infix split by /
) and replaces all \1
with the captured name, thus
filling the map.If we were to now update our previous example to using the new TOKEN_LOOKUP
map we notice it now works correctly:
1package main
2
3import "fmt"
4
5func main() {
6 t := Token{Type: TOKEN_NUMBER, Raw: "12"}
7 fmt.Printf("Token{Type: %s, Raw: %s}\n", TOKEN_LOOKUP[t.Type], t.Raw)
8}
1$ go run .
2Token{Type: TOKEN_NUMBER, Raw: 12}
Lexer overview
After establishing our debug capabilities we now can move on to creating the
Lexer
and defining our tokenisers API:
1// lexer.go
2package main
3
4// [...] Imports, token types, token struct, TOKEN_LOOKUP map
5
6type Lexer struct {
7 scanner bufio.Reader
8 cur rune
9}
10
11func NewLexer(reader io.Reader) *Lexer { }
12
13func (l *Lexer) Lex() []Token { }
14
15func (l *Lexer) number() Token { }
16
17func (l *Lexer) advance() { }
The Lexer
structure holds a scanner we will create in the NewLexer
function
this function accepts an unbuffered reader which we will wrap into a buffered
reader for stepping trough the source in an optimized fashion. The function
returns a Lexer structure. The cur
field holds the current character.
The heart of the tokeniser is the Lexer.Lex
method. It iterates over all
characters in the buffered reader and tries to recognize structures.
The Lexer.number
method is called when an number is detected, it then
iterates until the current character is no longer a part of a number and
returns a Token
structure.
Lexer.advance
requests the next character from the buffered scanner and sets
Lexer.cur
to the resulting character.
Tip
Definitions
Method vs function -> I refer to standalone functions as functions - to functions attached to structures as methods Thus
NewLexer
is a function andLexer.Lex
is a method. But to each their own - I don’t really care 😼.Number vs integer vs digit -> Here I define a number as 1 or more characters between 0 and 9, I extend this definition with
e
,.
and_
in between the first number and all following numbers. Thus I consider the following numbers as valid for this interpreter:1e5
12.5
0.5
5_000_000
Creating the lexer
As introduced before the NewLexer
function creates the lexer:
1// lexer.go
2package main
3
4// [...] Imports, token types, token struct, TOKEN_LOOKUP map
5
6type Lexer struct {
7 scanner bufio.Reader
8 cur rune
9}
10
11func NewLexer(reader io.Reader) *Lexer {
12 l := &Lexer{
13 scanner: *bufio.NewReader(reader),
14 }
15 l.advance()
16 return l
17}
This function accepts a reader, creates a new Lexer
structure, converts the
reader to a buffered Reader
, assigns it to
the Lexer
structure and afterwards invokes the Lexer.advance
function we
will discuss in the next chapter.
Advancing in the Input
Stepping through the source code is as easy as requesting a new character from
our buffered reader via the bufio.Reader.ReadRune()
method:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) advance() {
7 r, _, err := l.scanner.ReadRune()
8 if err != nil {
9 l.cur = 0
10 } else {
11 l.cur = r
12 }
13}
The ReadRune
function returns an error once the end of the file is hit, to
indicate this to our Lexer.Lex
function we will set the Lexer.cur
field to
0
.
Tip
End of file is often referred to asEOF
.We will now focus on the heart of the tokeniser: Lexer.Lex()
:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) Lex() []Token {
7 t := make([]Token, 0)
8 for l.cur != 0 {
9 l.advance()
10 }
11 t = append(t, Token{
12 Type: TOKEN_EOF,
13 Raw: "TOKEN_EOF",
14 })
15 return t
16}
We firstly create a new slice of type []Token
, we will fill with tokens we
find while stepping through the source code. The while loop iterates until we
hit the EOF
by calling *Lexer.advance()
. To indicate the ending of our
token list we append a token of type TOKEN_EOF
to the slice t
.
After defining the NewLexer
and the *Lexer.Lex
we can try running our tests
again:
1$ go test ./... -v
2=== RUN TestLexer
3=== RUN TestLexer/empty_input
4--- PASS: TestLexer (0.00s)
5 --- PASS: TestLexer/empty_input (0.00s)
6PASS
7ok calc 0.002s
Thus we know our lexer works correctly for empty inputs.
Ignoring white space
Every good programming language ignores white space and so do we (looking at
you Python). White space is
commonly defined as a new line: '\n'
/ '\r'
, a tab '\t'
or a space ' '
.
Lets add a new test case called whitespace
to our white space tests:
1// lexer_test.go
2package main
3
4// [...]
5
6func TestLexer(t *testing.T) {
7 tests := []struct {
8 Name string
9 In string
10 Out []Token
11 }{
12 // [...]
13 {
14 Name: "whitespace",
15 In: "\r\n\t ",
16 Out: []Token{
17 {TOKEN_EOF, "TOKEN_EOF"},
18 },
19 },
20 }
21 // [...]
22}
Having defined what we want as the output, lets get started with ignoring white space:
To check if the current character matches any of the above we introduce a switch case statement:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) Lex() []Token {
7 t := make([]Token, 0)
8 for l.cur != 0 {
9 switch l.cur {
10 case ' ', '\n', '\t', '\r':
11 l.advance()
12 continue
13 }
14
15 l.advance()
16 }
17 t = append(t, Token{
18 Type: TOKEN_EOF,
19 Raw: "TOKEN_EOF",
20 })
21 return t
22}
Lets run our tests and check if everything worked out the way we wanted it to:
1$ go test ./... -v
2=== RUN TestLexer
3=== RUN TestLexer/empty_input
4=== RUN TestLexer/whitespace
5--- PASS: TestLexer (0.00s)
6 --- PASS: TestLexer/empty_input (0.00s)
7 --- PASS: TestLexer/whitespace (0.00s)
8PASS
9ok calc 0.001s
Seems like we ignored whitespace the right way 😼.
Support for comments
Lets add a very similar test as we added in the previous chapter to check if we ignore comments correctly:
1// lexer_test.go
2package main
3
4// [...]
5
6func TestLexer(t *testing.T) {
7 tests := []struct {
8 Name string
9 In string
10 Out []Token
11 }{
12 // [...]
13 {
14 Name: "comment",
15 In: "# this is a comment\n# this is a comment without a newline at the end",
16 Out: []Token{
17 {TOKEN_EOF, "TOKEN_EOF"},
18 },
19 },
20 }
21 // [...]
22}
To ignore comments, we add a new case to our switch statement:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) Lex() []Token {
7 // [...]
8 for l.cur != 0 {
9 switch l.cur {
10 case '#':
11 for l.cur != '\n' && l.cur != 0 {
12 l.advance()
13 }
14 continue
15 case ' ', '\n', '\t', '\r':
16 l.advance()
17 continue
18 }
19 l.advance()
20 }
21 // [...]
22}
We want our comments to start with #
, therefore we enter the case if the
current character is a #
. Once in the case we call *Lexer.advance()
until
we either hit a newline or EOF
- both causing the loop to stop.
Lets again run our tests:
1$ go test ./... -v
2=== RUN TestLexer
3=== RUN TestLexer/empty_input
4=== RUN TestLexer/whitespace
5=== RUN TestLexer/comment
6--- PASS: TestLexer (0.00s)
7 --- PASS: TestLexer/empty_input (0.00s)
8 --- PASS: TestLexer/whitespace (0.00s)
9 --- PASS: TestLexer/comment (0.00s)
10PASS
11ok calc 0.001s
Detecting special symbols
Having added tests for empty input, ignoring white space and comments, we will now add a new test for the symbols we want to recognize in out input:
1// lexer_test.go
2package main
3
4
5// [...]
6
7func TestLexer(t *testing.T) {
8 tests := []struct {
9 Name string
10 In string
11 Out []Token
12 }{
13 // [...]
14 {
15 Name: "symbols",
16 In: "+-/*()",
17 Out: []Token{
18 {TOKEN_PLUS, "+"},
19 {TOKEN_MINUS, "-"},
20 {TOKEN_SLASH, "/"},
21 {TOKEN_ASTERISK, "*"},
22 {TOKEN_BRACE_LEFT, "("},
23 {TOKEN_BRACE_RIGHT, ")"},
24 {TOKEN_EOF, "TOKEN_EOF"},
25 },
26 },
27 }
28 // [...]
29}
Running our tests including the above at the current state of out
implementation will result in the following assert
Error:
1$ go test ./...
2--- FAIL: TestLexer (0.00s)
3 --- FAIL: TestLexer/symbols (0.00s)
4 lexer_test.go:56:
5 Error Trace: ./lexer_test.go:56
6 Error: Not equal:
7 expected: []main.Token{main.Token{Type:3, Raw:"+"}, main.Token{Type:4, Raw:"-"}, main.Token{Type:6, Raw:"/"}, main.Token{Type:5, Raw:"*"}, main.Token{Type:7, Raw:"("}, main.Token{Type:8, Raw:")"}, main.Token{Type:9, Raw:"TOKEN_EOF"}}
8 actual : []main.Token{main.Token{Type:9, Raw:"TOKEN_EOF"}}
9// [...]
10 Test: TestLexer/symbols
11FAIL
12FAIL calc 0.004s
13FAIL
Implementing support for the symbols we want should fix this issue.
Our first step towards this goal is to define a new variable called ttype
holding the type of token we recognized:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) Lex() []Token {
7 // [...]
8 for l.cur != 0 {
9 // [...]
10 ttype := TOKEN_UNKNOWN
11 // [...]
12 l.advance()
13 }
14 // [...]
15}
We use this variable to insert detected tokens into our t
array, if the value
of ttype
didn’t change and is still TOKEN_UNKNOWN
we display an error and exit:
1// lexer.go
2package main
3
4import (
5 // [...]
6 "log"
7)
8// [...]
9
10func (l *Lexer) Lex() []Token {
11 // [...]
12 for l.cur != 0 {
13 // [...]
14 ttype := TOKEN_UNKNOWN
15 // [...]
16 if ttype != TOKEN_UNKNOWN {
17 t = append(t, Token{
18 Type: ttype,
19 Raw: string(l.cur),
20 })
21 } else {
22 log.Fatalf("unknown %q in input", l.cur)
23 }
24
25 l.advance()
26 }
27 // [...]
28}
For now this concludes our error handling, not great - i know. Our next step is to add cases to our switch to react to differing characters:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) Lex() []Token {
7 // [...]
8 for l.cur != 0 {
9 // [...]
10 switch l.cur {
11 // [...]
12 case '+':
13 ttype = TOKEN_PLUS
14 case '-':
15 ttype = TOKEN_MINUS
16 case '/':
17 ttype = TOKEN_SLASH
18 case '*':
19 ttype = TOKEN_ASTERISK
20 case '(':
21 ttype = TOKEN_BRACE_LEFT
22 case ')':
23 ttype = TOKEN_BRACE_RIGHT
24 }
25 // [...]
26 l.advance()
27 }
28 // [...]
29}
We can now once again run our tests:
1$ go test ./... -v
2calc master M :: go test ./... -v
3=== RUN TestLexer
4=== RUN TestLexer/empty_input
5=== RUN TestLexer/whitespace
6=== RUN TestLexer/comment
7=== RUN TestLexer/symbols
8--- PASS: TestLexer (0.00s)
9 --- PASS: TestLexer/empty_input (0.00s)
10 --- PASS: TestLexer/whitespace (0.00s)
11 --- PASS: TestLexer/comment (0.00s)
12 --- PASS: TestLexer/symbols (0.00s)
13PASS
14ok calc 0.003s
And we pass our tests, the only feature missing from our tokeniser is detecting numbers.
Support for integers and floating point numbers
As introduced before i want to support numbers with several infixes, such as _
, e
and .
.
Go ahead and add some tests for these cases:
1// lexer_test.go
2package main
3
4
5// [...]
6
7func TestLexer(t *testing.T) {
8 tests := []struct {
9 Name string
10 In string
11 Out []Token
12 }{
13 // [...]
14 {
15 Name: "number",
16 In: "123",
17 Out: []Token{
18 {TOKEN_NUMBER, "123"},
19 {TOKEN_EOF, "TOKEN_EOF"},
20 },
21 },
22 {
23 Name: "number with underscore",
24 In: "10_000",
25 Out: []Token{
26 {TOKEN_NUMBER, "10_000"},
27 {TOKEN_EOF, "TOKEN_EOF"},
28 },
29 },
30 {
31 Name: "number with e",
32 In: "10e5",
33 Out: []Token{
34 {TOKEN_NUMBER, "10e5"},
35 {TOKEN_EOF, "TOKEN_EOF"},
36 },
37 },
38 {
39 Name: "number with .",
40 In: "0.005",
41 Out: []Token{
42 {TOKEN_NUMBER, "0.005"},
43 {TOKEN_EOF, "TOKEN_EOF"},
44 },
45 },
46 }
47 // [...]
48}
Lets add a default
-case to our switch statement:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) Lex() []Token {
7 // [...]
8 for l.cur != 0 {
9 // [...]
10 switch l.cur {
11 // [...]
12 default:
13 if (l.cur >= '0' && l.cur <= '9') || l.cur == '.' {
14 t = append(t, l.number())
15 continue
16 }
17 }
18 // [...]
19 }
20 // [...]
21}
As one should notice we have yet to define the *Lexer.number
function:
1// lexer.go
2package main
3
4// [...]
5
6func (l *Lexer) number() Token {
7 b := strings.Builder{}
8 for (l.cur >= '0' && l.cur <= '9') || l.cur == '.' || l.cur == '_' || l.cur == 'e' {
9 b.WriteRune(l.cur)
10 l.advance()
11 }
12 return Token{
13 Raw: b.String(),
14 Type: TOKEN_NUMBER,
15 }
16}
The function makes use of the strings.Builder
structure. This is used to omit
copying the string which we would have to do if we simply used string+string
.
We iterate while our character matches what we want and write to the
strings.Builder
structure. Upon hitting a character we do not accept the loop
stops and the function returns a Token
-Structure with the result of the
strings.Builder
we defined and wrote to previously.
Combining the previously added default
-case and our new *Lexer.number()
function we added support for numbers starting with 0-9
or .
. We support
infixes such as _
, .
, _
and e
- exactly matching our test cases, thus
we can now once again check if our tests pass:
1=== RUN TestLexer
2=== RUN TestLexer/empty_input
3=== RUN TestLexer/whitespace
4=== RUN TestLexer/comment
5=== RUN TestLexer/symbols
6=== RUN TestLexer/number
7=== RUN TestLexer/number_with_underscore
8=== RUN TestLexer/number_with_e
9=== RUN TestLexer/number_with_.
10--- PASS: TestLexer (0.00s)
11 --- PASS: TestLexer/empty_input (0.00s)
12 --- PASS: TestLexer/whitespace (0.00s)
13 --- PASS: TestLexer/comment (0.00s)
14 --- PASS: TestLexer/symbols (0.00s)
15 --- PASS: TestLexer/number (0.00s)
16 --- PASS: TestLexer/number_with_underscore (0.00s)
17 --- PASS: TestLexer/number_with_e (0.00s)
18 --- PASS: TestLexer/number_with_. (0.00s)
19PASS
20ok calc 0.003s
Calling our Tokeniser
Out tests pass - we can finally move on to my favorite part of every
programming project: passing input via the command line to our program and
seeing the output. Doing so requires some packages. We need os
to access the
command line arguments our program was called with, we need strings
to create
a io.Reader
for the parameter our tokeniser requires. Furthermore we include
the log
package and promptly disable all prefixes, timestamps, etc. by
invoking log.SetFlags
with 0 as the argument.
1package main
2
3import (
4 "log"
5 "os"
6 "strings"
7)
8
9func main() {
10 log.SetFlags(0)
11 if len(os.Args) != 2 {
12 log.Fatalln("missing input")
13 }
14
15 input := os.Args[1]
16
17 token := NewLexer(strings.NewReader(input)).Lex()
18}
Tip
When an executable build with go is started it can access the arguments passed to it via the os.Args
slice:
1// main.go
2package main
3
4import (
5 "fmt"
6 "os"
7)
8
9func main() {
10 fmt.Println(os.Args)
11}
1$ go build .
2$ ./main arg1 arg2 arg3
3[./main arg1 arg2 arg3]
The 0 index is always the name of the executable.
We got our tokens but we haven’t printed them yet, so we create a helper method
called debugToken
- we first print the header of our table and afterwards
iterate through our list of Token
structures, printing them one by one.
1// main.go
2package main
3
4// [...]
5
6func debugToken(token []Token) {
7 log.Printf("%5s | %20s | %15s \n\n", "index", "type", "raw")
8 for i, t := range token {
9 log.Printf("%5d | %20s | %15s \n", i, TOKEN_LOOKUP[t.Type], t.Raw)
10 }
11}
12
13func main() {
14 log.SetFlags(0)
15 if len(os.Args) != 2 {
16 log.Fatalln("missing input")
17 }
18
19 input := os.Args[1]
20
21 token := NewLexer(strings.NewReader(input)).Lex()
22 debugToken(token)
23}
Running our program with an expression of our choice results in a table of lexemes we recognized
1$ go run . "100_000+.5*(42-3.1415)/12"
2index | type | raw
3
4 0 | TOKEN_NUMBER | 100_000
5 1 | TOKEN_PLUS | +
6 2 | TOKEN_NUMBER | .5
7 3 | TOKEN_ASTERISK | *
8 4 | TOKEN_BRACE_LEFT | (
9 5 | TOKEN_NUMBER | 42
10 6 | TOKEN_MINUS | -
11 7 | TOKEN_NUMBER | 3.1415
12 8 | TOKEN_BRACE_RIGHT | )
13 9 | TOKEN_SLASH | /
14 10 | TOKEN_NUMBER | 12
15 11 | EOF | TOKEN_EOF
Closing Words
We did it, we read our input, recognized lexemes in it, stored theses in structures and attached meaning we require for further processing to them. All while employing TDD and tests the way the go god fathers intended us to.
The next part will probably take a while longer to make, this one already took me a week to write. Keep an eye on my RSS feed - I’m very excited to finally fully understand precedence parsing and writing the next blog article.