Tokenizing Arithmetic expressions - calculator p.1

Table of Contents

Tags:

Introduction

This is the first post of a four part series around implementing and understanding the steps for interpreting arithmetic expressions. The series is meant for explaining key concepts such as lexical analysis, parsing / building the ast, walking the ast / flatting it to byte code, bytecode virtual machines and TDD centered around compilers and interpreters.

Info

The corresponding GitHub repository can be found here.
  1. This first article contains the introduction to our problem domain, the setup of our project, the basics of TDD and the first steps towards interpreting arithmetic expressions: tokenizing our input / performing lexical analysis

  2. The second article will be centered around converting the list of tokens we created in the first article to an abstract syntax tree, short ast

  3. The third article will be focused on walking the abstract syntax tree and converting nodes into a list of instructions for our virtual machine

  4. The fourth and last article will be consisting of implementing the bytecode virtual machine and executing expressions compiled to bytecode

All posts in this series will include an extensive amount of resources for the readers deeper understanding of the matter. The following books and articles are commonly referenced in the compiler and interpreter space:

We will be using the go programming language in this series. All concepts can be applied to any programming language. Some level of proficiency with go is assumed, I will not explain syntax.

Problem domain

The expression we want to be able to execute with our interpreter is the smallest subset of a programming language i could imagine thus our problem domain is defined by a subset of arithmetic expressions:

  • addition
  • subtraction
  • multiplication
  • division

We will also support braces that can be used to indicate precedence, which we will talk about in the second post of this series.

Expressions

Some examples of expression we will accept:

PYTHON
 1# comments
 2
 3# addition
 41029+0.129
 5# =1029.129
 6
 7# subtraction
 85_120_129-5_120_128
 9# =1
10
11# multiplication
125e12*3
13# =150000
14
15# division
161/2
17# =0.5
18
19# braces
20(1+1)/2
21# =1

Interpreter design

There are several well established ways of interpreting programming languages. Lets take a look at the stages an interpreter commonly performs before the passed in source code is interpreted.

Interpreter Stages

  1. Lexical analysis -> the process of recognizing structures such as numbers, identifiers or symbols in the source code , converts recognized structures into a list of token, often referred to as scanning, lexing or tokenizing.

  2. Parsing -> refers to the process of detecting precedence and checking whether the input conforms to the defined grammar. In our case the parser analyses the output of our previous stage and builds an abstract syntax tree while taking operator precedence into account.

  3. Evaluating -> commonly means walking the tree starting from the leafs, computing a value for each node and exiting on computing the value of the root, see tree walk interpreter. While this is the simplest way to implement an interpreter, it introduces performance issues due to requiring recursion and the overhead of traversing many pointers. There are multiple ways of implementing this step, each one with different pros and cons:

For our interpreter we decide to use the idea of bytecode interpreters, thus splitting the third step into two sub steps::

  1. Compiling to bytecode -> Walking the tree and compiling each node into bytecode
  2. Executing bytecode -> Iterating over the list of bytecode instructions and executing them

Example

Consider the following statement and lets visualize the stages using the example:

PYTHON
11.025*3+1

Lexical analysis

Performing the first stage converts the input from a character stream into a list of structures:

GO
1token := []Token{
2    Token{Type: NUMBER, Raw: "1.025"},
3    Token{Type: ASTERIKS, Raw: "*"},
4    Token{Type: NUMBER, Raw: "3"},
5    Token{Type: PLUS, Raw: "+"},
6    Token{Type: NUMBER, Raw: "3"},
7}

Parsing

We now build an abstract syntax tree out of the list of token we get from the previous stage:

GO
 1ast := Addition{
 2    Token: Token{Type: PLUS, Raw: "+"},
 3    Left: Multiplication{
 4        Token: Token{Type: ASTERIKS, Raw: "*"},
 5        Left: Number{
 6            Token: Token{Type: NUMBER, Raw: "1.025"},
 7        },
 8        Right: Number{
 9            Token: Token{Type: NUMBER, Raw: "3"}
10        },
11    },
12    Right: Number{Token: NUMBER, Raw: "1"},
13}

Notice the depth of the tree, the deeper the node sits the earlier it is compiled to bytecode, thus considering operator precedence, see below for a visual explanation:

  1. Initial ast

    TEXT
    1+
    23├─ 1
    45└─ *
    67   ├─ 1.025
    89   └─ 3
  2. Multiplication evaluated:

    TEXT
    1+
    23├─ 1
    45└─ 3.075
  3. Addition evaluated:

    TEXT
    14.075

Compiling to bytecode

We use the AST we got from the previous step to compile each node to a list of bytecode instructions. The bottom most node, commonly referred to as leafs are all numbers, thus we will start there.

The bytecode VM we want to implement has a list of registers, comparable to the CPU registers on a real machine. We can load and manipulate values in these registers. In the third and fourth part of this series, we will go into great depth on registers, bytecode and virtual machines. For now simply know there are registers, we can manipulate them, our VM accepts an instruction and an argument.

Lets now take a look at the bytecode our previous example compiles to:

ASM
 1;; multiplication
 2    ;; loading 1.025 into register 0
 3    OP_LOAD    :: 1.025000
 4    ;; moving 1.025 from register 0 to register 1
 5    OP_STORE   :: 1.000000
 6
 7    ;; loading 3 into register 0
 8    OP_LOAD    :: 3.000000
 9    ;; multiplying the value of register 0
10    ;; with the value of register 1
11    OP_MULTIPY :: 1.000000
12
13    ;; storing the result of the
14    ;; multiplication in register 1
15    OP_STORE   :: 1.000000
16
17;; addition
18    ;; loading 1 into register 0
19    OP_LOAD    :: 1.000000
20    ;; adding the value of register 1
21    ;; to the value of register 0
22    OP_ADD     :: 1.000000

The left hand side of each line is the operation the virtual machine is executing, the right hand side is the argument of the operation, sides are separated with ::.

This should suffice as a high level overview over the steps we want to take until we reach the integer result of our expression, starting from the source code, tokenizing, parsing, compiling and executing bytecode.

Project setup

Tip

All code snippets used in this series start with a comment specifying the file it points to. Code not relevant to the current topic is omitted but still notated using [...].

GO
1// main.go
2package main
3
4// [...]
5
6func main() { }

If a new file should be created it will be explicitly stated.

Code snippets starting with a $ must be executed in a shell:

TEXT
1$ echo "This is a shell"
  1. Creating a directory for our project:

    TEXT
    1$ mkdir calc
  2. Entry point

    Using go we can start with the bare minimum a project requires:

    GO
    1// main.go
    2package main
    3
    4import "fmt"
    5
    6func main() {
    7    fmt.Println("Hello World!")
    8}

    Running the above via go run . requires the creation of the projects go.mod file:

  3. Initialising the project

    TEXT
    1$ go mod init calc
  4. Building and running the source code

    TEXT
    1$ go run .
    2Hello World!

All of our source code will live in the root of the project. Currently we have main.go and go.mod in the root of our project.

Test driven development

Info

Test driven development refers to the process of defining a problem domain for a function, creating the corresponding test, preferably with as much edge cases as possible and incrementing the implementation of the function to satisfy all test cases.

As we are implementing an interpreter both the input to our function and the output of our function is known and therefore easily representable with tests which screams we should use TDD and iterate until all tests are passing. We will create our tests once we defined the different kinds a token can represent and the Token structure.

Tokenising

Leaving the above behind, lets now get to the gist of this part of the series: the tokeniser. Our main goal is to step through the source code we input and convert it to different tokens and afterwards spitting out this list of tokens.

Lets get started, create a new file in the projects directory, beside main.go and go.sum called lexer.go:

GO
1// lexer.go
2package main
3
4import (
5	"bufio"
6	"io"
7	"log"
8	"strings"
9)

For now this will be enough, we will fill this file with content in the following sections.

Token and Types of Token

In the classical sense a lexical token refers to a list of characters with an assigned meaning, see lexical token and lexical tokenisation and remember the first step of the example.

To define the meaning we attach to this list of characters we will define a list of possible meanings we want to support in our interpreter, remember our problem domain.

GO
 1// lexer.go
 2package main
 3
 4const (
 5    TOKEN_UNKNOWN = iota + 1
 6
 7    TOKEN_NUMBER
 8
 9    // symbols
10	TOKEN_PLUS
11	TOKEN_MINUS
12	TOKEN_ASTERISK
13	TOKEN_SLASH
14
15    // structure
16	TOKEN_BRACE_LEFT
17	TOKEN_BRACE_RIGHT
18
19	TOKEN_EOF
20)

Iota - the Go way of doing enums

Go does not have enums, it has iota, see the go spec here. Its not a particularly good system but its good enough for our use case - it basically increments all consts by 1 in a const block starting from 0. Therefore TOKEN_UNKNOWN is set to 1 and TOKEN_EOF to 9.

We will now define the structure holding the detected structure, its type and its raw value:

GO
1// lexer.go
2package main
3
4// [...] token kind definition
5
6type Token struct {
7    Type int
8    Raw  string
9}

We defined the structure to hold tokens we found in the source code and their types.

Tests

Now lets get started with writing tests. Per go convention we create a new file postfixed with _test.go. This lexer_test.go file contains all tests for the tokeniser and exists beside all previously created files in the root of the directory.

So lets create the foundation for our tests - we will make use of an idea called table driven tests:

GO
 1// lexer_test.go
 2package main
 3
 4import (
 5    "testing"
 6    "strings"
 7
 8    "github.com/stretchr/testify/assert"
 9)
10
11func TestLexer(t *testing.T) {
12    tests := []struct{
13        Name string
14        In   string
15        Out  []Token
16    }{}
17    for _, test := range tests {
18        t.Run(test.Name, func(t *testing.T) {
19            in := strings.NewReader(test.In)
20			out := NewLexer(in).Lex()
21            assert.EqualValues(t, test.Out, out)
22        })
23    }
24}

We use the assert.EqualValues function to compare our expected and the actual resulting arrays.

Tip

Including external packages in our project is a simple as running the following command:

TEXT
1$ go get <repo>

For our example we want the github.com/stretchr/testify/assert package, thus we execute:

TEXT
1$ go get github.com/stretchr/testify/assert

Lets add our first test - an edge case - specifically the case of an empty input for which we expect only one structure Token with Token.Type: TOKEN_EOF in the resulting token list.

GO
 1// lexer_test.go
 2package main
 3
 4import (
 5    "testing"
 6    "strings"
 7
 8    "github.com/stretchr/testify/assert"
 9)
10
11func TestLexer(t *testing.T) {
12    tests := []struct{
13        Name string
14        In   string
15        Out  []Token
16    }{
17        {
18            Name: "empty input",
19            In: "",
20            Out: []Token{
21                {Type: TOKEN_EOF, Raw: "TOKEN_EOF"},
22            },
23        },
24    }
25    for _, test := range tests {
26        t.Run(test.Name, func(t *testing.T) {
27            in := strings.NewReader(test.In)
28			out := NewLexer(in).Lex()
29            assert.EqualValues(t, test.Out, out)
30        })
31    }
32}

Running our tests with go test ./... -v will result in an error simply because we have not yet defined our Lexer:

TEXT
1$ go test ./... -v
2# calc [calc.test]
3./lexer_test.go:35:11: undefined: NewLexer
4FAIL    calc [build failed]
5FAIL

Debugging

If we try to print our Token structure we will see the Token.Type as an integer, for example:

GO
1package main
2
3func main() {
4    t := Token{Type: TOKEN_NUMBER, Raw: "12"}
5    fmt.Printf("Token{Type: %d, Raw: %s}\n", t.Type, t.Raw)
6}

This would of course not result in the output we want, due to the enum defining token types as integers:

TEXT
1$ go run .
2Token{Type: 2, Raw: 12}

Therefore we add the TOKEN_LOOKUP hash map:

GO
 1// lexer.go
 2package main
 3
 4// [...] imports
 5
 6// [...] token types generation
 7
 8var TOKEN_LOOKUP = map[int]string{
 9	TOKEN_UNKNOWN:     "UNKNOWN",
10	TOKEN_NUMBER:      "TOKEN_NUMBER",
11	TOKEN_PLUS:        "TOKEN_PLUS",
12	TOKEN_MINUS:       "TOKEN_MINUS",
13	TOKEN_ASTERISK:    "TOKEN_ASTERISK",
14	TOKEN_SLASH:       "TOKEN_SLASH",
15	TOKEN_BRACE_LEFT:  "TOKEN_BRACE_LEFT",
16	TOKEN_BRACE_RIGHT: "TOKEN_BRACE_RIGHT",
17	TOKEN_EOF:         "EOF",
18}

Tip

With vim the above is extremely easy to generate, simply copy the before defined types of tokens, paste them into the map, remove = iota +1, white space and comments. Afterwards mark them again with Shift+v. Now regex all over the place by typing :'<,'>s/\([A-Z_]\+\)/\1: "\1",, this creates a capture group for all upper case characters found one or more times, this group is reused in the substitute replace part of the command (second part of the command, infix split by /) and replaces all \1 with the captured name, thus filling the map.

If we were to now update our previous example to using the new TOKEN_LOOKUP map we notice it now works correctly:

GO
1package main
2
3import "fmt"
4
5func main() {
6	t := Token{Type: TOKEN_NUMBER, Raw: "12"}
7	fmt.Printf("Token{Type: %s, Raw: %s}\n", TOKEN_LOOKUP[t.Type], t.Raw)
8}
TEXT
1$ go run .
2Token{Type: TOKEN_NUMBER, Raw: 12}

Lexer overview

After establishing our debug capabilities we now can move on to creating the Lexer and defining our tokenisers API:

GO
 1// lexer.go
 2package main
 3
 4// [...] Imports, token types, token struct, TOKEN_LOOKUP map
 5
 6type Lexer struct {
 7	scanner bufio.Reader
 8	cur     rune
 9}
10
11func NewLexer(reader io.Reader) *Lexer { }
12
13func (l *Lexer) Lex() []Token { }
14
15func (l *Lexer) number() Token { }
16
17func (l *Lexer) advance() { }

The Lexer structure holds a scanner we will create in the NewLexer function this function accepts an unbuffered reader which we will wrap into a buffered reader for stepping trough the source in an optimized fashion. The function returns a Lexer structure. The cur field holds the current character.

The heart of the tokeniser is the Lexer.Lex method. It iterates over all characters in the buffered reader and tries to recognize structures.

The Lexer.number method is called when an number is detected, it then iterates until the current character is no longer a part of a number and returns a Token structure.

Lexer.advance requests the next character from the buffered scanner and sets Lexer.cur to the resulting character.

Tip

Definitions

  • Method vs function -> I refer to standalone functions as functions - to functions attached to structures as methods Thus NewLexer is a function and Lexer.Lex is a method. But to each their own - I don’t really care 😼.

  • Number vs integer vs digit -> Here I define a number as 1 or more characters between 0 and 9, I extend this definition with e, . and _ in between the first number and all following numbers. Thus I consider the following numbers as valid for this interpreter:

    • 1e5
    • 12.5
    • 0.5
    • 5_000_000

Creating the lexer

As introduced before the NewLexer function creates the lexer:

GO
 1// lexer.go
 2package main
 3
 4// [...] Imports, token types, token struct, TOKEN_LOOKUP map
 5
 6type Lexer struct {
 7	scanner bufio.Reader
 8	cur     rune
 9}
10
11func NewLexer(reader io.Reader) *Lexer {
12	l := &Lexer{
13		scanner: *bufio.NewReader(reader),
14	}
15	l.advance()
16	return l
17}

This function accepts a reader, creates a new Lexer structure, converts the reader to a buffered Reader, assigns it to the Lexer structure and afterwards invokes the Lexer.advance function we will discuss in the next chapter.

Advancing in the Input

Stepping through the source code is as easy as requesting a new character from our buffered reader via the bufio.Reader.ReadRune() method:

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) advance() {
 7	r, _, err := l.scanner.ReadRune()
 8	if err != nil {
 9        l.cur = 0
10    } else {
11        l.cur = r
12    }
13}

The ReadRune function returns an error once the end of the file is hit, to indicate this to our Lexer.Lex function we will set the Lexer.cur field to 0.

Tip

End of file is often referred to as EOF.

We will now focus on the heart of the tokeniser: Lexer.Lex():

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) Lex() []Token {
 7	t := make([]Token, 0)
 8	for l.cur != 0 {
 9        l.advance()
10	}
11	t = append(t, Token{
12		Type: TOKEN_EOF,
13		Raw:  "TOKEN_EOF",
14	})
15	return t
16}

We firstly create a new slice of type []Token, we will fill with tokens we find while stepping through the source code. The while loop iterates until we hit the EOF by calling *Lexer.advance(). To indicate the ending of our token list we append a token of type TOKEN_EOF to the slice t.

After defining the NewLexer and the *Lexer.Lex we can try running our tests again:

TEXT
1$ go test ./... -v
2=== RUN   TestLexer
3=== RUN   TestLexer/empty_input
4--- PASS: TestLexer (0.00s)
5    --- PASS: TestLexer/empty_input (0.00s)
6PASS
7ok      calc    0.002s

Thus we know our lexer works correctly for empty inputs.

Ignoring white space

Every good programming language ignores white space and so do we (looking at you Python). White space is commonly defined as a new line: '\n' / '\r', a tab '\t' or a space ' '.

Lets add a new test case called whitespace to our white space tests:

GO
 1// lexer_test.go
 2package main
 3
 4// [...]
 5
 6func TestLexer(t *testing.T) {
 7	tests := []struct {
 8		Name string
 9		In   string
10		Out  []Token
11	}{
12        // [...]
13		{
14			Name: "whitespace",
15			In:   "\r\n\t             ",
16			Out: []Token{
17				{TOKEN_EOF, "TOKEN_EOF"},
18			},
19		},
20	}
21    // [...]
22}

Having defined what we want as the output, lets get started with ignoring white space:

To check if the current character matches any of the above we introduce a switch case statement:

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) Lex() []Token {
 7	t := make([]Token, 0)
 8	for l.cur != 0 {
 9		switch l.cur {
10		case ' ', '\n', '\t', '\r':
11			l.advance()
12			continue
13        }
14
15        l.advance()
16	}
17	t = append(t, Token{
18		Type: TOKEN_EOF,
19		Raw:  "TOKEN_EOF",
20	})
21	return t
22}

Lets run our tests and check if everything worked out the way we wanted it to:

TEXT
1$ go test ./... -v
2=== RUN   TestLexer
3=== RUN   TestLexer/empty_input
4=== RUN   TestLexer/whitespace
5--- PASS: TestLexer (0.00s)
6    --- PASS: TestLexer/empty_input (0.00s)
7    --- PASS: TestLexer/whitespace (0.00s)
8PASS
9ok      calc    0.001s

Seems like we ignored whitespace the right way 😼.

Support for comments

Lets add a very similar test as we added in the previous chapter to check if we ignore comments correctly:

GO
 1// lexer_test.go
 2package main
 3
 4// [...]
 5
 6func TestLexer(t *testing.T) {
 7	tests := []struct {
 8		Name string
 9		In   string
10		Out  []Token
11	}{
12        // [...]
13		{
14			Name: "comment",
15			In:   "# this is a comment\n# this is a comment without a newline at the end",
16			Out: []Token{
17				{TOKEN_EOF, "TOKEN_EOF"},
18			},
19		},
20	}
21    // [...]
22}

To ignore comments, we add a new case to our switch statement:

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) Lex() []Token {
 7    // [...]
 8	for l.cur != 0 {
 9		switch l.cur {
10        case '#':
11			for l.cur != '\n' && l.cur != 0 {
12				l.advance()
13			}
14			continue
15		case ' ', '\n', '\t', '\r':
16			l.advance()
17			continue
18		}
19		l.advance()
20	}
21    // [...]
22}

We want our comments to start with #, therefore we enter the case if the current character is a #. Once in the case we call *Lexer.advance() until we either hit a newline or EOF - both causing the loop to stop.

Lets again run our tests:

TEXT
 1$ go test ./... -v
 2=== RUN   TestLexer
 3=== RUN   TestLexer/empty_input
 4=== RUN   TestLexer/whitespace
 5=== RUN   TestLexer/comment
 6--- PASS: TestLexer (0.00s)
 7    --- PASS: TestLexer/empty_input (0.00s)
 8    --- PASS: TestLexer/whitespace (0.00s)
 9    --- PASS: TestLexer/comment (0.00s)
10PASS
11ok      calc    0.001s

Detecting special symbols

Having added tests for empty input, ignoring white space and comments, we will now add a new test for the symbols we want to recognize in out input:

GO
 1// lexer_test.go
 2package main
 3
 4
 5// [...]
 6
 7func TestLexer(t *testing.T) {
 8	tests := []struct {
 9		Name string
10		In   string
11		Out  []Token
12	}{
13        // [...]
14        {
15			Name: "symbols",
16			In:   "+-/*()",
17			Out: []Token{
18				{TOKEN_PLUS, "+"},
19				{TOKEN_MINUS, "-"},
20				{TOKEN_SLASH, "/"},
21				{TOKEN_ASTERISK, "*"},
22				{TOKEN_BRACE_LEFT, "("},
23				{TOKEN_BRACE_RIGHT, ")"},
24				{TOKEN_EOF, "TOKEN_EOF"},
25			},
26		},
27	}
28    // [...]
29}

Running our tests including the above at the current state of out implementation will result in the following assert Error:

TEXT
 1$ go test ./...
 2--- FAIL: TestLexer (0.00s)
 3    --- FAIL: TestLexer/symbols (0.00s)
 4        lexer_test.go:56:
 5                Error Trace:    ./lexer_test.go:56
 6                Error:          Not equal:
 7                                expected: []main.Token{main.Token{Type:3, Raw:"+"}, main.Token{Type:4, Raw:"-"}, main.Token{Type:6, Raw:"/"}, main.Token{Type:5, Raw:"*"}, main.Token{Type:7, Raw:"("}, main.Token{Type:8, Raw:")"}, main.Token{Type:9, Raw:"TOKEN_EOF"}}
 8                                actual  : []main.Token{main.Token{Type:9, Raw:"TOKEN_EOF"}}
 9// [...]
10                Test:           TestLexer/symbols
11FAIL
12FAIL    calc    0.004s
13FAIL

Implementing support for the symbols we want should fix this issue.

Our first step towards this goal is to define a new variable called ttype holding the type of token we recognized:

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) Lex() []Token {
 7    // [...]
 8	for l.cur != 0 {
 9        // [...]
10        ttype := TOKEN_UNKNOWN
11        // [...]
12		l.advance()
13	}
14    // [...]
15}

We use this variable to insert detected tokens into our t array, if the value of ttype didn’t change and is still TOKEN_UNKNOWN we display an error and exit:

GO
 1// lexer.go
 2package main
 3
 4import (
 5    // [...]
 6    "log"
 7)
 8// [...]
 9
10func (l *Lexer) Lex() []Token {
11    // [...]
12	for l.cur != 0 {
13        // [...]
14        ttype := TOKEN_UNKNOWN
15        // [...]
16		if ttype != TOKEN_UNKNOWN {
17			t = append(t, Token{
18				Type: ttype,
19				Raw:  string(l.cur),
20			})
21		} else {
22			log.Fatalf("unknown %q in input", l.cur)
23		}
24
25		l.advance()
26	}
27    // [...]
28}

For now this concludes our error handling, not great - i know. Our next step is to add cases to our switch to react to differing characters:

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) Lex() []Token {
 7    // [...]
 8	for l.cur != 0 {
 9        // [...]
10		switch l.cur {
11            // [...]
12        case '+':
13			ttype = TOKEN_PLUS
14		case '-':
15			ttype = TOKEN_MINUS
16		case '/':
17			ttype = TOKEN_SLASH
18		case '*':
19			ttype = TOKEN_ASTERISK
20		case '(':
21			ttype = TOKEN_BRACE_LEFT
22		case ')':
23			ttype = TOKEN_BRACE_RIGHT
24        }
25        // [...]
26		l.advance()
27	}
28    // [...]
29}

We can now once again run our tests:

TEXT
 1$ go test ./... -v
 2calc master M :: go test ./... -v
 3=== RUN   TestLexer
 4=== RUN   TestLexer/empty_input
 5=== RUN   TestLexer/whitespace
 6=== RUN   TestLexer/comment
 7=== RUN   TestLexer/symbols
 8--- PASS: TestLexer (0.00s)
 9    --- PASS: TestLexer/empty_input (0.00s)
10    --- PASS: TestLexer/whitespace (0.00s)
11    --- PASS: TestLexer/comment (0.00s)
12    --- PASS: TestLexer/symbols (0.00s)
13PASS
14ok      calc    0.003s

And we pass our tests, the only feature missing from our tokeniser is detecting numbers.

Support for integers and floating point numbers

As introduced before i want to support numbers with several infixes, such as _, e and ..

Go ahead and add some tests for these cases:

GO
 1// lexer_test.go
 2package main
 3
 4
 5// [...]
 6
 7func TestLexer(t *testing.T) {
 8	tests := []struct {
 9		Name string
10		In   string
11		Out  []Token
12	}{
13        // [...]
14        {
15			Name: "number",
16			In:   "123",
17			Out: []Token{
18				{TOKEN_NUMBER, "123"},
19				{TOKEN_EOF, "TOKEN_EOF"},
20			},
21		},
22		{
23			Name: "number with underscore",
24			In:   "10_000",
25			Out: []Token{
26				{TOKEN_NUMBER, "10_000"},
27				{TOKEN_EOF, "TOKEN_EOF"},
28			},
29		},
30		{
31			Name: "number with e",
32			In:   "10e5",
33			Out: []Token{
34				{TOKEN_NUMBER, "10e5"},
35				{TOKEN_EOF, "TOKEN_EOF"},
36			},
37		},
38		{
39			Name: "number with .",
40			In:   "0.005",
41			Out: []Token{
42				{TOKEN_NUMBER, "0.005"},
43				{TOKEN_EOF, "TOKEN_EOF"},
44			},
45		},
46	}
47    // [...]
48}

Lets add a default-case to our switch statement:

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) Lex() []Token {
 7    // [...]
 8	for l.cur != 0 {
 9        // [...]
10		switch l.cur {
11            // [...]
12        default:
13			if (l.cur >= '0' && l.cur <= '9') || l.cur == '.' {
14				t = append(t, l.number())
15				continue
16			}
17        }
18        // [...]
19	}
20    // [...]
21}

As one should notice we have yet to define the *Lexer.number function:

GO
 1// lexer.go
 2package main
 3
 4// [...]
 5
 6func (l *Lexer) number() Token {
 7	b := strings.Builder{}
 8	for (l.cur >= '0' && l.cur <= '9') || l.cur == '.' || l.cur == '_' || l.cur == 'e' {
 9		b.WriteRune(l.cur)
10		l.advance()
11	}
12	return Token{
13		Raw:  b.String(),
14		Type: TOKEN_NUMBER,
15	}
16}

The function makes use of the strings.Builder structure. This is used to omit copying the string which we would have to do if we simply used string+string. We iterate while our character matches what we want and write to the strings.Builder structure. Upon hitting a character we do not accept the loop stops and the function returns a Token-Structure with the result of the strings.Builder we defined and wrote to previously.

Combining the previously added default-case and our new *Lexer.number() function we added support for numbers starting with 0-9 or .. We support infixes such as _, ., _ and e - exactly matching our test cases, thus we can now once again check if our tests pass:

TEXT
 1=== RUN   TestLexer
 2=== RUN   TestLexer/empty_input
 3=== RUN   TestLexer/whitespace
 4=== RUN   TestLexer/comment
 5=== RUN   TestLexer/symbols
 6=== RUN   TestLexer/number
 7=== RUN   TestLexer/number_with_underscore
 8=== RUN   TestLexer/number_with_e
 9=== RUN   TestLexer/number_with_.
10--- PASS: TestLexer (0.00s)
11    --- PASS: TestLexer/empty_input (0.00s)
12    --- PASS: TestLexer/whitespace (0.00s)
13    --- PASS: TestLexer/comment (0.00s)
14    --- PASS: TestLexer/symbols (0.00s)
15    --- PASS: TestLexer/number (0.00s)
16    --- PASS: TestLexer/number_with_underscore (0.00s)
17    --- PASS: TestLexer/number_with_e (0.00s)
18    --- PASS: TestLexer/number_with_. (0.00s)
19PASS
20ok      calc    0.003s

Calling our Tokeniser

Out tests pass - we can finally move on to my favorite part of every programming project: passing input via the command line to our program and seeing the output. Doing so requires some packages. We need os to access the command line arguments our program was called with, we need strings to create a io.Reader for the parameter our tokeniser requires. Furthermore we include the log package and promptly disable all prefixes, timestamps, etc. by invoking log.SetFlags with 0 as the argument.

GO
 1package main
 2
 3import (
 4	"log"
 5	"os"
 6	"strings"
 7)
 8
 9func main() {
10    log.SetFlags(0)
11	if len(os.Args) != 2 {
12		log.Fatalln("missing input")
13	}
14
15	input := os.Args[1]
16
17	token := NewLexer(strings.NewReader(input)).Lex()
18}

Tip

When an executable build with go is started it can access the arguments passed to it via the os.Args slice:

GO
 1// main.go
 2package main
 3
 4import (
 5    "fmt"
 6    "os"
 7)
 8
 9func main() {
10    fmt.Println(os.Args)
11}
TEXT
1$ go build .
2$ ./main arg1 arg2 arg3
3[./main arg1 arg2 arg3]

The 0 index is always the name of the executable.

We got our tokens but we haven’t printed them yet, so we create a helper method called debugToken - we first print the header of our table and afterwards iterate through our list of Token structures, printing them one by one.

GO
 1// main.go
 2package main
 3
 4// [...]
 5
 6func debugToken(token []Token) {
 7	log.Printf("%5s | %20s | %15s \n\n", "index", "type", "raw")
 8	for i, t := range token {
 9		log.Printf("%5d | %20s | %15s \n", i, TOKEN_LOOKUP[t.Type], t.Raw)
10	}
11}
12
13func main() {
14    log.SetFlags(0)
15	if len(os.Args) != 2 {
16		log.Fatalln("missing input")
17	}
18
19	input := os.Args[1]
20
21	token := NewLexer(strings.NewReader(input)).Lex()
22	debugToken(token)
23}

Running our program with an expression of our choice results in a table of lexemes we recognized

TEXT
 1$ go run . "100_000+.5*(42-3.1415)/12"
 2index |                 type |             raw
 3
 4    0 |         TOKEN_NUMBER |         100_000
 5    1 |           TOKEN_PLUS |               +
 6    2 |         TOKEN_NUMBER |              .5
 7    3 |       TOKEN_ASTERISK |               *
 8    4 |     TOKEN_BRACE_LEFT |               (
 9    5 |         TOKEN_NUMBER |              42
10    6 |          TOKEN_MINUS |               -
11    7 |         TOKEN_NUMBER |          3.1415
12    8 |    TOKEN_BRACE_RIGHT |               )
13    9 |          TOKEN_SLASH |               /
14   10 |         TOKEN_NUMBER |              12
15   11 |                  EOF |       TOKEN_EOF

Closing Words

We did it, we read our input, recognized lexemes in it, stored theses in structures and attached meaning we require for further processing to them. All while employing TDD and tests the way the go god fathers intended us to.

The next part will probably take a while longer to make, this one already took me a week to write. Keep an eye on my RSS feed - I’m very excited to finally fully understand precedence parsing and writing the next blog article.