Info
This is a partial write-up of the lexer I implemented while working on my markdown to html converter. The repository can be found here.
Today’s goal is to implement a simple enough scanner / lexer to understand the basics of converting a stream of characters without sense into a stream of tokens with a lot of sense. 😉
Tokens
Our output consists of a lot of tokens which are strings with an identified meaning, contrary to our input stream, which has no meaning whatsoever.
Defining a Markdown subset
To keep the scanner small and simple we restrict ourself to a subset of Markdown.
Headings
1# h1 2 3## h2 4 5### h3 6 7#### h4 8 9##### h5 10 11###### h6
Lists
1- [ ] list 2- [x] list 3- list
Links
1[cool website](https://xnacly.me)
Images
1![very cool avatar](https://xnacly/images/avatar.webp)
Emphasis
1**this is bold**
Italics
1_this is italic_
Horizontal rulers
1---
Quotes
1> Opinion is the medium between knowledge and ignorance. 2> ~ Plato
code blocks
1```javascript 2console.log("test"); 3```
inline code
1`this is inline code`
Defining Tokens
The parser will do the job of putting our tokens together in a well structured way (AST). Our job is to tell the parser exactly which token its currently looking at. To do this we need to create a structure for each Token which holds information such as position, value, kind of token, etc.
In go the above looks something like the following:
1// scanner/tokens.go
2package scanner
3
4type Token struct {
5 Pos uint
6 Kind uint
7 Line uint
8 Value string
9}
The position, kind and line structure values are all of type unsigned integer, simply because it doesn’t make sense for them to be negative.
To indicate where the scanner encountered the token, the token struct contains the Line
and the Position
on the line on which the token was found.
To store text we find while lexing, we use the Value
field.
For example lets take a look at the following:
1## Tokens
2
3### Defining Tokens
Imagine we are at the second position of the third line, our fictional cursor (indicated using ^
) is therefore located under the second hash (#
) in the heading Defining Tokens
:
1## Tokens
2
3### Defining Tokens
4 ^
5 |
6 cursor, located at: [line: 2, position 1]
Tip
Remember, arrays start at the index0
, therefore we are located at the line 2
, even though the numbers on the left indicate we are in line 3
of the file.If we were to lex this character and store it somewhere we would create the following go structure:
1// main.go
2package main
3
4import "github.com/xnacly/fleck/scanner"
5
6func main() {
7 t := scanner.Token {
8 Pos: 1,
9 Kind: 0, // ? We don't know yet
10 Line: 2,
11 Value: "",
12 }
13}
Categorizing Tokens
To distinguish tokens from one another, we define a bucket load of constants:
1// scanner/tokens.go
2package scanner
3// ...
4
5const (
6 TEXT = iota + 1 // everything else
7 HASH // #
8 UNDERSCORE // _
9 STAR // *
10 NEWLINE // \n
11 DASH // -
12 STRAIGHTBRACEOPEN // [
13 STRAIGHTBRACECLOSE // ]
14 PARENOPEN // (
15 PARENCLOSE // )
16 BACKTICK // `
17 GREATERTHAN // >
18 BANG // !
19)
iota
is used to auto increment constants and is therefore providing something like an enum in go. (source)
We can now extend our example using our newly created enum:
1// main.go
2package main
3
4import "github.com/xnacly/fleck/scanner"
5
6func main() {
7 t := scanner.Token {
8 Pos: 1,
9 Kind: HASH,
10 Line: 2,
11 Value: "",
12 }
13}
Remember, the character is a hash (#
). Therefore we added the HASH
constant from above as the Kind
field in the resulting structure.
Scanner
A scanner / lexer / tokeniser takes an input without any sense, such as a stream of characters and converts it into a stream of tokens with information attached to them, such as their position, their kind or their value.
For example, take a look at the following markdown:
1### Heading 3
After throwing it into our scanner, the scanner converts it internally into a stream of tokens, representable by using an array of characters, like so:
1input := []rune{
2 '#',
3 '#',
4 '#',
5 ' ',
6 'H',
7 'e',
8 'a',
9 'd',
10 'i',
11 'n',
12 'g',
13 ' ',
14 '3',
15 '\n',
16}
Tip
Notice the'\n'
line break after the 3
.
A line in terms of a file always ends in a line break on linux, this means a line is composed of all the characters between the last \n
and the next \n
.After the scanner did its job it should spit out a stream of tokens, again representable as an array, only this time of objects:
1result := []Token{
2 {
3 Pos: 0,
4 Kind: HASH,
5 Line: 0,
6 Value: "",
7 },
8 {
9 Pos: 1,
10 Kind: HASH,
11 Line: 0,
12 Value: "",
13 },
14 {
15 Pos: 2,
16 Kind: HASH,
17 Line: 0,
18 Value: "",
19 },
20 {
21 Pos: 3,
22 Kind: TEXT,
23 Line: 0,
24 Value: " Heading 3",
25 },
26 {
27 Pos: 13,
28 Kind: NEWLINE,
29 Line: 0,
30 Value: "",
31 },
32}
Notice how Heading 3
is completely encapsulated in the last Token
object of Kind TEXT
.
Scanner setup
Our first step is to create a new Scanner
structure:
1// scanner/scanner.go
2package scanner
3
4type Scanner struct {
5 scan *bufio.Scanner
6 isAtEnd bool
7 curLine []rune
8 curChar rune
9 linePos uint
10 line uint
11 tokens []Token
12}
The fields in the structure can be explained as follows:
Field | usage |
---|---|
scan | contains the buffered scanner which is used to iterate over the file we want to lex |
isAtEnd | indicates if the scanner lexed the whole file and is at the end of it (EOF) |
curLine | array of runes containing the characters of the current line |
curChar | the character which can be found at the current index on the current line |
linePos | indicates the scanners position on the current line |
line | holds the line position of the scanner in the file |
tokens | contains the lexed token objects |
Instantiation
To instantiate we define a new exported function (a function of a module is exported if its first character is uppercase) in the package:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func New(fileName string) Scanner {
7 file, err := os.Open(fileName)
8 if err != nil {
9 log.Fatalln("couldn't open file", err)
10 }
11
12 scan := bufio.NewScanner(file)
13 // scan.Scan reads the first line
14 // of the given file into the scanner
15 scan.Scan()
16
17 // scan.Text returns
18 // the string value of the previously read line
19 firstLine := []rune(scan.Text())
20
21 return Scanner{
22 scan: scan,
23 curLine: firstLine,
24 curChar: firstLine[0],
25 line: 0,
26 linePos: 0,
27 }
28}
The scanner.New
function returns an instance of the Scanner
structure we defined before.
Adding a new Token
To write less repetitive code, we create the scanner.Scanner.addToken
function:
1// scanner/scanner.go
2package scanner
3
4//...
5
6func (s *Scanner) addToken(kind uint, value string) {
7 pos := s.linePos
8 if len(value) != 0 {
9 pos = s.linePos - uint(len(value))
10 }
11 s.tokens = append(s.tokens, Token{
12 Pos: pos,
13 Kind: kind,
14 Value: value,
15 Line: s.line,
16 })
17}
This function allows us to add a token after scanning it simply by calling the function like so:
1s.addToken(HASH, "")
Info
Lines 7 till 10 are a bit unintuitive, so lets go into detail and explore their use:
1pos := s.linePos
Line 7 is simply assigning the value of the current position the lexer is at to the pos
variable.
1pos := s.linePos
2if len(value) != 0 {
3 pos = s.linePos - uint(len(value))
4}
The if
statement in line 8 checks if the function parameter value
of type string
we are receiving contains a value, this is done by comparing its length with not null.
1pos := s.linePos
2if len(value) != 0 {
3 pos = s.linePos - uint(len(value))
4}
The magic happens in the 9th line of the snippet.
This line calculates the position of the token by subtracting the length of the function parameter value
of type string from the cursors position on the current line.
This is very dry and hard to understand without an example, so lets go back to the example we used before. Consider we have the following line to lex:
1### Heading 3
As we explored before we scan each hash as its own token and would add it using the addToken
function. We will call the function like so:
1s.addToken(HASH, "")
This means the addToken
function gets an empty string (length is 0) as the content of the parameter value
, therefore the if statement in its body is not executed.
However, if we lexed a text and want to add it using the addToken
function we would call it like this:
1s.addToken(TEXT, " Heading 3")
As one can see we supplied the function call with the value " Heading 3"
which is in fact not of length 0.
This means the if statement in the function body is executed and tries to calculate a different position for the TEXT
token than the HASH
token.
To explain this behaviour we need to think about how we would lex a token of type TEXT
.
When encountering a #
we can simply call the s.addToken(HASH, "")
function, advance to the next token and lex again.
We can’t do this when we encounter non format specific characters (everything not #_*\n-[]()>!
).
To store everything else in a neat way we iterate over the rest of the line until we hit any of the before mentioned special characters and then add the text with its content using the addToken(TEXT, " Heading 3")
function.
This however means the scanners position is at the end of the text we want to scan:
1 scanner is positioned here (pos: 13)
2 ↓↓
3### Heading 3\n
4 ↑
5 but the text starts here (pos: 3)
To compensate this offset, we subtract the length of the text we lexed from the position the scanner is currently on:
1end_of_text - length_of_text = beginning_of_the_text
Moving through the line
After lexing a character we want to take a look at the next one. Therefore we need to somehow advance to the next character and position on the line, the following function does exactly that:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) advance() {
7 if s.linePos+1 >= uint(len(s.curLine)) {
8 s.curChar = '\n'
9 s.linePos++
10 return
11 }
12
13 s.linePos++
14 s.curChar = s.curLine[s.linePos]
15}
The function increments the linePos
field of the Scanner
struct and sets the curChar
field to the next character if the scanner wouldn’t be at the end of the line when incrementing.
Due to the fact that the bufio.Scanner
removes line breaks from the read lines we need to set the current character to a line break if we are at the end of the line. (Source)
This enables us to check if we are at the end of the line while iterating through the current line.
Moving through the file
After hitting a line break (\n
) we of course need to advance to the next line.
The following function allows us to do exactly this:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) advanceLine() {
7 ok := s.scan.Scan()
8
9 if s.scan.Err() != nil || !ok {
10 s.isAtEnd = true
11 return
12 }
13
14 s.curLine = []rune(s.scan.Text())
15 s.line++
16 s.linePos = 0
17 for len(s.curLine) == 0 {
18 s.scan.Scan()
19 s.curLine = []rune(s.scan.Text())
20 s.line++
21 }
22 s.curChar = s.curLine[s.linePos]
23}
The above function calls the scan
field of the Scanner
structure to get the next line of the file,
if this failed or the bufio.Scanner
has an error the isAtEnd
flag on the Scanner
structure is set.
If the scanner is still in the bounds of the file it sets the current line to the string result of the line the bufio.Scanner
just scanned.
After that it increments the line
field of the Scanner
structure and sets the field linePos
to 0.
If the current line is of length 0 or simply empty, the function loops until it finds a line which isn’t empty.
At the end the function assigns the first character of the new line to the curChar
field of the Scanner
structure.
Pretty Printing the Tokens
To debug and take a look at our processed tokens we implement the scanner.Scanner.PrintTokens
function:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) PrintTokens() {
7 for _, token := range s.tokens {
8 fmt.Printf("[ '%s' | %d | %d | '%s' ]\n",
9 TOKEN_LOOKUP_MAP[token.Kind],
10 token.Pos,
11 token.Line,
12 token.Value,
13 )
14 }
15}
The function simply loops over every token in the tokens
array field of the Scanner
structure and formats the token values according to our format string.
An attentive reader has already noticed the TOKEN_LOOKUP_MAP
in line 9 of the snippet.
This map simply binds the enums we created as our token kinds to their name, because who likes looking at numbers if they can instead look at names 😴.
This map is defined in the scanner/tokens.go
file and looks like this:
1// scanner/tokens.go
2package scanner
3
4// ...
5
6var TOKEN_LOOKUP_MAP = map[uint]string{
7 HASH: "HASH",
8 UNDERSCORE: "UNDERSCORE",
9 STAR: "STAR",
10 NEWLINE: "NEWLINE",
11 DASH: "DASH",
12 STRAIGHTBRACEOPEN: "STRAIGHTBRACEOPEN",
13 STRAIGHTBRACECLOSE: "STRAIGHTBRACECLOSE",
14 PARENOPEN: "PARENOPEN",
15 PARENCLOSE: "PARENCLOSE",
16 GREATERTHAN: "GREATERTHAN",
17 BACKTICK: "BACKTICK",
18 TEXT: "TEXT",
19 BANG: "BANG",
20}
This means whenever we lex a hash (#
) and we want to check if it is a hash using our eyes, we can simply do the following:
1fmt.Println(TOKEN_LOOKUP_MAP[HASH])
Running the scanner.Scanner.PrintTokens
function for our example ### Heading 3
, results in the following output:
1[ 'HASH' | 0 | 0 | '' ]
2[ 'HASH' | 1 | 0 | '' ]
3[ 'HASH' | 2 | 0 | '' ]
4[ 'TEXT' | 3 | 0 | ' Heading 3' ]
5[ 'NEWLINE' | 13 | 0 | '' ]
Getting our scanned Tokens
To access the token we scanned, we create a new function called scanner.Scanner.Tokens
, which returns the scanner.Scanner.tokens
structure field:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Tokens() []Token {
7 return s.tokens
8}
Single character tokens
A single character token in Markdown can be any of:
1 # (Hash)
2 _ (Underscore)
3 * (Star)
4\n (New Line)
5 - (Dash)
6 [ (Open Square Bracket)
7 ] (Close Square Bracket)
8 ( (Open Paren)
9 ) (Close paren)
10 > (Bigger than)
11 ! (Bang / exclamation mark)
12 ` (Backtick)
To parse these we simply check if the value of the curChar
field is equal to them and call addToken
with the tokens kind in the scanner.Scanner.Lex
function.
Doing this requires a for loop over all the characters in the current line while we aren’t at the end of the file:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8
9 }
10}
Tip
Remember, theisAtEnd
field of the Scanner
struct indicates that we left the last line of the file behind us and are therefore done with the file.To pave the way for matching characters we create a switch case statement for the curChar
field of the Scanner
structure and a variable tokenKind
:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10
11 }
12 }
13}
We use the tokenKind
variable to store the kind of token we found.
This allows us to write the call to s.addToken
after the switch case not in each case statement:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10
11 }
12 s.addToken(tokenKind, "")
13 }
14}
After adding the token the current character matches we need to move on to the next character - remember we implemented scanner.Scanner.advance
exactly for this use case.:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10
11 }
12 s.addToken(tokenKind, "")
13 s.advance()
14 }
15}
I am now going to show you how to match the hash in detail and afterwards we are going to fast forward the process for the remaining symbols.
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10 case '#':
11 tokenKind = HASH
12 }
13 s.addToken(tokenKind, "")
14 s.advance()
15 }
16}
Everything we had to do was add the case to match #
and assign the kind of token (HASH
) to the tokenKind
variable.
Now the other symbols:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10 case '#':
11 tokenKind = HASH
12 case '!':
13 tokenKind = BANG
14 case '>':
15 tokenKind = GREATERTHAN
16 case '_':
17 tokenKind = UNDERSCORE
18 case '*':
19 tokenKind = STAR
20 case '-':
21 tokenKind = DASH
22 case '[':
23 tokenKind = STRAIGHTBRACEOPEN
24 case ']':
25 tokenKind = STRAIGHTBRACECLOSE
26 case '(':
27 tokenKind = PARENOPEN
28 case ')':
29 tokenKind = PARENCLOSE
30 case '`':
31 tokenKind = BACKTICK
32 }
33 s.addToken(tokenKind, "")
34 s.advance()
35 }
36}
Notice the missing line break (\n
)?
As i mentioned while discussing the scanner.Scanner.advance
function, we need to add the line breaks to the end of the line ourselfs to keep track if we are at the end of the current line.
To move to the next line we need to call the scanner.Scanner.advanceLine
function if the scanner encounters the line break symbol:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10 case '#':
11 tokenKind = HASH
12 case '!':
13 tokenKind = BANG
14 case '>':
15 tokenKind = GREATERTHAN
16 case '_':
17 tokenKind = UNDERSCORE
18 case '*':
19 tokenKind = STAR
20 case '\n':
21 s.addToken(NEWLINE, "")
22 s.advanceLine()
23 continue
24 case '-':
25 tokenKind = DASH
26 case '[':
27 tokenKind = STRAIGHTBRACEOPEN
28 case ']':
29 tokenKind = STRAIGHTBRACECLOSE
30 case '(':
31 tokenKind = PARENOPEN
32 case ')':
33 tokenKind = PARENCLOSE
34 case '`':
35 tokenKind = BACKTICK
36 }
37 s.addToken(tokenKind, "")
38 s.advance()
39 }
40}
As one can see we add a new token with the NEWLINE
kind and afterwards advance to the next line and continue onward to the next iteration of the loop
If we again look at our example we can see our lexer can already tokenize four of our five tokens - the three hashes and the line break at the end of the line.
Scanning the rest of the paragraph as a token of kind TEXT
is going to be slightly more complex.
Multi character tokens
Lets talk lexing texts. We want our scanner to categorize everything between the special symbols we are already scanning for as a token of kind TEXT
.
To accomplish this we will need to add a default case to our switch statement:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10 case '#':
11 tokenKind = HASH
12 case '!':
13 tokenKind = BANG
14 case '>':
15 tokenKind = GREATERTHAN
16 case '_':
17 tokenKind = UNDERSCORE
18 case '*':
19 tokenKind = STAR
20 case '\n':
21 s.addToken(NEWLINE, "")
22 s.advanceLine()
23 continue
24 case '-':
25 tokenKind = DASH
26 case '[':
27 tokenKind = STRAIGHTBRACEOPEN
28 case ']':
29 tokenKind = STRAIGHTBRACECLOSE
30 case '(':
31 tokenKind = PARENOPEN
32 case ')':
33 tokenKind = PARENCLOSE
34 case '`':
35 tokenKind = BACKTICK
36 default:
37 var res strings.Builder
38
39 out:
40 for {
41 switch s.curChar {
42 case '\n', '!', '#', '_', '*', '-', '[', ']', '(', ')', '`', '>', '?':
43 break out
44 }
45
46 res.WriteRune(s.curChar)
47 s.advance()
48 }
49 }
50 s.addToken(tokenKind, "")
51 s.advance()
52 }
53}
In this default statement we create a new variable of type strings.Builder
on which we will append the characters of the text to.
This is considerably faster than just concatenating two strings using the +=
operator.
The next line defines a label for our following loop (out:
).
We will use this label to break the outer loop if we find a character which matches any special character.
In this while loop we check if the current character matches a special character (lines 41-44), if it does we break at the out
label (line 43).
This stops the loop labeled as out
.
If it isn’t a special character the scanner writes the current character to the strings.Builder
and advances to the next character.
We want our scanner to skip empty lines, therefore we add the following snippet at the end of the default case:
1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Lex() {
7 for !s.isAtEnd {
8 var tokenKind uint
9 switch s.curChar {
10 case '#':
11 tokenKind = HASH
12 case '!':
13 tokenKind = BANG
14 case '>':
15 tokenKind = GREATERTHAN
16 case '_':
17 tokenKind = UNDERSCORE
18 case '*':
19 tokenKind = STAR
20 case '\n':
21 s.addToken(NEWLINE, "")
22 s.advanceLine()
23 continue
24 case '-':
25 tokenKind = DASH
26 case '[':
27 tokenKind = STRAIGHTBRACEOPEN
28 case ']':
29 tokenKind = STRAIGHTBRACECLOSE
30 case '(':
31 tokenKind = PARENOPEN
32 case ')':
33 tokenKind = PARENCLOSE
34 case '`':
35 tokenKind = BACKTICK
36 default:
37 var res strings.Builder
38
39 out:
40 for {
41 switch s.curChar {
42 case '\n', '!', '#', '_', '*', '-', '[', ']', '(', ')', '`', '>':
43 break out
44 }
45
46 res.WriteRune(s.curChar)
47 s.advance()
48 }
49
50 if res.Len() != 0 {
51 s.addToken(TEXT, res.String())
52 }
53
54 continue
55 }
56 s.addToken(tokenKind, "")
57 s.advance()
58 }
59}
This change skips all scanned token which have a string of length 0.
The continue
at the end skips adding the token again in line 56, due to the fact that we already added the token in line 51 if it is not empty.
Tests
While lexing we need to check if we get the output right and correct. Go supports testing code with a built-in module.
To test simply create a new file in the scanner
directory, named scanner_test.go
.:
1// scanner/scanner_test.go
2package scanner
3
4import "testing"
5
6func TestExample(t *testing.T) {
7 s := New("./markdown.md")
8 s.Lex()
9 tokens := s.Tokens()
10}
The _test
suffix is important, otherwise the go test
command won’t recognize the file and therefore the tests in it.
For this test to work the markdown.md
file needs to be created in the scanner/
directory. Fill it with the example we already used before:
1### Heading 3
Now we can append the following to our TestExample
function:
1// scanner/scanner_test.go
2package scanner
3
4import "testing"
5
6func TestExample(t *testing.T) {
7 s := New("./markdown.md")
8 s.Lex()
9 tokens := s.Tokens()
10
11 expectedTokens := []uint {
12 HASH,
13 HASH,
14 HASH,
15 TEXT,
16 NEWLINE,
17 }
18
19 if len(tokens) != len(expectedTokens) {
20 t.Errorf("expected %d tokens, got: %d", len(expectedTokens), len(tokens))
21 }
22}
We define the token kinds we expect in the expectedTokens
array, after that we compare the length of the tokens we got after scanning and the length of the array we just defined. If they didn’t match we did something wrong.
To check if all tokens we matched are correct we need to loop over them to check if the current token we expect matches the current token we scanned:
1// scanner/scanner_test.go
2package scanner
3
4import "testing"
5
6func TestExample(t *testing.T) {
7 s := New("./markdown.md")
8 s.Lex()
9 tokens := s.Tokens()
10
11 expectedTokens := []uint {
12 HASH,
13 HASH,
14 HASH,
15 TEXT,
16 NEWLINE,
17 }
18
19 if len(tokens) != len(expectedTokens) {
20 t.Errorf("expected %d tokens, got: %d", len(expectedTokens), len(tokens))
21 }
22
23 for i, token := range tokens {
24 if expectedTokens[i] != token.Kind {
25 t.Errorf("expected %d [%s], got %d [%s] for token %d",
26 expectedTokens[i], TOKEN_LOOKUP_MAP[expectedTokens[i]], token.Kind, TOKEN_LOOKUP_MAP[token.Kind], i,
27 )
28 }
29 }
30}
So thats it, now you are able to write a lexer for markdown and test if everything worked out correctly.