Lexical analysis of Markdown in Go

Apr 5, 2023
4255 Words
20 Minute read

Tags:

Info

This is a partial write-up of the lexer I implemented while working on my markdown to html converter. The repository can be found here.

Today’s goal is to implement a simple enough scanner / lexer to understand the basics of converting a stream of characters without sense into a stream of tokens with a lot of sense. 😉

Tokens

Our output consists of a lot of tokens which are strings with an identified meaning, contrary to our input stream, which has no meaning whatsoever.

Defining a Markdown subset

To keep the scanner small and simple we restrict ourself to a subset of Markdown.

Headings

TEXT

 1# h1
 2
 3## h2
 4
 5### h3
 6
 7#### h4
 8
 9##### h5
10
11###### h6

Lists
MARKDOWN
```
1- [ ] list
2- [x] list
3- list
```
Links
MARKDOWN
```
1[cool website](https://xnacly.me)
```

Images

MARKDOWN

1![very cool avatar](https://xnacly/images/avatar.webp)

Emphasis
MARKDOWN
```
1**this is bold**
```
Italics
MARKDOWN
```
1_this is italic_
```
Horizontal rulers
MARKDOWN
```
1---
```

Quotes

MARKDOWN

1> Opinion is the medium between knowledge and ignorance.
2> ~ Plato

code blocks

MARKDOWN

1```javascript
2console.log("test");
3```

inline code
MARKDOWN
```
1`this is inline code`
```

Defining Tokens

The parser will do the job of putting our tokens together in a well structured way (AST). Our job is to tell the parser exactly which token its currently looking at. To do this we need to create a structure for each Token which holds information such as position, value, kind of token, etc.

In go the above looks something like the following:

1// scanner/tokens.go
2package scanner
3
4type Token struct {
5    Pos uint
6    Kind uint
7    Line uint
8    Value string
9}

The position, kind and line structure values are all of type unsigned integer, simply because it doesn’t make sense for them to be negative. To indicate where the scanner encountered the token, the token struct contains the Line and the Position on the line on which the token was found. To store text we find while lexing, we use the Value field.

For example lets take a look at the following:

TEXT

1## Tokens
2
3### Defining Tokens

Imagine we are at the second position of the third line, our fictional cursor (indicated using ^) is therefore located under the second hash (#) in the heading Defining Tokens:

TEXT

1## Tokens
2
3### Defining Tokens
4 ^
5 |
6 cursor, located at: [line: 2, position 1]

Tip

Remember, arrays start at the index 0, therefore we are located at the line 2, even though the numbers on the left indicate we are in line 3 of the file.

If we were to lex this character and store it somewhere we would create the following go structure:

 1// main.go
 2package main
 3
 4import "github.com/xnacly/fleck/scanner"
 5
 6func main() {
 7    t := scanner.Token {
 8        Pos:    1,
 9        Kind:   0, // ? We don't know yet
10        Line:   2,
11        Value:  "",
12    }
13}

Categorizing Tokens

To distinguish tokens from one another, we define a bucket load of constants:

 1// scanner/tokens.go
 2package scanner
 3// ...
 4
 5const (
 6	TEXT = iota + 1 // everything else
 7	HASH // #
 8	UNDERSCORE // _
 9	STAR // *
10	NEWLINE // \n
11	DASH // -
12	STRAIGHTBRACEOPEN // [
13	STRAIGHTBRACECLOSE // ]
14	PARENOPEN // (
15	PARENCLOSE // )
16	BACKTICK // `
17	GREATERTHAN // >
18	BANG // !
19)

iota is used to auto increment constants and is therefore providing something like an enum in go. (source)

We can now extend our example using our newly created enum:

 1// main.go
 2package main
 3
 4import "github.com/xnacly/fleck/scanner"
 5
 6func main() {
 7    t := scanner.Token {
 8        Pos:    1,
 9        Kind:   HASH,
10        Line:   2,
11        Value:  "",
12    }
13}

Remember, the character is a hash (#). Therefore we added the HASH constant from above as the Kind field in the resulting structure.

Scanner

A scanner / lexer / tokeniser takes an input without any sense, such as a stream of characters and converts it into a stream of tokens with information attached to them, such as their position, their kind or their value.

For example, take a look at the following markdown:

MARKDOWN

1### Heading 3

After throwing it into our scanner, the scanner converts it internally into a stream of tokens, representable by using an array of characters, like so:

 1input := []rune{
 2    '#',
 3    '#',
 4    '#',
 5    ' ',
 6    'H',
 7    'e',
 8    'a',
 9    'd',
10    'i',
11    'n',
12    'g',
13    ' ',
14    '3',
15    '\n',
16}

Tip

Notice the '\n' line break after the 3. A line in terms of a file always ends in a line break on linux, this means a line is composed of all the characters between the last \n and the next \n.

After the scanner did its job it should spit out a stream of tokens, again representable as an array, only this time of objects:

 1result := []Token{
 2    {
 3        Pos:    0,
 4        Kind:   HASH,
 5        Line:   0,
 6        Value:  "",
 7    },
 8    {
 9        Pos:    1,
10        Kind:   HASH,
11        Line:   0,
12        Value:  "",
13    },
14    {
15        Pos:    2,
16        Kind:   HASH,
17        Line:   0,
18        Value:  "",
19    },
20    {
21        Pos:    3,
22        Kind:   TEXT,
23        Line:   0,
24        Value:  " Heading 3",
25    },
26    {
27        Pos:    13,
28        Kind:   NEWLINE,
29        Line:   0,
30        Value:  "",
31    },
32}

Notice how Heading 3 is completely encapsulated in the last Token object of Kind TEXT.

Scanner setup

Our first step is to create a new Scanner structure:

 1// scanner/scanner.go
 2package scanner
 3
 4type Scanner struct {
 5	scan    *bufio.Scanner
 6	isAtEnd bool
 7	curLine []rune
 8	curChar rune
 9	linePos uint
10	line    uint
11	tokens  []Token
12}

The fields in the structure can be explained as follows:

Field	usage
`scan`	contains the buffered scanner which is used to iterate over the file we want to lex
`isAtEnd`	indicates if the scanner lexed the whole file and is at the end of it (EOF)
`curLine`	array of runes containing the characters of the current line
`curChar`	the character which can be found at the current index on the current line
`linePos`	indicates the scanners position on the current line
`line`	holds the line position of the scanner in the file
`tokens`	contains the lexed token objects

Instantiation

To instantiate we define a new exported function (a function of a module is exported if its first character is uppercase) in the package:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func New(fileName string) Scanner {
 7    file, err := os.Open(fileName)
 8	if err != nil {
 9		log.Fatalln("couldn't open file", err)
10	}
11
12	scan := bufio.NewScanner(file)
13    // scan.Scan reads the first line
14    // of the given file into the scanner
15	scan.Scan()
16
17    // scan.Text returns
18    // the string value of the previously read line
19	firstLine := []rune(scan.Text())
20
21	return Scanner{
22		scan:    scan,
23		curLine: firstLine,
24		curChar: firstLine[0],
25		line:    0,
26		linePos: 0,
27	}
28}

The scanner.New function returns an instance of the Scanner structure we defined before.

Adding a new Token

To write less repetitive code, we create the scanner.Scanner.addToken function:

 1// scanner/scanner.go
 2package scanner
 3
 4//...
 5
 6func (s *Scanner) addToken(kind uint, value string) {
 7    pos := s.linePos
 8	if len(value) != 0 {
 9		pos = s.linePos - uint(len(value))
10	}
11	s.tokens = append(s.tokens, Token{
12		Pos:   pos,
13		Kind:  kind,
14		Value: value,
15		Line:  s.line,
16	})
17}

This function allows us to add a token after scanning it simply by calling the function like so:

1s.addToken(HASH, "")

Info

Lines 7 till 10 are a bit unintuitive, so lets go into detail and explore their use:

1pos := s.linePos

Line 7 is simply assigning the value of the current position the lexer is at to the pos variable.

1pos := s.linePos
2if len(value) != 0 {
3    pos = s.linePos - uint(len(value))
4}

The if statement in line 8 checks if the function parameter value of type string we are receiving contains a value, this is done by comparing its length with not null.

1pos := s.linePos
2if len(value) != 0 {
3    pos = s.linePos - uint(len(value))
4}

The magic happens in the 9th line of the snippet. This line calculates the position of the token by subtracting the length of the function parameter value of type string from the cursors position on the current line.

This is very dry and hard to understand without an example, so lets go back to the example we used before. Consider we have the following line to lex:

MARKDOWN

1### Heading 3

As we explored before we scan each hash as its own token and would add it using the addToken function. We will call the function like so:

1s.addToken(HASH, "")

This means the addToken function gets an empty string (length is 0) as the content of the parameter value, therefore the if statement in its body is not executed.

However, if we lexed a text and want to add it using the addToken function we would call it like this:

1s.addToken(TEXT, " Heading 3")

As one can see we supplied the function call with the value " Heading 3" which is in fact not of length 0. This means the if statement in the function body is executed and tries to calculate a different position for the TEXT token than the HASH token.

To explain this behaviour we need to think about how we would lex a token of type TEXT. When encountering a # we can simply call the s.addToken(HASH, "") function, advance to the next token and lex again. We can’t do this when we encounter non format specific characters (everything not #_*\n-[]()>!). To store everything else in a neat way we iterate over the rest of the line until we hit any of the before mentioned special characters and then add the text with its content using the addToken(TEXT, " Heading 3") function.

This however means the scanners position is at the end of the text we want to scan:

TEXT

1             scanner is positioned here (pos: 13)
2             ↓↓
3### Heading 3\n
4   ↑
5   but the text starts here (pos: 3)

To compensate this offset, we subtract the length of the text we lexed from the position the scanner is currently on:

TEXT

1end_of_text - length_of_text = beginning_of_the_text

Moving through the line

After lexing a character we want to take a look at the next one. Therefore we need to somehow advance to the next character and position on the line, the following function does exactly that:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) advance() {
 7	if s.linePos+1 >= uint(len(s.curLine)) {
 8		s.curChar = '\n'
 9		s.linePos++
10		return
11	}
12
13	s.linePos++
14	s.curChar = s.curLine[s.linePos]
15}

The function increments the linePos field of the Scanner struct and sets the curChar field to the next character if the scanner wouldn’t be at the end of the line when incrementing.

Due to the fact that the bufio.Scanner removes line breaks from the read lines we need to set the current character to a line break if we are at the end of the line. (Source) This enables us to check if we are at the end of the line while iterating through the current line.

Moving through the file

After hitting a line break (\n) we of course need to advance to the next line. The following function allows us to do exactly this:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) advanceLine() {
 7	ok := s.scan.Scan()
 8
 9	if s.scan.Err() != nil || !ok {
10		s.isAtEnd = true
11		return
12	}
13
14	s.curLine = []rune(s.scan.Text())
15	s.line++
16	s.linePos = 0
17	for len(s.curLine) == 0 {
18		s.scan.Scan()
19		s.curLine = []rune(s.scan.Text())
20		s.line++
21	}
22	s.curChar = s.curLine[s.linePos]
23}

The above function calls the scan field of the Scanner structure to get the next line of the file, if this failed or the bufio.Scanner has an error the isAtEnd flag on the Scanner structure is set.

If the scanner is still in the bounds of the file it sets the current line to the string result of the line the bufio.Scanner just scanned. After that it increments the line field of the Scanner structure and sets the field linePos to 0.

If the current line is of length 0 or simply empty, the function loops until it finds a line which isn’t empty.

At the end the function assigns the first character of the new line to the curChar field of the Scanner structure.

Pretty Printing the Tokens

To debug and take a look at our processed tokens we implement the scanner.Scanner.PrintTokens function:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) PrintTokens() {
 7	for _, token := range s.tokens {
 8		fmt.Printf("[ '%s' | %d | %d | '%s' ]\n",
 9			TOKEN_LOOKUP_MAP[token.Kind],
10			token.Pos,
11			token.Line,
12			token.Value,
13		)
14	}
15}

The function simply loops over every token in the tokens array field of the Scanner structure and formats the token values according to our format string.

An attentive reader has already noticed the TOKEN_LOOKUP_MAP in line 9 of the snippet. This map simply binds the enums we created as our token kinds to their name, because who likes looking at numbers if they can instead look at names 😴. This map is defined in the scanner/tokens.go file and looks like this:

 1// scanner/tokens.go
 2package scanner
 3
 4// ...
 5
 6var TOKEN_LOOKUP_MAP = map[uint]string{
 7	HASH:               "HASH",
 8	UNDERSCORE:         "UNDERSCORE",
 9	STAR:               "STAR",
10	NEWLINE:            "NEWLINE",
11	DASH:               "DASH",
12	STRAIGHTBRACEOPEN:  "STRAIGHTBRACEOPEN",
13	STRAIGHTBRACECLOSE: "STRAIGHTBRACECLOSE",
14	PARENOPEN:          "PARENOPEN",
15	PARENCLOSE:         "PARENCLOSE",
16	GREATERTHAN:        "GREATERTHAN",
17	BACKTICK:           "BACKTICK",
18	TEXT:               "TEXT",
19	BANG:               "BANG",
20}

This means whenever we lex a hash (#) and we want to check if it is a hash using our eyes, we can simply do the following:

1fmt.Println(TOKEN_LOOKUP_MAP[HASH])

Running the scanner.Scanner.PrintTokens function for our example ### Heading 3, results in the following output:

TEXT

1[ 'HASH' | 0 | 0 | '' ]
2[ 'HASH' | 1 | 0 | '' ]
3[ 'HASH' | 2 | 0 | '' ]
4[ 'TEXT' | 3 | 0 | ' Heading 3' ]
5[ 'NEWLINE' | 13 | 0 | '' ]

Getting our scanned Tokens

To access the token we scanned, we create a new function called scanner.Scanner.Tokens, which returns the scanner.Scanner.tokens structure field:

1// scanner/scanner.go
2package scanner
3
4// ...
5
6func (s *Scanner) Tokens() []Token {
7    return s.tokens
8}

Single character tokens

A single character token in Markdown can be any of:

TEXT

 1 # (Hash)
 2 _ (Underscore)
 3 * (Star)
 4\n (New Line)
 5 - (Dash)
 6 [ (Open Square Bracket)
 7 ] (Close Square Bracket)
 8 ( (Open Paren)
 9 ) (Close paren)
10 > (Bigger than)
11 ! (Bang / exclamation mark)
12 ` (Backtick)

To parse these we simply check if the value of the curChar field is equal to them and call addToken with the tokens kind in the scanner.Scanner.Lex function.

Doing this requires a for loop over all the characters in the current line while we aren’t at the end of the file:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8
 9    }
10}

Tip

Remember, the isAtEnd field of the Scanner struct indicates that we left the last line of the file behind us and are therefore done with the file.

To pave the way for matching characters we create a switch case statement for the curChar field of the Scanner structure and a variable tokenKind:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10
11        }
12    }
13}

We use the tokenKind variable to store the kind of token we found. This allows us to write the call to s.addToken after the switch case not in each case statement:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10
11        }
12        s.addToken(tokenKind, "")
13    }
14}

After adding the token the current character matches we need to move on to the next character - remember we implemented scanner.Scanner.advance exactly for this use case.:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10
11        }
12        s.addToken(tokenKind, "")
13        s.advance()
14    }
15}

I am now going to show you how to match the hash in detail and afterwards we are going to fast forward the process for the remaining symbols.

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10            case '#':
11			tokenKind = HASH
12        }
13        s.addToken(tokenKind, "")
14        s.advance()
15    }
16}

Everything we had to do was add the case to match # and assign the kind of token (HASH) to the tokenKind variable.

Now the other symbols:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10        case '#':
11            tokenKind = HASH
12        case '!':
13            tokenKind = BANG
14        case '>':
15            tokenKind = GREATERTHAN
16        case '_':
17            tokenKind = UNDERSCORE
18        case '*':
19            tokenKind = STAR
20        case '-':
21            tokenKind = DASH
22        case '[':
23            tokenKind = STRAIGHTBRACEOPEN
24        case ']':
25            tokenKind = STRAIGHTBRACECLOSE
26        case '(':
27            tokenKind = PARENOPEN
28        case ')':
29            tokenKind = PARENCLOSE
30        case '`':
31            tokenKind = BACKTICK
32        }
33        s.addToken(tokenKind, "")
34        s.advance()
35    }
36}

Notice the missing line break (\n)? As i mentioned while discussing the scanner.Scanner.advance function, we need to add the line breaks to the end of the line ourselfs to keep track if we are at the end of the current line.

To move to the next line we need to call the scanner.Scanner.advanceLine function if the scanner encounters the line break symbol:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10        case '#':
11            tokenKind = HASH
12        case '!':
13            tokenKind = BANG
14        case '>':
15            tokenKind = GREATERTHAN
16        case '_':
17            tokenKind = UNDERSCORE
18        case '*':
19            tokenKind = STAR
20        case '\n':
21			s.addToken(NEWLINE, "")
22			s.advanceLine()
23			continue
24        case '-':
25            tokenKind = DASH
26        case '[':
27            tokenKind = STRAIGHTBRACEOPEN
28        case ']':
29            tokenKind = STRAIGHTBRACECLOSE
30        case '(':
31            tokenKind = PARENOPEN
32        case ')':
33            tokenKind = PARENCLOSE
34        case '`':
35            tokenKind = BACKTICK
36        }
37        s.addToken(tokenKind, "")
38        s.advance()
39    }
40}

As one can see we add a new token with the NEWLINE kind and afterwards advance to the next line and continue onward to the next iteration of the loop

If we again look at our example we can see our lexer can already tokenize four of our five tokens - the three hashes and the line break at the end of the line.

Scanning the rest of the paragraph as a token of kind TEXT is going to be slightly more complex.

Multi character tokens

Lets talk lexing texts. We want our scanner to categorize everything between the special symbols we are already scanning for as a token of kind TEXT. To accomplish this we will need to add a default case to our switch statement:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10        case '#':
11            tokenKind = HASH
12        case '!':
13            tokenKind = BANG
14        case '>':
15            tokenKind = GREATERTHAN
16        case '_':
17            tokenKind = UNDERSCORE
18        case '*':
19            tokenKind = STAR
20        case '\n':
21			s.addToken(NEWLINE, "")
22			s.advanceLine()
23			continue
24        case '-':
25            tokenKind = DASH
26        case '[':
27            tokenKind = STRAIGHTBRACEOPEN
28        case ']':
29            tokenKind = STRAIGHTBRACECLOSE
30        case '(':
31            tokenKind = PARENOPEN
32        case ')':
33            tokenKind = PARENCLOSE
34        case '`':
35            tokenKind = BACKTICK
36        default:
37            var res strings.Builder
38
39        out:
40            for {
41                switch s.curChar {
42                case '\n', '!', '#', '_', '*', '-', '[', ']', '(', ')', '`', '>', '?':
43                    break out
44                }
45
46                res.WriteRune(s.curChar)
47                s.advance()
48            }
49        }
50        s.addToken(tokenKind, "")
51        s.advance()
52    }
53}

In this default statement we create a new variable of type strings.Builder on which we will append the characters of the text to. This is considerably faster than just concatenating two strings using the += operator.

The next line defines a label for our following loop (out:). We will use this label to break the outer loop if we find a character which matches any special character.

In this while loop we check if the current character matches a special character (lines 41-44), if it does we break at the out label (line 43). This stops the loop labeled as out. If it isn’t a special character the scanner writes the current character to the strings.Builder and advances to the next character.

We want our scanner to skip empty lines, therefore we add the following snippet at the end of the default case:

 1// scanner/scanner.go
 2package scanner
 3
 4// ...
 5
 6func (s *Scanner) Lex() {
 7	for !s.isAtEnd {
 8        var tokenKind uint
 9        switch s.curChar {
10        case '#':
11            tokenKind = HASH
12        case '!':
13            tokenKind = BANG
14        case '>':
15            tokenKind = GREATERTHAN
16        case '_':
17            tokenKind = UNDERSCORE
18        case '*':
19            tokenKind = STAR
20        case '\n':
21			s.addToken(NEWLINE, "")
22			s.advanceLine()
23			continue
24        case '-':
25            tokenKind = DASH
26        case '[':
27            tokenKind = STRAIGHTBRACEOPEN
28        case ']':
29            tokenKind = STRAIGHTBRACECLOSE
30        case '(':
31            tokenKind = PARENOPEN
32        case ')':
33            tokenKind = PARENCLOSE
34        case '`':
35            tokenKind = BACKTICK
36        default:
37            var res strings.Builder
38
39        out:
40            for {
41                switch s.curChar {
42                case '\n', '!', '#', '_', '*', '-', '[', ']', '(', ')', '`', '>':
43                    break out
44                }
45
46                res.WriteRune(s.curChar)
47                s.advance()
48            }
49
50			if res.Len() != 0 {
51				s.addToken(TEXT, res.String())
52			}
53
54			continue
55        }
56        s.addToken(tokenKind, "")
57        s.advance()
58    }
59}

This change skips all scanned token which have a string of length 0. The continue at the end skips adding the token again in line 56, due to the fact that we already added the token in line 51 if it is not empty.

Tests

While lexing we need to check if we get the output right and correct. Go supports testing code with a built-in module. To test simply create a new file in the scanner directory, named scanner_test.go.:

 1// scanner/scanner_test.go
 2package scanner
 3
 4import "testing"
 5
 6func TestExample(t *testing.T) {
 7    s := New("./markdown.md")
 8    s.Lex()
 9    tokens := s.Tokens()
10}

The _test suffix is important, otherwise the go test command won’t recognize the file and therefore the tests in it.

For this test to work the markdown.md file needs to be created in the scanner/ directory. Fill it with the example we already used before:

MARKDOWN

1### Heading 3

Now we can append the following to our TestExample function:

 1// scanner/scanner_test.go
 2package scanner
 3
 4import "testing"
 5
 6func TestExample(t *testing.T) {
 7    s := New("./markdown.md")
 8    s.Lex()
 9    tokens := s.Tokens()
10
11    expectedTokens := []uint {
12        HASH,
13        HASH,
14        HASH,
15        TEXT,
16        NEWLINE,
17    }
18
19    if len(tokens) != len(expectedTokens) {
20		t.Errorf("expected %d tokens, got: %d", len(expectedTokens), len(tokens))
21	}
22}

We define the token kinds we expect in the expectedTokens array, after that we compare the length of the tokens we got after scanning and the length of the array we just defined. If they didn’t match we did something wrong.

To check if all tokens we matched are correct we need to loop over them to check if the current token we expect matches the current token we scanned:

 1// scanner/scanner_test.go
 2package scanner
 3
 4import "testing"
 5
 6func TestExample(t *testing.T) {
 7    s := New("./markdown.md")
 8    s.Lex()
 9    tokens := s.Tokens()
10
11    expectedTokens := []uint {
12        HASH,
13        HASH,
14        HASH,
15        TEXT,
16        NEWLINE,
17    }
18
19    if len(tokens) != len(expectedTokens) {
20		t.Errorf("expected %d tokens, got: %d", len(expectedTokens), len(tokens))
21	}
22
23    for i, token := range tokens {
24		if expectedTokens[i] != token.Kind {
25			t.Errorf("expected %d [%s], got %d [%s] for token %d",
26				expectedTokens[i], TOKEN_LOOKUP_MAP[expectedTokens[i]], token.Kind, TOKEN_LOOKUP_MAP[token.Kind], i,
27			)
28		}
29	}
30}

So thats it, now you are able to write a lexer for markdown and test if everything worked out correctly.

Lexical analysis of Markdown in Go

Table of Contents

Info

Tokens

Defining a Markdown subset

Defining Tokens

Tip

Categorizing Tokens

Scanner

Tip

Scanner setup

Instantiation

Adding a new Token

Info

Moving through the line

Moving through the file

Pretty Printing the Tokens

Getting our scanned Tokens

Single character tokens

Tip

Multi character tokens

Tests