Scope of this article
I am currently in the process of writing a local first search engine by extracting the site map of a given webpage and indexing all listed sites. The index scope is currently limited to the following meta data:
- page title (contained in
<title>...</title>
) - description of the page (encoded in
<meta>
) - favicon of the webpage (
<link type="image/x-icon">
)
Thinking about tests
The first task is to establish test data we can use for extracting the aforementioned tags. I am going to use a recent post of mine (https://xnacly.me/posts/2023/sophia-lang-weekly02/) to throw against the extractor we are going to write:
1package main
2
3import (
4 "net/http"
5 "testing"
6 "time"
7)
8
9func TestExtractor(t *testing.T) {
10 client := http.Client{
11 Timeout: time.Millisecond * 250,
12 }
13 resp, err := client.Get("https://xnacly.me/posts/2023/sophia-lang-weekly02/")
14 if err != nil {
15 t.Error(err)
16 } else if resp.StatusCode > http.StatusPermanentRedirect {
17 t.Error(err)
18 }
19
20 site := Extract(resp.Body)
21 err = resp.Body.Close()
22 if site.Title != "Sophia Lang Weekly - 02 | xnacly - blog" {
23 t.Errorf("Title doesnt match, got %q", site.Title)
24 }
25
26 if site.Description != "Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas" {
27 t.Errorf("Description doesnt match, got %q", site.Description)
28 }
29
30 if site.IconUrl != "https://xnacly.me/images/favicon.ico" {
31 t.Errorf("IconUrl doesnt match, got %q", site.IconUrl)
32 }
33}
This test case is straight forward, we make a request, pass the body to the
Extract
function (we are going to write in the next section) and afterwards
check if the resulting data structure contains the correct values.
Setup, Types and Packages
Lets take a look at the signature of the Extract
function:
1package main
2
3type Site struct {
4 Title string
5 Description string
6 IconUrl string
7}
8
9func Extract(r io.Reader) Site {
10 site := Site{}
11 return site
12}
Lets now add the dependency to our imports and get started with creating a new tokenizer:
Tip
We are using the x/net/html package, you can add it to your projects go mod via
1$ go get golang.org/x/net/html
1package main
2
3import (
4 "io"
5 "golang.org/x/net/html"
6)
7
8type Site struct {
9 Title string
10 Description string
11 IconUrl string
12}
13
14func Extract(r io.Reader) Site {
15 site := Site{}
16 lexer := html.NewTokenizer(r)
17 return site
18}
Lets think content (using our heads)
All our extraction targets are contained in the <head></head>
tag of a HTML
document. Considering we are going to use my recent blog article, I will use
the same to highlight the structure of the data we want to extract:
1<!DOCTYPE html>
2<html lang="en-us" data-lt-installed="true">
3 <head>
4 <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
5 <title>Sophia Lang Weekly - 02 | xnacly - blog</title>
6 <meta charset="utf-8" />
7 <meta http-equiv="x-ua-compatible" content="IE=edge,chrome=1" />
8 <meta name="viewport" content="width=device-width,minimum-scale=1" />
9 <meta property="og:type" content="article" />
10 <meta property="og:title" content="Sophia Lang Weekly - 02" />
11 <meta
12 property="og:description"
13 content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
14 />
15 <meta
16 name="description"
17 content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
18 />
19 <meta property="article:author" content="xnacly" />
20 <meta
21 property="article:published_time"
22 content="2023-12-24 00:00:00 +0000 UTC"
23 />
24 <meta name="generator" content="Hugo 0.119.0" />
25 <link
26 rel="shortcut icon"
27 href="https://xnacly.me/images/favicon.ico"
28 type="image/x-icon"
29 />
30 </head>
31 <!-- body -->
32</html>
We can safely assume our content is contained in the head
tag, thus we will
at first create a bit of context in our extractor and get our hands on the
current token and its type:
1package main
2
3import (
4 "io"
5 "golang.org/x/net/html"
6)
7
8type Site struct {
9 Title string
10 Description string
11 IconUrl string
12}
13
14func Extract(r io.Reader) Site {
15 site := Site{}
16 lexer := html.NewTokenizer(r)
17
18 var inHead bool
19 for {
20 tokenType := lexer.Next()
21 tok := lexer.Token()
22 switch tokenType {
23 case html.EndTagToken: // </head>
24 if tok.Data == "head" {
25 return site
26 }
27 case html.StartTagToken: // <head>
28 if tok.Data == "head" {
29 inHead = true
30 }
31
32 // keep lexing if not in head of document
33 if !inHead {
34 continue
35 }
36 }
37 }
38 return site
39}
Tip
html.Token
exports the Data
-field that is either filled with the name of
the current tag or its content (for text).We use this snippet to detect if we are currently in the head
tag, if not we
simply skip the current token. If we hit the </head>
we do not care about the
rest of the document, thus we exit the function.
The title and its text contents
Lets add the logic for detecting and storing the title
tag content:
1package main
2
3import (
4 "io"
5 "golang.org/x/net/html"
6)
7
8type Site struct {
9 Title string
10 Description string
11 IconUrl string
12}
13
14func Extract(r io.Reader) Site {
15 site := Site{}
16 lexer := html.NewTokenizer(r)
17
18 var inHead bool
19 var inTitle bool
20 for {
21 tokenType := lexer.Next()
22 tok := lexer.Token()
23 switch tokenType {
24 case html.EndTagToken: // </head>
25 if tok.Data == "head" {
26 return site
27 }
28 case html.StartTagToken: // <head>
29 if tok.Data == "head" {
30 inHead = true
31 }
32
33 // keep lexing if not in head of document
34 if !inHead {
35 continue
36 }
37
38 if tok.Data == "title" {
39 inTitle = true
40 }
41 case html.TextToken:
42 if inTitle {
43 site.Title = tok.Data
44 inTitle = false
45 }
46 }
47 }
48 return site
49}
This enables us to check if we are between a <title>
and a </title>
tag, if
we are we write the content of the html.TextToken
to site.Title
.
The problematic part of the problem
We want to extract the <meta>
tag with both property="og:description"
and
name="description"
. This requires us to check if a given tag is of type
meta
, has the property property="whatever"
and write it to our site
structure.
1<meta
2 property="og:description"
3 content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
4/>
5<meta
6 name="description"
7 content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
8/>
The issue being the html.Attribute
structure we will use to access these
attributes returns an array, not particularly efficient to search multiple
times for multiple pages to index, therefore we need a helper function for
converting this array into a hash table mapping keys to values:
1// [...]
2func attrToMap(attr []html.Attribute) map[string]string {
3 r := make(map[string]string, len(attr))
4 for _, a := range attr {
5 r[a.Key] = a.Val
6 }
7 return r
8}
We will use this function in combination with hasKeyWithValue
- a function we
will implement in a second:
1// [...]
2func Extract(r io.Reader) Site {
3 site := Site{}
4 lexer := html.NewTokenizer(r)
5
6 var inHead bool
7 var inTitle bool
8 for {
9 tokenType := lexer.Next()
10 tok := lexer.Token()
11 switch tokenType {
12 case html.EndTagToken: // </head>
13 if tok.Data == "head" {
14 return site
15 }
16 case html.StartTagToken: // <head>
17 if tok.Data == "head" {
18 inHead = true
19 }
20
21 // keep lexing if not in head of document
22 if !inHead {
23 continue
24 }
25
26 var attrMap map[string]string
27 if tok.Data == "meta" {
28 if attrMap == nil {
29 attrMap = attrToMap(tok.Attr)
30 }
31 if hasKeyWithValue(attrMap, "property", "og:description") || hasKeyWithValue(attrMap, "name", "description") {
32 // we have the check above, thus we skip error handling here
33 site.Description, _ = attrMap["content"]
34 }
35 }
36
37 if tok.Data == "title" {
38 inTitle = true
39 }
40 case html.TextToken:
41 if inTitle {
42 site.Title = tok.Data
43 inTitle = false
44 }
45 }
46 }
47 return site
48}
The usage of the new hasKeyWithValue
function may not be intuitive, therefore
lets first take a look at the implementation:
1// [...]
2func hasKeyWithValue(attributes map[string]string, key, value string) bool {
3 if val, ok := attributes[key]; ok {
4 if val == value {
5 return true
6 }
7 }
8 return false
9}
The function just checks if the key is contained in the map and the passed in value matches the value found in the attribute map.
My favorite icon - a favicon
Lets employ the previously introduced helper functions for extracting the favicon:
1// [...]
2func Extract(r io.Reader) Site {
3 site := Site{}
4 lexer := html.NewTokenizer(r)
5
6 var inHead bool
7 var inTitle bool
8 for {
9 tokenType := lexer.Next()
10 tok := lexer.Token()
11 switch tokenType {
12 case html.EndTagToken: // </head>
13 if tok.Data == "head" {
14 return site
15 }
16 case html.StartTagToken: // <head>
17 if tok.Data == "head" {
18 inHead = true
19 }
20
21 // keep lexing if not in head of document
22 if !inHead {
23 continue
24 }
25
26 var attrMap map[string]string
27 if tok.Data == "meta" {
28 if attrMap == nil {
29 attrMap = attrToMap(tok.Attr)
30 }
31 if hasKeyWithValue(attrMap, "property", "og:description") || hasKeyWithValue(attrMap, "name", "description") {
32 // we have the check above, thus we skip error handling here
33 site.Description, _ = attrMap["content"]
34 }
35 } else if tok.Data == "link" {
36 if attrMap == nil {
37 attrMap = attrToMap(tok.Attr)
38 }
39 if hasKeyWithValue(attrMap, "type", "image/x-icon") {
40 site.IconUrl, _ = attrMap["href"]
41 }
42 }
43
44
45 if tok.Data == "title" {
46 inTitle = true
47 }
48 case html.TextToken:
49 if inTitle {
50 site.Title = tok.Data
51 inTitle = false
52 }
53 }
54 }
55 return site
56}
Here we check if a link
tag contains the property-value pair
type="image/x-icon"
and store the value of the href
attribute.
Running our tests
If we run our very cool test now, we should get no errors and everything should be fine:
1tmp.nSnzQqKFL3 :: go test ./... -v
2=== RUN TestExtractor
3--- PASS: TestExtractor (0.09s)
4PASS
5ok test 0.093s
Conclusion?
This may not be the fastest, cleanest or most efficient way, but it works and its robust.