Extract Metadata From HTML with GO

· 1753 words · 9 minute read

Scope of this article ##

I am currently in the process of writing a local first search engine by extracting the site map of a given webpage and indexing all listed sites. The index scope is currently limited to the following meta data:

  • page title (contained in <title>...</title>)
  • description of the page (encoded in <meta>)
  • favicon of the webpage (<link type="image/x-icon">)

Thinking about tests ##

The first task is to establish test data we can use for extracting the aforementioned tags. I am going to use a recent post of mine (https://xnacly.me/posts/2023/sophia-lang-weekly02/) to throw against the extractor we are going to write:

 1package main
 2
 3import (
 4	"net/http"
 5	"testing"
 6	"time"
 7)
 8
 9func TestExtractor(t *testing.T) {
10	client := http.Client{
11		Timeout: time.Millisecond * 250,
12	}
13	resp, err := client.Get("https://xnacly.me/posts/2023/sophia-lang-weekly02/")
14	if err != nil {
15		t.Error(err)
16	} else if resp.StatusCode > http.StatusPermanentRedirect {
17		t.Error(err)
18	}
19
20	site := Extract(resp.Body)
21	err = resp.Body.Close()
22	if site.Title != "Sophia Lang Weekly - 02 | xnacly - blog" {
23		t.Errorf("Title doesnt match, got %q", site.Title)
24	}
25
26	if site.Description != "Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas" {
27		t.Errorf("Description doesnt match, got %q", site.Description)
28	}
29
30	if site.IconUrl != "https://xnacly.me/images/favicon.ico" {
31		t.Errorf("IconUrl doesnt match, got %q", site.IconUrl)
32	}
33}

This test case is straight forward, we make a request, pass the body to the Extract function (we are going to write in the next section) and afterwards check if the resulting data structure contains the correct values.

Setup, Types and Packages ##

Lets take a look at the signature of the Extract function:

 1package main
 2
 3type Site struct {
 4	Title       string
 5	Description string
 6	IconUrl     string
 7}
 8
 9func Extract(r io.Reader) Site {
10    site := Site{}
11    return site
12}

Lets now add the dependency to our imports and get started with creating a new tokenizer:

Tip

We are using the x/net/html package, you can add it to your projects go mod via

1$ go get golang.org/x/net/html
 1package main
 2
 3import (
 4    "io"
 5    "golang.org/x/net/html"
 6)
 7
 8type Site struct {
 9	Title       string
10	Description string
11	IconUrl     string
12}
13
14func Extract(r io.Reader) Site {
15    site := Site{}
16    lexer := html.NewTokenizer(r)
17    return site
18}

Lets think content (using our heads) ###

All our extraction targets are contained in the <head></head> tag of a HTML document. Considering we are going to use my recent blog article, I will use the same to highlight the structure of the data we want to extract:

 1<!DOCTYPE html>
 2<html lang="en-us" data-lt-installed="true">
 3  <head>
 4    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
 5    <title>Sophia Lang Weekly - 02 | xnacly - blog</title>
 6    <meta charset="utf-8" />
 7    <meta http-equiv="x-ua-compatible" content="IE=edge,chrome=1" />
 8    <meta name="viewport" content="width=device-width,minimum-scale=1" />
 9    <meta property="og:type" content="article" />
10    <meta property="og:title" content="Sophia Lang Weekly - 02" />
11    <meta
12      property="og:description"
13      content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
14    />
15    <meta
16      name="description"
17      content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
18    />
19    <meta property="article:author" content="xnacly" />
20    <meta
21      property="article:published_time"
22      content="2023-12-24 00:00:00 +0000 UTC"
23    />
24    <meta name="generator" content="Hugo 0.119.0" />
25    <link
26      rel="shortcut icon"
27      href="https://xnacly.me/images/favicon.ico"
28      type="image/x-icon"
29    />
30  </head>
31  <!-- body -->
32</html>

We can safely assume our content is contained in the head tag, thus we will at first create a bit of context in our extractor and get our hands on the current token and its type:

 1package main
 2
 3import (
 4    "io"
 5    "golang.org/x/net/html"
 6)
 7
 8type Site struct {
 9	Title       string
10	Description string
11	IconUrl     string
12}
13
14func Extract(r io.Reader) Site {
15    site := Site{}
16    lexer := html.NewTokenizer(r)
17
18    var inHead bool
19    for {
20		tokenType := lexer.Next()
21		tok := lexer.Token()
22        switch tokenType {
23		case html.EndTagToken: // </head>
24			if tok.Data == "head" {
25				return site
26			}
27		case html.StartTagToken: // <head>
28			if tok.Data == "head" {
29				inHead = true
30			}
31
32			// keep lexing if not in head of document
33			if !inHead {
34				continue
35			}
36        }
37    }
38    return site
39}

Tip

html.Token exports the Data-field that is either filled with the name of the current tag or its content (for text).

We use this snippet to detect if we are currently in the head tag, if not we simply skip the current token. If we hit the </head> we do not care about the rest of the document, thus we exit the function.

The title and its text contents ###

Lets add the logic for detecting and storing the title tag content:

 1package main
 2
 3import (
 4    "io"
 5    "golang.org/x/net/html"
 6)
 7
 8type Site struct {
 9	Title       string
10	Description string
11	IconUrl     string
12}
13
14func Extract(r io.Reader) Site {
15    site := Site{}
16    lexer := html.NewTokenizer(r)
17
18    var inHead bool
19    var inTitle bool
20    for {
21		tokenType := lexer.Next()
22		tok := lexer.Token()
23        switch tokenType {
24		case html.EndTagToken: // </head>
25			if tok.Data == "head" {
26				return site
27			}
28		case html.StartTagToken: // <head>
29			if tok.Data == "head" {
30				inHead = true
31			}
32
33			// keep lexing if not in head of document
34			if !inHead {
35				continue
36			}
37
38            if tok.Data == "title" {
39				inTitle = true
40			}
41        case html.TextToken:
42			if inTitle {
43				site.Title = tok.Data
44				inTitle = false
45			}
46        }
47    }
48    return site
49}

This enables us to check if we are between a <title> and a </title> tag, if we are we write the content of the html.TextToken to site.Title.

The problematic part of the problem ###

We want to extract the <meta> tag with both property="og:description" and name="description". This requires us to check if a given tag is of type meta, has the property property="whatever" and write it to our site structure.

1<meta
2  property="og:description"
3  content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
4/>
5<meta
6  name="description"
7  content="Again, new object/array index + function + loop syntax, built-in functions, a License and lambdas"
8/>

The issue being the html.Attribute structure we will use to access these attributes returns an array, not particularly efficient to search multiple times for multiple pages to index, therefore we need a helper function for converting this array into a hash table mapping keys to values:

1// [...]
2func attrToMap(attr []html.Attribute) map[string]string {
3	r := make(map[string]string, len(attr))
4	for _, a := range attr {
5		r[a.Key] = a.Val
6	}
7	return r
8}

We will use this function in combination with hasKeyWithValue - a function we will implement in a second:

 1// [...]
 2func Extract(r io.Reader) Site {
 3    site := Site{}
 4    lexer := html.NewTokenizer(r)
 5
 6    var inHead bool
 7    var inTitle bool
 8    for {
 9		tokenType := lexer.Next()
10		tok := lexer.Token()
11        switch tokenType {
12		case html.EndTagToken: // </head>
13			if tok.Data == "head" {
14				return site
15			}
16		case html.StartTagToken: // <head>
17			if tok.Data == "head" {
18				inHead = true
19			}
20
21			// keep lexing if not in head of document
22			if !inHead {
23				continue
24			}
25
26            var attrMap map[string]string
27			if tok.Data == "meta" {
28				if attrMap == nil {
29					attrMap = attrToMap(tok.Attr)
30				}
31				if hasKeyWithValue(attrMap, "property", "og:description") || hasKeyWithValue(attrMap, "name", "description") {
32					// we have the check above, thus we skip error handling here
33					site.Description, _ = attrMap["content"]
34				}
35			}
36
37            if tok.Data == "title" {
38				inTitle = true
39			}
40        case html.TextToken:
41			if inTitle {
42				site.Title = tok.Data
43				inTitle = false
44			}
45        }
46    }
47    return site
48}

The usage of the new hasKeyWithValue function may not be intuitive, therefore lets first take a look at the implementation:

1// [...]
2func hasKeyWithValue(attributes map[string]string, key, value string) bool {
3	if val, ok := attributes[key]; ok {
4		if val == value {
5			return true
6		}
7	}
8	return false
9}

The function just checks if the key is contained in the map and the passed in value matches the value found in the attribute map.

My favorite icon - a favicon ###

Lets employ the previously introduced helper functions for extracting the favicon:

 1// [...]
 2func Extract(r io.Reader) Site {
 3    site := Site{}
 4    lexer := html.NewTokenizer(r)
 5
 6    var inHead bool
 7    var inTitle bool
 8    for {
 9		tokenType := lexer.Next()
10		tok := lexer.Token()
11        switch tokenType {
12		case html.EndTagToken: // </head>
13			if tok.Data == "head" {
14				return site
15			}
16		case html.StartTagToken: // <head>
17			if tok.Data == "head" {
18				inHead = true
19			}
20
21			// keep lexing if not in head of document
22			if !inHead {
23				continue
24			}
25
26            var attrMap map[string]string
27			if tok.Data == "meta" {
28				if attrMap == nil {
29					attrMap = attrToMap(tok.Attr)
30				}
31				if hasKeyWithValue(attrMap, "property", "og:description") || hasKeyWithValue(attrMap, "name", "description") {
32					// we have the check above, thus we skip error handling here
33					site.Description, _ = attrMap["content"]
34				}
35			} else if tok.Data == "link" {
36				if attrMap == nil {
37					attrMap = attrToMap(tok.Attr)
38				}
39				if hasKeyWithValue(attrMap, "type", "image/x-icon") {
40					site.IconUrl, _ = attrMap["href"]
41				}
42			}
43
44
45            if tok.Data == "title" {
46				inTitle = true
47			}
48        case html.TextToken:
49			if inTitle {
50				site.Title = tok.Data
51				inTitle = false
52			}
53        }
54    }
55    return site
56}

Here we check if a link tag contains the property-value pair type="image/x-icon" and store the value of the href attribute.

Running our tests ##

If we run our very cool test now, we should get no errors and everything should be fine:

1tmp.nSnzQqKFL3 :: go test ./... -v
2=== RUN   TestExtractor
3--- PASS: TestExtractor (0.09s)
4PASS
5ok  	test	0.093s

Conclusion? ##

This may not be the fastest, cleanest or most efficient way, but it works and its robust.