Go Web Scraper: Build and Optimize HTML Parsers

Leapcell: The Next-Gen Serverless Platform for Web Hosting Installation and Usage of Goquery Installation Execute: go get github.com/PuerkitoBio/goquery Import import "github.com/PuerkitoBio/goquery" Load the Page Take the IMDb Popular Movies page as an example: package main import ( "fmt" "log" "net/http" "github.com/PuerkitoBio/goquery" ) func main() { res, err := http.Get("https://www.imdb.com/chart/moviemeter/") if err != nil { log.Fatal(err) } defer res.Body.Close() if res.StatusCode != 200 { log.Fatalf("status code error: %d %s", res.StatusCode, res.Status) } Get the Document Object doc, err := goquery.NewDocumentFromReader(res.Body) if err != nil { log.Fatal(err) } // Other creation methods // doc, err := goquery.NewDocumentFromReader(reader io.Reader) // doc, err := goquery.NewDocument(url string) // doc, err := goquery.NewDocument(strings.NewReader("Example content")) Select Elements Element Selector Select based on basic HTML elements. For example, dom.Find("p") matches all p tags. It supports chained calls: ele.Find("h2").Find("a") Attribute Selector Filter elements by element attributes and values, with multiple matching methods: Find("div[my]") // Filter div elements with the my attribute Find("div[my=zh]") // Filter div elements whose my attribute is zh Find("div[my!=zh]") // Filter div elements whose my attribute is not equal to zh Find("div[my|=zh]") // Filter div elements whose my attribute is zh or starts with zh- Find("div[my*=zh]") // Filter div elements whose my attribute contains the string zh Find("div[my~=zh]") // Filter div elements whose my attribute contains the word zh Find("div[my$=zh]") // Filter div elements whose my attribute ends with zh Find("div[my^=zh]") // Filter div elements whose my attribute starts with zh parent > child Selector Filter the child elements under a certain element. For example, dom.Find("div>p") filters the p tags under the div tag. element + next Adjacent Selector Use it when the elements are irregularly selected, but the previous element has a pattern. For example, dom.Find("p[my=a]+p") filters the adjacent p tags whose my attribute value of the p tag is a. element~next Sibling Selector Filter the non-adjacent tags under the same parent element. For example, dom.Find("p[my=a]~p") filters the sibling p tags whose my attribute value of the p tag is a. ID Selector It starts with # and precisely matches the element. For example, dom.Find("#title") matches the content with id=title, and you can specify the tag dom.Find("p#title"). ele.Find("#title") Class Selector It starts with . and filters the elements with the specified class name. For example, dom.Find(".content1"), and you can specify the tag dom.Find("div.content1"). ele.Find(".title") Selector OR (|) Operation Combine multiple selectors, separated by commas. Filtering is done if any one of them is satisfied. For example, Find("div,span"). func main() { html := ` DIV1 DIV5 ` dom, err := goquery.NewDocumentFromReader(strings.NewReader(html)) if err != nil { log.Fatalln(err) } dom.Find("div,span").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) } Filters :contains Filter Filter elements that contain the specified text. For example, dom.Find("p:contains(a)") filters the p tags that contain a. dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) :has(selector) Filter elements that contain the specified element nodes. :empty Filter elements that have no child elements. :first-child and :first-of-type Filters Find("p:first-child") filters the first p tag; first-of-type requires it to be the first element of that type. :last-child and :last-of-type Filters The opposite of :first-child and :first-of-type. :nth-child(n) and :nth-of-type(n) Filters :nth-child(n) filters the nth element of the parent element; :nth-of-type(n) filters the nth element of the same type. :nth-last-child(n) and :nth-last-of-type(n) Filters Calculate in reverse order, with the last element being the first one. :only-child and :only-of-type Filters Find(":only-child") filters the only child element in the parent element; Find(":only-of-type") filters the only element of the same type. Get Content ele.Html() ele.Text() Traversal Use the Each method to traverse the selected elements: ele.Find(".item").Each(func(index

Mar 3, 2025 - 17:41
 0
Go Web Scraper: Build and Optimize HTML Parsers

Image description

Leapcell: The Next-Gen Serverless Platform for Web Hosting

Installation and Usage of Goquery

Installation

Execute:

go get github.com/PuerkitoBio/goquery

Import

import "github.com/PuerkitoBio/goquery"

Load the Page

Take the IMDb Popular Movies page as an example:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    res, err := http.Get("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()
    if res.StatusCode != 200 {
        log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
    }

Get the Document Object

    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }
    // Other creation methods
    // doc, err := goquery.NewDocumentFromReader(reader io.Reader)
    // doc, err := goquery.NewDocument(url string)
    // doc, err := goquery.NewDocument(strings.NewReader("

Example content"))

Select Elements

Element Selector

Select based on basic HTML elements. For example, dom.Find("p") matches all p tags. It supports chained calls:

ele.Find("h2").Find("a")

Attribute Selector

Filter elements by element attributes and values, with multiple matching methods:

Find("div[my]")        // Filter div elements with the my attribute
Find("div[my=zh]")     // Filter div elements whose my attribute is zh
Find("div[my!=zh]")    // Filter div elements whose my attribute is not equal to zh
Find("div[my|=zh]")    // Filter div elements whose my attribute is zh or starts with zh-
Find("div[my*=zh]")    // Filter div elements whose my attribute contains the string zh
Find("div[my~=zh]")    // Filter div elements whose my attribute contains the word zh
Find("div[my$=zh]")    // Filter div elements whose my attribute ends with zh
Find("div[my^=zh]")    // Filter div elements whose my attribute starts with zh

parent > child Selector

Filter the child elements under a certain element. For example, dom.Find("div>p") filters the p tags under the div tag.

element + next Adjacent Selector

Use it when the elements are irregularly selected, but the previous element has a pattern. For example, dom.Find("p[my=a]+p") filters the adjacent p tags whose my attribute value of the p tag is a.

element~next Sibling Selector

Filter the non-adjacent tags under the same parent element. For example, dom.Find("p[my=a]~p") filters the sibling p tags whose my attribute value of the p tag is a.

ID Selector

It starts with # and precisely matches the element. For example, dom.Find("#title") matches the content with id=title, and you can specify the tag dom.Find("p#title").

ele.Find("#title")

Class Selector

It starts with . and filters the elements with the specified class name. For example, dom.Find(".content1"), and you can specify the tag dom.Find("div.content1").

ele.Find(".title")

Selector OR (|) Operation

Combine multiple selectors, separated by commas. Filtering is done if any one of them is satisfied. For example, Find("div,span").

func main() {
    html := `
                
DIV1
DIV5
`
dom, err := goquery.NewDocumentFromReader(strings.NewReader(html)) if err != nil { log.Fatalln(err) } dom.Find("div,span").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) }

Filters

:contains Filter

Filter elements that contain the specified text. For example, dom.Find("p:contains(a)") filters the p tags that contain a.

dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) {
    fmt.Println(selection.Text())
})

:has(selector)

Filter elements that contain the specified element nodes.

:empty

Filter elements that have no child elements.

:first-child and :first-of-type Filters

Find("p:first-child") filters the first p tag; first-of-type requires it to be the first element of that type.

:last-child and :last-of-type Filters

The opposite of :first-child and :first-of-type.

:nth-child(n) and :nth-of-type(n) Filters

:nth-child(n) filters the nth element of the parent element; :nth-of-type(n) filters the nth element of the same type.

:nth-last-child(n) and :nth-last-of-type(n) Filters

Calculate in reverse order, with the last element being the first one.

:only-child and :only-of-type Filters

Find(":only-child") filters the only child element in the parent element; Find(":only-of-type") filters the only element of the same type.

Get Content

ele.Html()
ele.Text()

Traversal

Use the Each method to traverse the selected elements:

ele.Find(".item").Each(func(index int, elA *goquery.Selection) {
    href, _ := elA.Attr("href")
    fmt.Println(href)
})

Built-in Functions

Array Positioning Functions

Eq(index int) *Selection
First() *Selection
Get(index int) *html.Node
Index...() int
Last() *Selection
Slice(start, end int) *Selection

Extended Functions

Add...()
AndSelf()
Union()

Filtering Functions

End()
Filter...()
Has...()
Intersection()
Not...()

Loop Traversal Functions

Each(f func(int, *Selection)) *Selection
EachWithBreak(f func(int, *Selection) bool) *Selection
Map(f func(int, *Selection) string) (result []string)

Document Modification Functions

After...()
Append...()
Before...()
Clone()
Empty()
Prepend...()
Remove...()
ReplaceWith...()
Unwrap()
Wrap...()
WrapAll...()
WrapInner...()

Attribute Manipulation Functions

Attr*(), RemoveAttr(), SetAttr()
AttrOr(e string, d string)
AddClass(), HasClass(), RemoveClass(), ToggleClass()
Html()
Length()
Size()
Text()

Node Search Functions

Contains()
Is...()

Document Tree Traversal Functions

Children...()
Contents()
Find...()
Next...() *Selection
NextAll() *Selection
Parent[s]...()
Prev...() *Selection
Siblings...()

Type Definitions

Document
Selection
Matcher

Helper Functions

NodeName
OuterHtml

Examples

Getting Started Example

func main() {
    html := `
            
                

O Captain! My Captain!

O Captain! my Captain! our fearful trip is done, The ship has weather’d every rack, the prize we sought is won, The port is near, the bells I hear, the people all exulting, While follow eyes the steady keel, the vessel grim and daring; ` dom, err := goquery.NewDocumentFromReader(strings.NewReader(html)) if err != nil { log.Fatalln(err) } dom.Find("p").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }

Example of Crawling IMDb Popular Movie Information

package main

import (
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    doc.Find(".titleColumn a").Each(func(i int, selection *goquery.Selection) {
        title := selection.Text()
        href, _ := selection.Attr("href")
        fmt.Printf("Movie Name: %s, Link: https://www.imdb.com%s\n", title, href)
    })
}

The above examples extract the movie names and link information from the IMDb popular movies page. In actual use, you can adjust the selectors and processing logic according to your needs.

Leapcell: The Next-Gen Serverless Platform for Web Hosting

Finally, I would like to recommend the best platform for deploying Go services: Leapcell

Image description

1. Multi-Language Support

  • Develop with JavaScript, Python, Go, or Rust.

2. Deploy unlimited projects for free

  • pay only for usage — no requests, no charges.

3. Unbeatable Cost Efficiency

  • Pay-as-you-go with no idle charges.
  • Example: $25 supports 6.94M requests at a 60ms average response time.

4. Streamlined Developer Experience

  • Intuitive UI for effortless setup.
  • Fully automated CI/CD pipelines and GitOps integration.
  • Real-time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

  • Auto-scaling to handle high concurrency with ease.
  • Zero operational overhead — just focus on building.

Image description

Explore more in the documentation!

Leapcell Twitter: https://x.com/LeapcellHQ