Go: parsing an HTML document

Parsing an HTML document is one of the most common operations in the world of web development. In Go, the programming language created by Google, there are several libraries that allow you to parse an HTML document easily and efficiently. In this article, Ill walk you through the steps required to parse an HTML document using Gos standard library.

To get started, youll need to import two essential libraries: fmt for printing results and "golang.org/x/net/html" for parsing HTML. You can do this by adding the following lines of code to the top of your Go file:


package main

import (
    "fmt"
    "golang.org/x/net/html"
)

Next, well define a function called parseHTML that will take as an argument a string containing the HTML to parse:


    func parseHTML(htmlString string) {
        // Effettua il parsing dell'HTML
        doc, err := html.Parse(strings.NewReader(htmlString))
        if err != nil {
            fmt.Println("Errore durante il parsing dell'HTML:", err)
            return
        }
    
        // Effettua l'attraversamento dell'albero DOM
        traverseHTML(doc)
    }

The parseHTML function uses the html.Parse function to parse HTML and returns a pointer to a Node object representing the HTML documents DOM tree. If there are any errors while parsing, an error message will be printed.

Now that we have the HTML document parsed, we can traverse the DOM tree to get the information we want. We define a function called traverseHTML to do this:


    func traverseHTML(node *html.Node) {
        if node.Type == html.ElementNode &amp;&amp; node.Data == "a" {
            // Esempio: stampa il testo e l'URL di tutti i tag a
            fmt.Println("Testo:", node.FirstChild.Data)
            for _, attr := range node.Attr {
                if attr.Key == "href" {
                    fmt.Println("URL:", attr.Val)
                    break
                }
            }
        }
    
        // Attraversa i nodi figli
        for child := node.FirstChild; child != nil; child = child.NextSibling {
            traverseHTML(child)
        }
    }

In the "traverseHTML" function, we check if the current node is an HTML element of type "a" using the condition "node.Type == html.ElementNode && node.Data == "a"". In this example, we are looking for all <a> tags in the HTML document. You can adjust this condition according to your needs.

Once we find an <a> tag, we print the text contained within it using "node.FirstChild.Data". Next, we iterate through the tags attributes using "node.Attr" and find the corresponding URL by looking for the "href" attribute.

Finally, we recursively call the "traverseHTML" function to traverse all child nodes of the current node.

Now that weve defined the parsing function and the traversal function, we can use them in our main code:


    func main() {
        // Esempio di documento HTML
        htmlString := `
            &lt;html&gt;
                &lt;body&gt;
                    &lt;h1&gt;Titolo&lt;/h1&gt;
                    &lt;p&gt;Un paragrafo di testo.&lt;/p&gt;
                    &lt;a href="https://www.example.com"&gt;Link a example.com&lt;/a&gt;
                &lt;/body&gt;
            &lt;/html&gt;
        `
    
        parseHTML(htmlString)
    }

In the example above, we declared an htmlString variable containing a sample HTML document. We then call the parseHTML function passing the HTML string as an argument.

Upon execution, the programs output will be:


    Testo: Link a example.com
    URL: https://www.example.com

As you can see, we have printed the text of the <a> tag and the corresponding URL.

Conclusion

In this article, you learned how to parse an HTML document using the Go programming language. We used Gos standard library "golang.org/x/net/html" to parse it and traversed the DOM tree to get the desired information. You can use this knowledge as a foundation for performing more complex operations such as extracting structured data from HTML documents or manipulating content.