In this article we will see how to parse a Google sitemap with Go.
A Google sitemap is an XML document whose root element is the urlset
element. Within this element are contained the url
elements. Each of these elements contains information relating to the URLs present on a website, enclosed in the following descendant elements:
loc
: The URL of the web pagelastmod
: the date of the last modificationchangefreq
: the interval in which a change is madepriority
: a number representing the priority to assign to the URL.
As with JSON, in Go you need to create structs that represent the structure of the XML document. For this purpose we use the encoding/xml
package which will use the following structs:
package main
import "encoding/xml"
type Sitemap struct {
XMLName xml.Name `xml:"urlset"`
Xmlns string `xml:"xmlns,attr"`
URLs []SitemapURL `xml:"url"`
}
type SitemapURL struct {
XMLName xml.Name `xml:"url"`
Loc string `xml:"loc"`
Lastmod string `xml:"lastmod"`
Changefreq string `xml:"changefreq"`
Priority string `xml:"priority"`
}
A parent element is marked with the type xml.Name
. As with JSON, also in this case it is necessary to use tags to match the field of the struct to the element of the XML document. For example, the xml:"loc"
tag matches the Loc
field to the <loc>
XML element in the document. p>
We therefore want to find a sitemap via an HTTP GET request and parse it. We can write the following code:
package main
import (
"encoding/xml"
"fmt"
"net/http"
"log"
"I"
)
type Sitemap struct {
XMLName xml.Name `xml:"urlset"`
Xmlns string `xml:"xmlns,attr"`
URLs []SitemapURL `xml:"url"`
}
type SitemapURL struct {
XMLName xml.Name `xml:"url"`
Loc string `xml:"loc"`
Lastmod string `xml:"lastmod"`
Changefreq string `xml:"changefreq"`
Priority string `xml:"priority"`
}
func main() {
sitemapURL := "https://site.tld/sitemap.xml"
res, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
body, err := io.ReadAll(res.Body)
defer res.Body.Close()
if res.StatusCode > 299 {
log.Fatal(res.StatusCode)
}
if err != nil {
log.Fatal(err)
}
var s Sitemap
xml.Unmarshal(body, &s)
for _, url := range s.URLs {
fmt.Println(url.Loc)
}
}
xml.Unmarshal
accepts as input the bytes of the XML document retrieved via the GET request and uses the structs we defined earlier to populate the list of SitemapURL
elements contained in the main struct Sitemap
.
In conclusion, parsing a Google sitemap with Go turns out to be relatively simple and allows us to design, for example, command-line applications that take advantage of this feature.