Go: parsing a Google sitemap

Go: parsing a Google sitemap

In this article we will see how to parse a Google sitemap with Go.

In this article we will see how to parse a Google sitemap with Go.

A Google sitemap is an XML document whose root element is the urlset element. Within this element are contained the url elements. Each of these elements contains information relating to the URLs present on a website, enclosed in the following descendant elements:

  • loc: The URL of the web page
  • lastmod: the date of the last modification
  • changefreq: the interval in which a change is made
  • priority: a number representing the priority to assign to the URL.

As with JSON, in Go you need to create structs that represent the structure of the XML document. For this purpose we use the encoding/xml package which will use the following structs:

package main

import "encoding/xml"

type Sitemap struct {
     XMLName xml.Name `xml:"urlset"`
     Xmlns string `xml:"xmlns,attr"`
     URLs []SitemapURL `xml:"url"`
}

type SitemapURL struct {
     XMLName xml.Name `xml:"url"`
     Loc string `xml:"loc"`
     Lastmod string `xml:"lastmod"`
     Changefreq string `xml:"changefreq"`
     Priority string `xml:"priority"`
}

A parent element is marked with the type xml.Name. As with JSON, also in this case it is necessary to use tags to match the field of the struct to the element of the XML document. For example, the xml:"loc" tag matches the Loc field to the <loc> XML element in the document.

We therefore want to find a sitemap via an HTTP GET request and parse it. We can write the following code:

package main

import (
     "encoding/xml"
     "fmt"
     "net/http"
     "log"
     "I"
)

type Sitemap struct {
     XMLName xml.Name `xml:"urlset"`
     Xmlns string `xml:"xmlns,attr"`
     URLs []SitemapURL `xml:"url"`
}

type SitemapURL struct {
     XMLName xml.Name `xml:"url"`
     Loc string `xml:"loc"`
     Lastmod string `xml:"lastmod"`
     Changefreq string `xml:"changefreq"`
     Priority string `xml:"priority"`
}


func main() {
     sitemapURL := "https://site.tld/sitemap.xml"
     res, err := http.Get(url)
    
     if err != nil {
             log.Fatal(err)
         }
        
         body, err := io.ReadAll(res.Body)
         defer res.Body.Close()
        
         if res.StatusCode > 299 {
             log.Fatal(res.StatusCode)
         }
        
         if err != nil {
             log.Fatal(err)
         }
        
         var s Sitemap
         xml.Unmarshal(body, &s)
        
         for _, url := range s.URLs {
             fmt.Println(url.Loc)
         }
}

xml.Unmarshal accepts as input the bytes of the XML document retrieved via the GET request and uses the structs we defined earlier to populate the list of SitemapURL elements contained in the main struct Sitemap.

In conclusion, parsing a Google sitemap with Go turns out to be relatively simple and allows us to design, for example, command-line applications that take advantage of this feature.