Node.js: extract all internal links from a remote page

Node.js: extract all internal links from a remote page

In this tutorial we will see how to extract links from an HTML page of a website with Node.js.

In this tutorial we will see how to extract links from an HTML page of a website with Node.js.

We will use the cheerio and got NPM modules for HTML parsing and HTTP requests, respectively. The got package will be used in version 11, as later versions do not use plus the traditional way of including modules and for this reason they are not compatible with all Node.js installations.

It basically involves identifying all HTML elements a whose attribute href refers to internal pages of the site chosen as the base URL. We must also make sure that the returned array does not contain duplicate URLs and to do so we will use the spread operator and to the Set object since these are primitive values (strings).

Our implementation could be as follows:

'use strict';

const got = require('got');
const cheerio = require('cheerio');

class Crawler {
    constructor(url) {
        this.url = url.endsWith( '/') ? url : url + '/';
        this.links = [];
    }

    async fetch() {
        try {
            await this.getLinks();
            this.links = [...new Set(this.links)];
        } catch(err) {
            throw err;
        }
    }

    async getLinks() {
        try {
            const response = await this.getURL();
            const $ = cheerio.load(response);
            const links = $('a[href^="' + this.url + '"]');
            const hrefs = this.parseLinks($, links);

            if(hrefs.length > 0) {
                hrefs.forEach(href => {
                    this.links.push(href);
                });
            }
        } catch(err) {
            throw new Error('Unable to get links.');
        }
    }

    async getURL() {
        try {
            const response = await got(this.url);
            return response.body;
        } catch(err) {
            throw err;
        }
    }

    parseLinks( $dom, $links ) {
        const hrefs = [];

        $links.each(function() {
            hrefs.push($dom(this).attr('href'));
        });

        return hrefs;
    }
}

(async () => {
    try {
        const crawler = new Crawler('https://gabrieleromanato.com');
        await crawler.fetch();

        console.log(crawler.links);

    } catch(err) {
        console.log(err);
    }
})();

The important thing to remember is that error handling is a must. Newer versions of Node immediately terminate the running process if an exception in the Promises is not handled.