Python: how to download images for prototyping web apps

Python can be a great helper with web apps for retrieving sample images used during the development process.

Suppose we want to offer our customers a prototype that is as similar as possible to the final result and to do so we need to find many sample images.

Downloading images is a task that can be automated with Python by following a precise procedure:

We can find a page of a website that offers copyright-free images. In this tutorial we will avoid using the APIs of a service like Unsplash for purely educational purposes.
Let's examine the HTML source of the page and check how the images are inserted. In fact, many sites offer a preview of the images on the summary page and the actual image on the detail page.
We get the main page with the requests module and extract the URLs of the detail pages with the BeautifulSoup module.
After studying the source code of a detail page, we make a GET request to each URL and extract the value of the src attribute of the main image, i.e. the URL of the image we want download.
We then download the main image preserving its original name.

The first function we're going to define is the one that extracts the filename contained in a URL.

import os
from urllib.parse import urlsplit

def get_filename_from_url(url=None):
    if url is None:
        return None
    urlpath = urlsplit(url).path
    return os.path.basename(urlpath)

The core function urlsplit() extracts the path from an absolute URL. The path property is then used by the basename() method to get just the filename including the extension.

Now we can create the function that downloads a web page and returns its HTML code as a string.

import requests

def get_page(url):
    try:
        r = requests.get(url, allow_redirects=True)
        return r.text
    except requests.exceptions.RequestException as e:
        return ''

If the HTTP request fails, an empty string is returned. Now we can define the function that downloads the image and saves it to the filesystem.

import requests

def download_image(url, destination_path):
    if url is None:
        return False
    try:
        res = requests.get(url, allow_redirects=True)
        data = res.content
        with open(destination_path, 'wb') as img:
            img.write(data)
        return True
    except requests.exceptions.RequestException:
        return False

The content property of the response is here saved to the filesystem using binary write mode since these are images. This function returns a boolean value indicating the result of the operation.

Now we need to extract the page elements using their HTML attributes.

from bs4 import BeautifulSoup

def get_page_elements(html, attrs):
    soup = BeautifulSoup(html, 'html.parser')
    elements = soup.find_all(attrs=attrs)
    return elements

Since we are searching for multiple elements in the document, we use the find_all() method by passing the attrs dictionary as its parameter which allows us to specify the attribute name and its value.

We can then define the main function:

def download_images(start_url, link_class, img_class):
    start_page = get_page(start_url)
    if not start_page:
        return []
    detail_links = get_page_elements(start_page, {'class': link_class})
    downloaded_images = []
    for dlink in detail_links:
        dpage_url = dlink.get('href')
        lnk_page = get_page(dpage_url)
        if not lnk_page:
            continue
        imgs = get_page_elements(lnk_page, {'class': img_class})
        for img in imgs:
            src = img.get('src')
            name = get_filename_from_url(src)
            download = download_image(src, f'./static/{name}')
            if download:
                downloaded_images.append(name)
    return downloaded_images

This function returns a list containing the names of the downloaded images or an empty list in case of an error. We can use this function as follows:

def main():
    images = download_images('https://site.tld/images/', 'detail', 'pic')
    print(images)


if __name__ == '__main__':
    main()

In conclusion, Python turns out to be an excellent tool to speed up the acquisition of assets used to create the prototypes of the web apps we need to present.