Python can be a great helper with web apps for retrieving sample images used during the development process.
Suppose we want to offer our customers a prototype that is as similar as possible to the final result and to do so we need to find many sample images.
Downloading images is a task that can be automated with Python by following a precise procedure:
- We can find a page of a website that offers copyright-free images. In this tutorial we will avoid using the APIs of a service like Unsplash for purely educational purposes.
- Let's examine the HTML source of the page and check how the images are inserted. In fact, many sites offer a preview of the images on the summary page and the actual image on the detail page.
- We get the main page with the
requests
module and extract the URLs of the detail pages with theBeautifulSoup
module. - After studying the source code of a detail page, we make a GET request to each URL and extract the value
of the
src
attribute of the main image, i.e. the URL of the image we want download. - We then download the main image preserving its original name.
The first function we're going to define is the one that extracts the filename contained in a URL.
import os
from urllib.parse import urlsplit
def get_filename_from_url(url=None):
if url is None:
return None
urlpath = urlsplit(url).path
return os.path.basename(urlpath)
The core function urlsplit()
extracts the path from an absolute URL. The path
property is then
used by the basename()
method to get just the filename including the extension.
Now we can create the function that downloads a web page and returns its HTML code as a string.
import requests
def get_page(url):
try:
r = requests.get(url, allow_redirects=True)
return r.text
except requests.exceptions.RequestException as e:
return ''
If the HTTP request fails, an empty string is returned. Now we can define the function that downloads the image and saves it to the filesystem.
import requests
def download_image(url, destination_path):
if url is None:
return False
try:
res = requests.get(url, allow_redirects=True)
data = res.content
with open(destination_path, 'wb') as img:
img.write(data)
return True
except requests.exceptions.RequestException:
return False
The content
property of the response is here saved to the filesystem using binary write mode since these
are images. This function returns a boolean value indicating the result of the operation.
Now we need to extract the page elements using their HTML attributes.
from bs4 import BeautifulSoup
def get_page_elements(html, attrs):
soup = BeautifulSoup(html, 'html.parser')
elements = soup.find_all(attrs=attrs)
return elements
Since we are searching for multiple elements in the document, we use the find_all()
method by passing
the attrs
dictionary as its parameter which allows us to specify the attribute name and its value.
We can then define the main function:
def download_images(start_url, link_class, img_class):
start_page = get_page(start_url)
if not start_page:
return []
detail_links = get_page_elements(start_page, {'class': link_class})
downloaded_images = []
for dlink in detail_links:
dpage_url = dlink.get('href')
lnk_page = get_page(dpage_url)
if not lnk_page:
continue
imgs = get_page_elements(lnk_page, {'class': img_class})
for img in imgs:
src = img.get('src')
name = get_filename_from_url(src)
download = download_image(src, f'./static/{name}')
if download:
downloaded_images.append(name)
return downloaded_images
This function returns a list containing the names of the downloaded images or an empty list in case of an error. We can use this function as follows:
def main():
images = download_images('https://site.tld/images/', 'detail', 'pic')
print(images)
if __name__ == '__main__':
main()
In conclusion, Python turns out to be an excellent tool to speed up the acquisition of assets used to create the prototypes of the web apps we need to present.