In this article we will see how to extract all the URLs of the links present in a web page with Python using the BeautifulSoup module.
It's a matter of finding all the a
elements and getting the value of their href
attribute.
from bs4 import BeautifulSoup
def get_page_urls(html):
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
urls = []
for link in links:
href = link.get('href', '')
if href.startswith('http'):
urls.append(href)
return urls
Our function will return a list containing the URLs of the links found.