These notes are inspired by the book, Web Scraping with Python, written by Ryan Mitchell and based on my own understanding during my process of studying. DO NOT PUT THESE CODES DIRECTLY ON THE PRODUCTION ENVIRONMENT!

Opening a URL

example:

from urllib.request import urlopen

html = urlopen(str(input('Please input the URL: ')))
print(html.read())

Running BeautifulSoup

Use 'html.parser' to process the request and bs.tagName to grab the first occurrence of that tag.

example:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(str(input('Please input the URL: ')))
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

Other Parsers

lxml: faster and more forgiving than 'html.parser' but needs to be installed separately and depends on third-party C libraries.

html5lib: more forgiving than 'html.parser' but slower than both lxml and html.parser.

Handling Exceptions

  1. If try to retrieve a non existent tag of BeautifulSoap function, it will return none.
  2. Use try method in the code snippet.

example:

from urllib.error import HTTPError
from urllib.error import URLError
from urllib.request import urlopen
try:
    html = urlopen(str(input('Please input the URL: ')))
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found')
except AttributeError as e:
    print('Tag was not found')
else:
    print('it works')

grab the previous code in a function and use all previous exceptions

from urllib.error import HTTPError
from urllib.error import URLError
from urllib.request import urlopen
from bs4 import BeautifulSoup


def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError:
        print('cannot found the content you tried to retrieve on the server')
        return None
    except URLError:
        print('cannot found the server')
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.bjkgkhg.sggyfr # trigger the exception
    except AttributeError:
        print('you chose a wrong/non-existing tag')
        return None
    return title


title = get_title(str(input('Please input the URL: ')))
if title is None:
    print('\nTitle cannot be found')
else:
    print(title)