These notes are inspired by the book, Web Scraping with Python, written by Ryan Mitchell and based on my own understanding during my process of studying. DO NOT PUT THESE CODES DIRECTLY ON THE PRODUCTION ENVIRONMENT!

Some Useful functions

find_all(name, attrs, recursive, text, limit, **kwargs): The find_all method traverses the tree, starting at the given point, and finds all the Tag and NavigableString objects that match the criteria you give. GET A LIST.

name.get_text: separate the content from the tags. Returns Unicode TEXT as string only, no hyperlinks, paragraphs or other tags.

find(name, attrs, recursive, text, **kwargs): Instead of finding all the matching objects, it only finds the first one. It's like imposing a limit of 1 on the result set, and then extracting the single result from the array.

Arguments

visit documentation

name: restricts the set of tags by name(tag name, regular expression, a list or a dictionary, True, which matches every tag, callable object)

attrs: the argument of name

recursive: boolean value. if Ture, it looks into children, and children's children, for tags that match your parameters. if False, it only looks top-level tag.

text: return all apperances surrounded by tags in a list

limit: retrive the first x apperances

keyword: select tags that contain a particular attribute or a set of attribute. CAN BE REPLACED WITH REGEX

if you want to use class as keywords, use class_ instead or enclose class in quotes(as following example does):

HOW TO USE KEY WORD ARGUMENTS (THE WAY TO EXPRESS):

  1. bs.find_all(id='text')
  2. bs.find_all({'id': 'text'})

if you want to use class in the second way, use _class instead (since it is a protected word in python)

example:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

nameList = bs.find_all('span', {'class':'green'})
for name in nameList:
    print(name.get_text())

BeautifulSoup Objects

Comment: find HTML comments

<!--like this one-->

NavigableString: find text within tags instead of tags themselves

Some Terms

chirdren, descendants

example:
bs.body.h1 selects the first h1 tag that is a descendant of the bady tag

also bs.div.find_all('img') does the same thing and retrive a list.

but if you only want to retrive chirdren, then use .chirdren tag.

example:

  1. print all things once for they are all children (use .children tag)
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

for child in bs.find('', {'id': 'giftList'}).children:
    print(child)

  1. print every level(children and their own children) respectively (use .descendants tag)
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

for child in bs.find('table', {'id': 'giftList'}).descendants:
    print(child)

  1. only print the silblings of the objects but they cannot be siblings with themselves
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

for sibling in bs.find('table', {'id': 'giftList'}).tr.next_siblings:
    print(sibling)

NOTE FOR 3: there are also previous_siblings, next_sibling, precious_sibling as long as next_siblings

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

for sibling in bs.find('table', {'id': 'giftList'}).tr.next_sibling.previous_sibling:
    print(sibling)

  1. print parent using .parent tag
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

print(bs.find('img',{'src': '../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

WHY IT RETURNS 15.00?

The original code is as follows:

<td>
    15.00
</td>
<td>
    <img src="../img/gifts/img1.jpg">
</td>

parent is td > sibling is previous td > the text inside is 15.00

The cover photo is created by Natosha Benning on Unsplash