Web scrapping | Notion

Extracting specific content from a webpage is called 'Web Scraping". Web pages could be quite complicated to read and understand in its HTML format.

To understand a structure of the web page, we can use a tool such as web browser's inspect tool.

In Firefox or Chrome Browsers, you can right click on any item of interest, and choose Inspect (Right Click + Q). The item will be highlighted in the page, and the source code related to the selected item will be highlighted in a source-code view, (DOM and Style Inspector: Ctrl+Shift+C). You can identify the related tags, class or ID related to the interested item by observing the corresponding source code.

BeautifulSoup4

To navigate and search the HTML document using python, a library known as BeautifulSoup can be used. You can read about the library and its functions here.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful library takes in an HTML web page, and provide us python methods to navigate and access the individual tags within it.

Let us try some example uses:

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="<http://example.com/elsie>" class="sister" id="link1">Elsie</a>,
<a href="<http://example.com/lacie>" class="sister" id="link2">Lacie</a> and
<a href="<http://example.com/tillie>" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

#The BeautifulSoup object, represents the document as a nested data structure.
#prettify method print it in a way which is easy to read.
#print(soup.prettify())

#Here are some ways to navigate the data structure

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.title.name)
# u'title'

print(soup.title.string)
# u'The Dormouse's story'

print(soup.title.parent.name)
# u'head'

print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>

print(soup.p['class'])
# u'title'

print(soup.a)
# <a class="sister" href="<http://example.com/elsie>" id="link1">Elsie</a>

links = soup.find_all('a')
# [<a class="sister" href="<http://example.com/elsie>" id="link1">Elsie</a>,
#  <a class="sister" href="<http://example.com/lacie>" id="link2">Lacie</a>,
#  <a class="sister" href="<http://example.com/tillie>" id="link3">Tillie</a>]
for link in links:
  print(link)
  
print(soup.find(id="link3"))
# <a class="sister" href="<http://example.com/tillie>" id="link3">Tillie</a>

#Lets try to get siblings of the p tag
children = soup.p.parent.children
for child in children:
  print('Child:',child)
  
#or you can go sideways
print('Sibling:',soup.p.next_sibling.next_sibling)

#For many other methods, please visit:
#<https://www.crummy.com/software/BeautifulSoup/bs4/doc/>

outputs,

`<title>The Dormouse's story</title> title The Dormouse's story head The Dormouse's story ['title'] <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> Child:

Child: The Dormouse's story Child:

Child: Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. Child:

Child: ... Child:

Sibling: Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.`

There are many more techniques which you can use to locate any specific content easily.

BeautifulSoup4

Exercise : Extract the topic and the link of the Google search results