HTML 走訪

Da-Wei Chiang

大綱

  • 基本子走訪
  • 子(向下)走訪
  • 父(向上)走訪
  • 兄弟(左右)走訪
  • 走訪前一個、下一個元素

基本子走訪

父標籤.子標籤.子標籤.....
In [ ]:
## 範例 - 請猜猜以下執行結果

import requests
from bs4 import BeautifulSoup

re = requests.get('https://tw.news.yahoo.com/%E9%A6%99%E6%B8%AF%E9%9B%BB%E5%BD%B1%E9%87%91%E5%83%8F%E7%8D%8E%EF%BC%8F%E5%90%B3%E6%85%B7%E4%BB%81%E3%80%8C%E5%85%A8%E7%B2%B5%E8%AA%9E%E5%8F%97%E8%A8%AA%E3%80%8D%E7%B6%B2%E8%AA%87%E8%AE%9A-%E8%A8%B1%E5%85%89%E6%BC%A2%E5%B8%A5%E7%82%B8%E7%B4%85%E6%AF%AF-104634573.html')
bts = BeautifulSoup(re.text, 'lxml')
print(bts.head.title.string)
print(bts.head.meta['content'])

子(向下)走訪

  • BeautifulSoup內建方法
contents:抓取所有子標籤
children:抓取所有子標籤
descendants:抓取所有孫標籤
In [1]:
## 範例 - contents
from bs4 import BeautifulSoup

html_string = """
<div id='d1'>
    <span>sub iterator</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""
bts = BeautifulSoup(html_string, 'lxml')
tag_d2 = bts.select('#d2')
tag_sub_ul = tag_d2[0].ul
for children in tag_sub_ul.contents:
    if isinstance(children, type(tag_sub_ul)):
        print(children.string)
one
two
three
In [57]:
## 範例 - children
from bs4 import BeautifulSoup

html_string = """
<div id='d1'>
    <span>sub iterator</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""
bts = BeautifulSoup(html_string, 'lxml')
tag_d2 = bts.select('#d2')
tag_sub_ul = tag_d2[0].ul
for children in tag_sub_ul.children:
    if isinstance(children, type(tag_sub_ul)):
        print(children.string)
one
two
three

contents、children差異

  • contents與children實作上毫無差異
  • 差異:children回傳的是一個List Generator物件
In [106]:
## 範例 - descendants
from bs4 import BeautifulSoup
from bs4.element import NavigableString
html_string = """
<div id='d1'>
    <span>sub iterator</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""
bts = BeautifulSoup(html_string, 'lxml')
tag_d2 = bts.select('#d2')

for children in tag_d2[0].descendants:
    if isinstance(children, NavigableString):
        if(children!='\n'):
            print(children)
sub iterator
one
two
three

在開始練習前

  • 有些較為嚴謹的網頁為了防止網路爬蟲, 會限制requests存取權限.因而看到403 Forbidden的錯誤訊息(如下述範例)
  • 由於抓取不到整的網頁結構, 因而無法進行parser
In [2]:
## 範例
import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.imdb.com/chart/top/')
print(response.text)
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>

403 Forbidden解法

  • Step 1:透過開發人員工具-> Network -> Headers -> User-Agent
  • Step 2:將查找到的User-Agent加入requests參數中.
如:
requests.get(url, headers={"user-agent":查找到的agent})
In [ ]:
## 範例
import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.imdb.com/chart/top/', headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"})
print(response.text)

練習

  • 使用向下走訪
  • 抓取所有電影的名稱 如:
1. 刺激1995
2. 教父
3. 教父第二集
4. 黑暗騎士
5. 十二怒漢
6. 辛德勒的名單
....

父(向上)走訪

parent  #屬性
find_parent()  #函數
In [132]:
## 範例 - 猜猜以下執行結果
from bs4 import BeautifulSoup

html_string = """
<div id='d1'>
    <span>sub iterator 1</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator 2</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""

bts = BeautifulSoup(html_string, 'lxml')
tag_li = bts.li
print(tag_li.parent.parent.span.string)
print(tag_li.find_parent().find_parent().span.string)

兄弟(左右)走訪

  • BeautifulSoup內建方法
# 走訪下一個兄弟標籤
next_sibling #屬性
find_next_sibling()  #方法
In [15]:
## 範例
from bs4 import BeautifulSoup

html_string = """
<div id='d1'>
    <span>sub iterator 1</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator 2</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""

bts = BeautifulSoup(html_string, 'lxml')
sele_ul = bts.ul
#print(sele_ul)
sele_ul_li = sele_ul.li
print(sele_ul_li.next_sibling.next_sibling)
print(sele_ul_li.find_next_sibling())
<li>two</li>
<li>two</li>

兄弟(左右)走訪

  • BeautifulSoup內建方法
# 走訪前一個兄弟標籤
previous_sibling #屬性
find_previous_sibling()  #方法
In [25]:
## 範例 - 猜以下執行結果
from bs4 import BeautifulSoup

html_string = """
<div id='d1'>
    <span>sub iterator 1</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator 2</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""

bts = BeautifulSoup(html_string, 'lxml')
select_d2 = bts.select('#d2')[0]
#print(select_d2)
select_d2_li = select_d2.li.find_next_sibling()
#print(select_d2_li)
print(select_d2_li.find_previous_sibling())

練習

  • 使用兄弟走訪
  • 抓取所有電影的名稱、電影評分 如:
    1 : 刺激1995 評分: 9.2
    2 : 教父 評分: 9.1
    3 : 教父第二集 評分: 9.0
    4 : 黑暗騎士 評分: 9.0
    5 : 十二怒漢 評分: 8.9
    ...

走訪前一個、下一個元素

next_element   #屬性
previous_element  #屬性
In [53]:
## 範例
from bs4 import BeautifulSoup

html_string = """
<div id='d1'>
    <span>sub iterator 1</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator 2</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""
bts = BeautifulSoup(html_string, 'lxml')
tag_div = bts.div
print(tag_div.next_element.next_element)
print(tag_div.next_element.next_element.find_next_sibling().next_element.next_element)   #猜執行結果
<span>sub iterator 1</span>

所有下一個元素、所有前一個元素

next_elements   #屬性
previous_elements  #屬性
In [72]:
## 範例
from bs4 import BeautifulSoup
from bs4.element import NavigableString
html_string = """
<div id='d1'>
    <span>sub iterator 1</span>
    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
<div id='d2'>
    <span>sub iterator 2</span>
    <ul>
        <li><span>one</span></li>
        <li>two</li>
        <li>three</li>
    </ul>
</div>
"""
bts = BeautifulSoup(html_string, 'lxml')
tag_d1 = bts.select('#d1')
#print(tag_d1[0])
for i in tag_d1[0].next_elements:
    if(isinstance(i, NavigableString)):
        if(i!='\n'):
            print(i)
sub iterator 1
one
two
three
sub iterator 2
one
two
three

練習

  • 請自行建構html字串,並練習使用previous_elements並觀察其執行狀況

總練習

  • 請抓取電視排名,名稱、評分、年份分別存於csv檔案中
  • 並回答下列問題
    • 評分前100名的平均分數為何?