Selenium

Da-Wei Chiang

大綱

  • 為什麼需要Selenium?
  • Selenium簡介
  • 安裝Selenium
  • Selenium基礎
  • Selenium例外方法
  • Selenium互動方法

為什麼需要Selenium?

  • 目前已知的做法只能爬取靜態網頁、而不能爬取動態網頁
    • 靜態網頁
      • 已經確認網頁結果才爬取資料
    • 動態網頁
      • 不知道資料在哪,依據使用者的操作不同,所得到的網頁不同、所爬取的資料就不同
例:
    請使用該網頁(https://www.imdb.com/)爬取iron man電影資訊

Selenium簡介

  • Selenium IDE
    • 一套測試Seleinum的整合開發環境,可以錄製、編輯Selenium測試
  • Selenium API
    • Selenium支援C#、Java、Python...等語言建立Selenium測試,使用Selenium APIWebDriver溝通
  • Selenium WebDriver
    • 接收Selenium API所送出的訊息控制Web瀏覽器。其瀏覽器包括Chrome、Firefox、IE、Edge...等

安裝Selenium

  • Step 1:
    • 安裝Selenium API
  • Step 2:
    • Download WebDriver

Step 1

#安裝
pip install selenium
or
conda install selenium

#使用
from selenium import webdriver

Step 2

下載完後請解壓縮放於selenium開發工作目錄

使用selenium

webdriver_variable = webdriver.Chrome('driver path') # 開啟webdriver
webdriver_variable.quit()  #關閉webdriver
In [3]:
## 測試selenium是否安裝完成

from selenium import webdriver
import time

driver = webdriver.Chrome('./chromedriver')
time.sleep(10)
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:6: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  

Selenium基礎

取得html內容

driver_variable.implicitly_wait(integer)   #隱含等待
driver_variable.get('web link')   #連至特定網址
driver_variable.title  #抓取網頁title
driver_variable.page_source  #抓取html內容
In [4]:
# 範例

from selenium import webdriver

driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.imdb.com/title/tt0371746/?ref_=nv_sr_srsg_0')
print(driver.title)
#print(driver.page_source)
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:5: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  """
鋼鐵人 (2008) - IMDb

練習

  • 請使用Selenium抓取Bat man, title及所有卡司清單(存為串列結構)

Selenium網頁定位

# 引用
from selenium.webdriver.common.by import By
#抓取單一元素
driver_variable.find_element(By.ID, "id")
driver_variable.find_element(By.NAME, "name")
driver_variable.find_element(By.XPATH, "xpath")
driver_variable.find_element(By.TAG_NAME, "tag name")
driver_variable.find_element(By.CLASS_NAME, "class name")
driver_variable.find_element(By.CSS_SELECTOR, "css selector")
#抓取多個元素
driver_variable.find_elements(By.ID, "id")
driver_variable.find_elements(By.NAME, "name")
driver_variable.find_elements(By.XPATH, "xpath")
driver_variable.find_elements(By.TAG_NAME, "tag name")
driver_variable.find_elements(By.CLASS_NAME, "class name")
driver_variable.find_elements(By.CSS_SELECTOR, "css selector")
In [7]:
## 範例 - tag name
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.digitimes.com.tw/col/article.asp?id=1300&cf=AI1')
time.sleep(10)
tag_p = driver.find_element(By.TAG_NAME,"h1")

print(tag_p.text)

driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:5: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  """
半導體設備供應鏈的台灣角色
In [20]:
## 範例 - css selector

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm')
time.sleep(10)
cast_selector = driver.find_elements(By.CSS_SELECTOR, '.primary_photo+ td a') #注意寫法
for i in cast_selector:
    print(i.text)
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:6: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  
Robert Pattinson
Zoë Kravitz
Jeffrey Wright
Colin Farrell
Paul Dano
John Turturro
Andy Serkis
Peter Sarsgaard
Barry Keoghan
Jayme Lawson
Gil Perez-Abraham
Peter McDonald
Con O'Neill
Alex Ferns
Rupert Penry-Jones
Kosha Engler
Archie Barnes
Janine Harouni
Hana Hrzic
Joseph Walker
Luke Roberts
Oscar Novak
Stella Stocker
Sandra Dickinson
Jack Bennett
Andre Nightingale
Richard James-Neale
Lorraine Tai
Joseph Balderrama
James Eeles
Angela Yeoh
Leemore Marrett Jr.
Ezra Elliott
Itoya Osagiede
Stewart Alexander
Adam Rojko Vega
Heider Ali
Marcus Onilude
Elena Saurel
Ed Kear
Sid Sagar
Amanda Blake
Todd Boyce
Brandon Bassir
Will Austin
Chabris Napier-Lawrence
Douglas Russell
Charlie Carver
Max Carver
Phil Aizlewood
Mark Killeen
Philip Shaun McGuinness
Lorna Brown
Elliot Warren
Jay Lycurgo
Stefan Race
Elijah Baker
Craige Middleburg
Akie Kotabe
Spike Fearn
Urielle Klein-Mekongo
Bronson Webb
Madeleine Gray
Ste Johnston
Arthur Lee
Parry Glasspool
Jordan Coulson
Hadas Gold
Pat Battle
Bobby Cuza
Dean Meminger
Roma Torre
Mike Capozzola
Amanda Hurwitz
Joshua Eldridge-Smith
Daniel Rainford
Nathalie Armin
Jose Palma
Kazeem Tosin Amore
Jonathan Addis
Adaeze Cornelia Anane
Rodrig Andrisan
Eduardo Arrufat-Reboso
Kiran Asahan
Diego Barraza
Amy Clare Beales
Nicholas Benjamin
Scott Bennett
Charlie Bentley
Douglas Bunn
Phil Campbell
Tony Christian
Ruth Clarson
Bern Collaço
Andreea Helen David
Nick Davison
Obie Dean
Adria Dinev
Viliyan Donchev
Craig Douglas
Evan A. Dunn
Daniel Eghan
Hayden Ellingworth
Darcie Ellson
Paul Fitchford
Joseph L Geist
Albert Giannitelli
Susan Gillias
Callum Gore
Tamara Gough
George Graham
Rachel Handshaw
Juke Hardy
Metin Hassan
Christopher James Healy
Sarah Hussain
Shenel Hussein
Simon Jago
Yasmin J. James
Tobias James-Samuels
Adnan Kundi
Erran Lake
Sophie Lamont
Stuart D. Latham
Mickey Lewis
Eugene Lin
Annishia Camilla Lunette
Teresa Mahoney
Ben Mansfield
Tiago Martins
Obie Matthew
Nichola Jean Mazur
Kenny-Lee Mbanefo
Tony McCarthy
Tremayne Miller
Bharat Mistri
Christopher Moore
Sri Moorthy
Ayse Muge
Clément Osty
Nick Owenford
Andrew Paxton-Gray
Richard Price
Zoltan Rencsar
Paul Riddell
Will Rowlands
Iana Saliuk
Bernardo Santos
Kemal Shah
Eugene Shawn
Sam Shoubber
Amber Sienna
Dave Simon
James Snelling
Gareth Snow
Richard Stanley
Jimmy Star
Alfredo Tavares
Michelle Thomas
James Travis
Peter Trevor
Sahil Vaid
Vic Waghorn
Stuart Whelan
Paul Whelligan
Daniel Joseph Woolf
In [21]:
## 範例 - 擷取標籤屬性

from selenium import webdriver

driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.imdb.com/title/tt1877830/?ref_=nv_sr_srsg_0')
tag_a = driver.find_element(By.TAG_NAME, 'a')
print(tag_a.get_attribute('href'))
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:5: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  """
https://www.imdb.com/?ref_=nv_home

By方法取代

By.ID 可透過文字 "id" 取代
By.NAME 可透過文字 "name" 取代
By.XPATH 可透過文字 "xpath" 取代
By.TAG_NAME 可透過文字 "tag name" 取代
By.CLASS_NAME 可透過文字 "class name" 取代
By.CSS_SELECTOR 可透過文字 "css selector" 取代
In [22]:
## 範例 - css selector

from selenium import webdriver

import time
driver = webdriver.Chrome('./chromedriver')
driver.implicitly_wait(2)
driver.get('https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm')
time.sleep(10)
cast_selector = driver.find_elements("css selector", '.primary_photo+ td a') #注意寫法
for i in cast_selector:
    print(i.text)
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:6: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  
Robert Pattinson
Zoë Kravitz
Jeffrey Wright
Colin Farrell
Paul Dano
John Turturro
Andy Serkis
Peter Sarsgaard
Barry Keoghan
Jayme Lawson
Gil Perez-Abraham
Peter McDonald
Con O'Neill
Alex Ferns
Rupert Penry-Jones
Kosha Engler
Archie Barnes
Janine Harouni
Hana Hrzic
Joseph Walker
Luke Roberts
Oscar Novak
Stella Stocker
Sandra Dickinson
Jack Bennett
Andre Nightingale
Richard James-Neale
Lorraine Tai
Joseph Balderrama
James Eeles
Angela Yeoh
Leemore Marrett Jr.
Ezra Elliott
Itoya Osagiede
Stewart Alexander
Adam Rojko Vega
Heider Ali
Marcus Onilude
Elena Saurel
Ed Kear
Sid Sagar
Amanda Blake
Todd Boyce
Brandon Bassir
Will Austin
Chabris Napier-Lawrence
Douglas Russell
Charlie Carver
Max Carver
Phil Aizlewood
Mark Killeen
Philip Shaun McGuinness
Lorna Brown
Elliot Warren
Jay Lycurgo
Stefan Race
Elijah Baker
Craige Middleburg
Akie Kotabe
Spike Fearn
Urielle Klein-Mekongo
Bronson Webb
Madeleine Gray
Ste Johnston
Arthur Lee
Parry Glasspool
Jordan Coulson
Hadas Gold
Pat Battle
Bobby Cuza
Dean Meminger
Roma Torre
Mike Capozzola
Amanda Hurwitz
Joshua Eldridge-Smith
Daniel Rainford
Nathalie Armin
Jose Palma
Kazeem Tosin Amore
Jonathan Addis
Adaeze Cornelia Anane
Rodrig Andrisan
Eduardo Arrufat-Reboso
Kiran Asahan
Diego Barraza
Amy Clare Beales
Nicholas Benjamin
Scott Bennett
Charlie Bentley
Douglas Bunn
Phil Campbell
Tony Christian
Ruth Clarson
Bern Collaço
Andreea Helen David
Nick Davison
Obie Dean
Adria Dinev
Viliyan Donchev
Craig Douglas
Evan A. Dunn
Daniel Eghan
Hayden Ellingworth
Darcie Ellson
Paul Fitchford
Joseph L Geist
Albert Giannitelli
Susan Gillias
Callum Gore
Tamara Gough
George Graham
Rachel Handshaw
Juke Hardy
Metin Hassan
Christopher James Healy
Sarah Hussain
Shenel Hussein
Simon Jago
Yasmin J. James
Tobias James-Samuels
Adnan Kundi
Erran Lake
Sophie Lamont
Stuart D. Latham
Mickey Lewis
Eugene Lin
Annishia Camilla Lunette
Teresa Mahoney
Ben Mansfield
Tiago Martins
Obie Matthew
Nichola Jean Mazur
Kenny-Lee Mbanefo
Tony McCarthy
Tremayne Miller
Bharat Mistri
Christopher Moore
Sri Moorthy
Ayse Muge
Clément Osty
Nick Owenford
Andrew Paxton-Gray
Richard Price
Zoltan Rencsar
Paul Riddell
Will Rowlands
Iana Saliuk
Bernardo Santos
Kemal Shah
Eugene Shawn
Sam Shoubber
Amber Sienna
Dave Simon
James Snelling
Gareth Snow
Richard Stanley
Jimmy Star
Alfredo Tavares
Michelle Thomas
James Travis
Peter Trevor
Sahil Vaid
Vic Waghorn
Stuart Whelan
Paul Whelligan
Daniel Joseph Woolf

練習

  • 使用selenium爬取新聞整個文章內容
  • 作者資訊
    如:
      楊燿宏  台灣應材美國總公司AGS核心工程部資深協理
      簡介:
      現任台灣應材美國總公司AGS核心工程部資深協理,曾任Veeco Instruments專案及工程總監,2018起開始擔任北美台灣工程師協會矽谷分會(NATEA-SV)會長,亦為僑委會僑務促進委員。
In [15]:
# 練習 - 之前做法

import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.digitimes.com.tw/col/article.asp?id=1300&cf=AI1')

bts = BeautifulSoup(response.text, 'lxml')

main = bts.find_all('p', class_='main_p')
for i in main:
    print(i.get_text())
print("- - - - - - ")
caption_tag = bts.find(class_='caption')
print(caption_tag.get_text(' ').replace('\t', '').replace('\n', ''))
caption_desc = bts.find(class_='thumbnail_desc')
print("簡介:")
print(caption_desc.string)
						半導體,不同於其他產業,是一個非常技術密集且資本密集的產業。
在技術密集方面,半導體廠傾盡全力開發更先進的1奈米製程,也持續投入更大的資本在半導體設備上。在台積電及其他半導體IC大廠競相爭奪半導體市佔的同時,其實比較少被提到的半導體設備商,也扮演著產業鏈舉足輕重的角色。
在前AIoT的時代,90%所收集到的資料都已被分析利用,在AIoT時代,更巨大的資料量將被產生,而目前統計結果告訴我們只有50%的資料是能夠直接被機器使用的。 但不論是資料的生成以及分析都需要使用到大量的晶片。最近新聞報導不斷出現美國車廠對晶片供應短缺的擔憂,就是一個很好的例子。
在這個對自動化以及資料分析更重視的時代,IC晶片的需求的成長正朝著幾何級數方向發展。任何的產業最終都須達到供需的平衡,IC設計的進步以及製造製程朝向3奈米以及1奈米的發展確實讓技術以及製程上追上了市場應用對晶片規格的要求,但在供應鏈上,大量的資本投入加上對半導體設備的大力投資,則更進一步的幫助市場達到供需的平衡。
整體半導體的供應鏈極為分工精密,大致上可分為兩大區塊。第一個區塊是半導體的生產供應鏈,包含晶片設計、封裝、測試;第二個區塊是半導體設備的供應鏈,這部分其實就是提供設備以及原物料支援前述第一區塊的生產、測試、封裝各流程。相信大部分的讀者對半導體IC的生產供應鏈已經相當熟悉,也了解台灣重要的地位。但是在半導體設備的供應鏈上,台灣其實也佔了世界上舉足輕重的地位,但是這部分卻比較少在報章媒體的報導上被彰顯。
以半導體設備的產業趨勢而言,相對於之前30年聚焦於半導體設備的採購價格,最近3到5年,半導體設備的每單位晶圓成本(cost per wafer)以及總體擁有成本(cost of ownership)已經成為生產到場在談論交易時兩個重要指標。同時在半導體製造上,效能也不再是唯一,單位區域價格(cost per area)以及能耗(power consumption)都是半導體製程進步的非常重要指標。
防疫期間全球供應鏈的移轉以及不穩定性,考驗著半導體設備商在強勁的半導體設備需求上穩定供貨的能力。如同傳統產業如在數位轉型上的超前部署,則能在疫情期間穩定度過甚至得到的超預期的成果。這幾年產業對半導體需求的急速上升,如果沒有全球半導體設備廠和台灣的關鍵零組件供應商對全球供應鏈預先的超前部署以及台灣這一年來的穩定防疫成果,在COVID-19(新冠肺炎)期間我們可能早已看到全球高科技市場有更大的波動以及不穩定性。
至於台灣在未來半導體上的發展,筆者認為產官學界應持續關注自由貿易協定(FTA),如區域全面經濟夥伴協定(RCEP)、泛太平洋貿易協定(CPTPP),並全力推動跨太平洋夥伴全面進步協定(CPTPP)合作及海峽兩岸經濟合作架構協議(ECFA),才能在對半導體供應鏈能持續保有重要的地位,以及對各種風險能有更好的因應。另外每單位晶圓成本(cost per wafer)以及總體擁有成本(cost of ownership)的考慮也是台灣的廠商應該更加注意的。而關於這部分,下次專欄我們會再有更深入的討論。
							
- - - - - - 
楊燿宏  台灣應材美國總公司AGS核心工程部資深總監   
簡介:
楊燿宏為應用材料美國總公司工程部資深總監、前矽谷美臺高科技論壇(UTHF)大會召集人、2020美臺產業科技論壇單元主持人、北美台灣工程師協會NATEA矽谷分會2018會長及現任顧問、舊金山灣區台灣商會TCCSFBA任期顧問,亦為僑委會僑務促進委員,擁有30多項美國與國際專利。

Selenium例外方法

ElementNotVisibleException  #元素存在,但不可見
ErrorInResponseException    #伺服器回應錯誤
NoSuchElementException      #選取元素不存在
TimeoutException            #超過時間期限
In [20]:
# 範例

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Chrome('./chromedriver')
driver.implicitly_wait(2)
driver.get('https://www.imdb.com/title/tt1877830/?ref_=nv_sr_srsg_0')
try:
    main_tag = driver.find_element("tag name", 'test')
    print(main_tag.text)
except NoSuchElementException:
    print("無此標籤")
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:6: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  
無此標籤

練習

  • 請自行練習selenium例外方法

Selenium互動方法

主要使用方法

#引用
from selenium.webdriver.common.keys import Keys
#使用
send_keys() #寫入文字

Keys.ENTER  #執行Enter
In [21]:
## 範例
from selenium.webdriver.common.keys import Keys
from selenium import webdriver

driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.google.com.tw/')
input_tag = driver.find_element("css selector", 'textarea.gLFyf')
input_tag.send_keys('聯成電腦')
driver.implicitly_wait(5)
input_tag.send_keys(Keys.ENTER)
driver.implicitly_wait(5)
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:5: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  """

Selenium Action Chain

#引用
from selenium.webdriver.common.action_chains import ActionChains
#使用
click()  #點擊
double_click()  #點擊兩下
move_to_element()  #移動滑鼠至元件上
key_down()   #按下鍵盤某鍵
key_up()     #放開鍵盤某鍵
perform()    #儲存動作
send_keys()  #於目前指定元素送出按鍵
release()    #鬆開滑鼠按鍵
In [42]:
## 範例 - 抓取刺激1995的演員名單

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
link = "https://m.imdb.com/chart/top/"

driver = webdriver.Chrome('./chromedriver')
driver.get(link)

movie1995tag = driver.find_element("css selector", '.ipc-title-link-wrapper')
#print(movie1995tag.text)

actions = ActionChains(driver)
actions.click(movie1995tag)
actions.perform()
print('actions ok')
print('- - - - - -')
cast = driver.find_elements('css selector', 'a[data-testid="title-cast-item__actor"]')
for i in cast:
    print(i.text)

time.sleep(10)
driver.quit()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:8: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  
actions ok
- - - - - -
Tim Robbins
Morgan Freeman
Bob Gunton
William Sadler
Clancy Brown
Gil Bellows
Mark Rolston
James Whitmore
Jeffrey DeMunn
Larry Brandenburg
Neil Giuntoli
Brian Libby
David Proval
Joseph Ragno
Jude Ciccolella
Paul McCrane
Renee Blaine
Scott Mann

練習

練習

  • 請抓取漫畫排行榜的前60名. 並呈現名次順序
例:
  1. 鬼灭之刃
  2. 斗破苍穹
  3. 一拳超人
  4. 斗罗大陆
  5. ONE PIECE航海王
     ...
     ...
     ...
  60. 月光下的异世界之旅