淺談正規表達式¶

Da-Wei Chiang¶

大綱¶

什麼是正規表達式？
字元集
比較字元
python正規表達式
使用正規表達式搜尋網頁

什麼是正規表達式?¶

正規表達式
白話的說明：使用代號或符號，描述一段有意義的文字。是各類程式語言中常見處理字串的手段．

字元集¶

字元集為組成正規表達式最主要的內容
每個字元集代表比對字串中字元需符合的條件

字元集 - 範例¶

使用[]將一組字元包覆表示搜尋範圍

[abc] : 搜尋包含a、b或c的字
[^abc] : 搜尋除了abc之外的所有字元
[a-z] : 搜尋任何小寫字母
[A-Z] : 搜尋任何大寫字母
[a-zA-Z] : 搜尋任何大小寫字母
[0-9] : 搜尋數字0~9

字元集 - 範例¶

使用\開頭表示預設字元集

\w : 表示任何字元，包含數字、英文字母、底線。即[a-zA-Z0-9_]
\W : 表示不為\w的字元，即[^a-zA-Z0-9_]
\d : 表示任何數字字元，即[0-9]
\D : 表示不是數字字元，即[^0-9]
\s : 表示空白或特殊空白字元，即[\t\r\n]
\S : 表示不為空白或特殊空白字元，即[^\t\r\n]

比較字元¶

^ : 表示比對字串開始，即從第1個字元開始比對
$ : 表示比對字串結束，即最後1個字元必須符合
. : 表示任意字元
| : or，可以是前後兩個字元任一個
? : 出現次數0或1次
* : 0或很多次
+ : 1或很多次
{n} : 出現n次
{n,m} : 出現n到m次
{n,} : 至少出現n次

請判斷下列表達式意思¶

xyz
^abc
.a$
a?bc
x+yz
ap{2}
ab{1,2}c
[a-z]{1,}
\w{1,}

`python`正規表達式¶

使用re套件

#安裝
pip install regex
conda install regex

#引用
import re

#使用
RegExp_Variable = re.findall('表達式', '比對字串') #測試用

In [68]:

# 範例

import re
content = 'This is Crawler Class. The RegExp Example. This year 2021/01/01.'
result = re.findall('\d+/\d+/\d+', content)
print(type(result))
print(result)

<class 'list'>
['2021/01/01']

練習¶

比對字串內容如下
- This is Crawler Class. The RegExp Example. This year 2021/01/01.
請輸出Crawler Class字樣

練習¶

比對字串內容如下
- mike:(02)27192342 nike:0978487487 jamie:0962553266 home:8867423129 company:(03)666666

請輸出

name: mike nike jamie home company 
cell phone: 0978487487 0962553266 
home phone: 2719234 7423129 666666(Wrong Number)   #家電去掉區域碼、判斷是否為7個數

使用正規表達式搜尋網頁¶

Variable = re.compile('表達式')

In [2]:

## 範例

import re
from bs4 import BeautifulSoup
html_string = """
<ol><li>nick</li><li>2021/01/01</li><li>0978487487</li></ol>
<ol><li>jay</li><li>2021/05/01</li><li>0962567890</li></ol>

"""
regexp = re.compile('\d+/\d+/\d+')

bts = BeautifulSoup(html_string, 'lxml')
tag_list = bts.find_all(text=regexp)
print(tag_list)

['2021/01/01', '2021/05/01']

練習¶

html內容如下

<div>
    <h1>nick</h1>
    <span>0978487487</span>
    <div>xyz888@gmail.com</div>
<div>
<div>
    <h1>nancy</h1>
    <span>0962587777</span>
    <label>0234124@yahoo.com.tw</label>
<div>
<div>
    <h1>nick</h1>
    <span>0978487487</span>
    <h2>a_bc@hotmail.com</h2>
<div>

請試著抓出所有mail資料

淺談正規表達式¶

Da-Wei Chiang¶

大綱¶

什麼是正規表達式?¶

字元集¶

字元集 - 範例¶

字元集 - 範例¶

比較字元¶

請判斷下列表達式意思¶

python正規表達式¶

練習¶

練習¶

使用正規表達式搜尋網頁¶

練習¶

`python`正規表達式¶