Python爬蟲第二課

BeautifulSoup
- 下載下傳插件
- 提取資料（find()，find_all()，Tag對象）

BeautifulSoup

下載下傳插件

windowns: pip install BeautifulSoup4

Mac: pip3 install BeautifulSoup4

bs對象 = BeautifulSoup(想要解析的文本,‘解析器’)

在括号中，要輸入兩個參數，第0個參數是要被解析的文本，注意了，它必須必須必須是字元串。

括号中的第1個參數用來辨別解析器，我們要用的是一個Python内置庫：html.parser。（它不是唯一的解析器，但是比較簡單的）

例子：

#（來自風變程式設計）
import requests
from bs4 import BeautifulSoup
#引入BS庫
rs = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
html = rs.text
soup = BeautifulSoup(html ,'html.parser')#把網頁解析為BeautifulSoup對象
print(soup)

rs.text ：<class ‘str’>

soup = BeautifulSoup(html ,‘html.parser’)：<class ‘bs4.BeautifulSoup’>

rs.text是屬于字元串而**soup = BeautifulSoup(html ,‘html.parser’)**屬于被解析過的BeautifulSoup對象，之是以列印出來的效果是一樣的，是因為BeautifulSoup對象在直接列印的時候會調用該對象内的str方法，是以直接列印 bs 對象顯示字元串是str的傳回結果。

提取資料（find()，find_all()，Tag對象）

find()與find_all()是BeautifulSoup對象的兩個方法，它們可以比對html的标簽和屬性，把BeautifulSoup對象裡符合要求的資料都提取出來。

它倆的用法基本是一樣的，差別在于，find()隻提取首個滿足要求的資料，而find_all()提取出的是所有滿足要求的資料。

方法	作用	用法	示例
find()	提取滿足要求的首個資料	BeautifulSoup對象.find(标簽,屬性)	soup.find(‘div’,class_=‘books’)
find_all()	提取滿足要求的所有資料	BeautifulSoup對象.find(标簽,屬性)	soup.find_all(‘div’,class_=‘books’)

*class_，這裡有一個下劃線，是為了和python文法中的類 class區分，避免程式沖突。當然，除了用class屬性去比對，還可以使用其它屬性，比如style屬性等。

find() 例子：

#（來自風變程式設計）
import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
item = soup.find('div') #使用find()方法提取首個<div>元素，并放到變量item裡。
print(type(item)) #列印item的資料類型
print(item)       #列印item

列印結果為：

200
<class 'bs4.element.Tag'>
<div>大家好，我是一個塊</div>

可以看出來特它的資料類型是一個Tag類型的

find_all() 例子：

#（來自風變程式設計）
import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
items = soup.find_all('div') #用find_all()把所有符合要求的資料提取出來，并放在變量items裡
print(type(items)) #列印items的資料類型
print(items)       #列印items

列印結果為：

200
<class 'bs4.element.ResultSet'>
[<div>大家好，我是一個塊</div>, <div>我也是一個塊</div>, <div>我還是一個塊</div>]

運作結果是那三個

元素，它們一起組成了一個清單結構。列印items的類型，顯示的是<class ‘bs4.element.ResultSet’>，是一個ResultSet類的對象。其實是Tag對象以清單結構儲存了起來，可以把它當做清單來處理。

周遊出來的元素的資料類型是<class ‘bs4.element.Tag’>

這與find()提取出來的資料類型是一樣的

Tag類對象的常用屬性和方法

方法	作用
Tag.find()和Tag.find_all()	提取Tag中的Tag
Tag.text	提取Tag中的文字
Tag[‘屬性名’]	輸入參數：屬性名，可以提取Tag中這個屬性的值

Python爬蟲第二課BeautifulSoup

完

ps：革命尚未成功，同志仍需努力！！！

Python爬蟲第二課BeautifulSoup

Python爬蟲第二課

BeautifulSoup

下載下傳插件

提取資料（find()，find_all()，Tag對象）

繼續閱讀

Spring5學習筆記（十二）—— 事務操作

session與cookie

Run-Time Check Failure #2 - Stack around the variable 'cmd' was corrupted 的可能原因

常用的圖像特征有顔色特征、紋理特征、形狀特征、空間關系特征。

優秀IT顧問的七大能力之六--宏觀把控能力6 宏觀把控能力

魔數（代碼大全第12章：基本資料類型）

項管行知01--幾個經理1 定義2 曆史

《電磁學》學習筆記5——磁場強度H分子環流假說

項目管理二三事1、時間2、鐵三角 3、PMBOK4、PMBOK版本變更5 小結

安卓學習筆記（2）----LinearLayoutLinearLayout

安卓學習筆記（3）------RelativeLayoutRelativeLayout參考文獻

.NET中英文切換常見錯誤

閱讀教材的最佳方法是什麼？

辨別符的命名規則和規範辨別符命名規則辨別符命名規範基礎版Java代碼規範詳細版

藍橋杯單片機比賽蜂鳴器與繼電器子產品（原理、代碼詳解）

UE學習筆記：材質錯亂排序函數