這一次呢,讓我們來試一下“CSDN熱門文章的抓取”。
話不多說,讓我們直接進入CSND官網。
(其實是因為我被阿裡的反爬磨到沒脾氣,不想說話……)
一、URL分析
輸入“Python”并點選搜尋:
便得到了所有關于“Python”的熱門部落格,包括 [ 标題,網址、閱讀數 ] 等等,我們的任務,就是爬取這些部落格。
分析一下上圖中曲線處的URL,不難發現:p為頁數,q為關鍵字。
二、XPath路徑
打開開發者模式,比對我們所需資訊的标簽:
- 通過
比對各個部落格的URL位址;//dd[@class='author-time']/span[@class='link']/a/@href
-
比對各個部落格的标題。//h1[@class='title-article']/text()
注意:
對于XPath路徑有疑問的話,可複習《XPath文法的學習與lxml子產品的使用》。
三、代碼實作
1.定制輸入框:
keyword = input("請輸入關鍵詞:")
pn_start = int(input("起始頁:"))
pn_end = int(input("終止頁:"))
2.确定URL:
# 注意要+1
for pn in range(pn_start, pn_end+1):
url = "https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&viparticle=&domain=&o=&s=&u=&l=&f=&rbg=0" % (pn, keyword)
3.建構request對象:
# 傳回request對象
def getRequest(url):
return ur.Request(
url=url,
headers={
'User-Agent': user_agent.get_user_agent_pc(),
}
)
for pn in range(pn_start, pn_end+1):
url = "https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&viparticle=&domain=&o=&s=&u=&l=&f=&rbg=0" % (pn, keyword)
# 建構request對象
request = getRequest(url)
4.爬取部落格網站:
# 打開request對象
response = ur.urlopen(request).read()
# response為位元組,可直接進行le.HTML将其解析成xml類型
href_s = le.HTML(response).xpath("//dd[@class='author-time']/span[@class='link']/a/@href")
print(href_s)
輸出如下:
5.周遊href_s清單,輸出網址和标題:
for href in href_s:
print(href)
response_blog = ur.urlopen(getRequest(href)).read()
title = le.HTML(response_blog).xpath("//h1[@class='title-article']/text()")[0]
print(title)
結果倒是出來了,但
list index out of range
是什麼鬼???
6.列印open結果進行檢查:
for href in href_s:
try:
print(href)
response_blog = ur.urlopen(getRequest(href)).read()
print(response_blog)
有爬到東西,但似乎不是我們想要的網頁内容???看着這亂七八糟的
\x1f\x8b\x08\x00
,我一下子想到了 “ 編碼 ” 的問題。
7.九九八十一難:
-
和encode
;decode
-
UTF-8
GBK
-
bytes
str
-
urlencode
unquote
-
dumps
loads
-
request
requests
- 甚至是
accept-encoding: gzip, deflate
- ······
- 等等等等;
結果!竟然!都不是!!!啊啊啊啊~ ~ ~ ~
罷了罷了,bug還是要解決的_(:3」∠❀)_
8.靈機一動!難道是
cookie
???
# 傳回request對象
def getRequest(url):
return ur.Request(
url=url,
headers={
'User-Agent': user_agent.get_user_agent_pc(),
'Cookie': 'acw_tc=2760825115771713670314014ebaddd9cff4024b9ed3255873ddb28d85e269; acw_sc__v2=5e01b9a7ea60f5bf87d21658f23db9678e320a82; uuid_tt_dd=10_19029403540-1577171367226-613100; dc_session_id=10_1577171367226.599226; dc_tos=q3097q; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1577171368; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1577171368; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_19029403540-1577171367226-613100; c-login-auto=1; firstDie=1; announcement=%257B%2522isLogin%2522%253Afalse%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblog.csdn.net%252Fblogdevteam%252Farticle%252Fdetails%252F103603408%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D',
}
)
成功!!
才明白過來,原來我一直走錯了方向啊,竟然是阿裡反爬機制搞的鬼。I hate you。╭(╯^╰)╮
9.将網頁寫入本地檔案:
with open('blog/%s.html' % title, 'wb') as f:
f.write(response_blog)
10.加入異常處理代碼try-except:
try:
······
······
except Exception as e:
print(e)
講到這裡就結束了,大家有心思的話可以自己再把本文的代碼進行完善:比如将異常寫入到TXT中,友善後續進行異常分析;比如對爬取結果進行篩選,提高資料的針對性;等等。
全文完整代碼:
import urllib.request as ur
import user_agent
import lxml.etree as le
keyword = input("請輸入關鍵詞:")
pn_start = int(input("起始頁:"))
pn_end = int(input("終止頁:"))
# 傳回request對象
def getRequest(url):
return ur.Request(
url=url,
headers={
'User-Agent': user_agent.get_user_agent_pc(),
'Cookie': 'uuid_tt_dd=10_7175678810-1573897791171-515870; dc_session_id=10_1573897791171.631189; __gads=Test; UserName=WoLykos; UserInfo=b90874fc47d447b8a78866db1bde5770; UserToken=b90874fc47d447b8a78866db1bde5770; UserNick=WoLykos; AU=A57; UN=WoLykos; BT=1575250270592; p_uid=U000000; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_7175678810-1573897791171-515870!5744*1*WoLykos; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1575356971; Hm_ct_e5ef47b9f471504959267fd614d579cd=5744*1*WoLykos!6525*1*10_7175678810-1573897791171-515870; __yadk_uid=qo27C9PZzNLSwM0hXjha0zVMAtGzJ4sX; Hm_lvt_70e69f006e81d6a5cf9fa5725096dd7a=1575425024; Hm_ct_70e69f006e81d6a5cf9fa5725096dd7a=5744*1*WoLykos!6525*1*10_7175678810-1573897791171-515870; acw_tc=2760824315766522959534770e8d26ee9946cc510917f981c1d79aec141232; UM_distinctid=16f1e154f3e6fc-06a15022d8e2cb-7711a3e-e1000-16f1e154f3f80c; searchHistoryArray=%255B%2522python%2522%252C%2522Python%2522%255D; firstDie=1; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1577156990,1577157028,1577167164,1577184133; announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblog.csdn.net%252Fblogdevteam%252Farticle%252Fdetails%252F103603408%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D; TY_SESSION_ID=6121abc2-f0d2-404d-973b-ebf71a77c098; acw_sc__v2=5e01ee7889eda6ecf4690eab3dfd334e8301d2f6; acw_sc__v3=5e01ee7ca3cb12d33dcb15a19cdc2fe3d7735b49; dc_tos=q30jqt; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1577185014',
}
)
# 注意要+1
for pn in range(pn_start, pn_end+1):
url = "https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&viparticle=&domain=&o=&s=&u=&l=&f=&rbg=0" % (pn, keyword)
# 建構request對象
request = getRequest(url)
try:
# 打開request對象
response = ur.urlopen(request).read()
# response為位元組,可直接進行le.HTML将其解析成xml類型
href_s = le.HTML(response).xpath("//dd[@class='author-time']/span[@class='link']/a/@href")
# print(href_s)
for href in href_s:
try:
print(href)
response_blog = ur.urlopen(getRequest(href)).read()
# print(response_blog)
title = le.HTML(response_blog).xpath("//h1[@class='title-article']/text()")[0]
print(title)
with open('blog/%s.html' % title, 'wb') as f:
f.write(response_blog)
except Exception as e:
print(e)
except:
pass
全文完整代碼(代理IP):
import urllib.request as ur
import lxml.etree as le
import user_agent
keyword = input('請輸入關鍵詞:')
pn_start = int(input('起始頁:'))
pn_end = int(input('終止頁:'))
def getRequest(url):
return ur.Request(
url=url,
headers={
'User-Agent':user_agent.get_user_agent_pc(),
}
)
def getProxyOpener():
proxy_address = ur.urlopen('http://api.ip.data5u.com/dynamic/get.html?order=d314e5e5e19b0dfd19762f98308114ba&sep=4').read().decode('utf-8').strip()
proxy_handler = ur.ProxyHandler(
{
'http':proxy_address
}
)
return ur.build_opener(proxy_handler)
for pn in range(pn_start, pn_end+1):
request = getRequest(
'https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&domain=&o=&s=&u=&l=&f=&rbg=0' % (pn,keyword)
)
try:
response = getProxyOpener().open(request).read()
href_s = le.HTML(response).xpath('//span[@class="down fr"]/../span[@class="link"]/a/@href')
for href in href_s:
try:
response_blog = getProxyOpener().open(
getRequest(href)
).read()
title = le.HTML(response_blog).xpath('//h1[@class="title-article"]/text()')[0]
print(title)
with open('blog/%s.html' % title,'wb') as f:
f.write(response_blog)
except Exception as e:
print(e)
except:pass
為我心愛的女孩~~
一個佛系的部落格更新者,随手寫寫,看心情吧 (っ•̀ω•́)っ✎⁾⁾