1. 發送web請求
1.1 requests
用requests庫的get()方法發送get請求,常常會添加請求頭"user-agent",以及登入"cookie"等參數
1.1.1 user-agent
登入網站,将"user-agent"值複制到文本檔案
1.1.2 cookie
登入網站,将"cookie"值複制到文本檔案
1.1.3 測試代碼
import requests
from requests.exceptions import RequestException
headers = {
'cookie': '',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
} # 替換為自己的cookie
def get_page(url):
try:
html = requests.get(url, headers=headers, timeout=5)
if html.status_code == 200:
print('請求成功')
return html.text
else: # 這個else語句不是必須的
return None
except RequestException:
print('請求失敗')
if __name__ == '__main__':
input_url = 'https://www.zhihu.com/hot'
get_page(input_url)
1.2 selenium
多數網站能通過window.navigator.webdriver的值識别selenium爬蟲,是以selenium爬蟲首先要防止網站識别selenium模拟浏覽器。同樣,selenium請求也常常需要添加請求頭"user-agent",以及登入"cookie"等參數
1.2.1 移除Selenium中window.navigator.webdriver的值
在程式中添加如下代碼(對應老版本谷歌)
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)
1.2.2 user-agent
登入網站,将"user-agent"值複制到文本檔案,執行如下代碼将添加請求頭
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')
1.2.3 cookie
因為selenium要求cookie需要有"name","value"兩個鍵以及對應的值的值,如果網站上面的cookie是字元串的形式,直接複制網站的cookie值将不符合selenium要求,可以用selenium中的get_cookies()方法擷取登入"cookie"
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
import json
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')
driver = Chrome(options=option)
time.sleep(10)
driver.get('https://www.zhihu.com/signin?next=%2F')
time.sleep(30)
driver.get('https://www.zhihu.com/')
cookies = driver.get_cookies()
jsonCookies = json.dumps(cookies)
with open('cookies.txt', 'a') as f: # 檔案名和檔案位置自己定義
f.write(jsonCookies)
f.write('\n')
1.2.4 測試代碼示例
将上面擷取到的cookie複制到下面程式中便可運作
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)
driver.get('https://www.zhihu.com')
time.sleep(10)
driver.delete_all_cookies() # 清除剛才的cookie
time.sleep(2)
cookie = {} # 替換為自己的cookie
driver.add_cookie(cookie)
driver.get('https://www.zhihu.com/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
print(i.text)
2. HTML解析(元素定位)
要爬取到目标資料首先要定位資料所屬元素,BeautifulSoup和selenium都很容易實作對HTML的元素周遊
2.1 BeautifulSoup元素定位
下面代碼BeautifulSoup首先定位到屬性為"HotItem-title"的"h2"标簽,然後再通過.text()方法擷取字元串值
import requests
from requests.exceptions import RequestException
headers = {
'cookie': '',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
} # 替換為自己的cookie
def get_page(url):
try:
html = requests.get(url, headers=headers, timeout=5)
if html.status_code == 200:
print('請求成功')
return html.text
else: # 這個else語句不是必須的
return None
except RequestException:
print('請求失敗')
def parse_page(html):
html = BeautifulSoup(html, "html.parser")
titles = html.find_all("h2", {'class': 'HotItem-title'})[:10]
for title in titles:
print(title.text())
if __name__ == '__main__':
input_url = 'https://www.zhihu.com/hot'
parse_page(get_page(input_url))
2.2 selenium元素定位
selenium元素定位文法形式與requests不太相同,下面代碼示例(1.2.4 測試代碼示例)采用了一種層級定位方法:'div[itemprop="zhihu:question"] > a',筆者覺得這樣定位比較放心。
selenium擷取文本值得方法是.text,差別于requests的.text()
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)
driver.get('https://www.zhihu.com')
time.sleep(10)
driver.delete_all_cookies() # 清除剛才的cookie
time.sleep(2)
cookie = {} # 替換為自己的cookie
driver.add_cookie(cookie)
driver.get('https://www.zhihu.com/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
print(i.text)