概述
下載下傳器中間件(Downloader Middleware)
如上圖示号4、5處所示,下載下傳器中間件用于處理scrapy的request和response的鈎子架構,可以全局的修改一些參數,如代理ip,header等
使用下載下傳器中間件時必須激活這個中間件,方法是在settings.py檔案中設定DOWNLOADER_MIDDLEWARES這個字典,格式類似如下:
DOWNLOADERMIDDLEWARES = {
'myproject.middlewares.Custom_A_DownloaderMiddleware': 543,
'myproject.middlewares.Custom_B_DownloaderMiddleware': 643,
'myproject.middlewares.Custom_B_DownloaderMiddleware': None,
}
數字越小,越靠近引擎,數字越大越靠近下載下傳器,是以數字越小的,processrequest()優先處理;數字越大的,process_response()優先處理;若需要關閉某個中間件直接設為None即可
**自定義下載下傳器中間件
**
有時我們需要編寫自己的一些下載下傳器中間件,如使用代理,更換user-agent等,對于請求的中間件實作process_request(request, spider);對于處理回複中間件實作process_response(request, response, spider);以及異常處理實作 process_exception(request, exception, spider)
process_request(request, spider)
每當scrapy進行一個request請求時,這個方法被調用。通常它可以傳回
1.None
2.Response對象
3.Request對象
4.抛出IgnoreRequest對象
通常傳回None較常見,它會繼續執行爬蟲下去。其他傳回情況參考這裡
例如下面2個例子是更換user-agent和代理ip的下載下傳中間件
user-agent中間件
from faker import Faker
class UserAgent_Middleware():
def process_request(self, request, spider):
f = Faker()
agent = f.firefox()
request.headers['User-Agent'] = agent
代理ip中間件
class Proxy_Middleware():
def process_request(self, request, spider):
try:
xdaili_url = spider.settings.get('XDAILI_URL')
r = requests.get(xdaili_url)
proxy_ip_port = r.text
request.meta['proxy'] = 'https://' + proxy_ip_port
except requests.exceptions.RequestException:
print('擷取訊代理ip失敗!')
spider.logger.error('擷取訊代理ip失敗!')
scrapy中對接selenium
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from gp.configs import *
class ChromeDownloaderMiddleware(object):
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument('--headless') # 設定無界面
if CHROME_PATH:
options.binary_location = CHROME_PATH
if CHROME_DRIVER_PATH:
self.driver = webdriver.Chrome(chrome_options=options, executable_path=CHROME_DRIVER_PATH) # 初始化Chrome驅動
else:
self.driver = webdriver.Chrome(chrome_options=options) # 初始化Chrome驅動
def __del__(self):
self.driver.close()
def process_request(self, request, spider):
try:
print('Chrome driver begin...')
self.driver.get(request.url) # 擷取網頁連結内容
return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8',
status=200) # 傳回HTML資料
except TimeoutException:
return HtmlResponse(url=request.url, request=request, encoding='utf-8', status=500)
finally:
print('Chrome driver end...')
process_response(request, response, spider)
當請求發出去傳回時這個方法會被調用,它會傳回
1.若傳回Response對象,它會被下個中間件中的process_response()處理
2.若傳回Request對象,中間鍊停止,然後傳回的Request會被重新排程下載下傳
3.抛出IgnoreRequest,回調函數 Request.errback将會被調用處理,若沒處理,将會忽略
-
process_exception(request, exception, spider)
當下載下傳處理子產品或process_request()抛出一個異常(包括IgnoreRequest異常)時,該方法被調用
通常傳回None,它會一直處理異常
-
from_crawler(cls, crawler)
這個類方法通常是通路settings和signals的入口函數
@classmethod
def from_crawler(cls, crawler):
return cls(
mysql_host = crawler.settings.get('MYSQL_HOST'),
mysql_db = crawler.settings.get('MYSQL_DB'),
mysql_user = crawler.settings.get('MYSQL_USER'),
mysql_pw = crawler.settings.get('MYSQL_PW')
)
- scrapy自帶下載下傳器中間件
以下中間件是scrapy預設的下載下傳器中間件
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
scrapy自帶中間件請參考這裡
Spider中間件(Spider Middleware)
如文章第一張圖所示,spider中間件用于處理response及spider生成的item和Request
啟動spider中間件必須先開啟settings中的設定
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}
數字越小越靠近引擎,process_spider_input()優先處理,數字越大越靠近spider,process_spider_output()優先處理,關閉用None
編寫自定義spider中間件
-
process_spider_input(response, spider)
當response通過spider中間件時,這個方法被調用,傳回None
-
process_spider_output(response, result, spider)
當spider處理response後傳回result時,這個方法被調用,必須傳回Request或Item對象的可疊代對象,一般傳回result
-
process_spider_exception(response, exception, spider)
當spider中間件抛出異常時,這個方法被調用,傳回None或可疊代對象的Request、dict、Item
原文釋出時間為:2018-08-27
本文作者:Zarten
本文來自雲栖社群合作夥伴“
Python中文社群”,了解相關資訊可以關注“
”。