天天看點

Scrapy中間件的使用

下載下傳中間件(MiddleproDownloaderMiddleware)

  • 位置:引擎和下載下傳器之間
  • 作用:批量攔截到整個工程中所有的請求和響應
  • 攔截請求:
    • UA僞裝
    • IP代理
  • 攔截響應:
    • 篡改響應資料、響應請求
[middlewares.py] MiddleproDownloaderMiddleware類中有3個重要方法
import random
from fake_useragent import UserAgent

class MiddleproDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    USER_AGENT_LIST = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055'
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508'
    ]
           
  • process_request() 攔截請求
    1. 使用UA池(不推薦)
      def process_request(self, request, spider):
              # Called for each request that goes through the downloader
              # middleware.
      
              # Must either:
              # - return None: continue processing this request
              # - or return a Response object
              # - or return a Request object
              # - or raise IgnoreRequest: process_exception() methods of
              #   installed downloader middleware will be called
              """
              函數說明:攔截請求
              :param request:
              :param spider:
              :return:
              """
              # UA僞裝
              request.headers['User-Agent'] = rando.chiose(self.USER_AGENT_LIST)
      
              return None
                 
    2. 使用 fake-useragent 子產品(推薦)

      安裝子產品:

      pip install fake-useragent

      def process_request(self, request, spider):
                  # Called for each request that goes through the downloader
                  # middleware.
      
                  # Must either:
                  # - return None: continue processing this request
                  # - or return a Response object
                  # - or return a Request object
                  # - or raise IgnoreRequest: process_exception() methods of
                  #   installed downloader middleware will be called
                  """
                  函數說明:攔截請求
                  :param request:
                  :param spider:
                  :return:
                  """
                  # UA僞裝
                  request.headers['User-Agent'] = UserAgent().random
                 
  • process_response() 攔截所有的響應
    • 這裡以 爬取網易新聞為例
  • process_exception() 攔截異常的請求
    • 代理IP
      PROXY_http = [
          '153.180.102.104:80',
          '195.208.131.189:56055'
      ]
      PROXY_https = [
          '120.83.49.90:9000',
          '95.189.112.214:35508'
      ]
        
        	def process_exception(self, request, exception, spider):
              # Called when a download handler or a process_request()
              # (from other downloader middleware) raises an exception.
      
              # Must either:
              # - return None: continue processing this exception
              # - return a Response object: stops process_exception() chain
              # - return a Request object: stops process_exception() chain
              """
              函數說明:攔截發生異常的請求
              :param request:
              :param exception:
              :param spider:
              :return:
              """
              # 代理IP
              if request.url.split(':')[0] == 'http':
                  request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http)
              else:
                  request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https)
      
              # 請修正之後的請求對象進行重新的請求發送
              return request