【Scrapy學習心得】添加IP代理
聲明:僅供技術交流,請勿用于非法用途,如有其它非法用途造成損失,和本部落格無關
添加
ip
代理即添加
proxy
屬性的值
這裡我用到的免費
ip
代理需要先在66免費代理網上取到
ip
,這個網站真得很好用,隻需要請求一下便可以得到想要數量的
ip
。附上連結點選跳轉
隻需修改
scrapy
項目下的
middlewares.py
中間件,廢話不多說,直接上代碼:
from scrapy import signals
import requests
import parsel #用來解析網頁的庫,類似于BeautifulSoup,不過比它要強得多
class ipdailiDownloaderMiddleware(object):
def __init__(self):
self.proxy=self.get_ip #儲存ip
self.flag=True #判斷是否要重新擷取ip
self.w=0 #标記proxy的下标
#裝飾器的用處是将下面這個函數變為一個變量
@property
def get_ip(self):
#url裡面的getum參數為要擷取的ip數量
url = 'http://www.66ip.cn/nmtq.php?getnum=200&isp=0&anonymoustype=3&area=0&proxytype=1&api=66ip'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r = requests.get(url, headers=headers)
response = parsel.Selector(r.text)
ips = []
hehe = response.xpath('.//script/../text()').getall()
for ip in hehe:
temp = ip.strip()
if temp != '':
tem='https://'+temp
ips.append(tem)
return ips
#每次請求前都會調用的函數
def process_request(self,request,spider):
if (self.flag) and (self.w < len(self.proxy)):
proxy=self.proxy[self.w]
print('this ip is :',proxy)
request.meta['proxy']=proxy
self.w += 1
else:
if self.w == 0:
proxy = self.proxy[self.w]
else:
proxy = self.proxy[self.w - 1]
print('this ip is :', proxy)
request.meta['proxy'] = proxy
#每次請求成功傳回網頁源代碼之前都會調用的函數
def process_response(self,request,response,spider):
if response.status == 200:
self.flag=False
return response #若成功傳回資料則直接return它
else:
self.flag=True
self.proxy=self.get_ip
self.w=0
return request #狀态碼不為200,則需要return它,并修改flag為True,重新去擷取ip
下面定義一個爬蟲檔案
ip.py
來驗證能否使用免費的代理
ip
來通路,這個就不用多解釋了吧。代碼如下:
import scrapy
class IpSpider(scrapy.Spider):
name = 'ip'
allowed_domains = ['baidu.com']
start_urls = ['https://www.baidu.com']
def parse(self, response):
print(response.status)
yield scrapy.Request(
self.start_urls[0],
dont_filter=True
)
最後别忘了在
setting.py
檔案中添加以下代碼,将上面定義的
ip
代理中間件給打開:
DOWNLOADER_MIDDLEWARES = {
'hehe.middlewares.ipdailiDownloaderMiddleware': 543,
}
最後在
cmd
中先進入到這個項目的根目錄下,即有
scrapy.cfg
檔案的目錄下,然後輸入并運作
scrapy crawl ip
。
可以看到輸出如下:
this ip is : https://182.35.87.118:9999
this ip is : https://120.25.252.232:8118
this ip is : https://117.95.162.79:9999
this ip is : https://175.42.123.108:9999
this ip is : https://113.120.38.141:9999
this ip is : https://120.25.252.18:8118
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
Process finished with exit code -1
可以看到當運作到了
ip
為
https://49.51.193.134:1080
時,傳回狀态碼200,說明已經完成了添加
ip
代理來通路百度首頁了,并且每次通路都是用這一個
ip
寫在最後
好了,寫完這次,不知道下次是什麼時候了,不過,我依然會繼續學習,可能不在繼續學
Scrapy
了,因為學得也差不多了我覺得,我認為其實就剩下
Scrapy-redis
沒有學了,也不是沒有學吧,有看過網上的教程視訊,大概也知道怎麼使用,隻是沒有實戰罷了,也因為現在根本用不上這個子產品,我知道怎麼用,有這個東西就行了,如果真的要用到,就到時候再學就好了哈哈,其實還有很多網站的爬蟲也都爬過,例子也很多,代碼也還在,隻是不想再重新去分析網頁了,因為之前還沒有學到
Scrapy
,用得都是
requests
和
selenium
。以後有機會在寫吧。