【Scrapy学习心得】添加IP代理
声明:仅供技术交流,请勿用于非法用途,如有其它非法用途造成损失,和本博客无关
添加
ip
代理即添加
proxy
属性的值
这里我用到的免费
ip
代理需要先在66免费代理网上取到
ip
,这个网站真得很好用,只需要请求一下便可以得到想要数量的
ip
。附上链接点击跳转
只需修改
scrapy
项目下的
middlewares.py
中间件,废话不多说,直接上代码:
from scrapy import signals
import requests
import parsel #用来解析网页的库,类似于BeautifulSoup,不过比它要强得多
class ipdailiDownloaderMiddleware(object):
def __init__(self):
self.proxy=self.get_ip #保存ip
self.flag=True #判断是否要重新获取ip
self.w=0 #标记proxy的下标
#装饰器的用处是将下面这个函数变为一个变量
@property
def get_ip(self):
#url里面的getum参数为要获取的ip数量
url = 'http://www.66ip.cn/nmtq.php?getnum=200&isp=0&anonymoustype=3&area=0&proxytype=1&api=66ip'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r = requests.get(url, headers=headers)
response = parsel.Selector(r.text)
ips = []
hehe = response.xpath('.//script/../text()').getall()
for ip in hehe:
temp = ip.strip()
if temp != '':
tem='https://'+temp
ips.append(tem)
return ips
#每次请求前都会调用的函数
def process_request(self,request,spider):
if (self.flag) and (self.w < len(self.proxy)):
proxy=self.proxy[self.w]
print('this ip is :',proxy)
request.meta['proxy']=proxy
self.w += 1
else:
if self.w == 0:
proxy = self.proxy[self.w]
else:
proxy = self.proxy[self.w - 1]
print('this ip is :', proxy)
request.meta['proxy'] = proxy
#每次请求成功返回网页源代码之前都会调用的函数
def process_response(self,request,response,spider):
if response.status == 200:
self.flag=False
return response #若成功返回数据则直接return它
else:
self.flag=True
self.proxy=self.get_ip
self.w=0
return request #状态码不为200,则需要return它,并修改flag为True,重新去获取ip
下面定义一个爬虫文件
ip.py
来验证能否使用免费的代理
ip
来访问,这个就不用多解释了吧。代码如下:
import scrapy
class IpSpider(scrapy.Spider):
name = 'ip'
allowed_domains = ['baidu.com']
start_urls = ['https://www.baidu.com']
def parse(self, response):
print(response.status)
yield scrapy.Request(
self.start_urls[0],
dont_filter=True
)
最后别忘了在
setting.py
文件中添加以下代码,将上面定义的
ip
代理中间件给打开:
DOWNLOADER_MIDDLEWARES = {
'hehe.middlewares.ipdailiDownloaderMiddleware': 543,
}
最后在
cmd
中先进入到这个项目的根目录下,即有
scrapy.cfg
文件的目录下,然后输入并运行
scrapy crawl ip
。
可以看到输出如下:
this ip is : https://182.35.87.118:9999
this ip is : https://120.25.252.232:8118
this ip is : https://117.95.162.79:9999
this ip is : https://175.42.123.108:9999
this ip is : https://113.120.38.141:9999
this ip is : https://120.25.252.18:8118
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
200
this ip is : https://49.51.193.134:1080
Process finished with exit code -1
可以看到当运行到了
ip
为
https://49.51.193.134:1080
时,返回状态码200,说明已经完成了添加
ip
代理来访问百度首页了,并且每次访问都是用这一个
ip
写在最后
好了,写完这次,不知道下次是什么时候了,不过,我依然会继续学习,可能不在继续学
Scrapy
了,因为学得也差不多了我觉得,我认为其实就剩下
Scrapy-redis
没有学了,也不是没有学吧,有看过网上的教程视频,大概也知道怎么使用,只是没有实战罢了,也因为现在根本用不上这个模块,我知道怎么用,有这个东西就行了,如果真的要用到,就到时候再学就好了哈哈,其实还有很多网站的爬虫也都爬过,例子也很多,代码也还在,只是不想再重新去分析网页了,因为之前还没有学到
Scrapy
,用得都是
requests
和
selenium
。以后有机会在写吧。