環境要求
- Python 2.7, 3.4 or 3.5
- Redis >= 2.8
-
>= 1.1Scrapy
-
>= 2.10redis-py
1. 先安裝scrapy-redis
sudo pip3 install scrapy-redis
2. 安裝redis
更改設定
$ sudo vi /etc/redis/redis.conf
...
# bind 127.0.0.1
bind 0.0.0.0 # 接收來自任意IP的請求
重新開機redis
sudo service redis-server restart
3. 安裝 redis的可視化工具 redis desktop manager
連接配接https://pan.baidu.com/s/1miRPuOC?fid=489763908155827
4. 改寫spider
#檔案為wb.py
import scrapy
from datetime import datetime
from ..items import QuestionItem, AnswerItem
from scrapy_redis.spiders import RedisSpider
import re
class WbSpider(RedisSpider):
name = 'wb'
allowed_domains = ['58che.com']
# start_urls = ['https://bbs.58che.com/cate-1.html']
redis_key = "wbSpider:start_urls"
首先是改成繼承RedisSpider,然後增加一個redis_key是爬蟲名字,同時注釋掉start_urls,同時使用Redis指令向該清單添加起始爬取點,去掉了start_requests,因為所有的爬蟲都是從redis來擷取url,是以沒有開始請求的位址了
redis-cli
lpush wbSpider:start_urls https://bbs.58che.com/cate-1.html
5.修改setting設定
# Enables scheduling storing requests queue in redis.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
# Enables scheduling storing requests queue in redis.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
## 爬蟲資料采用redis存儲,注釋掉其他存儲設定
# Enables scheduling storing requests queue in redis.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
6.部署到不同的電腦上,啟動爬蟲
scrapy crawl wb