爬蟲之将Scrapy爬取資料儲存至Mongodb資料庫

需求：以1藥網中中西藥品分類中的所有頁面為目标，爬取每件商品的單價，名稱以及評論

在上一篇部落格中，我們講了Scrapy的基本使用以及各個檔案該如何配置，與上篇部落格中的案例相比，不同的地方就是在pipelines.py中對資料的處理不同。

建立爬蟲檔案

scrapy genspider yiyaowang yiyaowang.com

在yiyaowang.py檔案中先編寫回調函數，先爬取一頁的資料

# -*- coding: utf-8 -*-
import scrapy

class YaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yaowang.com']
    start_urls = [https://www.111.com.cn/categories/953710]

    def parse(self, response):

        # 提取資料
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
			# 擷取單價
			good_price = good.xpath('.//p[@class="price"]//span/text    ()').get().strip()
			
			# 擷取标題 
			# good_title = good.xpath('.//p[@class="titleBox"]//a/te    xt()').get()
			# 發現問題：
			# 并沒有傳回None，而是傳回一片空白
			# 分析：傳回空白而不是傳回None說明不是xpath路徑，可能是>    傳回的清單的第一個元素是一個空字元串
			# 解決：先用getall()全部取出來，然後再取我們需要的資料
			# 擷取标題
			good_title = good.xpath('.//p[@class="titleBox"]//a/text    ()').getall()[1].strip()
			
			# 擷取評論
			good_comment = good.xpath('.//a[@id="pdlink3"]//em/text(    )').get()

查找每一頁url的規律，循環爬取所有頁數

第一頁：https://www.111.com.cn/categories/953710-j1.html
第二頁：https://www.111.com.cn/categories/953710-j2.html
...
最後一頁：https://www.111.com.cn/categories/953710-j50.html

總結發現：頁數一共為50頁，唯一變化的為j後面的數字，并且數字與頁數對應

在原有代碼上進行添加

# -*- coding: utf-8 -*-
import scrapy

class YaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yaowang.com']
    # -------------------------------------------------------------------------
    start_urls = []
	base_url = "https://www.111.com.cn/categories/953710-j{}.html"
		# 得到每一頁的url
		for i in range(1,51):
			start_urls.append(base_url.format(i))
	# -------------------------------------------------------------------------

    def parse(self, response):

        # 提取資料
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
			# 擷取單價
			good_price = good.xpath('.//p[@class="price"]//span/text    ()').get().strip()
			
			# 擷取标題
			good_title = good.xpath('.//p[@class="titleBox"]//a/text    ()').getall()[1].strip()
			
			# 擷取評論
			good_comment = good.xpath('.//a[@id="pdlink3"]//em/text(    )').get()

至此資料已經爬取資料，接下來要先進行資料的處理

在items.py中編寫相應的類

class YiYaoWang(scrapy.Item):
    # 定義标題
    title = scrapy.Field()
    # 定義單價
    price = scrapy.Field()
    # 定義評價
    comment = scrapy.Field()

将資料放入item中準備讓管道調用

import scrapy

# 讀入item中的類
# ------------------------------------------------------------
from ..items import YiYaoWang
# ------------------------------------------------------------

class YiyaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yiyaowang.com']
    start_urls = []
    base_url = "https://www.111.com.cn/categories/953710-j{}.html"
    # 得到每一頁的url
    for i in range(1,51):
        start_urls.append(base_url.format(i))

    def parse(self, response):
        """從連結中擷取資料"""
        good_list = response.xpath('//ul[@id="itemSearchList"]/li')

        # 執行個體化item對象
        # ----------------------------------------------------------
        item = YiYaoWang()
        # ----------------------------------------------------------

        # 擷取資料
        for good in good_list:
            # 擷取單價
            price = good.xpath('.//p[@class="price"]//span/text()').get().strip()

            # 擷取标題
            # good_title = good.xpath('.//p[@class="titleBox"]//a/text()').get()
            # 發現問題：
            # 并沒有傳回None，而是傳回一片空白
            # 分析：傳回空白而不是傳回None說明不是xpath路徑，可能是傳回的清單的第一個元素是一個空字元串
            # 解決：先用getall()全部取出來，然後再取我們需要的資料
            # 擷取标題
            title = good.xpath('.//p[@class="titleBox"]//a/text()').getall()[1].strip()

            # 擷取評論
            comment = good.xpath('.//a[@id="pdlink3"]//em/text()').get()
            
            # ---------------------------------------------------------------
            # 處理資料
            item["title"] = title
            item["price"] = price
            item["comment"] = comment
    		# ---------------------------------------------------------------
    		
            yield item

在pipelines.py管道檔案中編寫資料儲存的類

class YiYaoWangPipeline:
    def open_spider(self,spider):
        # 建立連結
        self.client = pymongo.MongoClient(host="127.0.0.1",port=27017)
        # 進入資料庫
        self.db = self.client["first_text"]
        # 進入集合
        self.col = self.db["yiyaowang"]

    def process_item(self,item,spider):
        # 插入資料
        self.col.insert({"标題":item["title"],"單價":item["price"],"評論>數":item["comment"]})
        return item

    def close_spider(self,spider):
       self.client.close()

将寫好的管道加入到settings.py配置檔案中

必須把以前爬蟲檔案的管道設定注釋掉，不然以前爬蟲檔案的管道也會在現在的爬蟲檔案中運作一次，儲存資料的參數不一樣時就會報錯

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    # 'reptile.pipelines.ReptilePipeline': 300,
	# 'reptile.pipelines.HuPuPipeline': 300,
 'reptile.pipelines.YiYaoWangPipeline': 300,
}

執行爬蟲檔案
```
scrapy crawl yiyaowang
           
```

打開指令視窗檢視是否儲存到了資料庫

MongoDB Enterprise > show tables
yiyaowang

爬蟲之将Scrapy爬取資料儲存至Mongodb資料庫爬蟲之将Scrapy爬取資料儲存至Mongodb資料庫

爬蟲之将Scrapy爬取資料儲存至Mongodb資料庫

建立爬蟲檔案

在yiyaowang.py檔案中先編寫回調函數，先爬取一頁的資料

查找每一頁url的規律，循環爬取所有頁數

至此資料已經爬取資料，接下來要先進行資料的處理

将資料放入item中準備讓管道調用

在pipelines.py管道檔案中編寫資料儲存的類

将寫好的管道加入到settings.py配置檔案中

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

Ubuntu14.04 LTS下安裝mongodb

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入