11、web爬蟲講解2—Scrapy架構爬蟲—Scrapy使用

xpath表達式

　　//x 表示向下查找n層指定标簽，如：//div 表示查找所有div标簽

　　/x 表示向下查找一層指定的标簽

　　/@x 表示查找指定屬性的值,可以連綴如：@id @src

　　[@屬性名稱="屬性值"]表示查找指定屬性等于指定值的标簽,可以連綴，如查找class名稱等于指定名稱的标簽

　　/text() 擷取标簽文本類容

　　[x] 通過索引擷取集合裡的指定一個元素

1、将xpath表達式過濾出來的結果進行正則比對，用正則取最終内容

最後.re('正則')

xpath('//div[@class="showlist"]/li//img')[0].re('alt="(\w+)')

2、在選擇器規則裡應用正則進行過濾

[re:正則規則]

xpath('//div[re:test(@class, "showlist")]').extract()

實戰使用Scrapy擷取一個電商網站的、商品标題、商品連結、和評論數

分析源碼

第一步、編寫items.py容器檔案

我們已經知道了我們要擷取的是、商品标題、商品連結、和評論數

在items.py建立容器接收爬蟲擷取到的資料

設定爬蟲擷取到的資訊容器類，必須繼承scrapy.Item類

scrapy.Field()方法，定義變量用scrapy.Field()方法接收爬蟲指定字段的資訊

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

#items.py,檔案是專門用于，接收爬蟲擷取到的資料資訊的，就相當于是容器檔案

class AdcItem(scrapy.Item):    #設定爬蟲擷取到的資訊容器類
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()      #接收爬蟲擷取到的title資訊
    link = scrapy.Field()       #接收爬蟲擷取到的連接配接資訊
    comment = scrapy.Field()    #接收爬蟲擷取到的商品評論數

第二步、編寫pach.py爬蟲檔案

定義爬蟲類，必須繼承scrapy.Spider

name設定爬蟲名稱

allowed_domains設定爬取域名

start_urls設定爬取網址

parse(response)爬蟲回調函數，接收response，response裡是擷取到的html資料對象

xpath()過濾器，參數是xpath表達式

extract()擷取html資料對象裡的資料

yield item 接收了資料的容器對象，傳回給pipelies.py

# -*- coding: utf-8 -*-
import scrapy
from adc.items import AdcItem  #導入items.py裡的AdcItem類，容器類

class PachSpider(scrapy.Spider):                 #定義爬蟲類，必須繼承scrapy.Spider
    name = 'pach'                                #設定爬蟲名稱
    allowed_domains = ['search.dangdang.com']    #爬取域名
    start_urls = ['http://category.dangdang.com/pg1-cid4008149.html']     #爬取網址

    def parse(self, response):                   #parse回調函數
        item = AdcItem()                         #執行個體化容器對象
        item['title'] = response.xpath('//p[@class="name"]/a/text()').extract()  #表達式過濾擷取到資料指派給，容器類裡的title變量
        # print(rqi['title'])
        item['link'] = response.xpath('//p[@class="name"]/a/@href').extract()    #表達式過濾擷取到資料指派給，容器類裡的link變量
        # print(rqi['link'])
        item['comment'] = response.xpath('//p[@class="star"]//a/text()').extract() #表達式過濾擷取到資料指派給，容器類裡的comment變量
        # print(rqi['comment'])
        yield item   #接收了資料的容器對象，傳回給pipelies.py

robots協定

注意：如果擷取的網站在robots.txt檔案裡設定了，禁止爬蟲爬取協定，那麼将無法爬取，因為scrapy預設是遵守這個robots這個國際協定的，如果想不遵守這個協定，需要在settings.py設定

到settings.py檔案裡找到ROBOTSTXT_OBEY變量，這個變量等于False不遵守robots協定，等于True遵守robots協定

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   #不遵循robots協定

第三步、編寫pipelines.py資料處理檔案

如果需要pipelines.py裡的資料處理類能工作，需在settings.py設定檔案裡的ITEM_PIPELINES變量裡注冊資料處理類

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'adc.pipelines.AdcPipeline': 300,  #注冊adc.pipelines.AdcPipeline類，後面一個數字參數表示執行等級，數值越大越先執行
}

注冊後pipelines.py裡的資料處理類就能工作

定義資料處理類，必須繼承object

process_item(item)為資料處理函數，接收一個item，item裡就是爬蟲最後yield item 來的資料對象

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class AdcPipeline(object):                      #定義資料處理類，必須繼承object
    def process_item(self, item, spider):       #process_item(item)為資料處理函數，接收一個item，item裡就是爬蟲最後yield item 來的資料對象
        for i in range(0,len(item['title'])):   #可以通過item['容器名稱']來擷取對應的資料清單
            title = item['title'][i]
            print(title)
            link = item['link'][i]
            print(link)
            comment = item['comment'][i]
            print(comment)
        return item

最後執行

執行爬蟲檔案，scrapy crawl pach --nolog

可以看到我們需要的資料已經拿到了

【轉載自：

http://www.lqkweb.com

】

11、web爬蟲講解2—Scrapy架構爬蟲—Scrapy使用

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入