使用Scrapy對新聞進行爬蟲（零）Scrapy學習筆記

2023-05-20 12:56:02

Scrapy學習筆記

目标

使用Scrapy爬蟲架構對擷取網站新聞資料。

爬蟲目标網站：http://tech.163.com

提取内容：

url 新聞位址

source 新聞來源

title 新聞标題

editor 新聞編輯

time 新聞時間

content 新聞正文内容

内容存儲方式：

檔案

資料庫

代碼

爬蟲架構檔案：

結果：

├── NewsSpiderMan

│ ├── DmozSpider // 自建目錄，存放針對DOMZ網站的爬蟲類

│ │ ├── init.py

│ │ └── dmoz_spider.py

│ ├── NewsSpider // 自建目錄，存放針對新聞的爬蟲類

│ │ ├── NewsSpider.py

│ │ ├── NewsSpider.pyc

│ │ ├── init.py

│ │ └── init.pyc

│ ├── init.py

│ ├── init.pyc

│ ├── items.py // 爬蟲提取内容設定

│ ├── items.pyc

│ ├── pipelines.py // 爬到資料後使用ITEM PIPELINE過濾處理資料并存放

│ ├── pipelines.pyc

│ ├── settings.py // 爬蟲架構下配置

│ └── settings.pyc

├── README.md

├── news.txt // 爬蟲結果

└── scrapy.cfg

GITHUB： https://github.com/chenxilinsidney/ScrapyNews

運作方式：scrapy crawl news163spider

爬蟲類

#!/usr/bin/env python
# -*-encoding:UTF-8-*-

from NewsSpiderMan.items import NewsItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class NewsSpider(CrawlSpider):
    name = "news163spider"
    allowed_domains = ["tech.163.com"]
    start_urls = ["http://tech.163.com"]

    rules = [
        Rule(LinkExtractor(allow='tech.163.com/16/.*\.html'),
             follow=True, callback='parse_item')
    ]

    def parse_item(self, response):
        item = NewsItem()
        item['url'] = [response.url]
        item['source'] =\
            response.xpath('//a[@id="ne_article_source"]/text()').\
            extract()
        item['title'] =\
            response.xpath('//div[@class="post_content_main"]/h1/text()').\
            extract()
        item['editor'] =\
            response.xpath('//span[@class="ep-editor"]/text()').\
            extract()
        item['time'] =\
            response.xpath('//div[@class="post_time_source"]/text()').\
            extract()
        item['content'] =\
            response.xpath('//div[@class="post_text"]/p/text()').\
            extract()
        for key in item:
            for data in item[key]:
                self.logger.debug("item %s value %s" % (key, data))
        return item

    # def parse_start_url(self, response):
    #    log.start()
    #    log.msg(str(response.xpath('//a/@href')))
    #    return response.xpath('//a/@href')

關鍵點：

1. name 辨別一個爬蟲，架構調用時使用（如指令：scrapy crawl news163spider）

2. start_urls 初始爬蟲目标網站

3. Rule 爬蟲規則

4. parse_item 結構化内容提取方法實作

結果

使用Scrapy對新聞進行爬蟲（零）Scrapy學習筆記

使用Scrapy對新聞進行爬蟲（零）Scrapy學習筆記

Scrapy學習筆記

目标

代碼

爬蟲類

結果

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入