Python实战案例分享：爬取当当网商品数据

作者：韦玮

转载请注明出处

目前，网络爬虫应用领域非常广，在搜索引擎、大数据分析、客户挖掘中均可以用到。在本篇博文中，韦玮老师会以当当网爬虫为例，为大家讲解如何编写一个自动爬虫将当当网的商品数据都爬取下来。

首先，需要创建一个名为dangdang的爬虫项目，如下所示：

d:\python35\myweb>scrapy startproject dangdang

new scrapy project 'dangdang', using template directory 'd:\\python35\\lib\\site-packages\\scrapy\\templates\\project', created in:

d:\python35\myweb\dangdang

you can start your first spider with:

cd dangdang

scrapy genspider example example.com

创建好了爬虫项目之后，我们需要进入该爬虫项目，然后在爬虫项目中创建一个爬虫，如下所示：

d:\python35\myweb>cd .\dangdang\

d:\python35\myweb\dangdang>scrapy genspider -t basic dangspd dangdang.com

created spider 'dangspd' using template 'basic' in module:

dangdang.spiders.dangspd

爬虫和爬虫项目是不一样的，一个爬虫项目中可以有1个或多个爬虫文件。

随后，我们需要编写items.py文件，在该文件中定义好需要爬取的内容，我们将items.py文件修改为如下所示：

# -*- coding: utf-8 -*-

# define here the models for your scraped items

# see documentation in:

import scrapy

class dangdangitem(scrapy.item):

# define the fields for your item here like:

# name = scrapy.field()

#商品标题

title=scrapy.field()

#商品评论数

num=scrapy.field()

随后，需要编写pipelines.py文件，在pipelines.py文件中，我们一般会编写一些爬取后数据处理的代码，我们需要将爬取到的信息依次展现到屏幕上（当然你也可以将爬取到的信息写进文件或数据库中），我们将pipelines.py文件修改为如下所示：

# define your item pipelines here

# don't forget to add your pipeline to the item_pipelines setting

class dangdangpipeline(object):

def process_item(self, item, spider):

#item=dict(item)

#print(len(item["name"]))

for j in range(0,len(item["title"])):

print(j)

title=item["title"][j]

num=item["num"][j]

print("商品名："+title)

print("商品评论数："+num)

print("--------")

return item

随后，接下来我们还需要编写配置文件settings.py，编写配置文件的目的有两个：

1）、启用刚刚编写的pipelines，因为默认是不启用的。

2）、设置不遵循robots协议爬行，因为该协议对我们的爬虫有相关限制，遵循该协议，可能会无法爬取到结果。

我们可以将配置文件settings.py的robots协议配置部分修改为如下所示，此时值设置为false，代表让爬虫不遵循当当网的robots协议爬行，当然我们不要利用这些技术做违法事项。

# obey robots.txt rules

robotstxt_obey = false

然后，我们再将配置文件settings.py的pipelines配置部分设置为如下所示，开启对应的pipelines:

# configure item pipelines

item_pipelines = {

'dangdang.pipelines.dangdangpipeline': 300,

}

随后，我们需要分析当当网的网页结构，总结出信息提取的规则以及自动爬行的规律。

我们打开某一个频道页，各页对应的网址如下所示：

<a href="http://category.dangdang.com/pg1-cid4002644.html" target="_blank">http://category.dangdang.com/pg1-cid4002644.html</a>

<a href="http://category.dangdang.com/pg2-cid4002644.html" target="_blank">http://category.dangdang.com/pg2-cid4002644.html</a>

<a href="http://category.dangdang.com/pg3-cid4002644.html" target="_blank">http://category.dangdang.com/pg3-cid4002644.html</a>

……

有了该规律之后，我们可以将页码位置设置为变量，通过for循环就可以构造出一个频道中所有的商品页，也就通过这种方式实现了自动爬取。

然后，我们再分析商品信息的提取规律。

此时我们需要提取该页面中所有的商品标题和商品评论信息，将其他无关信息过滤掉。所以，我们可以查看该网页源代码，以第一个商品为例进行分析，然后总结出所有商品的提取规律。我们可以右键--查看源代码，然后通过ctrl+find快速定位源码中该商品的对应源代码部分，如下所示：

对应源代码复制出来如下所示：

所以，我们可以得到提取商品标题和商品评论的xpath表达式，如下所示：

#提取商品标题

"//a[@class='pic']/@title"

#提取商品评论

"//a[@name='p_pl']/text()"

在这里时间有限，无法详细讲解xpath表达式基础，没有xpath表达式基础的朋友可以参考下方作者的书籍或者百度自行补充，xpath基础部分知识不属于本篇博文范畴。

此时，我们已经总结出了信息提取的对应的xpath表达式，然后我们可以编写刚才最开始的时候创建的爬虫文件dangspd.py了，我们将爬虫文件编写修改为如下所示：

import re

from dangdang.items import dangdangitem

from scrapy.http import request

class dangspdspider(scrapy.spider):

name = "dangspd"

allowed_domains = ["dangdang.com"]

start_urls = (

)

def parse(self, response):

item=dangdangitem()

item["title"]=response.xpath("//a[@class='pic']/@title").extract()

item["num"]=response.xpath("//a[@name='p_pl']/text()").extract()

yield item

for i in range(2,101):

yield request(url, callback=self.parse)

这样，就可以实现爬虫的编写了。

随后，我们可以进入调试和运行阶段。

我们进入cmd界面，运行该爬虫，出现如下所示结果，中间结果太长，省略了部分：

d:\python35\myweb\dangdang>scrapy crawl dangspd --nolog

商品名： wis水润面膜套装24片祛痘控油补水保湿淡痘印收缩毛孔面膜贴男女

商品评论数：255条评论

--------

商品名：欧诗漫水活奇迹系列【水活奇迹珍珠水(清润型)+珍珠水活奇迹保湿凝乳】

商品评论数：0条评论

商品名：【法国进口】雅漾（avene）活泉恒润保湿精华乳30ml 0064

商品名：【法国进口】avene雅漾敏感肌肤护理净柔洁面摩丝150ml温和泡沫洁面乳洗面奶0655

商品名：珍视明中老年护眼贴2盒装 30对60贴针对中老年用眼问题缓解眼疲劳

商品评论数：226条评论

可以看到，此时一共输出了19k多行，将近2万行数据，如下所示：

捕获3.png

每个数据占4行，所以将近爬取了19210/4=4802.5条数据，当然这个是估算，因为中间可能会有极少量的数据抓取异常等情况，这是正常的。目前已经抓取了将近100页的数据，而爬虫中设置爬取100页，所以结果属于正常的。

Python实战案例分享：爬取当当网商品数据

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入