前言

本文介绍了使用aiohttp库实现异步爬虫爬取网站图书的评论，及其代码的详细解释（案例来源于python3网络爬虫开发实战，本人对其进行了改编）

一、观察需要爬取的页面的数据包

1、通过观察发现页面的数据包是ajax数据包，而且我们需要的组成详情页的url的id也在ajax数据包的响应当中

python3网络爬虫aiohttp实战案例前言一、观察需要爬取的页面的数据包二、基本配置三、生成不同页面的url，并获取响应数据四、获得评论信息和数据储存五、主函数调用六、运行结果

2、进入详情页中，页面的数据还是一个ajax数据包，我们想要的评论数据也在ajax数据包的响应当中。

这里注意一个问题：详情页的url和ajax数据包的url是不一样的

二、基本配置

1、引入需要使用的库

import aiohttp
import asyncio
import nest_asyncio
import time
import json
import logging##这个是日志文件的库，可以记录程序运行的过程

2、定义必要的信息

nest_asyncio.apply()##jupyter notebook上不加此句话会报错
logging.basicConfig(level=logging.INFO,format='%(asctime)s-%(levelname)s:%(message)s')
index_url='https://spa5.scrape.center/api/book/?limit=18&offset={offset}'#索引页的url
detail_url= 'https://spa5.scrape.center/api/book/{id}'#详情页的url
page_size=18#一页有18本图书
page_number=2#设置爬取的页数是2页
concurrency=5#设置最大的并大数量是5

三、生成不同页面的url，并获取响应数据

1、抓取ajax数据包中的响应数据

async def scrape_api(url):
    async with semaphore:
        try:
            logging.info('scraping %s',url)
            async with session.get(url) as response:
                return await response.json()
        except aiohttp.ClientError:
            logging.error('error occured while scraping %s',url,exc_info=True)

2、由于需要爬取不同页面的图书信息，所以url在参数上会有不同，不同页面的url也不同

##得到不同页面的url
async def scrape_index(page):
    url=index_url.format(offset=page_size*(page-1))##拼接成完整的详情页的url
    return await scrape_api(url)

四、获得评论信息和数据储存

#处理获取到的响应数据
async def save_file(data):
    list_content=data.get('comments')
    for item in list_content:
         with open('comment.txt',mode='a',encoding='utf-8') as fp:
            fp.write(item.get('content'))
            fp.write('\r\n')
#获取详情页的评论信息   
async def scrape_detail(id):
    url=detail_url.format(id=id)
    data=await scrape_api(url)
    await save_file(data)

五、主函数调用

async def main():
    global session
    session=aiohttp.ClientSession()#创建一个ClientSession对象，用于获得网页响应，类似于使用get
    scrape_index_tasks=[asyncio.ensure_future(scrape_index(page)) for page in range(1,page_number+1)]#获取爬取页面的任务
    results=await asyncio.gather(*scrape_index_tasks)##将任务进行聚合，相当于放到一个队列中
    logging.info('results %s',json.dumps(results,ensure_ascii=False,indent=2))
    ids=[]
    for index_data in results:
        if not index_data:
            continue
        for item in index_data.get('results'):#获取数据包的响应数据是一个字典，字典的值是一个列表
            ids.append(item.get('id'))#列表中又是一个字典，找到id
    scrape_detail_tasks=[asyncio.ensure_future(scrape_detail(id)) for id in ids]##创建获取详情页的任务
    await asyncio.wait(scrape_detail_tasks)##这里是io操作比较密集的地方，需要挂起
    await session.close()
    
if __name__=='__main__':
    asyncio.run(main())

python3网络爬虫aiohttp实战案例前言一、观察需要爬取的页面的数据包二、基本配置三、生成不同页面的url，并获取响应数据四、获得评论信息和数据储存五、主函数调用六、运行结果

文章目录

前言

一、观察需要爬取的页面的数据包

二、基本配置

三、生成不同页面的url，并获取响应数据

四、获得评论信息和数据储存

五、主函数调用

六、运行结果

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

sort()函数到底是怎样进行数字排序的

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入