Ajax

有些网站抓取到的内容和网站看到的不一样，这是因为我们抓取的是html文档，浏览器中网页是经过JavaScript处理过的，这些数据可能通过Ajax加载的，肯是在HTML文档中，有可能经过JavaScript和特殊算法计算后得到的

Ajax分析方法

查看请求

在浏览器中打开某网站，以这个网站为例：

https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=鸟

，点击鼠标右键，选择检查，此时会出现开发者工具

爬虫——Ajax数据爬取AjaxAjax分析方法

此时可以看到源代码，不过不是我们想要的内容，我们切换到网络板块，下滑刷页面，可以看到很多条目，这里其实就是在网页加载过程中浏览器与服务器之间发送请求和接收响应的所有记录，点击XHR帅选，显示下方的请求就是Ajax请求

爬虫——Ajax数据爬取AjaxAjax分析方法

单击一个请求，可以在下方看到请求链接，每个请求的链接都差不多，只是后面的部分参数不一样

爬虫——Ajax数据爬取AjaxAjax分析方法

我们通过这个链接抓取需要的内容

from urllib.parse import urlencode
import requests

url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8957187432310752416&ipn=rj&ct=201326592&is=&fp=result&queryWord=鸟&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=鸟&s=&se=&tab=&width=&height=&
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
response = requests.get(url, headers=headers)
json = response.json()

我们抓取的内容转化成了json模式保存在json变量中，下一步是从中获取对应图片的网址，点击响应，我们需要的内容在data中对应的thumbURL

爬虫——Ajax数据爬取AjaxAjax分析方法

获取后，要在网址中抓取图片并保存，保存在E盘a文件中

from urllib.parse import urlencode
import requests

url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8957187432310752416&ipn=rj&ct=201326592&is=&fp=result&queryWord=鸟&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=鸟&s=&se=&tab=&width=&height=&
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
response = requests.get(url, headers=headers)
json = response.json()
items = json.get('data')
i = 1
for item in items:
    if item:
        url = item.get('thumbURL')  # 图片网址
        image = requests.get(url)
        file_path = 'E:/a/' + str(i) + '.jpg'
        with open(file_path, 'wb') as f:
            f.write(image.content)
        i += 1

用for循环抓取可以获取大量图片，每个请求链接某个参数不同，间距为30，可在for循环中用30*j改变该参数

from urllib.parse import urlencode
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
i = 1
for j in range(1, 31):
    url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8957187432310752416&ipn=rj&ct=201326592&is=&fp=result&queryWord=鸟&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word=鸟&s=&se=&tab=&width=&height=& + str(30*j) + '&rn=30&gsm=1e&1622545007096='
    response = requests.get(url, headers=headers)
    json = response.json()
    items = json.get('data')
    for item in items:
        if item:
            url = item.get('thumbURL')
            image = requests.get(url)
            file_path = 'E:/a/' + str(i) + '.jpg'
            with open(file_path, 'wb') as f:
                f.write(image.content)
            i += 1

爬虫——Ajax数据爬取AjaxAjax分析方法

文章目录

Ajax

Ajax分析方法

查看请求

继续阅读

v2ex的简单爬虫

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的