Python3-网页爬取-批量爬取贴吧页面数据

2023-03-07 00:05:35

# 批量爬取贴吧页面数据
# 网页抓取汉字转码、多个参数拼接

# 第1页： https://tieba.baidu.com/f?kw=%E6%97%85%E8%A1%8C%E9%9D%92%E8%9B%99&ie=utf-8&pn=0
# 第2页：https://tieba.baidu.com/f?kw=%E6%97%85%E8%A1%8C%E9%9D%92%E8%9B%99&ie=utf-8&pn=50
# 第3页 https://tieba.baidu.com/f?kw=%E6%97%85%E8%A1%8C%E9%9D%92%E8%9B%99&ie=utf-8&pn=100
# 第4页                                                                            pn=150

# 及格水平---单页爬取
# base_url = "https://tieba.baidu.com/f?kw=%E6%97%85%E8%A1%8C%E9%9D%92%E8%9B%99&ie=utf-8&pn="
# for page in range(10):
#     new_url = base_url + str(page*50)
#     print(new_url)


# 进阶水平--单页爬取
# 从键盘去输入贴吧名称和页数，然后爬取指定页面的内容
base_url = 'https://tieba.baidu.com/f?'
name = input("请输入贴吧名称：")
page = input("请输入贴吧页数：")  # page输入的时候就是字符串

from urllib import request, parse

# qs={'kw':name,
#     'pn':(int(page)-1)*50}
#
# qs_data=parse.urlencode(qs)
# url=base_url+qs_data
# print(url)
#
# headers={
#     'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
#
# }
# req=request.Request(url,headers=headers)
# response=request.urlopen(req)
# html=response.read()
# html=html.decode('utf-8')
#
# with open(name+'第'+page+'页'+'.html','w',encoding='utf-8') as f:
#     f.write(html)

# 进阶水平----批量爬取
# 从键盘去输入贴吧名称和页数，然后爬取指定页面的内容

for i in range(int(page)):
    qs = {'kw': name,
          'pn': i * 50}

    qs_data = parse.urlencode(qs)
    url = base_url + qs_data
    print(url)

    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'

    }
    req = request.Request(url, headers=headers)
    response = request.urlopen(req)
    html = response.read()
    html = html.decode('utf-8')

    with open(name + '第' + str(i+1) + '页' + '.html', 'w', encoding='utf-8') as f:
        f.write(html)

C:\Users\Apple\PycharmProjects\spider\venv\Scripts\python.exe C:/Users/Apple/PycharmProjects/spider/04tieba.py

请输入贴吧名称：旅行青蛙

请输入贴吧页数：2

https://tieba.baidu.com/f?kw=%E6%97%85%E8%A1%8C%E9%9D%92%E8%9B%99&pn=0

https://tieba.baidu.com/f?kw=%E6%97%85%E8%A1%8C%E9%9D%92%E8%9B%99&pn=50

Process finished with exit code 0

Python3-网页爬取-批量爬取贴吧页面数据

继续阅读

v2ex的简单爬虫

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的