如何利用request和正则表达式获取微博热搜榜

2023-05-17 05:36:20

其实这个是很简单的，网上有很多教程，虽然说微博热搜榜是动态数据，但是数据存储确实可以通过HTML来获取

https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6

注意微博是每分钟都跟新的，因此上一分组和下一分钟数据可能不完全相同

如何利用request和正则表达式获取微博热搜榜

import re

import requests

from requests.exceptions import RequestException

import json

headers={

‘User-Agent’:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36”

}

def get_one_page(url):

try:

#之前我在公司，没有外网的情况下设置proxy，

#response=requests.get(url，proxy=proxy，headers=headers,verity=False),如果没有这个参数将报错，因为没有安全证书#问题在后面是如果遇到反爬虫建议设置爬去速度调慢一些time,sleep(3)

reponse=requests.get(url)

if reponse.status_code==200:

return reponse.text

return None

except RequestException:

return None

def parse_one_page(html):

patterm=re.compile(’<tr.?<td.?ranktop">(\d+).?_blank">(.?).?(\d+).?’,re.S)

items=re.findall(patterm,html)

#return items

for item in items:

yield {

‘top’:item[0],

‘title’:item[1],

‘pop_nums’:item[2]

}

def write_to_file(conten):

path = ‘E:/test001/weibo%s.txt’ % time.strftime(’%Y_%m_%d’)

with open(path,‘w’,encoding=‘utf-8’) as f:

f.write(json.dumps(conten,ensure_ascii=False)+’\n’)

f.close()

def main():

url = ‘https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6’

html=get_one_page(url)

#print(html)

content=parse_one_page(html)

#print(content)

for item in parse_one_page(html):

print(item)

write_to_file(item)

if name == ‘main’:

main()

如何利用request和正则表达式获取微博热搜榜

继续阅读

v2ex的简单爬虫

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的