Python爬取百度备案信息

首先使用pip install requests和pip install bs4安装两个必备的库(注意:你的lxml可能没有安装，如果运行错误的话尝试使用pip install lxml安装lxml，这个库是解析HTML的)

这里我使用的编译器是Spyder,当然你也可以直接在Python自带的IDE中运行

Python爬取百度备案信息Python爬取百度备案信息

爬虫的核心是

1.伪造请求头

2.获取目标网站的地址

3.找到需要爬取内容的DOM位置

4.进行构造遍历爬取(当然这个爬取备案信息的很简单，不需要各种提取操作)

Python爬取百度备案信息Python爬取百度备案信息

// 完整代码及解释
import requests
from bs4 import BeautifulSoup
//伪造请求头，防止服务器端触发反爬机制
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
//爬取目标网站的地址
res = requests.get('https://icp.aizhan.com/www.baidu.com/', headers = headers)
try:
	//BeautifulSoup可以读取HTML文件进行解析
    soup = BeautifulSoup(res.text, 'lxml')
    //找到需要爬取内容的DOM位置
    div = soup.find('div', attrs = {'id':'icp-table'})
    td_list = div.find_all('td')
    //使用:nth-child(n) 选择器匹配父元素中的第 n 个子元素
	//https://icp.aizhan.com/www.baidu.com/
    //icp-table > table > tbody > tr:nth-child(3) > td:nth-child(2) > span
    icp = soup.select('#icp-table > table > tbody > tr:nth-of-type(3) > td:nth-of-type(2) > span')
    if len(icp):
        print(icp[0].get_text())
	//遍历 构造打印出来的内容
    for i in range(0, len(td_list), 2):
        info = td_list[i].text + ":" + td_list[i + 1].text
        print(info)
        print("-" * 20)
    
except ConnectionError:
    print("网站连接失败")

Python爬取百度备案信息Python爬取百度备案信息

Python爬取百度备案信息

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

sort()函数到底是怎样进行数字排序的

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入