认识爬虫：beautifulsoup4 库如何使用三种方式提取 html 网页元素？

通过前面网页下载器得到一个网页源代码的很长的字符串，接下来则是要通过网页解析器对网页源代码中的信息进行提取，beautifulsoup4 库作为第三方插件同时支持 html、xml 的解析。通过将网页下载器下载的 html 字符串解析成为一个 BeautifulSoup 的对象，最后从这个对象中根据网页源代码的 html 标签、属性等因素提取我们需要的内容。

1、准备网页下载器获取的源代码

1# 首先获取到网页下载器已经下载到的网页源代码
 2# 这里直接取官方的案例
 3html_doc = """
 4<html><head><title>The Dormouse's story</title></head>
 5<body>
 6<p class="title"><b>The Dormouse's story</b></p>
 7
 8<p class="story">Once upon a time there were three little sisters; and their names were
 9<a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1">Elsie</a>,
10<a href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link2">Lacie</a> and
11<a href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
12and they lived at the bottom of a well.</p>
13
14<p class="story">...</p>
15"""

2、导入 beautifulsoup4 库并创建解析对象

1# 导入 beautifulsoup4 库、用于完成解析
2from bs4 import BeautifulSoup
3
4'''
5创建 BeautifulSoup 对象、html_doc 为执行要解析的字符串、html.parser 为指定的解析器,
6除此之外，还有其他的解析库,比如 htm5llib、lxml,各个解析库各有优势
7'''
8beau_soup = BeautifulSoup(html_doc, 'html.parser')

3、使用结构化的方式获取元素、属性等

1'''
 2获取结构化元素或属性
 3'''
 4# 获取 title 元素、也就是 title 标签
 5print beau_soup.title
 6# <title>The Dormouse's story</title>
 7
 8# 获取第一个 p 元素
 9print beau_soup.p
10# <p class="title"><b>The Dormouse's story</b></p>
11
12# 获取第一个 p 元素的 class 属性
13print beau_soup.p['class']
14# [u'title']
15
16# 获取第一个 p 元素下面的 b 元素
17print beau_soup.p.b
18# <b>The Dormouse's story</b>
19
20# 获取 p 元素的父节点的源代码
21print beau_soup.p.parent
22'''
23<body>
24<p class="title"><b>The Dormouse's story</b></p>
25<p class="story">Once upon a time there were three little sisters; and their names were
26<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1">Elsie</a>,
27<a class="sister" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a> and
28<a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>;
29and they lived at the bottom of a well.</p>
30<p class="story">...</p>
31</body>
32'''
33# 获取 p 元素的父节点的名称
34print beau_soup.p.parent.name
35# body

4、通过元素搜索的方式获取元素、属性等

1'''
 2除了通过结构化的方式获取元素，在其他情况使用结构化不容易获取元素时，
 3可以使用类似于的搜索的功能对源代码的标签、属性等进行筛选。
 4find() 函数、find_all() 函数可以利用多个条件的模式对源代码标签等
 5进行搜索。
 6'''
 7'''
 8find_all(self, name=None, attrs={}, recursive=True, text=None,
 9                 limit=None, **kwargs)
10结果返回一个 list 集合
11'''
12
13# 搜索所有 p 元素、然后返回一个 p 元素的 list
14print beau_soup.find_all('p')
15# 搜索所有 a 元素、然后返回一个 a 元素的 list
16links = beau_soup.find_all('a')
17for link in links:
18    print '未爬取的链接：',link['href']
19
20# 多条件查找,获取 p 元素、并且 class 属性 == title 的元素
21print beau_soup.find_all('p',class_='title')
22'''
23 find(self, name=None, attrs={}, recursive=True, text=None,
24             **kwargs)
25结果只返回一个，如果有多个则返回第一个，相比 find_all() 函数少了 limit 参数
26'''
27
28# 通过 id 搜索
29print beau_soup.find(id='link3')
30
31# 多条件查找,获取 p 元素、并且 class 属性 == title 的元素
32print beau_soup.find('p',class_='title')
33
34import re
35
36# 多条件查找,获取 a 元素的 href 属性中包含 lacie 字符串的元素对象
37print beau_soup.find('a',href=re.compile(r"lacie"))

5、通过样式选择器的方式获取元素、属性等

1'''
2除了上述使用结构化获取、元素/属性查找的方式，还提供了 select()
3函数通过 css 样式选择器的方式进行元素获取,这个函数返回的也是一个 list
4'''
5print beau_soup.select('html head title')
6# html head title 在 css 选择器中表示 html 标签下面的 head 标签下面的 title 标签
7
8print beau_soup.select('#link3')
9# #link3 样式选择器中 id 为 link3 的元素

更多精彩前往微信公众号【Python 集中营】，关注获取《python 从入门到精通全套视频》

认识爬虫：beautifulsoup4 库如何使用三种方式提取 html 网页元素？

认识爬虫：beautifulsoup4 库如何使用三种方式提取 html 网页元素？

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

sort()函数到底是怎样进行数字排序的

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入