一 节点操作
1 点睛
pyquery提供了一系列方法来对节点进行动态修改,比如为某个节点添加一个class,移除某个节点等,这些操作有时候会为提取信息带来极大的便利。
2 addClass和removeClass
2.1 代码
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)
2.2 结果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
2.3 说明
首先选中了第三个li节点,然后调用removeClass()方法,将li节点的active这个class移除,后来又调用addClass()方法,将class添加回来。每执行一次操作,就打印输出当前li节点的内容。
可以看到,一共输出了3次。第二次输出时,li节点的active这个class被移除了,第三次class又添加回来了。
所以说,addClass()和removeClass()这些方法可以动态改变节点的class属性。
3 attr、text和html
3.1 点睛
可以用attr()方法对属性进行操作。此外,还可以用text()和html()方法来改变节点内部的内容。
3.2 代码
html = '''
<ul class="list">
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
</ul>
'''
from pyquery import PyQuery as pq
doc = pq(html)
# 首先选中li节点
li = doc('.item-0.active')
print(li)
# 然后调用attr()方法来修改属性,其中该方法的第一个参数为属性名,第二个参数为属性值。
li.attr('name', 'link')
print(li)
# 接着,调用text()和html()方法来改变节点内部的内容。
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)
3.3 结果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link">changed item</li>
<li class="item-0 active" name="link"><span>changed item</span></li>
3.4 说明
可以发现,调用attr()方法后,li节点多了一个原本不存在的属性name,其值为link。接着调用text()方法,传入文本之后,li节点内部的文本全被改为传入的字符串文本了。最后,调用html()方法传入HTML文本后,li节点内部又变为传入的HTML文本了。
所以说,如果attr()方法只传入第一个参数的属性名,则是获取这个属性值;如果传入第二个参数,可以用来修改属性值。text()和html()方法如果不传参数,则是获取节点内纯文本和HTML文本;如果传入参数,则进行赋值。
4 remove()
4.1 代码
html = '''
<div class="wrap">
Hello, World
<p>This is a paragraph.</p>
</div>
'''
# 需求:提取Hello, World
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
# remove去掉p节点
wrap.find('p').remove()
print(wrap.text())
4.2 结果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
Hello, World
4.3 说明
首先选中p节点,然后调用了remove()方法将其移除,然后这时wrap内部就只剩下Hello, World这句话了,然后再利用text()方法提取即可。
二 伪类选择器
1 点睛
CSS选择器之所以强大,还有一个很重要的原因,那就是它支持多种多样的伪类选择器,例如选择第一个节点、最后一个节点、奇偶数节点、包含某一文本的节点等。
2 代码
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
3 结果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0">first item</li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
4 说明
这里我们使用了CSS3的伪类选择器,依次选择了第一个li节点、最后一个li节点、第二个li节点、第三个li之后的li节点、偶数位置的li节点。