天天看点

爬虫pyquery高级篇

一 节点操作

1 点睛

pyquery提供了一系列方法来对节点进行动态修改,比如为某个节点添加一个class,移除某个节点等,这些操作有时候会为提取信息带来极大的便利。

2 addClass和removeClass

2.1 代码

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             <li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
             <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         </ul>
     </div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)
           

2.2 结果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             
<li class="item-0"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
           

2.3 说明

首先选中了第三个li节点,然后调用removeClass()方法,将li节点的active这个class移除,后来又调用addClass()方法,将class添加回来。每执行一次操作,就打印输出当前li节点的内容。

可以看到,一共输出了3次。第二次输出时,li节点的active这个class被移除了,第三次class又添加回来了。

所以说,addClass()和removeClass()这些方法可以动态改变节点的class属性。

3 attr、text和html

3.1 点睛

可以用attr()方法对属性进行操作。此外,还可以用text()和html()方法来改变节点内部的内容。

3.2 代码

html = '''
<ul class="list">
     <li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
</ul>
'''
from pyquery import PyQuery as pq
doc = pq(html)
# 首先选中li节点   
li = doc('.item-0.active')
print(li)
# 然后调用attr()方法来修改属性,其中该方法的第一个参数为属性名,第二个参数为属性值。
li.attr('name', 'link')
print(li)
# 接着,调用text()和html()方法来改变节点内部的内容。
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)
           

3.3 结果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link">changed item</li>

<li class="item-0 active" name="link"><span>changed item</span></li>
           

3.4 说明

可以发现,调用attr()方法后,li节点多了一个原本不存在的属性name,其值为link。接着调用text()方法,传入文本之后,li节点内部的文本全被改为传入的字符串文本了。最后,调用html()方法传入HTML文本后,li节点内部又变为传入的HTML文本了。

所以说,如果attr()方法只传入第一个参数的属性名,则是获取这个属性值;如果传入第二个参数,可以用来修改属性值。text()和html()方法如果不传参数,则是获取节点内纯文本和HTML文本;如果传入参数,则进行赋值。

4 remove()

4.1 代码

html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
</div>
'''
# 需求:提取Hello, World
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
# remove去掉p节点
wrap.find('p').remove()
print(wrap.text())
           

4.2 结果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
Hello, World
           

4.3 说明

首先选中p节点,然后调用了remove()方法将其移除,然后这时wrap内部就只剩下Hello, World这句话了,然后再利用text()方法提取即可。

二 伪类选择器

1 点睛

CSS选择器之所以强大,还有一个很重要的原因,那就是它支持多种多样的伪类选择器,例如选择第一个节点、最后一个节点、奇偶数节点、包含某一文本的节点等。

2 代码

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             <li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
             <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         </ul>
     </div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
           

3 结果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0">first item</li>
             
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
             <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             <li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
           

4 说明

这里我们使用了CSS3的伪类选择器,依次选择了第一个li节点、最后一个li节点、第二个li节点、第三个li之后的li节点、偶数位置的li节点。