天天看點

爬蟲pyquery進階篇

一 節點操作

1 點睛

pyquery提供了一系列方法來對節點進行動态修改,比如為某個節點添加一個class,移除某個節點等,這些操作有時候會為提取資訊帶來極大的便利。

2 addClass和removeClass

2.1 代碼

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             <li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
             <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         </ul>
     </div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)
           

2.2 結果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             
<li class="item-0"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
           

2.3 說明

首先選中了第三個li節點,然後調用removeClass()方法,将li節點的active這個class移除,後來又調用addClass()方法,将class添加回來。每執行一次操作,就列印輸出目前li節點的内容。

可以看到,一共輸出了3次。第二次輸出時,li節點的active這個class被移除了,第三次class又添加回來了。

是以說,addClass()和removeClass()這些方法可以動态改變節點的class屬性。

3 attr、text和html

3.1 點睛

可以用attr()方法對屬性進行操作。此外,還可以用text()和html()方法來改變節點内部的内容。

3.2 代碼

html = '''
<ul class="list">
     <li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
</ul>
'''
from pyquery import PyQuery as pq
doc = pq(html)
# 首先選中li節點   
li = doc('.item-0.active')
print(li)
# 然後調用attr()方法來修改屬性,其中該方法的第一個參數為屬性名,第二個參數為屬性值。
li.attr('name', 'link')
print(li)
# 接着,調用text()和html()方法來改變節點内部的内容。
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)
           

3.3 結果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link">changed item</li>

<li class="item-0 active" name="link"><span>changed item</span></li>
           

3.4 說明

可以發現,調用attr()方法後,li節點多了一個原本不存在的屬性name,其值為link。接着調用text()方法,傳入文本之後,li節點内部的文本全被改為傳入的字元串文本了。最後,調用html()方法傳入HTML文本後,li節點内部又變為傳入的HTML文本了。

是以說,如果attr()方法隻傳入第一個參數的屬性名,則是擷取這個屬性值;如果傳入第二個參數,可以用來修改屬性值。text()和html()方法如果不傳參數,則是擷取節點内純文字和HTML文本;如果傳入參數,則進行指派。

4 remove()

4.1 代碼

html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
</div>
'''
# 需求:提取Hello, World
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
# remove去掉p節點
wrap.find('p').remove()
print(wrap.text())
           

4.2 結果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
Hello, World
           

4.3 說明

首先選中p節點,然後調用了remove()方法将其移除,然後這時wrap内部就隻剩下Hello, World這句話了,然後再利用text()方法提取即可。

二 僞類選擇器

1 點睛

CSS選擇器之是以強大,還有一個很重要的原因,那就是它支援多種多樣的僞類選擇器,例如選擇第一個節點、最後一個節點、奇偶數節點、包含某一文本的節點等。

2 代碼

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             <li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
             <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         </ul>
     </div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
           

3 結果

E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0">first item</li>
             
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
             <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a></li>
         
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
             <li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
           

4 說明

這裡我們使用了CSS3的僞類選擇器,依次選擇了第一個li節點、最後一個li節點、第二個li節點、第三個li之後的li節點、偶數位置的li節點。