一 節點操作
1 點睛
pyquery提供了一系列方法來對節點進行動态修改,比如為某個節點添加一個class,移除某個節點等,這些操作有時候會為提取資訊帶來極大的便利。
2 addClass和removeClass
2.1 代碼
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)
2.2 結果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
2.3 說明
首先選中了第三個li節點,然後調用removeClass()方法,将li節點的active這個class移除,後來又調用addClass()方法,将class添加回來。每執行一次操作,就列印輸出目前li節點的内容。
可以看到,一共輸出了3次。第二次輸出時,li節點的active這個class被移除了,第三次class又添加回來了。
是以說,addClass()和removeClass()這些方法可以動态改變節點的class屬性。
3 attr、text和html
3.1 點睛
可以用attr()方法對屬性進行操作。此外,還可以用text()和html()方法來改變節點内部的内容。
3.2 代碼
html = '''
<ul class="list">
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
</ul>
'''
from pyquery import PyQuery as pq
doc = pq(html)
# 首先選中li節點
li = doc('.item-0.active')
print(li)
# 然後調用attr()方法來修改屬性,其中該方法的第一個參數為屬性名,第二個參數為屬性值。
li.attr('name', 'link')
print(li)
# 接着,調用text()和html()方法來改變節點内部的内容。
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)
3.3 結果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link">changed item</li>
<li class="item-0 active" name="link"><span>changed item</span></li>
3.4 說明
可以發現,調用attr()方法後,li節點多了一個原本不存在的屬性name,其值為link。接着調用text()方法,傳入文本之後,li節點内部的文本全被改為傳入的字元串文本了。最後,調用html()方法傳入HTML文本後,li節點内部又變為傳入的HTML文本了。
是以說,如果attr()方法隻傳入第一個參數的屬性名,則是擷取這個屬性值;如果傳入第二個參數,可以用來修改屬性值。text()和html()方法如果不傳參數,則是擷取節點内純文字和HTML文本;如果傳入參數,則進行指派。
4 remove()
4.1 代碼
html = '''
<div class="wrap">
Hello, World
<p>This is a paragraph.</p>
</div>
'''
# 需求:提取Hello, World
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
# remove去掉p節點
wrap.find('p').remove()
print(wrap.text())
4.2 結果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
Hello, World
4.3 說明
首先選中p節點,然後調用了remove()方法将其移除,然後這時wrap内部就隻剩下Hello, World這句話了,然後再利用text()方法提取即可。
二 僞類選擇器
1 點睛
CSS選擇器之是以強大,還有一個很重要的原因,那就是它支援多種多樣的僞類選擇器,例如選擇第一個節點、最後一個節點、奇偶數節點、包含某一文本的節點等。
2 代碼
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-0 active"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
3 結果
E:\WebSpider\venv\Scripts\python.exe E:/WebSpider/4_3.py
<li class="item-0">first item</li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a></li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-1 active"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
4 說明
這裡我們使用了CSS3的僞類選擇器,依次選擇了第一個li節點、最後一個li節點、第二個li節點、第三個li之後的li節點、偶數位置的li節點。