Python爬蟲：xpath常用方法示例

2021-11-21 23:50:00

# -*-coding:utf-8-*-

html = """
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>
"""
from scrapy.selector import Selector

sel = Selector(text=html)

print("================title===============")

title_by_xpath = sel.xpath("//title//text()").extract_first()
print(title_by_xpath)

title_by_css = sel.css("title::text").extract_first()
print(title_by_css)


print("================href===============")

hrefs = sel.xpath("//a/@href").extract()
print(hrefs)

hrefs_by_css = sel.css("a::attr(href)").extract()
print(hrefs_by_css)

print("================img===============")

imgs = sel.xpath("//a[contains(@href, 'image')]/@href").extract()
print(imgs)

imgs_by_css = sel.css("a[href*=image]::attr(href)").extract()
print(imgs_by_css)

print("================src===============")

src = sel.xpath("//a[contains(@href, 'image')]/img/@src").extract()
print(src)

src_by_css = sel.css("a[href*=image] img::attr(src)").extract()
print(src_by_css)

print("================ re ===============")

text_by_re = sel.css("a[href*=image]::text").re(r"Name:\s*(.*)")
print(text_by_re)

print("================ xpath ===============")

div = sel.xpath("//div")  # 相對路徑
print(div)

a = div.xpath(".//a").extract() # 從目前提取所有元素
print(a)

print("================ text ===============")

text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'
sel1 = Selector(text=text)

# a下面的文字
a = sel1.xpath("//a/text()").extract()
print(a)

# a 下面所有的文字，包括strong
a = sel1.xpath("//a//text()").extract()
print(a)

# 解析出所有文字内容
a = sel1.xpath("string(//a)").extract()
print(a)

a = sel1.xpath("string(.)").extract()
print(a)

# 簡化寫法，推薦
xp = lambda x: sel.xpath(x).extract()

all_a = xp("//a/text()")
print(all_a)

Python爬蟲：xpath常用方法示例

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入