xpath 處理網頁出現的問題總結
<div class="name">
<div class="title">
<div class="price">
<span>
<a href="" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >網頁欣賞</a>
</span>
</div>
</div>
</div>
- 當文檔為多層結構時,無法比對内容
- 使用text()方法比對不到内容,得到的隻是空的 ‘\n ’
data.xpath('//div[@class="name"]//text()')
- 使用xpath(‘string(.)’)的方式,則可以正确比對
data.xpath('//div[@class="name"]')[0].xpath('string(.)')
- 使用text()方法比對不到内容,得到的隻是空的 ‘\n ’
- 使用xpath得到的div對象,再次使用xpath比對内容時,出現錯誤
- 使用 div對象 <class 'lxml.etree._Element'>,再次使用xpath 比對其内容時,失敗
-
html = etree.parse('./test_xpath.html', etree.HTMLParser()) strings = etree.HTML(etree.tounicode(html)) # print(strings) pp = strings.xpath('//div[@class="name"]')[0] print(type(strings)) # <class 'lxml.etree._Element'> print(type(pp)) # <class 'lxml.etree._Element'> print(pp.xpath('/div[@class="title"]')) # []
- 為何會比對失敗??
- print(pp.xpath('/div[@class="title"]')) # [], 前面使用的是 '/' 而不是 ‘//’, 比對的是根路徑,導緻無法查找
- 使用xpath 打開html檔案時,會遇到無法解碼為中文的情況
html = etree.parse('./test_xpath.html', etree.HTMLParser()) strings = etree.tostring(html) print(strings) # get page like this ''' b'<!DOCTYPE html>\n<html > \n<head> \n <meta charset="UTF-8"/> \n <title>Title</title> \n</head> \n<body> \n<div class="name"> \n <div class="title"> \n <div class="price"> \n <span> \n <a href="" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >网页欣赏</a> \n </span> \n </div> \n </div> \n</div> \n</body> \n</html>' ''' html = etree.parse('./test_xpath.html', etree.HTMLParser()) strings = etree.tostring(html).decode() print(strings) ''' <!DOCTYPE html> <html > <head> <meta charset="UTF-8"/> <title>Title</title> </head> <body> <div class="name"> <div class="title"> <div class="price"> <span> <a href="" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >网页欣赏</a> </span> </div> </div> </div> </body> </html> ''' 但是還是不對,中文并沒有解析出來
- 使用etree.tostring()
- 使用etree.tounicode(), 則正常解析,并且不需要使用decode,就能得到正常的html
-
html = etree.parse('./test_xpath.html', etree.HTMLParser()) strings = etree.tounicode(html) print(strings) <!DOCTYPE html> <html > <head> <meta charset="UTF-8"/> <title>Title</title> </head> <body> <div class="name"> <div class="title"> <div class="price"> <span> <a href="" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >網頁欣賞</a> </span> </div> </div> </div> </body> </html>
movie_div = strings.xpath("//div[contains(@class,'doulist-item')]//text()")
movie_div = strings.xpath("//div[contains(@class,'doulist-item')]//text()")
'\n '