天天看点

Python Scrapy使用Selector、xpath、css选择器提取数据

从页面中提取数据的核心技术是HTTP文本解析,在Python 中常用以下模块处理此类问题:

BeautifulSoup lxml
非常流行的HTTP解析库,API 简洁易用,但解析速度较慢。 由C语言编写的xml解析库( libxml2),解析速度更快,API相对复杂。

Scrapy综合上述两者优点实现了Selector 类,它是基于lxml库构建的,并简化了API接口。在Scrapy中使用Selector 对象提取页面中的数据,使用时先通过XPath或CSS选择器选中页面中需要提取的数据,然后进行提取,下面来介绍一下Selector对象的使用。

一、Selector对象

1.1、创建对象

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <title>Scrapy Study</title>
</head>
<body>
    <h1>Hello World</h1>
    <h2>ayouleyang</h2>
    <b>yangyou</b>
    <ul>
        <li>Python</li>
        <li>Scrapy</li>
        <li>html</li>
    </ul>
    <!-- 标签缺失 -->
'''
           

使用Response对象构造Selector对象,将其传递给Selector构造器方法的response参数:

>>> result = HtmlResponse(html,body=html,encoding='utf-8')
>>> selector = Selector(response = result)
>>> print(selector)
<Selector xpath=None data='<html >\n<head>\n    <title>Scrap'>
>>> 
           

1.2、选中数据

调用Selector对象的xpath或css方法可以选中文中某个或某部分:

>>> selector_h1 = selector.xpath('//h1')
>>> print (selector_h1)
[<Selector xpath='//h1' data='<h1>Hello World</h1>'>]
           
>>> selector_li = selector.xpath('//li')
>>> print (selector_li)
[<Selector xpath='//li' data='<li>Python</li>'>, 
<Selector xpath='//li' data='<li>Scrapy</li>'>, 
<Selector xpath='//li' data='<li>html</li>'>]
>>> 
           

xpath和css方法返回一个SelectorList对象,SelectorList支持列表接口,可使用for语句迭代访问其中的对象:

>>> for li in selector_li:
		print (li.xpath('./text()'))

	
[<Selector xpath='./text()' data='Python'>]
[<Selector xpath='./text()' data='Scrapy'>]
[<Selector xpath='./text()' data='html'>]
>>> 
           

SelectorList对象也有xpath和css方法:

>>> lis = selector.xpath('.//ul').css('li').xpath('./text()')
>>> print (lis)
[<Selector xpath='./text()' data='Python'>, 
<Selector xpath='./text()' data='Scrapy'>, 
<Selector xpath='./text()' data='html'>]
>>> 
           

1.3、提取数据

调用Selector或SelectorList对象的一下方法可将选中的内容提取

  • extract()
  • re()
  • extract_first() (SelectorList专有)
  • re_first (SelectorList专有)

extract方法

>>> selector_li = selector.xpath('//li')
>>> print (selector_li)
[<Selector xpath='//li' data='<li>Python</li>'>, 
Selector xpath='//li' data='<li>Scrapy</li>'>, 
Selector xpath='//li' data='<li>html</li>'>]
>>> 
>>> print (selector_li[0].extract())
<li>Python</li>
>>> 


>>> li = selector.xpath('.//li/text()')
>>> print (li)
[<Selector xpath='.//li/text()' data='Python'>, 
<Selector xpath='.//li/text()' data='Scrapy'>,
<Selector xpath='.//li/text()' data='html'>]
>>> 
>>>>print (li.extract())
['Python', 'Scrapy', 'html']
>>> 
>>> print (li[0].extract())
Python
>>> 
>>> print (li[1].extract())
Scrapy
           

提取标题内容:

>>> title = selector.xpath('.//title/text()')
>>> print (title)
[<Selector xpath='.//title/text()' data='Scrapy Study'>]

>>> print (title.extract())
['Scrapy Study']

>>> print (title[0].extract())
Scrapy Study
>>> 
           

定点提取ul>li的内容:

>>> html = '''
    <ul>
        <li>Python编程<b>价格:32.00元</b></li>
        <li>精通Scrapy<b>价格:12.00元</b></li>
        <li>html知识<b>价格:52.00元</b></li>
    </ul>
'''

>>> selector  = Selector(text=html)
>>> li = selector.xpath('.//ul/li/text()')
>>> print (li)
[<Selector xpath='.//ul/li/text()' data='Python编程'>,
 <Selector xpath='.//ul/li/text()' data='精通Scrapy'>,
 <Selector xpath='.//ul/li/text()' data='html知识'>]

>>> li = selector.xpath('.//ul/li/text()').extract()
>>> print (li)
['Python编程', '精通Scrapy', 'html知识']

>>> li = selector.xpath('.//ul/li/b/text()').extract()
>>> print (li)
['价格:32.00元', '价格:12.00元', '价格:52.00元']

>>> li = selector.xpath('.//ul/li/b/text()').re('\d+\.\d+')	#只提取数字		
>>> print (li)			
['32.00', '12.00', '52.00']
>>> 
>>> li = selector.xpath('.//ul/li/b/text()').re_first('\d+\.\d+')			
>>> print(li)			
32.00
>>> 
>>> li = selector.xpath('.//ul/li[2]/b/text()').re('\d+\.\d+')	#li[2]定位第二个li标签		
>>> print (li[0])	#[0]提取数组第一位			
12.00
>>> 
           

二、Xpath

Xpath即XML路径语言(XML Path Language),它是一种用来确定xml文档中某部分位置的语言,xml文档(html属于xml)是由一系列节点构成的树。XML 实例文档参考

菜鸟教程Xpath语法 。

2.1、基础语法

先创一个html文档,接下来,我们通过一些例子xpath的作用。

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <title>Xpath study</title>
</head>
<body>
<div id="images">
    <a href="image1.html">Name:图片1<br><img src="image1.jpg"></a>
    <a href="image2.html">Name:图片2<br><img src="image2.jpg"></a>
    <a href="image3.html">Name:图片3<br><img src="image3.jpg"></a>
    <a href="image4.html">Name:图片4<br><img src="image4.jpg"></a>
    <a href="image5.html">Name:图片5<br><img src="image5.jpg"></a>
</div>
</body>
</html>
'''
>>> response = HtmlResponse(html,body = html,encoding='utf-8')
           
  • /:描述一个从根节点的绝对路径
>>> response.xpath('/html')
[<Selector xpath='/html' data='<html >\n<head>\n    <title>Xpath'>]
>>> 
           
  • E1/E2:选中E1子节点中的所有E2
>>> response.xpath('/html/body/div/a')
[<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name:图片1<br><img s'>,
 <Selector xpath='/html/body/div/a' data='<a href="image2.html">Name:图片2<br><img s'>, 
 <Selector xpath='/html/body/div/a' data='<a href="image3.html">Name:图片3<br><img s'>,
  <Selector xpath='/html/body/div/a' data='<a href="image4.html">Name:图片4<br><img s'>, 
  <Selector xpath='/html/body/div/a' data='<a href="image5.html">Name:图片5<br><img s'>]
>>> 
           
  • //E:选中文档中的所有E,无论在什么位置
>>> name = response.xpath('.//a/text()')
>>> name
[<Selector xpath='.//a/text()' data='Name:图片1'>, 
<Selector xpath='.//a/text()' data='Name:图片2'>, 
<Selector xpath='.//a/text()' data='Name:图片3'>, 
<Selector xpath='.//a/text()' data='Name:图片4'>, 
<Selector xpath='.//a/text()' data='Name:图片5'>]
>>> 
>>> name.extract()
['Name:图片1', 'Name:图片2', 'Name:图片3', 'Name:图片4', 'Name:图片5']
>>> 
           
  • E/text() :选中E的文本子节点
>>> name = response.xpath('.//a/text()')
>>> name
[<Selector xpath='.//a/text()' data='Name:图片1'>, 
<Selector xpath='.//a/text()' data='Name:图片2'>, 
<Selector xpath='.//a/text()' data='Name:图片3'>,
 <Selector xpath='.//a/text()' data='Name:图片4'>, 
 <Selector xpath='.//a/text()' data='Name:图片5'>]
>>> 
>>> name.extract()
['Name:图片1', 'Name:图片2', 'Name:图片3', 'Name:图片4', 'Name:图片5']
>>> 
           
  • E/*: 选中E中的所有元素节点
>>> response.xpath('/html/body/div/a/*')
[<Selector xpath='/html/body/div/a/*' data='<br>'>, <Selector xpath='/html/body/div/a/*' data='<img src="image1.jpg">'>, <Selector xpath='/html/body/div/a/*' data='<br>'>, <Selector xpath='/html/body/div/a/*' data='<img src="image2.jpg">'>, <Selector xpath='/html/body/div/a/*' data='<br>'>, <Selector xpath='/html/body/div/a/*' data='<img src="image3.jpg">'>, <Selector xpath='/html/body/div/a/*' data='<br>'>, <Selector xpath='/html/body/div/a/*' data='<img src="image4.jpg">'>, <Selector xpath='/html/body/div/a/*' data='<br>'>, <Selector xpath='/html/body/div/a/*' data='<img src="image5.jpg">'>]
           
  • *E: 选中孙节点中的所有E
#选中div孙节点中的所有img
>>> response.xpath('//div/*/img')
[<Selector xpath='//div/*/img' data='<img src="image1.jpg">'>, 
<Selector xpath='//div/*/img' data='<img src="image2.jpg">'>, 
<Selector xpath='//div/*/img' data='<img src="image3.jpg">'>, 
<Selector xpath='//div/*/img' data='<img src="image4.jpg">'>, 
<Selector xpath='//div/*/img' data='<img src="image5.jpg">'>]
>>> 
           
  • E/@ATTR: 选中E的ATTR属性
#选中所有img的src属性
>>> response.xpath('//img/@src')
[<Selector xpath='//img/@src' data='image1.jpg'>,
 <Selector xpath='//img/@src' data='image2.jpg'>, 
 <Selector xpath='//img/@src' data='image3.jpg'>,
 <Selector xpath='//img/@src' data='image4.jpg'>,
 <Selector xpath='//img/@src' data='image5.jpg'>]
>>> 
>>> response.xpath('//img/@src').extract()
['image1.jpg', 'image2.jpg', 'image3.jpg', 'image4.jpg', 'image5.jpg']
>>> 
           
  • //@ATTR:选中文档中所有ATTR属性
#选中所有的href属性
>>> response.xpath('//@href')
[<Selector xpath='//@href' data='image1.html'>, 
<Selector xpath='//@href' data='image2.html'>, 
<Selector xpath='//@href' data='image3.html'>, 
<Selector xpath='//@href' data='image4.html'>, 
<Selector xpath='//@href' data='image5.html'>]
>>> 
           
  • E/@*: 选中E中的所有属性
#获取第一个a下img的所有属性(这里只有一个src属性)
>>> response.xpath('//a[1]/img/@*')
[<Selector xpath='//a[1]/img/@*' data='image1.jpg'>]
>>> 
           
  • . :选中当前节点,又来描述相对路径
#获取第一个a的选择器对象
>>> img = response.xpath('//a')[0]
>>> img
<Selector xpath='//a' data='<a href="image1.html">Name:图片1<br><img s'>
>>> 
>>> 
#假设找a[0]中的所有img,但却得到所有的img,因为//是绝对路径,会从文档的根部开始搜索
>>> img.xpath('//img')
[<Selector xpath='//img' data='<img src="image1.jpg">'>, 
<Selector xpath='//img' data='<img src="image2.jpg">'>,
<Selector xpath='//img' data='<img src="image3.jpg">'>, 
<Selector xpath='//img' data='<img src="image4.jpg">'>, 
<Selector xpath='//img' data='<img src="image5.jpg">'>]
>>> 
>>> 
#需要使用.//来描述当前节点后代中的所有img
>>> img.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image1.jpg">'>]
>>> 
           
  • 选中当前节点的父节点
>>> response.xpath('//img/..')
[<Selector xpath='//img/..' data='<a href="image1.html">Name:图片1<br><img s'>,
 <Selector xpath='//img/..' data='<a href="image2.html">Name:图片2<br><img s'>,
 <Selector xpath='//img/..' data='<a href="image3.html">Name:图片3<br><img s'>,
 <Selector xpath='//img/..' data='<a href="image4.html">Name:图片4<br><img s'>,
 <Selector xpath='//img/..' data='<a href="image5.html">Name:图片5<br><img s'>] 
           
  • node[谓语]:谓语用来查找某个特定的节点或者包含某个特定值的节点
#选区所有a中的第3个
>>> response.xpath('//a[3]')
[<Selector xpath='//a[3]' data='<a href="image3.html">Name:图片3<br><img s'>]
>>> 
>>> 
#使用last函数,选中最后一个
>>> response.xpath('//a[last()]')
[<Selector xpath='//a[last()]' data='<a href="image5.html">Name:图片5<br><img s'>]
>>> 
>>>
#使用position函数,选中前3个
>>> response.xpath('//a[position()<=3]')
[<Selector xpath='//a[position()<=3]' data='<a href="image1.html">Name:图片1<br><img s'>,
<Selector xpath='//a[position()<=3]' data='<a href="image2.html">Name:图片2<br><img s'>, 
<Selector xpath='//a[position()<=3]' data='<a href="image3.html">Name:图片3<br><img s'>]
>>> 
>>> 
#选中所有含id属性的div
>>> response.xpath('//div[@id]')
[<Selector xpath='//div[@id]' data='<div id="images">\n    <a href="image1.ht'>]
>>> 
>>> 
#选中所有含有id属性且值为images的div
>>> response.xpath('//div[@id="images"]')
[<Selector xpath='//div[@id="images"]' data='<div id="images">\n    <a href="image1.ht'>]
           

2.2、常用函数

Xpath还提供了很多函数,如数字、字符串、时间、日期、统计等。

  • string(arg):返回参数的字符串值。
>>> from scrapy.selector import Selector
>>> html = '<a href="https://blog.csdn.net/ayouleyang/" target="_blank" rel="external nofollow" ><b>阿优乐扬</b>的博客</a>'

>>> sel = Selector(text=html)
>>> sel
<Selector xpath=None data='<html><body><a href="https://blog.csdn.n'>

>>> sel.xpath('/html/body/a/text()')
[<Selector xpath='/html/body/a/text()' data='的博客'>]

>>> sel.xpath('/html/body/a/b/text()')
[<Selector xpath='/html/body/a/b/text()' data='阿优乐扬'>]

#如果想同时得到a中的字符串(阿优乐扬的博客),只是用text()就不行了
>>> sel.xpath('/html/body/a//text()').extract()
['阿优乐扬', '的博客']
>>> 
#这种情况可以使用string()函数
>>> sel.xpath('string(/html/body/a)').extract()
['阿优乐扬的博客']
>>> 
           
  • contains(str1,str2): 判断str1中是否包含str2,返回布尔值
>>> from scrapy.selector import Selector
>>> html = '''
<div>
	<p class="Nic name">阿优乐扬</p>
	<p class="English name">Youle</p>
</div>
'''
>>> sel = Selector(text=html)

#选择class属性中包含Nic的p元素
>>> sel.xpath('//p[contains(@class,"Nic")]')
[<Selector xpath='//p[contains(@class,"Nic")]' data='<p class="Nic name">阿优乐扬</p>'>]

#选择class属性中包含name的p元素
>>> sel.xpath('//p[contains(@class,"name")]')
[<Selector xpath='//p[contains(@class,"name")]' data='<p class="Nic name">阿优乐扬</p>'>, 
<Selector xpath='//p[contains(@class,"name")]' data='<p class="English name">Youle</p>'>]
>>> 

           

三、CSS选择器

CSS即层叠样式表,其选择器是种用来确定 Html文档中某部分位置的语言。CSS选择器的语法比XPath更简单一些, 但功能不如XPath强大。实际上,当我们调用Selector对象的CSS方法时,在其内部会使用Python库csselecet将CSS选择器表达式翻译成XPath表达式,然后调用Selector 对象的XPATH方法。

先创建一个HTML文档并构造一个HtmlResponse对象。

from scrapy.selector import Selector
from scrapy.http import HtmlResponse
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <title>CSS选择器 study</title>
</head>
<body>
<div id="images1", style="width:512px;" >
    <a href="image1.html">Name:图片1<br><img src="image1.jpg"></a>
    <a href="image2.html">Name:图片2<br><img src="image2.jpg"></a>
    <a href="image3.html">Name:图片3<br><img src="image3.jpg"></a>
</div>
<div id="images2", class="pictrue" >
    <a href="image4.html">Name:图片4<br><img src="image4.jpg"></a>
    <a href="image5.html">Name:图片5<br><img src="image5.jpg"></a>
</div>
</body>
</html>
'''
response = HtmlResponse(html,body = html,encoding='utf-8')
           
  • E :选中E元素
#选中所有的img
>>> response.css('img')
[<Selector xpath='descendant-or-self::img' data='<img src="image1.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="image2.jpg">'>, 
 <Selector xpath='descendant-or-self::img' data='<img src="image3.jpg">'>, 
 <Selector xpath='descendant-or-self::img' data='<img src="image4.jpg">'>, 
 <Selector xpath='descendant-or-self::img' data='<img src="image5.jpg">'>]
>>> 
           
  • E1,E2 :选中E1和E2元素
#选中所有的title和div
>>> response.css('title,div')
[<Selector xpath='descendant-or-self::title | descendant-or-self::div' data='<title>CSS选择器 study</title>'>, 
<Selector xpath='descendant-or-self::title | descendant-or-self::div' data='<div id="images1" style="width:512px;">\n'>, 
<Selector xpath='descendant-or-self::title | descendant-or-self::div' data='<div id="images1" class="pictrue">\n    <'>]
>>> 
           
  • E1 E2 :选中E1后代中的E2元素
#div后代中的img
>>> response.css('div img')
[<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image1.jpg">'>, 
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image2.jpg">'>, 
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image3.jpg">'>, 
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image4.jpg">'>, 
<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image5.jpg">'>]
>>> 
           
  • E1>E2 :选中E1子元素中的E2元素
>>> response.css('body>div')
[<Selector xpath='descendant-or-self::body/div' data='<div id="images1" style="width:512px;">\n'>, 
<Selector xpath='descendant-or-self::body/div' data='<div id="images1" class="pictrue">\n    <'>]
           
  • [ATTR] :选中包含ATTR属性的元素
>>> response.css('[style]')
[<Selector xpath='descendant-or-self::*[@style]' data='<div id="images1" style="width:512px;">\n'>]
>>> 
           
  • [ATTR=VALUE] :选中包含ATTR属性且值为VALUE的元素
>>> response.css('[id="images1"]')
[<Selector xpath="descendant-or-self::*[@id = 'images1']" data='<div id="images1" style="width:512px;">\n'>]
>>> 
           
  • E:nth-child(n):选中E元素,且该元素必须是其父元素的第n个子元素
#选中每个div的第一个a
>>> response.css('div>a:nth-child(1)')
[<Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='<a href="image1.html" target="_blank" rel="external nofollow" >Name:图片1<br><img s'>, 
<Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='<a href="image4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >Name:图片4<br><img s'>]
>>> 


#选中第二个div的第一个a
>>> response.css('div:nth-child(2)>a:nth-child(1)')
[<Selector xpath='descendant-or-self::div[count(preceding-sibling::*) = 1]/a[count(preceding-sibling::*) = 0]' data='<a href="image4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >Name:图片4<br><img s'>]
>>> 
           
  • E:first-child:选中E元素,该元素必须是其父元素的第一个子元素
  • E:last-child:选中E元素,该元素必须是其父元素的倒数第一个子元素
#选取第一个div的最后一个a
>>> response.css('div:first-child>a:last-child')
[<Selector xpath='descendant-or-self::div[count(preceding-sibling::*) = 0]/a[count(following-sibling::*) = 0]' data='<a href="image3.html" target="_blank" rel="external nofollow" >Name:图片3<br><img s'>]
>>> 
           
  • E::text: 选中E元素的文本节点
#选中所有a的文本

>>> response.css('a::text')
[<Selector xpath='descendant-or-self::a/text()' data='Name:图片1'>, 
<Selector xpath='descendant-or-self::a/text()' data='Name:图片2'>, 
<Selector xpath='descendant-or-self::a/text()' data='Name:图片3'>, 
<Selector xpath='descendant-or-self::a/text()' data='Name:图片4'>, 
<Selector xpath='descendant-or-self::a/text()' data='Name:图片5'>]

>>> response.css('a::text').extract()
['Name:图片1', 'Name:图片2', 'Name:图片3', 'Name:图片4', 'Name:图片5']
>>> 
           
以上学习内容参考《精通Scrapy网络爬虫 ——刘硕 编著》