爬蟲基礎--資料提取方法

資料提取

結構化資料: json,xml

非結構化資料:Html,字元串

結構化資料處理的方式有:jsonpath,xpath,轉換python類型處理,bs4

非結構化資料處理方式有:正規表達式,xpath,bs4

1.用json子產品提取資料

爬蟲基礎--資料提取方法

類檔案對象的了解：

具有read()或者write()方法的對象就是類檔案對象，比如f = open(“a.txt”,”r”) f就是類檔案對象

具體使用方法：

mydict = {
        "name": "孫威",
        "age": 16
    }
#json.dumps 實作python類型轉化為json字元串
#indent實作換行和空格
#ensure_ascii=False實作讓中文寫入的時候保持為中文
json_str = json.dumps(mydict,indent=2,ensure_ascii=False)

#json.loads 實作json字元串轉化為python的資料類型
my_dict = json.loads(json_str)

#json.dump 實作把python類型寫入類檔案對象
with open("temp.txt","w") as f:
    json.dump(mydict,f,ensure_ascii=False,indent=2)

# json.load 實作類檔案對象中的json字元串轉化為python類型
with open("temp.txt","r") as f:
    my_dict = json.load(f)
# 或者my_dict = json.load(open("temp.txt","r"))

jsonpath子產品提取資料

jsonpath用來解析多層嵌套的json資料;JsonPath 是一種資訊抽取類庫，是從JSON文檔中抽取指定資訊的工具，提供多種語言實作版本，包括：Javascript, Python， PHP 和 Java。

jsonpath	文法描述
$	根節點
@	現行節點
. or []	子節點
. .	不管位置，選取所有符合條件的條件
*	比對所有元素節點
[]	疊代器辨別，可以在裡面做簡單的疊代操作，如數組下标，根據内容選值
[,]	支援疊代器中做多選
?()	支援過濾操作
()	支援表達式計算
-	不支援去父節點
-	不支援屬性通路
-	不支援分組

# 文法使用示例
data = { "store": {
    "book": [ 
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}
# 轉換成json格式的資料
data_dict = json.loads(data)
# 解析資料，傳回的是所有的author
result_list = jsonpath.jsonpath(data_dict, '$..author')

文法使用示例:

jsonpath	result
$.store.book[*].author	store中的所有的book的作者
$..author	所有的作者
$.store.*	store下的所有的元素
$.store..price	store中的所有的内容的價格
$..book[2]	第三本書
$..book[(@.length-1)] \| $..book[-1:]	最後一本書
$..book[0,1] \| $..book[:2]	前兩本書
$..book[?(@.isbn)]	擷取有isbn的所有數
$..book[?(@.price<10)]	擷取價格大于10的所有的書
$..*	擷取所有的資料

2.正規表達式提取資料

3.xpath文法

XPath (XML Path Language) 是一門在 HTML\XML 文檔中查找資訊的語言，可用來在 HTML\XML 文檔中對元素和屬性進行周遊。

xpath文法

表達式	描述
nodename	選中該元素。
/	從根節點選取、或者是元素和元素間的過渡。
//	從比對選擇的目前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取目前節點。
..	選取目前節點的父節點。
@	選取屬性。
text()	選取文本。

執行個體

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

路徑表達式	結果
bookstore	選擇bookstore元素。
/bookstore	選取根元素 bookstore。注釋：假如路徑起始于正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬于 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬于 bookstore 元素的後代的所有 book 元素，而不管
//book/title/@lang	選擇所有的book下面的title中的lang屬性的值。
//book/title/text()	選擇所有的book下面的title的文本。
//title[@lang=“eng”]	選擇lang屬性值為eng的所有title元素
/bookstore/book[1]	選取屬于 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬于 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬于 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()>1]	選擇bookstore下面的book元素，從第二個開始選擇
//book/title[text()=‘Harry Potter’]	選擇所有book下的title元素，僅僅選擇文本為Harry Potter的title元素
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大于 35.00。
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

注意點: 在xpath中，第一個元素的位置是1，最後一個元素的位置是last(),倒數第二個是last()-1

4.lxml子產品

lxml是一款高性能的 Python HTML/XML 解析器，我們可以利用XPath，來快速的定位特定元素以及擷取節點資訊

text = ''' <div> <ul> 
        <li class="item-1"><a href="link1.html" target="_blank" rel="external nofollow" >first item</a></li> 
        <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" >second item</a></li> 
        <li class="item-inactive"><a href="link3.html" target="_blank" rel="external nofollow" >third item</a></li> 
        <li class="item-1"><a href="link4.html" target="_blank" rel="external nofollow" >fourth item</a></li> 
        <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" >fifth item</a> 
        </ul> </div> '''
# 導入lxml 的 etree 庫
from lxml import etree
# 利用etree.HTML，将字元串轉化為Element對象,Element對象具有xpath的方法
html = etree.HTML(text) 
#擷取href的清單和title的清單
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")

我們取到屬性，或者是文本的時候，傳回字元串但是如果我們取到的是一個節點，傳回的是element對象，可以繼續使用xpath方法

5.BeautifulSoup4

和 lxml 一樣，Beautiful Soup 也是一個HTML/XML的解析器,lxml 隻會局部周遊，而Beautiful Soup 是基于HTML DOM的，會載入整個文檔，解析整個DOM樹，是以時間和記憶體開銷都會大很多，是以性能要低于lxml。

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 導入 bs4 庫
from bs4 import BeautifulSoup
#建立 Beautiful Soup 對象
soup = BeautifulSoup(html)

#　傳字元串
print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>]

# 傳正規表達式
soup.find_all(re.compile("^b"))

# 傳清單

# keyword 參數
soup.find_all(class = "sister")

# text 參數
soup.find_all(text="Elsie")

# （1）通過标簽選擇器查找
print soup.select('title') 

# （2）通過類選擇器查找
print soup.select('.sister')

# （3）通過 id 選擇器查找
print soup.select('#link1')

# （4）層級選擇器 查找
print soup.select('p #link1')

# （5）通過屬性選擇器查找
print soup.select('a[class="sister"]')

# (6) 擷取文本内容 get_text()
for title in soup.select('title'):
    print title.get_text()

# (7) 擷取屬性 get('屬性的名字')
print soup.select('a')[0].get('href')

find_all(name, attrs, recursive, text, **kwargs)：

查找所有名字為 name 的标簽

CSS選擇器：

通過标簽選擇器查找

爬蟲基礎--資料提取方法

資料提取

1.用json子產品提取資料

jsonpath子產品提取資料

2.正規表達式提取資料

3.xpath文法

4.lxml子產品

5.BeautifulSoup4

繼續閱讀

C++ iostream 疊代器 STL

Activity 與 Window、PhoneWindow、DecorView 之間的關系詳解

攝像頭的像素與分辨率之間的關系

常見的http響應頭内容介紹

Java基礎-JVM、JDK、JRE之間的關系

【基礎知識】【轉】原碼, 反碼, 補碼詳解

sum函數axis參數詳解(轉載)1 sum函數可以傳入一個axis的參數，這個參數怎麼了解呢？這樣了解：2 了解參數axis取值對sum結果的影響：

C#_ImageList和ListView的使用

QNX的詳細介紹

BUG單内容規範

VC6.0中友元函數通路類的私有成員的bug

程式設計語言十大熱門哦

學習Shell腳本必須先掌握Shell腳本的基礎知識。249個Shell腳本随學随用，包括：1、防範DOS攻擊的腳本。2

nginx基礎知識（掌握）

體二極管的原理及應用

shell排序基本思想及其複雜度分析