天天看點

python找房源_Python爬蟲項目--爬取自如網房源資訊

本次爬取自如網房源資訊所用到的知識點:

1. requests get請求

2. lxml解析html

3. Xpath

4. MongoDB存儲

正文

1.分析目标站點

1. url: http://hz.ziroom.com/z/nl/z3.html?p=2 的p參數控制分頁

2. get請求

2.擷取單頁源碼

1 #-*- coding: utf-8 -*-

2 importrequests3 importtime4 from requests.exceptions importRequestException5 defget_one_page(page):6 try:7 url = "http://hz.ziroom.com/z/nl/z2.html?p=" +str(page)8 headers ={9 'Referer':'http://hz.ziroom.com/',10 'Upgrade-Insecure-Requests':'1',11 'User-Agent':'Mozilla/5.0(WindowsNT6.3;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.106Safari/537.36'

12 }13 res = requests.get(url,headers=headers)14 if res.status_code == 200:15 print(res.text)16 exceptRequestException:17 returnNone18 defmain():19 page = 1

20 get_one_page(page)21 if __name__ == '__main__':22 main()23 time.sleep(1)

3.解析單頁源碼

1. 解析html文檔, 目的: 測試XPath表達式

将擷取的源碼儲存到目前檔案夾下的"result.html"中, 然後通過XPath對其進行相應内容的提取, 當然你也可以使用某些線上工具.

1 from lxml importetree2 #解析html文檔

3 html = etree.parse("./resul.html",etree.HTMLParser())4 results = html.xpath('//ul[@id="houseList"]/li')5 for result in results[1:]:6 title = result.xpath("./div/h3/a/text()")[0][5:] if len(result.xpath("./div/h3/a/text()")[0]) >5 else ""

7 location = result.xpath("./div/h4/a/text()")[0].replace("[","").replace("]",'')8 area = " ".join(result.xpath("./div/div/p[1]/span/text()")).replace(" ","",1) #使用join方法将清單中的内容以" "字元連接配接

9 nearby = result.xpath("./div/div/p[2]/span/text()")[0]10 print(title)11 print(location)12 print(area)13 print(nearby)

2. 解析源代碼

1 from lxml importetree2 defparse_one_page(sourcehtml):3 '''解析單頁源碼'''

4 contentTree = etree.HTML(sourcehtml) #解析源代碼

5 results = contentTree.xpath('//ul[@id="houseList"]/li') #利用XPath提取相應内容

6 for result in results[1:]:7 title = result.xpath("./div/h3/a/text()")[0][5:] if len(result.xpath("./div/h3/a/text()")[0]) > 5 else ""

8 location = result.xpath("./div/h4/a/text()")[0].replace("[", "").replace("]", '')9 area = " ".join(result.xpath("./div/div/p[1]/span/text()")).replace(" ", "", 1) #使用join方法将清單中的内容以" "字元連接配接

10 nearby = result.xpath("./div/div/p[2]/span/text()")[0]11 yield{12 "title": title,13 "location": location,14 "area": area,15 "nearby": nearby16 }17 defmain():18 page = 1

19 html =get_one_page(page)20 print(type(html))21 parse_one_page(html)22 for item inparse_one_page(html):23 print(item)24

25 if __name__ == '__main__':26 main()27 time.sleep(1)

4.擷取多個頁面

1 defparse_one_page(sourcehtml):2 '''解析單頁源碼'''

3 contentTree = etree.HTML(sourcehtml) #解析源代碼

4 results = contentTree.xpath('//ul[@id="houseList"]/li') #利用XPath提取相應内容

5 for result in results[1:]:6 title = result.xpath("./div/h3/a/text()")[0][5:] if len(result.xpath("./div/h3/a/text()")[0]) > 5 else ""

7 location = result.xpath("./div/h4/a/text()")[0].replace("[", "").replace("]", '')8 area = " ".join(result.xpath("./div/div/p[1]/span/text()")).replace(" ", "", 1) #使用join方法将清單中的内容以" "字元連接配接

9 #nearby = result.xpath("./div/div/p[2]/span/text()")[0].strip()這裡需要加判斷, 改寫為下句

10 nearby = result.xpath("./div/div/p[2]/span/text()")[0].strip() if len(result.xpath("./div/div/p[2]/span/text()"))>0 else ""

11 yield{12 "title": title,13 "location": location,14 "area": area,15 "nearby": nearby16 }17 print(nearby)18 #yield {"pages":pages}

19 defget_pages():20 """得到總頁數"""

21 page = 1

22 html =get_one_page(page)23 contentTree =etree.HTML(html)24 pages = int(contentTree.xpath('//div[@class="pages"]/span[2]/text()')[0].strip("共頁"))25 returnpages26 defmain():27 pages =get_pages()28 print(pages)29 for page in range(1,pages+1):30 html =get_one_page(page)31 for item inparse_one_page(html):32 print(item)33

34 if __name__ == '__main__':35 main()36 time.sleep(1)

5. 存儲到MongoDB中

需確定MongoDB已啟動服務, 否則必然會存儲失敗

1 defsave_to_mongodb(result):2 """存儲到MongoDB中"""

3 #建立資料庫連接配接對象, 即連接配接到本地

4 client = pymongo.MongoClient(host="localhost")5 #指定資料庫,這裡指定ziroom

6 db =client.iroomz7 #指定表的名稱, 這裡指定roominfo

8 db_table =db.roominfo9 try:10 #存儲到資料庫

11 ifdb_table.insert(result):12 print("---存儲到資料庫成功---",result)13 exceptException:14 print("---存儲到資料庫失敗---",result)

6.完整代碼

python找房源_Python爬蟲項目--爬取自如網房源資訊
python找房源_Python爬蟲項目--爬取自如網房源資訊

1 #-*- coding: utf-8 -*-

2

3 importrequests4 importtime5 importpymongo6 from lxml importetree7 from requests.exceptions importRequestException8 defget_one_page(page):9 '''擷取單頁源碼'''

10 try:11 url = "http://hz.ziroom.com/z/nl/z2.html?p=" +str(page)12 headers ={13 'Referer':'http://hz.ziroom.com/',14 'Upgrade-Insecure-Requests':'1',15 'User-Agent':'Mozilla/5.0(WindowsNT6.3;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.106Safari/537.36'

16 }17 res = requests.get(url,headers=headers)18 if res.status_code == 200:19 returnres.text20 returnNone21 exceptRequestException:22 returnNone23 defparse_one_page(sourcehtml):24 '''解析單頁源碼'''

25 contentTree = etree.HTML(sourcehtml) #解析源代碼

26 results = contentTree.xpath('//ul[@id="houseList"]/li') #利用XPath提取相應内容

27 for result in results[1:]:28 title = result.xpath("./div/h3/a/text()")[0][5:] if len(result.xpath("./div/h3/a/text()")[0]) > 5 else ""

29 location = result.xpath("./div/h4/a/text()")[0].replace("[", "").replace("]", '')30 area = " ".join(result.xpath("./div/div/p[1]/span/text()")).replace(" ", "", 1) #使用join方法将清單中的内容以" "字元連接配接

31 #nearby = result.xpath("./div/div/p[2]/span/text()")[0].strip()這裡需要加判斷, 改寫為下句

32 nearby = result.xpath("./div/div/p[2]/span/text()")[0].strip() if len(result.xpath("./div/div/p[2]/span/text()"))>0 else ""

33 data ={34 "title": title,35 "location": location,36 "area": area,37 "nearby": nearby38 }39 save_to_mongodb(data)40 #yield {"pages":pages}

41 defget_pages():42 """得到總頁數"""

43 page = 1

44 html =get_one_page(page)45 contentTree =etree.HTML(html)46 pages = int(contentTree.xpath('//div[@class="pages"]/span[2]/text()')[0].strip("共頁"))47 returnpages48 defsave_to_mongodb(result):49 """存儲到MongoDB中"""

50 #建立資料庫連接配接對象, 即連接配接到本地

51 client = pymongo.MongoClient(host="localhost")52 #指定資料庫,這裡指定ziroom

53 db =client.iroomz54 #指定表的名稱, 這裡指定roominfo

55 db_table =db.roominfo56 try:57 #存儲到資料庫

58 ifdb_table.insert(result):59 print("---存儲到資料庫成功---",result)60 exceptException:61 print("---存儲到資料庫失敗---",result)62

63 defmain():64 pages =get_pages()65 print(pages)66 for page in range(1,pages+1):67 html =get_one_page(page)68 parse_one_page(html)69

70 if __name__ == '__main__':71 main()72 time.sleep(1)

點選檢視

7.最終結果

python找房源_Python爬蟲項目--爬取自如網房源資訊

總結

1. 在第三步中XPath使用注意事項

title = result.xpath("./div/h3/a/text()")

此處的點'.'不能忘記, 它表示目前節點, 如果不加'.', '/'就表示從根節點開始選取

2. 在第四步擷取多個頁面時出現索引超出範圍錯誤

nearby = result.xpath("./div/div/p[2]/span/text()")[0].strip()

IndexError: list index out of range

造成這種錯誤原因有兩種:

1) [index] index超出list範圍

2) [index] index索引内容為空

因為這裡的nearby的index是0, 排除第一種情況, 那麼這裡就是空行了, 加句if判斷就可以解決

nearby = result.xpath("./div/div/p[2]/span/text()")[0].strip()#改寫以後:

nearby = result.xpath("./div/div/p[2]/span/text()")[0].strip() if len(result.xpath("./div/div/p[2]/span/text()"))>0 else ""

以上主要是對爬蟲過程學習的總結, 若有不對的地方, 還請指正, 謝謝!