python爬取電影和美食資料實戰

本文使用的是requests+正則來比對網頁内容，對于資料量較多的采用了多線程抓取的方法，共3個案例，分别是抓取貓眼電影TOP100榜單和淘票票正在熱映的電影資訊、以及美團的美食資料。這幾個案例采用的方法大同小異。

1、首先選擇想要爬取的網站

2、确定要用的子產品，requests,json,re三個子產品，如果想加快爬取速度可以加一個Pool

3、網頁請求，先得到整個頁面，需要加一個headers來進行請求，否則會被網站攔截

4、格式化整個頁面，通過patter的正則來比對，找出我們需要的内容，

5、擷取資料，findall，然後通過yield将資料傳回，yield 是一個類似 return 的關鍵字，疊代一次遇到yield時就傳回yield後面(右邊)的值

6、周遊擷取到的資料

7、儲存到相應的文檔中

8、關閉文檔，

9、提示資料儲存成功。

一、爬取貓眼電影Top100榜單的資料

import requests

from multiprocessing import Pool

from requests.exceptions import RequestException

import re

import json

def get_one_page(url):

try:

headers = {

"user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

response = requests.get(url, headers=headers)

if response.status_code ==200:

return response.text

return None

except RequestException:

def parse_one_page(html):

pattern = re.compile('<dd>.*?board-index.*?>(\d+).*?data-src="(.*?)".*?name"><a'

+ '.*?>(.*?)</a>.*?star">(.*?).*?releasetime">(.*?)'

+ '.*?integer">(.*?).*?fraction">(.*?).*?</dd>', re.S)

items =re.findall(pattern,html)

for item in items:

yield {

'index':item[0],

'image':item[1],

'title':item[2],

'actor':item[3].strip()[3:],

'time': item[4].strip()[5:],

'score': item[5] + item[6]

}

def write_to_file(content):

with open('result.txt', 'a', encoding='utf-8') as f:

f.write(json.dumps(content, ensure_ascii=False) + '\n')

f.close()

def main(offset):

url ='http://maoyan.com/board/4?offset='+str(offset)

html = get_one_page(url)

for item in parse_one_page(html):

#print(item)

write_to_file(item)

if __name__ == '__main__':

#for i in range(10):

# main(i*10)

pool = Pool()

pool.map(main,[i*10 for i in range(10)])

結果：将爬取的資料存放到文本檔案中，

因為我這邊采用的是線程池爬取的，是以有時候是不按順序進行存儲的，如果采用非多線程方式，就會按照順序進行存儲。

二、爬取淘票票正在熱映的電影

可以看到網頁結構如下，我這邊使用了正則比對的方法進行查找：

代碼如下：

try:

headers = {

"user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

response = requests.get(url, headers=headers)

if response.status_code ==200:

return response.text

return None

except RequestException:

pattern = re.compile('<div class="movie-card-poster">.*?data-src="(.*?)".*?(.*?).*?(.*?).*?<div class="movie-card-list">.*?(.*?)'

+'.*?(.*?).*?(.*?).*?(.*?).*?(.*?).*?(.*?)',re.S)

items = re.findall(pattern, html)

for item in items:

yield {

'image': item[0],

'title': item[1],

'score': item[2],

'director': item[3].strip()[3:],

'actor': item[4].strip()[3:],

'type': item[5].strip()[3:],

'area': item[6].strip()[3:],

'language': item[7].strip()[3:],

'time': item[8].strip()[3:]

}

with open('movie-hot.txt', 'a', encoding='utf-8') as f:

f.write(json.dumps(content, ensure_ascii=False) + '\n')

f.close()

def main():

url ='https://www.taopiaopiao.com/showList.htm'

html = get_one_page(url)

for item in parse_one_page(html):

print(item)

write_to_file(item)

if __name__ == '__main__':

main()

結果：

三、爬取美團（深圳）美食店鋪資訊，評分大于4.0分的店鋪

做為一名吃貨，想知道我所在是城市的美食店，是以爬取評分較高的店鋪資訊：

美團的這個網頁的不同之處在于，全部是通過js渲染生成的，是以我這邊是拿到頁面後，在js裡面查找到的資料，然後用正則來比對。

"""

author 朱培

title 爬取美團(深圳)美食店鋪資訊,評分大于4.0分的店鋪

response = requests.get(url, headers=headers)

pattern = re.compile('"poiId":(.*?),"frontImg":"(.*?)","title":"(.*?)","avgScore":(.*?),"allCommentNum":(.*?)'

+',"address":"(.*?)","avgPrice":(.*?),', re.S)

if float(item[3]) >= 4.0:

yield {

'poiId': item[0],

'frontImg': item[1],

'title': item[2],

'avgScore': item[3],

'allCommentNum':item[4],

'address': item[5],

'avgPrice': item[6]

}

with open('food-meituan.txt', 'a', encoding='utf-8') as f:

def main(n):

url ='http://sz.meituan.com/meishi/pn'+str(n)+'/'

#for i in range(32):

# main(i)

pool = Pool()

pool.map(main, [ 1 for i in range(32)])

結果如下：

對于後期，可以選擇把這個資料落庫，常用的可以放在mongodb或者mysql資料庫中進行存儲。

python爬取電影和美食資料實戰

繼續閱讀

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

vue-cli簡介（中文翻譯）

Ajax發送和擷取json資料到Spring mvc 1.spring mvc後端2.web前段

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

JSONObject包導入異常 java.lang.NoClassDefFoundErrorweb項目的導入包的問題

在python中建立excel并寫入