爬取貓眼電影TOP100

本文所講的爬蟲項目實戰屬于基礎、入門級别，使用的是Python3.5實作的。

本項目基本目标：在貓眼電影中把top100的電影名，排名，海報，主演，上映時間，評分等爬取下來

爬蟲原理和步驟

爬蟲，就是從網頁中爬取自己所需要的東西，如文字、圖檔、視訊等，這樣我們就需要讀取網頁，然後擷取網頁源代碼，從源代碼中用正規表達式進行比對，把比對成功的資訊存入相關文檔中。這就是爬蟲的簡單原理。

操作步驟：

1.确定抓取的資料字段（排名，海報，電影名，主演，上映時間，評分）

2.分析頁面html标簽結構，找到資料所在位置

3.選擇實作方法及資料存儲位置（存在在mysql 資料庫中）

4.代碼寫入（requests+re+pymysql）

5.代碼調試

确定抓取的頁面目标URL:http://maoyan.com/board/4

1.導入庫/子產品

1 import re
2 import requests
3 import pymysql
4 from requests.exceptions import  RequestException  #捕獲異常

複制

2.請求頭域，在網頁中檢視headers,複制User-Agent内容

請求一個單頁内容拿到HTML,定義函數，建構headers,請求成功則代碼為200，否則失敗重新寫入代碼

1 def get_one_page(url):
2     try:
3         #建構headers
4         headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
5         response=requests.get(url,headers=headers)
6         if response.status_code==200:
7             return response.text
8     except RequestException:
9         return '請求異常'

複制

3.解析HTML，用正規表達式比對字元，為非貪婪模式.*?比對

1 def parse_one_page(html):
2     # 建立一個正規表達式對象
3     #使用re.S可以使元字元.比對到換行符
4     pattern=re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name">'
5                          + '<a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
6                          + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
7     items = re.findall(pattern, html)
8     # print(items)

複制

運作已經比對好的第一頁内容

運作結果沒有處理的如下：

4.進行資料處理，格式美化，按字段依次排列，去掉不必要的空格符

1  for item in items:
2         yield {
3             'index': item[0],
4             'image': item[1],
5             'title': item[2],
6             'actor': item[3].strip()[3:],  # strip():删除前後空白
7             'time': item[4].strip()[5:],
8             'score': item[5] + item[6]
9         }

複制

5.建立MySQL資料庫，庫名movie1表名maoyan,添加我們爬取的6個字段名

6.在python中建立資料庫連接配接，把爬取的資料存儲到MySQL

1 def write_to_mysql(content):
 2     conn=pymysql.connect(host='localhost',user='root',passwd='123456',
 3                          db='movie1',charset='utf8')
 4     cursor=conn.cursor()
 5     index=content['index']
 6     image=content['image']
 7     title=content['title']
 8     actor=content['actor']
 9     time=content['time']
10     score=content['score']
11     sql='insert into maoyan values(%s,%s,%s,%s,%s,%s)'
12     parm=(index,image,title,actor,time,score)
13     cursor.execute(sql,parm)
14     conn.commit()
15     cursor.close()
16     conn.close()

複制

調用主函數，運作後得到結果如下：

以上為調取的一頁資料，隻有TOP10的電影排名，如果需要得到TOP100，則要重新得到URL來建構

第一頁的URL為：http://maoyan.com/board/4

第二頁的URL為：http://maoyan.com/board/4?offset=10

第三頁的URL為：http://maoyan.com/board/4?offset=20

得到頁面都是以10來遞增URL為：

url='http://maoyan.com/board/4?offset='+str(offset)
需要循環10次即可得到排名前100的電影，并把它寫入到資料庫中

複制

1 def main(offset):
 2     url='http://maoyan.com/board/4?offset='+str(offset)
 3     html=get_one_page(url)
 4     for item in parse_one_page(html):
 5         print(item)
 6         write_to_mysql(item)
 7 
 8 if __name__=='__main__':
 9     for i in range(0,10):
10         main(i*10)

複制

運作後進入MySQL檢視寫入的資料

以上是爬取貓眼top100完整代碼，如有錯誤請多指教。