Python爬蟲入門教程 29-100 手機APP資料抓取 pyspider

1. 手機APP資料----寫在前面

繼續練習pyspider的使用，最近搜尋了一些這個架構的一些使用技巧，發現文檔竟然挺難了解的，不過使用起來暫時沒有障礙，估摸着，要在寫個5篇左右關于這個架構的教程。今天教程中增加了圖檔的處理，你可以重點學習一下。

2. 手機APP資料----頁面分析

咱要爬取的網站是

http://www.liqucn.com/rj/new/

這個網站我看了一下，有大概20000頁，每頁資料是9個，資料量大概在180000左右，可以抓取下來，後面做資料分析使用，也可以練習優化資料庫。

網站基本沒有反爬措施，上去爬就可以，略微控制一下并發，畢竟不要給别人伺服器太大的壓力。

頁面經過分析之後，可以看到它是基于URL進行的分頁，這就簡單了，我們先通過首頁擷取總頁碼，然後批量生成所有頁碼即可

http://www.liqucn.com/rj/new/?page=1
http://www.liqucn.com/rj/new/?page=2
http://www.liqucn.com/rj/new/?page=3
http://www.liqucn.com/rj/new/?page=4

擷取總頁碼的代碼

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.liqucn.com/rj/new/?page=1', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        # 擷取最後一頁的頁碼
        totle = int(response.doc(".current").text())
        for page in range(1,totle+1):
            self.crawl('http://www.liqucn.com/rj/new/?page={}'.format(page), callback=self.detail_page)

然後copy一段官方中文翻譯，過來，時刻提醒自己

代碼簡單分析：

def on_start(self) 方法是入口代碼。當在web控制台點選run按鈕時會執行此方法。

self.crawl(url, callback=self.index_page)這個方法是調用API生成一個新的爬取任務，
            這個任務被添加到待抓取隊列。
def index_page(self, response) 這個方法擷取一個Response對象。 
            response.doc是pyquery對象的一個擴充方法。pyquery是一個類似于jQuery的對象選擇器。

def detail_page(self, response)傳回一個結果集對象。
            這個結果預設會被添加到resultdb資料庫（如果啟動時沒有指定資料庫預設調用sqlite資料庫）。你也可以重寫
            on_result(self,result)方法來指定儲存位置。

更多知識：
@every(minutes=24*60, seconds=0) 這個設定是告訴scheduler（排程器）on_start方法每天執行一次。
@config(age=10 * 24 * 60 * 60) 這個設定告訴scheduler（排程器）這個request（請求）過期時間是10天，
    10天内再遇到這個請求直接忽略。這個參數也可以在self.crawl(url, age=10*24*60*60) 和 crawl_config中設定。
@config(priority=2) 這個是優先級設定。數字越大越先執行。

分頁資料已經添加到待爬取隊列中去了，下面開始分析爬取到的資料，這個在

detail_page

函數實作

@config(priority=2)
    def detail_page(self, response):
        docs = response.doc(".tip_blist li").items()
        dicts = []
        for item in docs:
            title = item(".tip_list>span>a").text()
            pubdate = item(".tip_list>i:eq(0)").text()
            info = item(".tip_list>i:eq(1)").text()
            # 手機類型
            category = info.split("：")[1]
            size = info.split("/")
            if len(size) == 2:
                size = size[1]
            else:
                size = "0MB"
            app_type = item("p").text()
            mobile_type = item("h3>a").text()
            # 儲存資料
            
            # 建立圖檔下載下傳管道
            
            img_url = item(".tip_list>a>img").attr("src")
            # 擷取檔案名字
            filename = img_url[img_url.rindex("/")+1:]
            # 添加軟體logo圖檔下載下傳位址
            self.crawl(img_url,callback=self.save_img,save={"filename":filename},validate_cert=False)
            dicts.append({
                "title":title,
                "pubdate":pubdate,
                "category":category,
                "size":size,
                "app_type":app_type,
                "mobile_type":mobile_type
                
                })
        return dicts

資料已經集中傳回，我們重寫

on_result

來儲存資料到

mongodb

中，在編寫以前，先把連結

mongodb

的相關内容編寫完畢

import os

import pymongo
import pandas as pd
import numpy as np
import time
import json

DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME = 'sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba", "dba")
collection = db.liqu  # 準備插入資料

資料存儲

def on_result(self,result):
        if result:
            self.save_to_mongo(result)            
 
    def save_to_mongo(self,result):
        df = pd.DataFrame(result)
        #print(df)
        content = json.loads(df.T.to_json()).values()
        if collection.insert_many(content):
            print('存儲到 mongondb 成功')

擷取到的資料，如下表所示。到此為止，咱已經完成大部分的工作了，最後把圖檔下載下傳完善一下，就收工啦！

3. 手機APP資料----圖檔存儲

圖檔下載下傳，其實就是儲存網絡圖檔到一個位址即可

def save_img(self,response):
        content = response.content
        file_name = response.save["filename"]
        #建立檔案夾（如果不存在）
        if not os.path.exists(DIR_PATH):                         
            os.makedirs(DIR_PATH) 
            
        file_path = DIR_PATH + "/" + file_name
        
        with open(file_path,"wb" ) as f:
            f.write(content)

到此為止，任務完成，儲存之後，調整爬蟲的抓取速度，點選run，資料跑起來~~~~

Python爬蟲入門教程 29-100 手機APP資料抓取 pyspider

1. 手機APP資料----寫在前面

2. 手機APP資料----頁面分析

3. 手機APP資料----圖檔存儲

繼續閱讀

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入