前言

我們用爬蟲爬取到網上的資料後，需要将資料存儲下來。資料存儲的形式多種多樣，其中最簡單的一種是将資料直接儲存為文本檔案，如TXT、JSON、CSV、EXCEL，還可以将資料儲存到資料庫中，如常用的關系型資料庫MySQL和非關系型資料庫MongoDB，下面以一個具體爬取案例為例分别介紹這幾種資料存儲方式的實作。

案例介紹

我們有時想要學習某個知識點，經常在一些線上課程網站查找一些課程，以網易雲課堂為例，在搜尋框中輸入關鍵詞python，點選搜尋，會出現很多關于Python的課程，我們需要将這些課程資訊儲存下來。

在Google浏覽器中右擊選擇“檢查”，通過分析得知，網頁上面的課程資料是通過一個ajax接口請求的，請求這個接口便可以擷取到想要的資訊。

資料為Json格式，代碼如下：

import requests


def get_json(index):
  url = 'https://study.163.com/p/search/studycourse.json'
  plyload = {
    'activityId': 0,
    'advertiseSearchUuid': "0c2689fb-db3c-4e76-b413-6dae72725b0d",
    'keyword': "python",
    'orderType': 50,
    'pageIndex': index,
    'pageSize': 50,
    'priceType': -1,
    'qualityType': 0,
    'relativeOffset': 150,
    'searchTimeType': -1,
    'searchType': 10
  }


  heads = {
    'accept': 'application/json',
    'content-type': 'application/json',
    'origin': 'https://study.163.com',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
  }
  try:
    response = requests.post(url,json=plyload,headers=heads)
    content_join = response.json()
    if content_join and content_join['code'] == 0:
      return content_join
    return None
  except Exception as e:
    print('出錯了')
    print(e)
    return None


def get_content(content_join):
  if 'result' in content_join:
    return content_join['result']['list']

TXT文本資料存儲

将資料儲存為txt檔案的操作非常簡單，而且txt文本幾乎相容任何平台，但是這種方式有個缺點，就是不利于檢索，如果對檢索資料要求不高，追求友善的話，可以采用txt文本存儲。代碼如下：

if __name__ == '__main__':
  totalPageCount = get_json(1)['result']['query']['totlePageCount']
  file = open('python課程.txt', 'w', encoding='utf-8')
  for index in range(1, totalPageCount + 1):
    content = get_content(get_json(index))
    for item in content:
      file.write(f"商品ID：{item['productId']}\n")
      file.write(f"商品名稱：{item['productName']}\n")
      file.write(f"機構名稱：{item['lectorName']}\n")
      file.write(f"評分：{item['score']}\n")
      file.write(f"{'='*50}\n")
  file.close()

儲存的資料如下圖所示：

JSON檔案存儲

JSON全稱為JavaScript Object Notation，通過對象和數組的組合來表示資料，雖構造簡潔但是結構化程式非常高，是一種輕量級的資料交換格式。

代碼如下：

import json
def save_data(item):
  try:
    name = item['productName']
    data_path = f'../results/{name}.json'
    json.dump(item,open(data_path,'w',encoding='utf-8'),ensure_ascii=False,indent=2)
  except Exception as e:
    print(f'{name}:出錯了')


if __name__ == '__main__':
  totalPageCount = get_json(1)['result']['query']['totlePageCount']
  for index in range(1, totalPageCount + 1):
    content = get_content(get_json(index))
    for item in content:
      save_data(item)

儲存的資料如下：

CSV檔案存儲

CSV全稱為Comma-Separated Values，中文叫做逗号分隔值或字元分隔值，其檔案以純文字形式存儲表格資料。CSV是一個字元序列，可以是任意數目的記錄組成，各條記錄以某種換行符分割開。

代碼如下：

import csv
if __name__ == '__main__':
  totalPageCount = get_json(1)['result']['query']['totlePageCount']
  file = open('python課程.csv', 'w')
  head = ['商品ID', '商品名稱', '機構名稱', '評分']
  writer = csv.writer(file, delimiter=',')
  writer.writerow(head)
  for index in range(1, totalPageCount + 1):
    content = get_content(get_json(index))
    for item in content:
      list = [item['productId'], item['productName'], item['lectorName'], item['score']]
      writer.writerow(list)
  file.close()

儲存的資料如下：

Excel檔案存儲

Excel是我們經常使用的一款電子表格軟體，它可以非常直覺的展示和分析資料。但是Excel存儲資料有數量限制，xls格式的Excel檔案一個工作表最多可以存儲65536行資料。Xlsx格式的Excel檔案一個工作表最多可以存儲1048576行資料，可以滿足絕大多數的存儲要求。

代碼如下：

import openpyxl
def save_excel(index):
  content = get_content(get_json(index))
  for item in content:
    list = [item['productId'],item['productName'],item['lectorName'],item['score']]
    sheet.append(list)


if __name__ == '__main__':
    print('開始執行')
    wb_name = 'python課程.xlsx'
    wb = openpyxl.Workbook()
    sheet = wb.create_sheet('first_sheet')
    excel_head = ['商品ID','商品名稱','機構名稱','評分']
    sheet.append(excel_head)
    totalPageCount = get_json(1)['result']['query']['totlePageCount']
    for index in range(1,totalPageCount+1):
      save_excel(index)
    wb.save(wb_name)

儲存的資料如下：

MySQL存儲

MySQL是一種關系型資料庫，關系型資料庫是基于關系模型的資料庫，是通過二維表來儲存資料，每一列代表一個字段，每一行代表一條記錄。表可以看作某個實體的集合。

具體代碼如下：

import pymysql


conn = pymysql.connect(
  host='localhost',
  port=3306,
  user='root',
  password='root',
  db='flask',
  charset='utf8'
)
cur = conn.cursor()
def save_to_mysql(course_list):
    course_data = []
    for item in course_list:
        course_value = (0, item["productId"], item["productName"],item["lectorName"], item["score"])
        course_data.append(course_value)
    string_s = ('%s,' * 5)[:-1]
    sql_course = f"insert into course values ({string_s})"
    cur.executemany(sql_course, course_data)


def main(index):
    content = get_json(index)  # 擷取json資料
    course_list = get_content(content)  # 擷取第index頁的50條件記錄
    save_to_mysql(course_list) 


import time
if __name__ == "__main__":
    print("開始執行")
    start = time.time()
    total_page_count = get_json(1)["result"]["query"]["totlePageCount"]  # 總頁數
    for index in range(1, total_page_count + 1):
        main(index)
    cur.close()
    conn.commit()
    conn.close()
    end = time.time()
    print(f"執行結束,程式耗時{end-start}秒")

儲存的資料如下：

MongoDB文檔存儲

MongoDB是一種非關系型資料庫NoSQL，全稱Not Only SQL，意為不僅僅是SQL。NoSQL是基于鍵值對的，而且不需要經過SQL層的解析，資料之間沒有耦合性，性能非常高。對于爬蟲的資料存儲來說，一條資料可能存在因某些字段提取失敗而缺失的情況，而且資料可能随時調整。另外資料之間還存在嵌套關系，如果使用關系型資料庫存儲這些資料，一是需要提前建表，二是如果資料存在嵌套關系，還需要進行序列化操作才可以存儲，這非常不友善。如果使用非關系型資料庫，就可以避免這些麻煩，更簡單、高效。

代碼如下：

import pymongo


MONGO_CONNECTION_STRING = 'mongodb://localhost:27017'
MONGO_DB_NAME = 'course'
MONGO_COLLECTION_NAME = 'course'
client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client['course']
collection = db['course']


def save_data(item):
    collection.update_one({
        'name':item['productName']
    },{
        '$set':item
    },upsert=True)


if __name__ == '__main__':
  total_page_count = get_json(1)["result"]["query"]["totlePageCount"]  # 總頁數
  for page in range(1, total_page_count + 1):
    content = get_content(get_json(page))
    for item in content:
      save_data(item)

儲存的資料如下：

爬蟲學習 -- 資料存儲

前言

案例介紹

TXT文本資料存儲

JSON檔案存儲

CSV檔案存儲

Excel檔案存儲

MySQL存儲

MongoDB文檔存儲

繼續閱讀

Hadoop中的HDFS的存儲機制 1. HDFS中的基礎概念 2. HDFS中檔案讀寫操作流程 3. HDFS的優缺點分析

Hadoop中HDFS的存儲機制1. HDFS中的基礎概念2. HDFS中檔案讀寫操作流程3. HDFS的優缺點分析

SpringBoot：ElasticSearch-路由(_routing)機制

iOS 資料儲存4種方式總結

iOS資料存儲方式總結

lua資料存儲與檔案解析

【Android 練習】安卓存儲練習

爬蟲學習 bs4子產品安裝

Python包和資料庫

ffmpeg的一些概念

Python爬蟲—scrapy架構八、圖檔資料爬取之ImagesPipeline

學習MongoDB筆記（一）——簡介

爬蟲學習(三) Scrapy架構入門與豆瓣電影爬蟲

Reids架構NO_SQL非關系資料庫的一些簡單應用

Scrapy Crawl 運作出錯 AttributeError: 'xxxSpider' object has no attribute '_rules' 的問題解決

安卓資料庫程式設計總結(1)