天天看點

Scrapy抓取知乎網站

今天給大家帶來如何抓取知乎網站中最新熱點欄目中的資訊,擷取裡面的标題、内容、作者、網友評論、點贊量等資訊。擷取這些資料可以提取我們想要的内容進行資料分析和資料展示,建立一個自己的網站,将擷取的内容進行展示!

1.軟體安裝scrapy+selenium+chrome(詳情見我的上一篇文章,這裡就不提了)

2.接下來我就直接上代碼,并進行一定的詳解

1)首先要抓取知乎的資料我們需要進行模拟登陸後,擷取cookie并儲存到mongo資料庫 如下:

# -*- coding: utf-8 -*-

import sys
import time
import pymongo
from selenium.webdriver import DesiredCapabilities

reload(sys)
sys.setdefaultencoding("utf-8")

if __name__ == '__main__':
    # 連接配接mongo建立資料庫和表為後面儲存cookie做準備 
    client = pymongo.MongoClient(host="mongodb://192.168.98.5:27017")
    dbs = client["zhihu"]
    table = dbs["cookies"]
    from selenium import webdriver
    # browser = webdriver.Chrome()
    # 加載chromeoptions 是一個友善控制 chrome 啟動時屬性的類
    option = webdriver.ChromeOptions()
    # 無頭模式啟動
    option.add_argument("--headless")
    # 谷歌文檔提到需要加上這個屬性來規避bug
    option.add_argument("--disable-gpu")
    # 取消沙盒模式
    option.add_argument("--no-sanbox")
    # 單程序運作
    option.add_argument("--single-process")
    # 設定網頁大小
    option.add_argument("--window-size=414,736")
    # 添加useragent
    option.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'") 
    browser = webdriver.Chrome(chrome_options=option)
    try:
        browser.get("https://www.zhihu.com/signin")
        browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
            "你的知乎賬号")
        time.sleep(1)
        browser.find_element_by_css_selector(".SignFlow-password input").send_keys(
            "你的知乎密碼")
        time.sleep(2)
        browser.find_element_by_css_selector(
            ".Button.SignFlow-submitButton").click()
        time.sleep(3)
        zhihu_cookies = browser.get_cookies()
        cookie_dict = {}
        for cookie in zhihu_cookies:
            cookie_dict[cookie['name']] = cookie['value']
        table.insert(cookie_dict)
        print "插入成功"
        browser.close()
    except Exception, e:
        zhihu_cookies = browser.get_cookies()
        cookie_dict = {}
        for cookie in zhihu_cookies:
            cookie_dict[cookie['name']] = cookie['value']
        print cookie_dict
        browser.close()
        print e
           

ChromeOptions是Chrome 的一個類,她的作用是在啟動chrome的時候進行一定的設定。如添加參數,阻止圖檔加載,阻止JavaScript執行 等動作。這些需要 selenium的 ChromeOptions 來幫助我們完成。

以上代碼就是連接配接到mongo,建立好資料庫和資料表,并用selenium和chrome結合以浏覽器的行為去通路知乎網站,找到輸入框,輸入賬号和密碼擷取到cookie。并進行提取存到mongo指定的資料庫中,供後面的爬蟲在發送請求的時候去通路和攜帶!

2)配置settings,設定延時,配置mongo 等

ROBOTSTXT_OBEY = False
LOG_LEVEL = "WARNING"
MONGO_URI = 'mongodb://xxx.xxx.xx.x:27017'
MONGODB_DBNAME = 'zhihu'
MONGODB_DBTABLE = 'zh_data'
MONGODB_COOKIE = 'cookies'
DOWNLOAD_DELAY = 0.8
           

3) 爬蟲spider核心代碼詳解

# -*- coding: utf-8 -*-
import json
import os
import re
from datetime import datetime
from os import path
from urlparse import urljoin
import pymongo
import scrapy
import sys
from bs4 import BeautifulSoup
from copy import deepcopy
from selenium.webdriver import DesiredCapabilities
from zhihu import settings
reload(sys)
sys.setdefaultencoding('utf8')
from scrapy.loader import ItemLoader

class ZhihuSpider(scrapy.Spider):
    name = "zh"
    allowed_domains = ["www.zhihu.com"]
    start_urls = ['https://www.zhihu.com/']
    # question的第一頁answer的請求url
    start_url = "https://www.zhihu.com/hot"
    question_detail_url = "https://www.zhihu.com/api/v4/questions/{0}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset={1}&sort_by=default"
    q_detail_url = "https://www.zhihu.com/api/v4/articles/{0}/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open"

    headers = {
        "HOST": "www.zhihu.com",
        "Referer": "https://www.zhizhu.com",
        'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/68.0.3440.106 Safari/537.36"
    }

    def __init__(self, param=None, *args, **kwargs):
        super(ZhihuSpider, self).__init__(*args, **kwargs)
        client = pymongo.MongoClient(host=settings.MONGO_URI)
        dbs = client[settings.MONGODB_DBNAME]
        self.table = dbs[settings.MONGODB_DBTABLE]

    def parse(self, response):
        section_list = response.xpath("//section[@class='HotItem']")
        for section in section_list:
            url = section.xpath(".//div[@class='HotItem-content']/a/@href").extract_first()
            title = section.xpath(".//div[@class='HotItem-content']/a/@title").extract_first()
            question_id = url.split("/")[-1]
            if "question" in url:
                detail_url = self.question_detail_url.format(question_id, 5)
                yield scrapy.Request(
                    detail_url,
                    callback=self.parse_detail,
                    meta={"meta": deepcopy(url)}
                )
            else:
                detail_url = self.q_detail_url.format(question_id)
                yield scrapy.Request(
                    detail_url,
                    callback=self.q_parse_detail,
                    meta={"url": deepcopy(url), "title": title}
                )

    def parse_detail(self, response):
        question_url = response.meta["meta"]
        detail_url = response.url
        all_dict = json.loads(response.text)
        data_dict = all_dict["data"]

        for data in data_dict:
            item1 = {}
            item1["question_url"] = question_url
            item1["title"] = data["question"]["title"]
            item1["content"] = data["content"]
            item1["comment_count"] = data["comment_count"]
            item1["voteup_count"] = data["voteup_count"]
            p2 = re.compile(u'[^\u4e00-\u9fa5]')  # 中文的編碼範圍是:\u4e00到\u9fa5
            item1["content"] = p2.sub(r'', item1["content"])
            print "===========>question_url:{0}".format(question_url)
            print "===========>title:{0}".format(item1["title"])
            print "===========>點贊量:{0}".format(item1["voteup_count"])
            print "===========>評論量:{0}".format(item1["comment_count"])
            print "===========>評論:{0}".format(item1["content"])
            #self.table.insert(item1)
        paging = all_dict["paging"]
        if not paging["is_end"]:
            next_url = paging["next"]
            yield scrapy.Request(
                next_url,
                self.parse_detail,
                meta={"meta": deepcopy(question_url)}
            )

    def q_parse_detail(self, response):
        question_url = response.meta["url"]
        title = response.meta["title"]
        detail_url = response.url
        all_dict = json.loads(response.text)
        data_dict = all_dict["data"]

        for data in data_dict:
            content = data["content"]
            comment_count = 0
            vote_count = data["vote_count"]
            p2 = re.compile(u'[^\u4e00-\u9fa5]')  # 中文的編碼範圍是:\u4e00到\u9fa5
            content = p2.sub(r'', content)
            item2 = {}
            item2["question_url"] = question_url
            item2["title"] = title
            item2["voteup_count"] = vote_count
            item2["comment_count"] = comment_count
            item2["content"] = content
            print "===========>question_url:{0}".format(question_url)
            print "===========>title:{0}".format(title)
            print "===========>點贊量:{0}".format(vote_count)
            print "===========>評論量:{0}".format(comment_count)
            print "===========>評論:{0}".format(content)
            #self.table.insert(item2)

        paging = all_dict["paging"]
        if not paging["is_end"]:
            next_url = paging["next"]
            yield scrapy.Request(
                next_url,
                self.q_parse_detail,
                meta={"url": deepcopy(question_url), "title": deepcopy(title)}
            )
        # pass

    def start_requests(self):
        return [scrapy.Request(url=self.start_url, dont_filter=True, cookies=cookie_dict, headers=self.headers)]
           

以上代碼很簡單,scrapy拿到的第一個請求,也就是start_urls,引擎會從spider中拿到https://www.zhihu.com/hot這個請求交給排程器去入隊列,執行排程。排程器将封裝好的請求傳回給引擎,引擎會将剛剛處理過得請求交給下載下傳器去下載下傳,下載下傳器在下載下傳中間件中可以對該請求添加cookie、useragent、代理等方式,并将下載下傳好的response傳回給spider中的parse,将産生新的url反複執行如上。

爬蟲開始抓取資料,預設是直接通路知乎的熱點欄目https://www.zhihu.com/hot,檢視parse函數中代碼。首先擷取到熱點欄目下的超連結,然後從超連結中提取出每一個作者的id号,供後面抓取每一個超連結下的評論、點贊等使用。需要注意的是這裡會有兩種類型的超連結,一種是提問的超連結,另一種是熱點的描述。分别做處理後,去請求這些超連結。question_detail_url、q_detail_url這兩個類變量是兩個初始的json url。

每一個超連結下面都是json加載的資料,你往下拉的時候會發現一直重新整理,這個時候我們使用google浏覽器中的調試工具去捕捉,你會發現network中xhr裡能夠看到這樣的資訊

Scrapy抓取知乎網站

其實隻要知道這個url,當你去通路它,它裡面的内容就會提示你去通路新的url,你隻需要提取去通路就ok。附上json中資料:

{“paging”:{“is_end”:true,“is_start”:true,“next”:“https://www.zhihu.com/answers/516374549/concerned_upvoters?limit=10\u0026offset=10",“previous”:“https://www.zhihu.com/answers/516374549/concerned_upvoters?limit=10\u0026offset=0”,“totals”:0},"data”:[]}

這條json資料中,告訴我們下一個url和上一個url的路徑,以及是否是最後一頁,我們可以根據json中的next這個field去通路url,當is_end這個field為false即表示資料已經提取完成。提取規則寫完了後,大家可以将資料儲存到指定的資料庫即可,爬蟲到此就結束了,謝謝大家!歡迎大家提問。