python爬蟲之-擷取b站搜尋頁面所有視訊彈幕

謹以此文表示對一線工作者的感謝！

作為一個在特殊期間隻能宅在家裡做貢獻的小小程式員，真的非常感激那些美麗的白衣天使們。剛好前幾天學會了一些python的小應用，用代碼表達我的感謝！

實作功能：

用python爬取哔哩哔哩搜尋頁面的所有視訊彈幕，建立詞雲

1、爬取頁面：

python爬蟲之-擷取b站搜尋頁面所有視訊彈幕

2、實作思路

實作思路就是從以上頁面中擷取所有視訊的連結，比如說第一個視訊的連結為：https://www.bilibili.com/video/av85319370?from=search&seid=4971102771316254686

然後我再從連結裡面擷取該視訊号，這個視訊就是av85319370，然後再從該視訊頁面的所有包裡面找到彈幕包，就是以下兩個包：

python爬蟲之-擷取b站搜尋頁面所有視訊彈幕

從這兩個包裡面我們可以知道，第二個包裡面存放有該視訊的所有彈幕資訊，但是該視訊連結的oid值存放在第一個json資料包裡面，是以就得先擷取第一個包的内容，再擷取第二個包的内容。

将所有視訊的彈幕都得到了之後，再利用jieba庫進行詞語截取，截取完之後将所有詞語數量進行統計，最終得到數量前200的所有詞語，繪制詞雲。。。

實作代碼：

1、所需庫

用于爬取頁面的是 requests，BeautifulSoup, lxml，json

用于中文詞彙分割的是 jieba

制作詞雲相關：wordcloud, numpy, PIL

2、代碼

# coding=utf-8
import requests,lxml,jieba,codecs,re,json
from bs4 import BeautifulSoup
from wordcloud import WordCloud
from collections import Counter
import numpy as np
from PIL import Image

barragepath = 'C:/Users/11037/Desktop/allbarrages.txt' #存放在本地的檔案位置
maskpicture = 'C:/Users/11037/Desktop/china.jpeg' #詞雲mask圖檔
wordsource = 'C:/Users/11037/Desktop/wordsource.txt' #存放制作圖雲詞彙的檔案

barrages = [] #存取所有原始彈幕
barragespages = []  #存放彈幕xml文檔的頁面清單

def getpage(url): #擷取頁面資訊
    html = requests.get(url).content.decode('utf-8')
    return html

def getbarrage(html):  #擷取彈幕頁面所有彈幕
    bs = BeautifulSoup(html,'lxml')
    dlist = bs.find_all('d')
    for d in dlist:
        barrages.append(d.string)

def jsongetaid(url):  #從json頁面中擷取cid值
    html = getpage(url)
    data = json.loads(html)
    return str(data['data'][0]['cid'])

def getAllLink(html): #得到所有彈幕的xml文檔頁面
    global barragespages
    jsonlink = []  #json頁面連結
    url = 'https://api.bilibili.com/x/player/pagelist?aid='
    end = '&jsonp=jsonp'
    pattern = re.compile('av(\d)*')
    bs = BeautifulSoup(html,'lxml')
    alist = bs.find('ul',class_="video-list clearfix").find_all('a')
    for al in alist:
        link = al['href']
        match = pattern.search(link)
        if match==None:
            alist.remove(al)
        else:
            string = str(match.group())
            string = string.strip('av')
            string = url + string + end
            jsonlink.append(string)
    l2 = list(set(jsonlink))
    l2.sort(key=jsonlink.index)
    jsonlink = l2
    newurl = 'https://api.bilibili.com/x/v1/dm/list.so?oid='
    for link in jsonlink:
        string = newurl + jsongetaid(link)
        barragespages.append(string)

def splitbarrage(barragelist): #統計彈幕清單中重複量前200的詞彙
    file = codecs.open(barragepath,'wb','utf-8')
    for b in barragelist:
        file.write(b)
    file = codecs.open(barragepath,'r','utf-8')
    wordfile = codecs.open(wordsource, 'wb', 'utf-8')
    text = file.read()
    words = [x for x in jieba.cut(text) if len(x) >= 2]
    count = Counter(words).most_common(200)
    patern = re.compile("'(.*)'")
    for c in count:
        match = patern.search(str(c))
        if match:
            string = str(match.group()).strip("'")
            wordfile.write(string + ' ')
    wordfile.close()

def createWold(url): #建立詞雲
    file = codecs.open(url,'r','utf-8')
    text = file.read()
    mask = np.array(Image.open(maskpicture))
    print(text)
    wordcloud = WordCloud(
        background_color="White",
        font_path='C:/Windows/Fonts/msyh.ttc',
        width=2000,height=2000,
        mask=mask).generate(text)
    img = wordcloud.to_image()
    img.save('C:/Users/11037/Desktop/cloudpicture.jpg')
    img.show()

if __name__ == '__main__':
    #搜尋界面url
    searchurl = 'https://search.bilibili.com/all?keyword=%E7%99%BD%E8%A1%A3%E5%A4%A9%E4%BD%BF&from_source=nav_search_new'
    html = getpage(searchurl)
    getAllLink(html)
    for pageurl in barragespages: #周遊搜尋界面的所有視訊擷取彈幕
        print('已擷取',str(pageurl),'的彈幕')
        html = getpage(pageurl)
        getbarrage(html)
    splitbarrage(barrages)
    createWold(wordsource)

運作生成的圖檔：

python爬蟲之-擷取b站搜尋頁面所有視訊彈幕

最後，祝所有的醫護人員工作順利，身體健康，祝我們的祖國繁榮昌盛！

python爬蟲之-擷取b站搜尋頁面所有視訊彈幕

實作功能：

實作代碼：

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入