python爬蟲執行個體（百度圖檔、網站圖檔）

爬蟲基本流程

發起請求：通過HTTP庫向目标站點發起請求，也就是發送一個Request，請求可以包含額外的header等資訊，等待伺服器響應
擷取響應内容：如果伺服器能正常響應，會得到一個Response，Response的内容便是所要擷取的頁面内容，類型可能是HTML,Json字元串，二進制資料（圖檔或者視訊）等類型
解析内容：得到的内容可能是HTML,可以用正規表達式，頁面解析庫進行解析，可能是Json,可以直接轉換為Json對象解析，可能是二進制資料，可以做儲存或者進一步的處理
儲存資料：儲存形式多樣，可以存為文本，也可以儲存到資料庫，或者儲存特定格式的檔案

1.百度圖檔爬蟲

在這裡有一個小技巧，百度圖檔展示是下拉式的，要想看更多的圖檔，需要滑動滾輪讓界面加載才可以檢視。

普通的爬蟲對于百度圖檔的url隻會接受到未滾動滾輪前界面所展示的所有資訊

是以這裡有一個小技巧，如上圖紅框中的資訊index，在這裡用字元flip替換掉index,即可實作圖檔分頁，但其實分頁圖檔都是存在在一個界面的，也就意味着爬蟲時不需要對分頁做處理。

實作代碼如下

import re
import requests
import os
# 1.拿到url

word=input('你想看有顔色的圖檔嗎，請輸入：')


if not os.path.exists(word):
    os.mkdir(word)
url="https://image.baidu.com/search/flip?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word="+word
head={"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36"}
# 2.得到網頁源代碼

r=requests.get(url,headers=head)
      #<response [200]> 200狀态碼 請求成功
#ret=r.text #ret 得到就是網頁源代碼
ret=r.content.decode('utf-8') #ret 得到就是網頁源代碼
    # 3.拿到所有圖檔的url
    #"objURL"
result=re.findall('"objURL":"(.*?)",',ret)

    # 4.儲存所有圖檔
for i in result:
        try:
            r = requests.get(i,timeout=3)
        except Exception as e:
            print(e)
            continue
        #取幾張圖檔。取50張
        path = i[0:50]

        #判斷url後10位是否是圖檔類型的結尾
        end=re.search(r'\.jpg$|\.jpeg$|\.gif$|\.png$',path)

        if end ==None:
            path = path + '.jpg'
        print(path)
        path= re.sub('/','',path)
        with open(word + '/' + path,'wb') as f:
            f.write(r.content)

代碼中可以通過輸入自己想要的關鍵詞，修改想要的圖檔數量參數，來下載下傳對應的圖檔

2.網站圖檔

實作代碼如下：

import os

'''
頁面一共35
http://ailuotuku.com/page_1.html
'''
from bs4 import BeautifulSoup#網頁解析
import re  #正則比對
import urllib.request,urllib.error  #制定url
import requests

#建立正規表達式對象，表示規則
#圖檔連結
findlink=re.compile(r'<a href="(.*?)" target="_blank">')
#圖檔标題
findtitle=re.compile(r'<h1>(.*?)</h1>')
#圖檔
findimg=re.compile(r'<img.*?src="(.*?)".*?>',re.S)

#爬資料
def getdata():
    #1.調用擷取頁面的函數
    for i  in range(1,36):
        url="http://ailuotuku.com/page_%s.html"%i
        html1=askURL(url)
        #2.解析資料
        soup1=BeautifulSoup(html1,"html.parser")
        for item1 in soup1.find_all('div',class_="update_area_content"):

            item1=str(item1)
            #re庫查找正規表達式
            link=re.findall(findlink,item1)
            for j in link:
                j=str(j)
                html2 = askURL(j)
                soup2 = BeautifulSoup(html2, "html.parser")
                data_title=[]
                data_img=[]
                for item2 in soup2.find_all('div', class_="main_left single_mian"):

                    item2=str(item2)
                    title=re.findall(findtitle,item2)

                for item3 in soup2.find_all('div',class_='content_left'):
                    item3=str(item3)
                    image=re.findall(findimg,item3)
                    for img in image:
                        data_img.append(img)
            print(title)
            print(data_img)
            #儲存圖檔
            for filename in title:
                for url in data_img:
                        file=url.split('/')[-1]
                        head = {
                            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.93 Safari/537.36"}

                        response = requests.get(url, headers=head)
                        if not os.path.exists('data/%s'%filename):
                            os.mkdir('data/%s'%filename)
                            with open('data/%s/'%filename + file, 'wb') as f:
                                f.write(response.content)
                        else:
                            pass

def askURL(url):
    head={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.93 Safari/537.36"}
    #僞裝浏覽器

    request=urllib.request.Request(url,headers=head)
    html=""

    try:
        response=urllib.request.urlopen(request)
        html=response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html

def main():
    data=getdata()

if __name__ =="__main__":
    main()

python爬蟲執行個體（百度圖檔、網站圖檔）

爬蟲基本流程

1.百度圖檔爬蟲

實作代碼如下

2.網站圖檔

實作代碼如下：

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

403 Forbidden，You don't have permission to access / on this server.Forbidden

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入