零基礎教你寫python爬蟲

大家都知道python經常被用來做爬蟲，用來在網際網路上抓取我們需要的資訊。

使用Python做爬蟲，需要用到一些包：

requests

urllib

BeautifulSoup

等等，關于python工具的說明，請看這裡：

Python 爬蟲的工具清單

今天介紹一個簡單的爬蟲,網絡聊天流行鬥圖，偶然發現一個網站www.doutula.com.上面的圖檔挺搞笑的，可以摘下來使用。

我們來抓一下“最新鬥圖表情”：

看到下面有分頁，分析下他的分頁url格式：

不難發現分頁的url是：https://www.doutula.com/photo/list/?page=x

一步步來：

先簡單抓取第一頁上的圖檔試試:

将抓取的圖檔重新命名，存儲在項目根目錄的images目錄下：

分析網頁上img格式：

好了，我們開始準備寫程式吧：使用pycharm IDE建立項目

我們抓包會用到：requests 和urllib,需要先安裝這些包：file->default settings

點選右側綠色的+号：

同樣的引入：BeautifulSoup，lxml

接下來就可以引入這些包，然後開始開發了：

零基礎教你寫python爬蟲

import requests
from bs4 import BeautifulSoup
import urllib
import os

url = 'https://www.doutula.com/photo/list/?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
img_list = soup.find_all('img',attrs={'class':'img-responsive lazy image_dta'})
i=0
for img in img_list:
    print (img['data-original'])
    src = img['data-original']
    #src = '//ws1.sinaimg.cn/bmiddle/9150e4e5ly1fjlv8kgzr0g20ae08j74p.gif'
    if not src.startswith('http'):
        src= 'http:'+src
    filename = src.split('/').pop()
    fileextra = filename.split('.').pop()
    filestring = i+'.'+fileextra
    path = os.path.join('images',filestring)
    # 下載下傳圖檔
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
    }
    #urllib.request.urlretrieve(url,path,header)
    req = urllib.request.Request(url=src, headers=headers)
    cont = urllib.request.urlopen(req).read()
    root = r""+path+""
    f=open(root,'wb')
    f.write(cont)
    f.close
    i += 1

View Code

注意：

　　1.請求的時候需要加上header，僞裝成浏覽器請求，網站大多不允許抓取。

抓完一頁的圖檔，我們試着抓取多頁的圖檔：這裡試下抓取第一頁和第二頁的圖檔

零基礎教你寫python爬蟲

import requests
from bs4 import BeautifulSoup
import urllib
import os
import datetime
#begin
print (datetime.datetime.now())
URL_LIST = []
base_url = 'https://www.doutula.com/photo/list/?page='
for x in range(1,3):
    url = base_url+str(x)
    URL_LIST.append(url)
i = 0
for page_url in URL_LIST:
        response = requests.get(page_url)
        soup = BeautifulSoup(response.content,'lxml')
        img_list = soup.find_all('img',attrs={'class':'img-responsive lazy image_dta'})
        for img in img_list: #一頁上的圖檔
            print (img['data-original'])
            src = img['data-original']
            if not src.startswith('http'):
                src= 'http:'+src
            filename = src.split('/').pop()
            fileextra = filename.split('.').pop()
            filestring = str(i)+'.'+fileextra
            path = os.path.join('images',filestring)
            # 下載下傳圖檔
            headers = {
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Encoding': 'gzip, deflate, sdch',
                'Accept-Language': 'zh-CN,zh;q=0.8',
                'Connection': 'keep-alive',
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
            }
            #urllib.request.urlretrieve(url,path,header)
            req = urllib.request.Request(url=src, headers=headers)
            cont = urllib.request.urlopen(req).read()
            root = r""+path+""
            f=open(root,'wb')
            f.write(cont)
            f.close
            i += 1
#end
print (datetime.datetime.now())

這樣我們就完成了多頁圖檔的抓取，但是貌似有點慢啊，要是抓所有的，那估計得花一點時間了。

python是支援多線程的，我們可以利用多線程來提高速度：

分析一下這是怎麼樣的一個任務：我們将網頁位址全部存儲到一個list中，所有的圖檔位址也存儲在一個list中，然後按順序來取圖檔位址，再依次下載下傳

這樣類似一個：多線程有序操作的過程，就是“消費者生産者模式”，使用list加鎖來實作隊列（FIFO先進先出）。

一起回憶一下隊列的特點吧：

看代碼吧：我們下載下傳第一頁到第99頁的圖檔

零基礎教你寫python爬蟲

import requests
from bs4 import BeautifulSoup
import urllib
import os
import datetime
import threading
import time

i = 0
FACE_URL_LIST = []
URL_LIST = []
base_url = 'https://www.doutula.com/photo/list/?page='
for x in range(1,100):
    url = base_url+str(x)
    URL_LIST.append(url)
#初始化鎖
gLock = threading.Lock()

#生産者，負責從頁面中提取表情圖檔的url
class producer(threading.Thread):
    def run(self):
        while len(URL_LIST)>0:
            #通路時需要加鎖
            gLock.acquire()
            cur_url = URL_LIST.pop()
            #使用完後及時釋放鎖，友善其他線程使用
            gLock.release()
            response = requests.get(cur_url)
            soup = BeautifulSoup(response.content, 'lxml')
            img_list = soup.find_all('img', attrs={'class': 'img-responsive lazy image_dta'})
            gLock.acquire()
            for img in img_list:  # 一頁上的圖檔
                print(img['data-original'])
                src = img['data-original']
                if not src.startswith('http'):
                    src = 'http:' + src
                FACE_URL_LIST.append(src)
            gLock.release()
            time.sleep(0.5)


#消費者，負責從FACE_URL_LIST中取出url，下載下傳圖檔
class consumer(threading.Thread):
    def run(self):
        global i
        j=0
        print ('%s is running' % threading.current_thread)
        while True:
            #上鎖
            gLock.acquire()
            if len(FACE_URL_LIST) == 0:
                #釋放鎖
                gLock.release()
                j = j + 1
                if (j > 1):
                    break
                continue
            else:
                #從FACE_URL_LIST中取出url，下載下傳圖檔
                face_url = FACE_URL_LIST.pop()
                gLock.release()
                filename = face_url.split('/').pop()
                fileextra = filename.split('.').pop()
                filestring = str(i) + '.' + fileextra
                path = os.path.join('images', filename)
                #path = os.path.join('images', filestring)
                # 下載下傳圖檔
                headers = {
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                    'Accept-Encoding': 'gzip, deflate, sdch',
                    'Accept-Language': 'zh-CN,zh;q=0.8',
                    'Connection': 'keep-alive',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
                }
                # urllib.request.urlretrieve(url,path,header)
                req = urllib.request.Request(url=face_url, headers=headers)
                cont = urllib.request.urlopen(req).read()
                root = r"" + path + ""
                f = open(root, 'wb')
                f.write(cont)
                f.close
                print(i)
                i += 1



if __name__ == '__main__': #在本檔案内運作
    # begin
    print(datetime.datetime.now())
    #2個生産者線程從頁面抓取表情連結
    for x in range(2):
        producer().start()

    #5個消費者線程從FACE_URL_LIST中提取下載下傳連結，然後下載下傳
    for x in range(5):
        consumer().start()
    #end
    print (datetime.datetime.now())

看看images檔案夾下多了好多圖，以後鬥圖不用愁了！

OK，到此算是結束了。最後為python宣傳一下。

零基礎教你寫python爬蟲

繼續閱讀

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Windows下配置Apache的SSL服務

Mac｜Windows系統本地照片自動上傳到伺服器

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入