Python爬蟲入門教程 57-100 python爬蟲進階技術之驗證碼篇3-滑動驗證碼識别技術

滑動驗證碼介紹

本篇部落格涉及到的驗證碼為滑動驗證碼，不同于極驗證，本驗證碼難度略低，需要的将滑塊拖動到矩形區域右側即可完成。

Python爬蟲入門教程 57-100 python爬蟲進階技術之驗證碼篇3-滑動驗證碼識别技術

這類驗證碼不常見了，官方介紹位址為：

https://promotion.aliyun.com/ntms/act/captchaIntroAndDemo.html

使用起來肯定是非常安全的了，不是很好通過機器檢測

如何判斷驗證碼類型

這個驗證碼的辨別一般比較明顯，在頁面源碼中一般存在一個 nc.js 基本可以判定是阿裡雲的驗證碼了

<script type="text/javascript" src="//g.alicdn.com/sd/ncpc/nc.js?t=1552906749855"></script>

識别套路

截止到2019年3月18日，本驗證碼加入了大量的selenium關鍵字驗證，是以單純的模拟拖拽被反爬的機率滿高的，你也知道一般情況爬蟲具備

時效性

不確定這種手段過一段時間還可以使用！

導入selenium必備的一些子產品與方法

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ActionChains

import time
import random

在啟動selenium之前必須要設定一個本機的代理，進行基本的反[反爬] 處理，很多爬蟲在擷取使用者指紋的時候，都比較喜歡selenium，因為使用selenium模拟浏覽器進行資料抓取，能夠繞過客戶JS加密，繞過爬蟲檢測，繞過簽名機制

但是selenium越來越多的被各種網站進行了相關屏蔽，因為selenium在運作的時候會暴露出一些預定義的Javascript變量（特征字元串），例如"window.navigator.webdriver"，在非selenium環境下其值為undefined，而在selenium環境下，其值為true

下圖所示為selenium驅動下Chrome控制台列印出的值

細緻的繞過去的方法，可能需要單獨的一篇部落格進行贅述了，這裡我隻對上面的參數進行屏蔽，使用到的是之前部落格中涉及的mitmdump進行代理

https://docs.mitmproxy.org/stable/concepts-certificates/

mitmdump進行代理

技術參考來源：

https://zhuanlan.zhihu.com/p/43581988

關于這個子產品的基本使用，參考我前面的部落格即可,這裡核心使用了如下代碼

indject_js_proxy.py

from mitmproxy import ctx
injected_javascript = '''
// overwrite the `languages` property to use a custom getter
Object.defineProperty(navigator, "languages", {
  get: function() {
    return ["zh-CN","zh","zh-TW","en-US","en"];
  }
});
// Overwrite the `plugins` property to use a custom getter.
Object.defineProperty(navigator, 'plugins', {
  get: () => [1, 2, 3, 4, 5],
});
// Pass the Webdriver test
Object.defineProperty(navigator, 'webdriver', {
  get: () => false,
});
// Pass the Chrome Test.
// We can mock this in as much depth as we need for the test.
window.navigator.chrome = {
  runtime: {},
  // etc.
};
// Pass the Permissions Test.
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
  parameters.name === 'notifications' ?
    Promise.resolve({ state: Notification.permission }) :
    originalQuery(parameters)
);
'''
 
def response(flow):
    # Only process 200 responses of HTML content.
    if not flow.response.status_code == 200:
        return
 
    # Inject a script tag containing the JavaScript.
    html = flow.response.text
    html = html.replace('<head>', '<head><script>%s</script>' % injected_javascript)
    flow.response.text = str(html)
    ctx.log.info('>>>> js代碼插入成功 <<<<')
 
    # 隻要url連結以target開頭，則将網頁内容替換為目前網址
    # target = 'https://target-url.com'
    # if flow.url.startswith(target):
    #     flow.response.text = flow.url

上述腳本放置任意目錄，之後進行mitmdump的啟動即可

C:\user>mitmdump -s indject_js_proxy.py   
Loading script indject_js_proxy.py
Proxy server listening at http://*:8080

啟動之後，通過webdriver通路

測試網站：

https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html

如果webDriver是綠色，也說明代理起作用了

selenium爬取

接下來就是通過selenium進行一些模拟行為的操作了，這部分代碼比較簡單，編寫的時候參考一下注釋即可。

# 執行個體化一個啟動參數對象
chrome_options = Options()
# 添加啟動參數
chrome_options.add_argument('--proxy-server=127.0.0.1:8080')
# 将參數對象傳入Chrome，則啟動了一個設定了視窗大小的Chrome
driver = webdriver.Chrome(chrome_options=chrome_options)

關鍵函數

def move_to_gap(tracks):

    driver.get("https://passport.zcool.com.cn/regPhone.do?appId=1006&cback=https://my.zcool.com.cn/focus/activity")

    # 找到滑塊span
    need_move_span = driver.find_element_by_xpath('//*[@id="nc_1_n1t"]/span')
    # 模拟按住滑鼠左鍵
    ActionChains(driver).click_and_hold(need_move_span).perform()
    for x in tracks:  # 模拟人的拖動軌迹
        print(x)
        ActionChains(driver).move_by_offset(xoffset=x,yoffset=random.randint(1,3)).perform()
    time.sleep(1)
    ActionChains(driver).release().perform()  # 釋放左鍵

注意看到上述代碼中有何核心的點 --- 拖拽距離的清單

tracks

if __name__ == '__main__':
    move_to_gap(get_track(295))

這個地方可以借鑒網上的方案即可

def get_track(distance):
    '''
    拿到移動軌迹，模仿人的滑動行為，先勻加速後勻減速
    勻變速運動基本公式：
    ①v=v0+at
    ②s=v0t+(1/2)at²
    ③v²-v0²=2as

    :param distance: 需要移動的距離
    :return: 存放每0.2秒移動的距離
    '''
    # 初速度
    v=0
    # 機關時間為0.2s來統計軌迹，軌迹即0.2内的位移
    t=0.1
    # 位移/軌迹清單，清單内的一個元素代表0.2s的位移
    tracks=[]
    # 目前的位移
    current=0
    # 到達mid值開始減速
    mid=distance * 4/5

    distance += 10  # 先滑過一點，最後再反着滑動回來

    while current < distance:
        if current < mid:
            # 加速度越小，機關時間的位移越小,模拟的軌迹就越多越詳細
            a = 2  # 加速運動
        else:
            a = -3 # 減速運動

        # 初速度
        v0 = v
        # 0.2秒時間内的位移
        s = v0*t+0.5*a*(t**2)
        # 目前的位置
        current += s
        # 添加到軌迹清單
        tracks.append(round(s))

        # 速度已經達到v,該速度作為下次的初速度
        v= v0+a*t

    # 反着滑動到大概準确位置
    for i in range(3):
       tracks.append(-2)
    for i in range(4):
       tracks.append(-1)
    return tracks

代碼注釋已經添加好，可以自行查閱，臨摹一下即可明白

最後開始進行嘗試，實測中，發現可以自動拖動，但是，出現一個問題是最後被識别為機器，這個地方，我進行了多次的修改與調整，最終從代碼層面發現實作确實有些複雜，是以改變政策，找一下chromedriver.exe是否有修改過的版本，中間去除了selenium的一些關鍵字，運氣不錯，被我找到了。

目前隻有windows10版本和linux16.04版本

gitee位址：

https://gitee.com/bobozhangyx/java-crawler/tree/master/file/%E7%BC%96%E8%AF%91%E5%90%8E%E7%9A%84chromedriver

下載下傳之後，替換你的

chromedriver.exe

再次運作，成功驗證

歡迎關注「非大學程式員」回複【0411】擷取本篇部落格源碼

Python爬蟲入門教程 57-100 python爬蟲進階技術之驗證碼篇3-滑動驗證碼識别技術

滑動驗證碼介紹

如何判斷驗證碼類型

識别套路

mitmdump進行代理

selenium爬取

繼續閱讀

Python爬蟲入門教程 5-100 27270圖檔爬取

Python爬蟲入門教程 7-100 蜂鳥網圖檔爬取之二

Python爬蟲入門教程 8-100 蜂鳥網圖檔爬取之三

Python爬蟲入門教程 9-100 河北陽光理政投訴闆塊

Python爬蟲入門教程 11-100 行行網電子書多線程爬取

Python爬蟲入門教程 12-100 半次元COS圖爬取寫在前面

Python爬蟲入門教程 13-100 鬥圖啦表情包多線程爬取

Python爬蟲入門教程 20-100 慕課網免費課程抓取

Python爬蟲入門教程 26-100 知乎文章圖檔爬取器之二

Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作

Python爬蟲入門教程 51-100 Python3爬蟲通過m3u8檔案下載下傳ts視訊-Python爬蟲6操作

Python爬蟲入門教程 56-100 python爬蟲進階技術之驗證碼篇2-開放平台OCR技術

Python爬蟲入門教程 58-100 python爬蟲進階技術之驗證碼篇4-極驗證識别技術之一

Python爬蟲入門教程 59-100 python爬蟲進階技術之驗證碼篇5-極驗證識别技術之二

被蘋果抛棄之後，英特爾打算出售 8000 多項無線通信專利

Python爬蟲入門教程 62-100 30歲了，想找點文獻提高自己，還被反爬了，Python搞起，反爬第2篇