爬蟲從入門到精通(2) | requests子產品の使用一、requests子產品基礎知識二、requests中get請求使用的三種常見情況三、requests中post請求的使用四、requests中的鈎子函數五、常見的requests報錯

文章目錄

一、requests子產品基礎知識
- 1.requests的用途
- 2.安裝方法
- 3.參數介紹
- 4.傳回值response對象
- 5.檢視網頁使用的是什麼請求
二、requests中get請求使用的三種常見情況
- 1.不需要請求參數（百度産品）
- 2.需要請求參數（新浪新聞）
- 3.請求中常見的分頁處理
三、requests中post請求的使用
- 1.JSON子產品
- 2.post請求常用格式
- 3.上傳檔案
四、requests中的鈎子函數
五、常見的requests報錯
- 1. 連接配接逾時
- 2. 連接配接、讀取逾時
- 3. 未知的伺服器
- 4. 代理連接配接不上
- 5. 連接配接代理逾時
- 6. 代理讀取逾時
- 7. 網絡環境異常
- 8.官網的一些參考

參考部落格：https://blog.csdn.net/shanzhizi/article/details/50903748

一、requests子產品基礎知識

1.requests的用途

requests 庫可以實作 HTTP 協定中絕大部分功能，它提供的功能包括：keep-alive、連接配接池、Cookie 持久化、内容自動解壓、HTTP 代理、SSL 認證、連接配接逾時、Session 等很多特性，最重要的是它同時相容 python2 和 python3，它是 Github 關注數最多的 Python 項目之一。

2.安裝方法

pip install requests

3.參數介紹

3.1 參數介紹

import requests

requests.get(
  	url=base_url, # 請求的url
  	headers={},   # 請求頭，例如{‘user-agent’:'xxx'}
  	params={},    # 請求參數字典,例如{‘a’:123}
  	proxies={},   # 代理，例如{‘https’:'168.168.16.16:9000'}    
  	timeout=3,    # 逾時時間
  	verify=False, # 跳過ssl驗證
  )

3.2 支援的請求方法

requests.get(‘https://github.com/timeline.json’) #GET請求
requests.post(“http://httpbin.org/post”) #POST請求
requests.put(“http://httpbin.org/put”) #PUT請求
requests.delete(“http://httpbin.org/delete”) #DELETE請求
requests.head(“http://httpbin.org/get”) #HEAD請求
requests.options(“http://httpbin.org/get”) #OPTIONS請求

4.傳回值response對象

import requests
r=requests.get(.....)

4.1 參數介紹

代碼	意義
r.status_code	響應狀态碼
r.raw	傳回原始響應體，也就是 urllib 的 response 對象，使用 r.raw.read() 讀取
r.content	位元組方式的響應體，會自動為你解碼 gzip 和 deflate 壓縮
r.text	字元串方式的響應體，會自動根據響應頭部的字元編碼進行解碼
r.headers	以字典對象存儲伺服器響應頭，但是這個字典比較特殊，字典鍵不區分大小寫，若鍵不存在則傳回None。例如擷取cookie為response.headers[‘Cookie’]
r.json()	Requests中内置的JSON解碼器
r.raise_for_status()	失敗請求(非200響應)抛出異常

4.2

response.text

亂碼問題

當我們用response.text擷取字元串的響應正文的時候，有時候會出現亂碼：原因是response.encoding這個字元預設指定編碼有誤。

解決：

response.encoding='utf-8'
 print(response.text)

5.檢視網頁使用的是什麼請求

二、requests中get請求使用的三種常見情況

1.不需要請求參數（百度産品）

import requests

base_url = 'https://www.baidu.com/more/'   
response = requests.get(base_url)
response.encoding='utf-8'

print(response.status_code)
print(response.headers)
print(type(response.text))
print(type(response.content))

2.需要請求參數（新浪新聞）

import requests
  
 # 1.确定url
base_url = 'https://search.sina.com.cn/'  # 新浪新聞
  
# 2.設定headers字典
headers = {
      'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
 	}
  
# 3.設定請求參數
key = '孫悟空'  # 搜尋内容
params = {
      'q': key,
      'c': 'news',
      'from': 'channel',
      'ie': 'utf-8',
  }
# 4.發起請求
response = requests.get(base_url, headers=headers, params=params)
response.encoding='gbk'
print(response.text)

3.請求中常見的分頁處理

分頁類型
- 第一步：找出分頁參數的規律
- 第二步：headers和params字典
- 第三步：用for循環

# --------------------爬取百度貼吧搜尋某個貼吧的前十頁
import os
  
import requests
  
base_url = 'https://tieba.baidu.com/f?'
headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
  }

# 建立檔案夾
dirname = './tieba/woman/'
if not os.path.exists(dirname):
    os.makedirs(dirname)


# 構造參數，for循環發送請求
for i in range(0, 10):
    params = {
          'ie': 'utf-8',
          'kw': '美女',
          'pn': str(i * 50)
      }
      
	response = requests.get(base_url, headers=headers, params=params)

	# 将爬取的内容按頁數存放寫入html
	with open(dirname + '美女第%s頁.html' % (i+1), 'w', encoding='utf-8') as file:
	      file.write(response.content.decode('utf-8'))

三、requests中post請求的使用

1.JSON子產品

json.dumps(python的list或者dict)---->(傳回值)---->json字元串

json.loads(json字元串)---->(傳回值)----->python的list或者dict

post請求一般得到的響應内容是json資料。
處理json資料用到的子產品是json子產品。
json資料本質就是一個字元串。

response.json()
#可以直接将擷取到的json字元串轉換為json.dumps(python的list或者dict)---->(傳回值)---->json字元串

2.post請求常用格式

response=requests.post(
	url,
	headers={},
	data={},#請求資料字典
)

3.上傳檔案

import requests
 
url = 'http://127.0.0.1:5000/upload'
files = {'file': open('/home/lyb/sjzl.mpg', 'rb')}
#files = {'file': ('report.jpg', open('/home/lyb/sjzl.mpg', 'rb'))}     #顯式的設定檔案名
 
r = requests.post(url, files=files)
print(r.text)

四、requests中的鈎子函數

hooks可以串改response裡的參數資訊或者列印一句話

def change_url(response, *args, **kwargs):
    """ 回調函數 """
    response.url = '123'


# 建立一個鈎子hooks=dict(response=change_url),字典型，将response放在回調函數中,可以對傳回結果進行篡改
response = requests.get('https://www.baidu.com', hooks=dict(response=change_url,))
print response.url

五、常見的requests報錯

1. 連接配接逾時

伺服器在指定時間内沒有應答，抛出 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=0.001)

# 抛出錯誤
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f1b16da75f8>, 'Connection to github.com timed out. (connect timeout=0.001)'))

2. 連接配接、讀取逾時

若分别指定連接配接和讀取的逾時時間，伺服器在指定時間沒有應答，抛出 requests.exceptions.ConnectTimeout- timeout=([連接配接逾時時間], [讀取逾時時間])

連接配接：用戶端連接配接伺服器并并發送http請求伺服器
讀取：用戶端等待伺服器發送第一個位元組之前的時間

requests.get('http://github.com', timeout=(6.05, 0.01))

# 抛出錯誤
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='github.com', port=80): Read timed out. (read timeout=0.01)

3. 未知的伺服器

requests.get('http://github.comasf', timeout=(6.05, 27.05))

# 抛出錯誤
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.comasf', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f75826665f8>: Failed to establish a new connection: [Errno -2] Name or service not known',))

4. 代理連接配接不上

代理伺服器拒絕建立連接配接，端口拒絕連接配接或未開放，抛出 requests.exceptions.ProxyError

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "192.168.10.1:800"})

# 抛出錯誤
requests.exceptions.ProxyError: HTTPConnectionPool(host='192.168.10.1', port=800): Max retries exceeded with url: http://github.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce3438c6d8>: Failed to establish a new connection: [Errno 111] Connection refused',)))

5. 連接配接代理逾時

代理伺服器沒有響應 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "10.200.123.123:800"})

# 抛出錯誤
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='10.200.123.123', port=800): Max retries exceeded with url: http://github.com/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fa8896cc6d8>, 'Connection to 10.200.123.123 timed out. (connect timeout=6.05)'))

6. 代理讀取逾時

說明與代理建立連接配接成功，代理也發送請求到目标站點，但是代理讀取目标站點資源逾時

即使代理通路很快，如果代理伺服器通路的目标站點逾時，這個鍋還是代理伺服器背

假定代理可用，timeout就是向代理伺服器的連接配接和讀取過程的逾時時間，不用關心代理伺服器是否連接配接和讀取成功

requests.get('http://github.com', timeout=(2, 0.01), proxies={"http": "192.168.10.1:800"})

# 抛出錯誤
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.10.1:800', port=1080): Read timed out. (read timeout=0.5)

7. 網絡環境異常

可能是斷網導緻，抛出 requests.exceptions.ConnectionError

requests.get('http://github.com', timeout=(6.05, 27.05))

# 抛出錯誤
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc8c17675f8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

8.官網的一些參考

你可以告訴 requests 在經過以 timeout 參數設定的秒數時間之後停止等待響應。基本上所有的生産代碼都應該使用這一參數。如果不使用，你的程式可能會永遠失去響應：

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

并不是整個下載下傳響應的時間限制，而是如果伺服器在 timeout 秒内沒有應答，将會引發一個異常（更精确地說，是在 timeout 秒内沒有從基礎套接字上接收到任何位元組的資料時）


- 遇到網絡問題（如：DNS 查詢失敗、拒絕連接配接等）時，Requests 會抛出一個 requests.exceptions.ConnectionError 異常。
- 如果 HTTP 請求傳回了不成功的狀态碼， Response.raise_for_status() 會抛出一個 HTTPError 異常。
- 若請求逾時，則抛出一個 Timeout 異常。
- 若請求超過了設定的最大重定向次數，則會抛出一個 TooManyRedirects 異常。
- 所有Requests顯式抛出的異常都繼承自 requests.exceptions.RequestException 。

爬蟲從入門到精通(2) | requests子產品の使用一、requests子產品基礎知識二、requests中get請求使用的三種常見情況三、requests中post請求的使用四、requests中的鈎子函數五、常見的requests報錯

文章目錄

一、requests子產品基礎知識

1.requests的用途

2.安裝方法

3.參數介紹

4.傳回值response對象

5.檢視網頁使用的是什麼請求

二、requests中get請求使用的三種常見情況

1.不需要請求參數（百度産品）

2.需要請求參數（新浪新聞）

3.請求中常見的分頁處理

三、requests中post請求的使用

1.JSON子產品

2.post請求常用格式

3.上傳檔案

四、requests中的鈎子函數

五、常見的requests報錯

1. 連接配接逾時

2. 連接配接、讀取逾時

3. 未知的伺服器

4. 代理連接配接不上

5. 連接配接代理逾時

6. 代理讀取逾時

7. 網絡環境異常

8.官網的一些參考

繼續閱讀

Python爬蟲實戰，requests子產品，Python模拟登入實作拉勾網資料解析

Python高手之路【八】python基礎之requests子產品

Python爬蟲實戰，requests子產品，Python爬取網易雲歌曲并儲存本地

最人性化的Python網絡爬蟲requests子產品，下載下傳百度貼吧和部落格标題

Python爬蟲二：HTTP請求urllib與requests一：urllib子產品二：urllib3 庫三：requests庫

python第三方子產品requests子產品

爬蟲（二）--requests子產品一、requests子產品

爬蟲 requests與BeautifulSoup 子產品/方法/參數詳解1. 方法2. 參數

爬蟲從入門到精通(3) | cookie和session與模拟登入+正規表達式+代理IP的使用+Json資料序列化一、記錄浏覽器狀态的cookie和session二、爬蟲使用cookie和session進行模拟登入三、代理IP四、json資料五、正規表達式

python實作12306查詢火車票

Python爬蟲之二：使用requests子產品Python實作爬蟲的檔案上傳、下載下傳，以及同一會話