python3 HTTP請求urllib與requests

一：urllib子產品
- 1.1 urlopen()
- 1.2 User-Agent
- 1.3 Request類
- 1.4 urllib.parse 子產品
- 1.5 送出方法method
- 1.6 處理json資料
- 1.7 HTTPS證書忽略
二：urllib3 庫
三：requests庫
- 3.1 發送get請求
- 3.2 發送post請求
- 3.3 使用代理IP
- 3.4 session與cookie
- 3.5 簡單封裝

一：urllib子產品

urllib是Python中内置的發送網絡請求的一個庫(包)，在Python2中由urllib和urllib2兩個庫來實作請求的發送，但是在Python3中已經不存在urllib2這個庫了，已經将urllib和urllib2合并為urllib。

urllib 是标準庫，它一個工具包子產品，包含下面的子產品處理 url：

urllib.request 用于打開和讀寫url

urllib.error 包含了有urllib.request引起的異常。

urllib.parse 用于解析url

urllib.robotparser 分析robots.txt 檔案

1.1 urlopen()

url參數，可以是一個string，或者一個Request對象。

data一定是bytes對象，傳遞給伺服器的資料，或者為None。

目前隻有HTTP requests會使用data，提供data時會是一個post請求，如若沒有data，那就是get請求。data在使用前需要使用urllib.parse.urlencode()函數轉換成流資料。

from urllib.request import urlopen
url = 'https://www.bing.com'
response = urlopen(url, timeout = 5)
print(response.closed)
with response:
    print(type(response))           # from http.client import HTTPResponse
    print(response.status, response.reason)
    print(response._method)
    print(response.read())          # 傳回網頁内容
    print(response.info())          # 擷取響應頭資訊
    print(response.geturl())        # 請求的真正url（有的url會被301，302重定向）
	print(response.closed)

通過urllib.requset.urlopen 方法，發起一個HTTP的GET請求，web 伺服器傳回了網頁内容，響應的資料被封裝到類檔案對象中，可以通過read方法，readline方法，readlines方法，擷取資料，status，和reason 表示狀态碼， info方法表示傳回header資訊等

1.2 User-Agent

urlopen方法通過url 字元串和data發起HTTP請求，如果想修改HTTP頭，例如：useragent 就得借助其他方式

urllib.request源碼中構造的預設的useragent 如下：

# from urllib.request import OpenerDirector
class OpenerDirector:
    def __init__(self):
        client_version = "Python-urllib/%s" % __version__
        self.addheaders = [('User-agent', client_version)]

自定義構造請求頭：

from urllib.request import urlopen, Request
url = 'https://www.bing.com'
user_agent = {"User-Agent": "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"}
req = Request(url, headers = user_agent)    # 也可以通過下面行的代碼添加user_agent到請求頭
# req.add_header('User-Agent', user_agent)
response = urlopen(req, timeout = 5)	# url參數為一個Request對象
print(response.closed)
with response:
    print(type(response))           # from http.client import HTTPResponse
    print(response.status, response.reason)
    print(response._method)
    print(response.read())          # 傳回網頁内容
    print(response.info())          # 擷取響應頭資訊
    print(response.geturl())        # 請求的真正url（有的url會被301，302重定向）
print(response.closed)

1.3 Request類

Request（url, data=None, headers={} ）
# 初始化方法，構造一個請求對象，可添加一個header的字典
# data 參數決定是GET 還是POST 請求（data 為None是GET，有資料，就是POST）
# add_header(key, val) 為header中增加一個鍵值對。

import random
from urllib.request import urlopen, Request
url = 'http://www.bing.com'
user_agent_list = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
]
user_agent = random.choice(user_agent_list)		# 随機選擇user_agent
request = Request(url)
request.add_header('User-Agent', user_agent)
response = urlopen(request, timeout = 20)
print(type(response))
with response:
    print(1, response.status, response.getcode(), response.reason)    # 狀态，getcode本質上就是傳回status
    print(2, response.geturl())     # 傳回真正請求的url
    print(3, response.info())       # 傳回響應頭headers
    print(4, response.read())       # 讀取傳回的内容
print(5, request.get_header('User-agent'))  # 擷取請求頭中的User-agent資訊
print(6, request.header_items())            # 擷取請求頭中的資訊

1.4 urllib.parse 子產品

該子產品可以完成對url的編解碼

from urllib import parse
d = dict(
    id = 1,
    name = '張三',
    hobby = 'football'
)
u = parse.urlencode(d)
print(u)
# id=1&name=%E5%BC%A0%E4%B8%89&hobby=football

從運作結果來看冒号。斜杆 & 等号，問号都被編碼，%之後實際上是單位元組十六進制表示的值

query_param = parse.urlencode({'wd': '中'})
url = 'https://www.baidu.com?{}'.format(query_param)
print(url)                          # https://www.baidu.com?wd=%E4%B8%AD
print('中'.encode('utf-8'))          # b'\xe4\xb8\xad'
print(parse.unquote(query_param))   # 解碼：wd=中
print(parse.unquote(url))           # https://www.baidu.com?wd=中

一般來說，url中的位址部分，一般不需要使用中文路徑，但是參數部分，不管 GET 還是post 方法，送出的資料中，可能有斜杆等符号，這樣的字元表示資料，不表示元字元，如果直接發送給伺服器端，就會導緻接收方無法判斷誰是元字元，誰是資料，為了安全，一般會将資料部分的字元串做url 編碼，這樣就不會有歧義了。

後來可以傳送中文，同樣會做編碼，一般先按照字元集的encoding要求轉化成位元組序列，每一個位元組對應的十六進制字元串前加上百分号即可。

1.5 送出方法method

最常用的HTTP互動資料的方法是GET ,POST

GET 方法，資料是通過URL 傳遞的，也就是說資料時候在http 封包的header部分

POST方法，資料是放在http封包的body 部分送出的資料都是鍵值對形式，多個參數之間使用&符号連結

GET方法：

連接配接 bing 搜尋引擎官網，擷取一個搜尋的URL： http://cn.bing.com/search?q=如何學好Python

請寫程式需完成對關鍵字的bing 搜尋，将傳回的結果儲存到一個網頁檔案中。

query_param = dict(q = '如何學好Python')
base_url = 'https://cn.bing.com/search'
url = '{}?{}'.format(base_url, parse.urlencode(query_param))
user_agent = "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
req = Request(url, headers = {'User-agent': user_agent})
result = urlopen(req)
print(type(result))
with result:
    with open('bing.html', 'wb+') as f:
        f.write(result.read())
        f.flush()

POST方法：

url = 'http://httpbin.org/post'
data = urlencode({'name': '張三,@=/$*', 'age': 22})
request = Request(url, headers = {'User-agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'})
with urlopen(request, data = data.encode()) as result:
    text = result.read()
    d = json.loads(text)
    print(d)
    print(type(d))

1.6 處理json資料

重要的不是代碼，而是在網頁上能找到json請求的資料，這裡以豆瓣電影為例：

Python爬蟲二：HTTP請求urllib與requests一：urllib子產品二：urllib3 庫三：requests庫

上面箭頭指向的就是該頁面的json請求，得到json請求資料的url後，代碼與上面的寫法是一樣的

jurl = 'https://movie.douban.com/j/search_subjects'
data = dict(
    type= 'movie',
    tag = '熱門',
    page_limit=10,
    page_start=10
)
user_agent = {'User-agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'}
req = Request('{}?{}'.format(jurl, urlencode(data)), headers = user_agent)
with urlopen(req) as result:
    subjects = json.loads(result.read())
    print(len(subjects['subjects']))
    print(subjects)

1.7 HTTPS證書忽略

HTTPS使用SSL 安全套接層協定，在傳輸層對網路資料進行加密，HTTPS 使用的時候，需要證書，而證書需要cA認證

CA（Certificate Authority）是數字證書認證中心的簡稱，是指發放，管理，廢除資料證書的機構。

CA是受信任的第三方，有CA簽發的證書具有可信性。如果使用者由于信任了CA簽發的證書導緻的損失可以追究CA的法律責任。

CA是層級結構，下級CA信任上級CA，且有上級CA頒發給下級CA憑證并認證。

from urllib.request import urlopen, Request
import ssl
request = Request('https://www.12306.cn/mormhweb/')
print(request)
ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3486.0 Safari/537.36'
request.add_header('User-agent', ua)
context = ssl._create_unverified_context()  # 忽略不信任的證書(不用校驗的上下文）
res = urlopen(request, context=context)
with res:
    print(res._method)
    print(res.geturl())
    print(res.read().decode())

二：urllib3 庫

标準庫urllib缺少了一些關鍵的功能，非标準庫的第三方庫 urlib3 提供了，比如說連接配接池管理

官方文檔：https://urllib3.readthedocs.io/en/latest/

from urllib.parse import urlencode
from urllib3.response import HTTPResponse
import urllib3
jurl = 'https://movie.douban.com/j/search_subjects'
data = dict(
    type= 'movie',
    tag = '熱門',
    page_limit=10,
    page_start=10
)
user_agent = {'User-agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'}
# 連接配接池管理器
with urllib3.PoolManager() as http:
    response = http.request('GET', '{}?{}'.format(jurl, urlencode(data)), headers = user_agent)
    print(type(response))
    # response: HTTPResponse = HTTPResponse()
    print(response.status)
    print(response.data)

三：requests庫

requests 使用了 urllib3，但是 API 更加友好，推薦使用。requests的底層實作就是urllib子產品。

# 官方文檔見
https://requests.readthedocs.io/en/master/

# 安裝requests子產品
pip install requests

3.1 發送get請求

import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
data = {'wd': 'python文檔'}
response = requests.get(url = 'https://www.baidu.com/s', params = data, headers = headers)
# url後面可加?也可不加?，requests會自動把url與params進行拼接；如果有中文也會自動進行編碼
# https://www.baidu.com/s?wd=python%E6%96%87%E6%A1%A3
print(response.status_code)         # 狀态碼
print(response.request.headers)     # 請求頭
print(response.headers)             # 響應頭
print(response.request.url)         # 請求的URL
print(response.url)                 # 響應的URL(如URL發生了重定向，該方法傳回實際傳回資料的URL)
print(response.text)				# 響應内容（字元串）
print(response.content.decode())    # 響應内容（bytes）

response.text與response.content的差別：

url = 'http://www.baidu.com'
response = requests.get(url)
print(response)
print(response.encoding)		# 根據HTTP頭部對響應的編碼作出有根據的推測，推測的文本編碼
# response.encoding = 'utf-8'	# 修改編碼方式，将編碼修改為utf-8後，輸出的中文正常顯示
print(response.text)			# 直接輸出結果會發現中文亂碼

url = 'http://www.baidu.com'
response = requests.get(url)
print(response.content.decode())	# 傳回bytes類型且不指定解碼類型，使用decode()進行解碼，預設為utf-8

請求并儲存圖檔：

response = requests.get('https://www.baidu.com/img/dong_f6764cd1911fae7d460b25e31c7e342c.gif')
with open('demo.gif', 'wb') as f:
    f.write(response.content)

3.2 發送post請求

requests預設使用application/x-www-form-urlencoded對POST資料編碼：

headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
data={'name':"tian"}
r=requests.post('http://www.httpbin.org/post',data=data,headers=headers)
print(r.text)

Python爬蟲二：HTTP請求urllib與requests一：urllib子產品二：urllib3 庫三：requests庫

如果要傳遞JSON資料，可以直接傳入json參數：

headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
data={'name':"tian"}
r=requests.post('http://www.httpbin.org/post',json=data,headers=headers)
# 通過json=data傳入一個字典，内部會自動序列化為json
print(r.text)

Python爬蟲二：HTTP請求urllib與requests一：urllib子產品二：urllib3 庫三：requests庫

上傳檔案需要更複雜的編碼格式，但是requests把它簡化成files參數：

>>> upload_files = {'file': open('report.xls', 'rb')}
>>> r = requests.post(url, files=upload_files)
# 在讀取檔案時，注意務必使用'rb'即二進制模式讀取，這樣擷取的bytes長度才是檔案的長度

3.3 使用代理IP

proxies = {
    'http': 'http://140.255.145.83:25330'
}
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(url = 'http://www.baidu.com', proxies = proxies, headers = headers)
print(r.status_code)

3.4 session與cookie

cookie和session的差別：

cookie資料存放在客戶的浏覽器上，session資料存放在伺服器上
cookie不是很安全，别人可以分析存放在本地的cookie并進行cookie欺騙
session會在一定時間内儲存在伺服器上。當通路增多，會占用伺服器的資源
單個cookie儲存的資料不能超過4K，很多浏覽器都限制一個站點最多儲存20個cookie

帶上cookie, session的好處：能夠請求到登入之後的頁面

帶上cookie, session的弊端：一套cookie和session往往和一個使用者對應，請求太快，次數太多，容易被伺服器識别為爬蟲

session = requests.session()
url = 'http://www.renren.com/PLogin.do'
data = {'email': '[email protected]', 'password': 'xxxxxx'}
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
# 使用session發送post請求，cookie儲存在其中
session.post(url = url, data = data, headers = headers)
# 再使用session進行請求登陸後才能通路的位址
r = session.get('http://www.renren.com/875198389/profile', headers = headers)
print(r.status_code)
with open('renren.html', 'w', encoding='utf-8') as f:
    f.write(r.content.decode())

或者手動登入網站，然後得到cookie資訊，請求時直接攜帶cookie資訊發送請求：

session = requests.session()
url = 'http://www.renren.com/PLogin.do'
data = {'email': '[email protected]', 'password': 'xxxxxx'}
headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',
    'Cookie': '這裡是登入網站後的cookie資訊'
}
# 使用session發送post請求，cookie儲存在其中
# session.post(url = url, data = data, headers = headers)
# 再使用session進行請求登陸後才能通路的位址
r = session.get('http://www.renren.com/875198389/profile', headers = headers)
print(r.status_code)
with open('renren.html', 'w', encoding='utf-8') as f:
    f.write(r.content.decode())

cookies也可以作為一個單獨的參數傳遞，但要求是cookies為一個字典

session = requests.session()
url = 'http://www.renren.com/PLogin.do'
data = {'email': '[email protected]', 'password': 'xxxxxx'}
headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
cookies = 'anonymid=xxxxxx-lhsd6u; depovince=GW; jebecookies=192dd350-389f-4002-9ad4-955822ef9e78|||||; _r01_=1; taihe_bi_sdk_uid=xxxxxxxx; taihe_bi_sdk_session=xxxxxx; ick_login=a49c573b-532d-4898-914e-e9ecdf1fd003; _de=F0FA10CCF09C5140CA6F896A1DF2C9CE; p=269e670c6edffcdf98110263a89728d79; first_login_flag=1; [email protected]; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; t=d600e408c953ce49e3a8f886e6ee3fc29; societyguester=d600e408c953ce49e3a8f886e6ee3fc29; id=875198389; xnsid=21554bf8; ver=7.0; loginfrom=null; JSESSIONID=abcP8p0hNRpNcdxsRQEhx; wp_fold=0'
cookies = {i.split('=')[0]:i.split('=')[1] for i in cookie.split('; ')}	# 将上面的coockie轉換為字典
r = session.get('http://www.renren.com/875198389/profile', headers = headers, cookies = cookies)
print(r.status_code)
with open('renren.html', 'w', encoding='utf-8') as f:
    f.write(r.content.decode())

request預設使用Session 對象，是為了在多次和伺服器互動中保留會話的資訊，例如cookie，否則，每次都要重新發起請求

urls = ['https://www.baidu.com/s?wd=Python', 'https://www.baidu.com/s?wd=Java']
headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
session = requests.Session()
with session:
    for url in urls:
        response = session.get(url, headers = headers)
        print(type(response))
        with response:
            print(response.text[:50])       # Html内容
            print(response.headers)         # 響應頭
            print(response.request.headers) # 請求頭
            print('-'*30)
            print(response.cookies)         # 輸入cookie資訊
            print('-'*30)
        print('=' * 100)

第一次發起請求是不帶cookie的，第二次請求會帶上cookie資訊去請求網頁

3.5 簡單封裝

class HTTP:
    @staticmethod
    def get(url, return_json = True):
        r = requests.get(url)
        if r.status_code != 200:
            return {} if return_json else ''
        return r.json() if return_json else r.text

Python爬蟲二：HTTP請求urllib與requests一：urllib子產品二：urllib3 庫三：requests庫

python3 HTTP請求urllib與requests

一：urllib子產品

1.1 urlopen()

1.2 User-Agent

1.3 Request類

1.4 urllib.parse 子產品

1.5 送出方法method

1.6 處理json資料

1.7 HTTPS證書忽略

二：urllib3 庫

三：requests庫

3.1 發送get請求

3.2 發送post請求

3.3 使用代理IP

3.4 session與cookie

3.5 簡單封裝

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入