一、requests子產品
(一)get請求
1.步驟
- 導包
import requests
- 确定請求的url
- 發送請求,擷取響應
response = requests.get( url = base_url, headers = {}, # 請求頭字典 params = {}, # 請求參數字典 )
2.response對象
這個對象包含的内容有以下幾個
- 狀态碼
response.status_code
- 響應頭
response.headers['Cookie']
-
響應正文
擷取字元串類型的響應正文:
擷取bytes類型的響應正文:response.text
響應正文字元串編碼:response.content
response.encoding
響應内容的亂碼問題:
當我們用response.text擷取字元串的響應正文的時候,有時候會出現亂碼
原因是:response.encoding的預設指定編碼有誤
解決方法:
- 手動指定
- 解碼
3.get請求項目類别
1.沒有請求參數的,我們隻需要添加請求頭,封裝User-Agent這個請求頭就可以了
比如百度和百度産品兩個項目
百度産品
# 1.導包
import requests
# 2.确定url
base_url = 'https://www.baidu.com/more/'
# 3.發送請求,擷取響應
response = requests.get(base_url)
# 檢視頁面内容
# print(requests.text)
# print(requests.encoding)
print(response.status_code) # 狀态碼
print(response.headers) # 響應頭
print(type(response.text)) # <class 'str'>
print(type(response.content)) # <class 'bytes'>
'''
解決亂碼
'''
# 1
# response.encoding = 'utf-8'
# print(response.text)
# with open('index.html','w',encoding='utf-8') as fp:
# fp.write(response.text)
# 2
with open('index.html','w',encoding='utf-8') as fp:
fp.write(response.content.decode('utf-8'))
百度
import requests
base_url = 'https://www.baidu.com'
# 封裝請求頭
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
with requests.get(base_url,headers = headers) as response:
if response.status_code == 200:
with open('baidu.html','w',encoding='utf-8') as fp:
fp.write(response.content.decode('utf-8'))
else:
print('error')
2.帶請求參數的,基礎url就是?之前包括?的部分,需要設定請求頭和請求參數字典
比如新浪新聞這個項目
import requests
# 攜帶參數的get請求,基礎url是問号之前包括問号的部分
base_url = 'https://search.sina.com.cn/?'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
q=input('請輸入想要查詢的内容:')
params = {
'q': q,
'c': 'news',
'from': 'channel',
'ie': 'utf-8'
}
response = requests.get(base_url,headers = headers,params=params)
with open('sina_news.html','w',encoding='gbk') as fp:
fp.write(response.content.decode('gbk'))
3.分頁
方法:
- 找到分頁的規律,一般是通過params參數中的其中一個參數來控制的
- 找到這個參數每一頁的規律
- 用for循環來請求每一頁的内容
比如百度貼吧這個項目
import requests
import os
base_url = 'http://tieba.baidu.com/f?'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
kw = 'java'
dirname = './tieba/'+kw+'/'
if not os.path.exists(dirname):
os.makedirs(dirname)
for i in range(10):
params = {
'kw': kw,
'ie': 'utf-8',
'pn': str(i*50)
}
response = requests.get(base_url,headers = headers,params=params)
with open(dirname+kw+'%s.html'%(i+1),'w',encoding='utf-8') as fp:
fp.write(response.content.decode('utf-8'))
(二)post請求
發送請求
response = requests.post(
url,
headers = {}, # 請求頭字典
data = {} # 請求資料字典
)
post請求一般得到的響應内容是json資料。
json資料本質上就是字元串,處理json資料用到的子產品是json子產品。
json子產品的常用方法:
json.dumps(python的list或者dict) # 傳回值為json字元串
json.loads(json字元串) # 傳回值為python的list或者dict
使用response.json(),可以直接将響應内容的json字元串轉換成python的list或者dict。
1.基礎post請求
百度翻譯項目
import requests
url = 'https://fanyi.baidu.com/sug'
kw = 'origin'
data = {
'kw':kw,
}
headers = {
'content-length': str(len(str(data))),
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'origin': 'https://fanyi.baidu.com',
'referer': 'https://fanyi.baidu.com/?aldtype=16047',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
response = requests.post(url,headers=headers,data=data)
json_str = response.text
json_data = response.json()
# print(type(json_str)) # <class 'str'>
# print(json_str)
# print(type(json_data)) # <class 'dict'>
# print(json_data)
result = ''
for data in json_data['data']:
result += data['k']+':\n'+data['v']+'\n'
print(result)
2.問題
post請求的請求參數換了,就請求不到了
這就需要解決改變的請求參數
思路:
- 對比。對比兩次請求的data字典,找到不一樣的參數
-
找到這些參數的生成原理,并手動生成它們
可能找到這些參數的位置:
- 頁面中,這種情況的參數,都是固定寫死的
- js中動态生成的參數
- 通過ajax擷取
3.有道詞典項目(js中動态生成參數)
import time,random,hashlib,json
import requests
url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
def get_md5(value):
md5 = hashlib.md5()
md5.update(value.encode())
return md5.hexdigest()
i = 'word'
salt = str(int(time.time()*1000)) + str(random.randint(0,10))
sign = get_md5("fanyideskweb" + i + salt + "n%A-rKaT5fb[Gy?;[email protected]")
ts = str(int(time.time()*1000))
data = {
'i': i,
'from': 'AUTO',
'to': 'AUTO',
'smartresult': 'dict',
'client': 'fanyideskweb',
'salt': salt,
'sign': sign,
'ts': ts,
'bv': 'f0325f69e46de1422e85dedc4bd3c11f',
'doctype': 'json',
'version': '2.1',
'keyfrom': 'fanyi.web',
'action': 'FY_BY_REALTlME'
}
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': str(len(str(data))),
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': '_ntes_nnid=a34e4f9a3febc07732be65da730cc12c,1571726549684; OUTFOX_SEARCH_USER_ID_NCOO=1297033622.5178812; OUTFOX_SEARCH_USER_ID="[email protected]"; _ga=GA1.2.1132435346.1572169065; _gid=GA1.2.1708709462.1572169065; JSESSIONID=aaaUk11nUld4J6hzmyr4w; ___rl__test__cookies=1572249660634',
'Host': 'fanyi.youdao.com',
'Origin': 'http://fanyi.youdao.com',
'Referer': 'http://fanyi.youdao.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
response = requests.post(url,headers = headers,data = data)
json_data = response.json()
for mean in json_data['smartResult']['entries']:
print(mean)