爬虫从入门到精通(2) | requests模块の使用一、requests模块基础知识二、requests中get请求使用的三种常见情况三、requests中post请求的使用四、requests中的钩子函数五、常见的requests报错

文章目录

一、requests模块基础知识
- 1.requests的用途
- 2.安装方法
- 3.参数介绍
- 4.返回值response对象
- 5.查看网页使用的是什么请求
二、requests中get请求使用的三种常见情况
- 1.不需要请求参数（百度产品）
- 2.需要请求参数（新浪新闻）
- 3.请求中常见的分页处理
三、requests中post请求的使用
- 1.JSON模块
- 2.post请求常用格式
- 3.上传文件
四、requests中的钩子函数
五、常见的requests报错
- 1. 连接超时
- 2. 连接、读取超时
- 3. 未知的服务器
- 4. 代理连接不上
- 5. 连接代理超时
- 6. 代理读取超时
- 7. 网络环境异常
- 8.官网的一些参考

参考博客：https://blog.csdn.net/shanzhizi/article/details/50903748

一、requests模块基础知识

1.requests的用途

requests 库可以实现 HTTP 协议中绝大部分功能，它提供的功能包括：keep-alive、连接池、Cookie 持久化、内容自动解压、HTTP 代理、SSL 认证、连接超时、Session 等很多特性，最重要的是它同时兼容 python2 和 python3，它是 Github 关注数最多的 Python 项目之一。

2.安装方法

pip install requests

3.参数介绍

3.1 参数介绍

import requests

requests.get(
  	url=base_url, # 请求的url
  	headers={},   # 请求头，例如{‘user-agent’:'xxx'}
  	params={},    # 请求参数字典,例如{‘a’:123}
  	proxies={},   # 代理，例如{‘https’:'168.168.16.16:9000'}    
  	timeout=3,    # 超时时间
  	verify=False, # 跳过ssl验证
  )

3.2 支持的请求方法

requests.get(‘https://github.com/timeline.json’) #GET请求
requests.post(“http://httpbin.org/post”) #POST请求
requests.put(“http://httpbin.org/put”) #PUT请求
requests.delete(“http://httpbin.org/delete”) #DELETE请求
requests.head(“http://httpbin.org/get”) #HEAD请求
requests.options(“http://httpbin.org/get”) #OPTIONS请求

4.返回值response对象

import requests
r=requests.get(.....)

4.1 参数介绍

代码	意义
r.status_code	响应状态码
r.raw	返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
r.content	字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩
r.text	字符串方式的响应体，会自动根据响应头部的字符编码进行解码
r.headers	以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None。例如获取cookie为response.headers[‘Cookie’]
r.json()	Requests中内置的JSON解码器
r.raise_for_status()	失败请求(非200响应)抛出异常

4.2

response.text

乱码问题

当我们用response.text获取字符串的响应正文的时候，有时候会出现乱码：原因是response.encoding这个字符默认指定编码有误。

解决：

response.encoding='utf-8'
 print(response.text)

5.查看网页使用的是什么请求

二、requests中get请求使用的三种常见情况

1.不需要请求参数（百度产品）

import requests

base_url = 'https://www.baidu.com/more/'   
response = requests.get(base_url)
response.encoding='utf-8'

print(response.status_code)
print(response.headers)
print(type(response.text))
print(type(response.content))

2.需要请求参数（新浪新闻）

import requests
  
 # 1.确定url
base_url = 'https://search.sina.com.cn/'  # 新浪新闻
  
# 2.设置headers字典
headers = {
      'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
 	}
  
# 3.设置请求参数
key = '孙悟空'  # 搜索内容
params = {
      'q': key,
      'c': 'news',
      'from': 'channel',
      'ie': 'utf-8',
  }
# 4.发起请求
response = requests.get(base_url, headers=headers, params=params)
response.encoding='gbk'
print(response.text)

3.请求中常见的分页处理

分页类型
- 第一步：找出分页参数的规律
- 第二步：headers和params字典
- 第三步：用for循环

# --------------------爬取百度贴吧搜索某个贴吧的前十页
import os
  
import requests
  
base_url = 'https://tieba.baidu.com/f?'
headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
  }

# 创建文件夹
dirname = './tieba/woman/'
if not os.path.exists(dirname):
    os.makedirs(dirname)


# 构造参数，for循环发送请求
for i in range(0, 10):
    params = {
          'ie': 'utf-8',
          'kw': '美女',
          'pn': str(i * 50)
      }
      
	response = requests.get(base_url, headers=headers, params=params)

	# 将爬取的内容按页数存放写入html
	with open(dirname + '美女第%s页.html' % (i+1), 'w', encoding='utf-8') as file:
	      file.write(response.content.decode('utf-8'))

三、requests中post请求的使用

1.JSON模块

json.dumps(python的list或者dict)---->(返回值)---->json字符串

json.loads(json字符串)---->(返回值)----->python的list或者dict

post请求一般得到的响应内容是json数据。
处理json数据用到的模块是json模块。
json数据本质就是一个字符串。

response.json()
#可以直接将获取到的json字符串转换为json.dumps(python的list或者dict)---->(返回值)---->json字符串

2.post请求常用格式

response=requests.post(
	url,
	headers={},
	data={},#请求数据字典
)

3.上传文件

import requests
 
url = 'http://127.0.0.1:5000/upload'
files = {'file': open('/home/lyb/sjzl.mpg', 'rb')}
#files = {'file': ('report.jpg', open('/home/lyb/sjzl.mpg', 'rb'))}     #显式的设置文件名
 
r = requests.post(url, files=files)
print(r.text)

四、requests中的钩子函数

hooks可以串改response里的参数信息或者打印一句话

def change_url(response, *args, **kwargs):
    """ 回调函数 """
    response.url = '123'


# 创建一个钩子hooks=dict(response=change_url),字典型，将response放在回调函数中,可以对返回结果进行篡改
response = requests.get('https://www.baidu.com', hooks=dict(response=change_url,))
print response.url

五、常见的requests报错

1. 连接超时

服务器在指定时间内没有应答，抛出 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=0.001)

# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f1b16da75f8>, 'Connection to github.com timed out. (connect timeout=0.001)'))

2. 连接、读取超时

若分别指定连接和读取的超时时间，服务器在指定时间没有应答，抛出 requests.exceptions.ConnectTimeout- timeout=([连接超时时间], [读取超时时间])

连接：客户端连接服务器并并发送http请求服务器
读取：客户端等待服务器发送第一个字节之前的时间

requests.get('http://github.com', timeout=(6.05, 0.01))

# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='github.com', port=80): Read timed out. (read timeout=0.01)

3. 未知的服务器

requests.get('http://github.comasf', timeout=(6.05, 27.05))

# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.comasf', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f75826665f8>: Failed to establish a new connection: [Errno -2] Name or service not known',))

4. 代理连接不上

代理服务器拒绝建立连接，端口拒绝连接或未开放，抛出 requests.exceptions.ProxyError

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "192.168.10.1:800"})

# 抛出错误
requests.exceptions.ProxyError: HTTPConnectionPool(host='192.168.10.1', port=800): Max retries exceeded with url: http://github.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce3438c6d8>: Failed to establish a new connection: [Errno 111] Connection refused',)))

5. 连接代理超时

代理服务器没有响应 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "10.200.123.123:800"})

# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='10.200.123.123', port=800): Max retries exceeded with url: http://github.com/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fa8896cc6d8>, 'Connection to 10.200.123.123 timed out. (connect timeout=6.05)'))

6. 代理读取超时

说明与代理建立连接成功，代理也发送请求到目标站点，但是代理读取目标站点资源超时

即使代理访问很快，如果代理服务器访问的目标站点超时，这个锅还是代理服务器背

假定代理可用，timeout就是向代理服务器的连接和读取过程的超时时间，不用关心代理服务器是否连接和读取成功

requests.get('http://github.com', timeout=(2, 0.01), proxies={"http": "192.168.10.1:800"})

# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.10.1:800', port=1080): Read timed out. (read timeout=0.5)

7. 网络环境异常

可能是断网导致，抛出 requests.exceptions.ConnectionError

requests.get('http://github.com', timeout=(6.05, 27.05))

# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc8c17675f8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

8.官网的一些参考

你可以告诉 requests 在经过以 timeout 参数设定的秒数时间之后停止等待响应。基本上所有的生产代码都应该使用这一参数。如果不使用，你的程序可能会永远失去响应：

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

并不是整个下载响应的时间限制，而是如果服务器在 timeout 秒内没有应答，将会引发一个异常（更精确地说，是在 timeout 秒内没有从基础套接字上接收到任何字节的数据时）


- 遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 requests.exceptions.ConnectionError 异常。
- 如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError 异常。
- 若请求超时，则抛出一个 Timeout 异常。
- 若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。
- 所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

爬虫从入门到精通(2) | requests模块の使用一、requests模块基础知识二、requests中get请求使用的三种常见情况三、requests中post请求的使用四、requests中的钩子函数五、常见的requests报错

文章目录

一、requests模块基础知识

1.requests的用途

2.安装方法

3.参数介绍

4.返回值response对象

5.查看网页使用的是什么请求

二、requests中get请求使用的三种常见情况

1.不需要请求参数（百度产品）

2.需要请求参数（新浪新闻）

3.请求中常见的分页处理

三、requests中post请求的使用

1.JSON模块

2.post请求常用格式

3.上传文件

四、requests中的钩子函数

五、常见的requests报错

1. 连接超时

2. 连接、读取超时

3. 未知的服务器

4. 代理连接不上

5. 连接代理超时

6. 代理读取超时

7. 网络环境异常

8.官网的一些参考

继续阅读

Python爬虫实战，requests模块，Python模拟登录实现拉勾网数据解析

Python高手之路【八】python基础之requests模块

Python爬虫实战，requests模块，Python爬取网易云歌曲并保存本地

最人性化的Python网络爬虫requests模块，下载百度贴吧和博客标题

Python爬虫二：HTTP请求urllib与requests一：urllib模块二：urllib3 库三：requests库

python第三方模块requests模块

爬虫（二）--requests模块一、requests模块

爬虫 requests与BeautifulSoup 模块/方法/参数详解1. 方法2. 参数

爬虫从入门到精通(3) | cookie和session与模拟登录+正则表达式+代理IP的使用+Json数据序列化一、记录浏览器状态的cookie和session二、爬虫使用cookie和session进行模拟登录三、代理IP四、json数据五、正则表达式

python实现12306查询火车票

Python爬虫之二：使用requests模块Python实现爬虫的文件上传、下载，以及同一会话