python爬虫入门--创建爬虫

urllib 是 Python 的标准库，包含了从网络请求数据，处理 cookie，甚至改变像请求头和用户代理这些元数据的函数。（https://docs.python.org/3/ library/urllib.html）。

Python 2.x 里的 urllib2 库，可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里，urllib2 改名为 urllib，被分成一些子模块：urllib.request、 urllib.parse 和 urllib.error。

#导入库,我的博客上有文档
from urllib.request import urlopen
'''
urllib.request使用

'''
html = urlopen("https://blog.csdn.net/qq_35706045")
print(html.read())

$python pa.py

$python3 pa.py

#第三方库，推荐
"""
pip install requests

"""
import requests
'''
response = requests.get('http://www.baidu.com')
print(response.status_code)      # 打印状态码
print(response.url)              # 打印请求url
print(response.headers)          # 打印头信息
print(response.cookies)          # 打印cookie信息
print(response.text)             #以文本形式打印网页源码
print(response.content)          #以字节流形式打印

'''
r = requests.get('https://blog.csdn.net/qq_35706045')
print(r.next)

BeautifulSoup 库最常用的对象恰好就是 BeautifulSoup 对象。让

from urllib.request import urlopen
from bs4 import BeautifulSoup
'''
bs4库简单使用

'''
html = urlopen("https://blog.csdn.net/qq_35706045")
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj.h1)

新的 BeautifulSoup 4 版本（也叫 BS4）。

BeautifulSoup 4 的所有安装方法都在 http://www. crummy.com/software/BeautifulSoup/bs4/doc/ 里面。

Linux 系统上的基本安装方法是： $sudo apt-get install python-bs4

对于 Mac 系统，首先用 $sudo easy_install pip 安装 Python 的包管理器 pip，然后运行 $pip install beautifulsoup4

$python > from bs4 import BeautifulSoup 如果没有错误，说明导入成功了。

另外，还有一个 Windows 版 pip（https://pypi.python.org/pypi/setuptools）的 .exe 格式安装器，装了之后你就可以轻松安装和管理包了： >pip install beautifulsoup4

一个基本的爬虫，带有异常反馈

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("https://blog.csdn.net/qq_35706045")
if title == None:
    print("Title could not be found")
else:
    print(title)

由于使用python版本不同，使用pip安装时注意，使用想对pip版本，避免奇怪的异常。

或者，Python网络数据采集有以下，通过虚拟环境使用不同python版本，主要是py2,py3

用虚拟环境保存库文件如果你同时负责多个 Python 项目，或者想要轻松打包某个项目及其关联的库文件，再或者你担心已安装的库之间可能有冲突，那么你可以安装一个 Python 虚拟环境来分而治之。

当一个 Python 库不用虚拟环境安装的时候，你实际上是全局安装它。

这通常需要有管理员权限，或者以 root 身份安装，这个库文件对设备上的每个用户和每个项目都是存在的。好在创建虚拟环境非常简单： $ virtualenv scrapingEnv

这样就创建了一个叫作 scrapingEnv 的新环境，你需要先激活它再使用： $ cd scrapingEnv/ $ source bin/activate

激活环境之后，你会发现环境名称出现在命令行提示符前面，提醒你当前处于虚拟环境中。

后面你安装的任何库和执行的任何程序都是在这个环境下运行。

在新建的 scrapingEnv 环境里，可以安装并使用 BeautifulSoup：

(scrapingEnv)ryan$ pip install beautifulsoup4

(scrapingEnv)ryan$ python > from bs4 import BeautifulSoup >

当不再使用虚拟环境中的库时，可以通过释放命令来退出环境：

(scrapingEnv)ryan$ deactivate ryan$ python > from bs4 import BeautifulSoup Traceback (most recent call last): File "", line 1, in ImportError: No module named 'bs4' 将项目关联的所有库单独放在一个虚拟环境里，还可以轻松打包整个环境发生给其他人。只要他们的 Python 版本和你的相同，你打包的代码就可以直接通过虚拟环境运行，不需要再安装任何库。

python爬虫入门--创建爬虫

继续阅读

Python入门级爬取百度百科词条

16Python爬虫---Scrapy常用命令

Python爬虫基本库的使用第二章基本库的使用

Python爬虫（四）lxml、xpath安装模块导入查找节点属性查找 @ 符号使用谓语选取未知节点获取文本和属性

爬虫学习之04-request模块获取糗事百科一张热图

python3下用selenium库和chrome的headless模式实现网页抓取（注释中有用phantomJS的小段代码）

【Python爬虫案例学习19】多进程爬取某图片网站

python爬虫实战：利用beautiful soup爬取猫眼电影TOP100榜单内容-2

python爬虫实战之爬取成语大全

【爬取百度首页】-将整个html源码保存-headers使用一、网页分析二、代码实现与步骤三、结果分析

爬取百度贴吧

爬取猫眼电影--静态网页反爬与多线程/多进程爬取网页解析爬取代码多线程与多进程

requests模块进行人人网模拟登陆

2023爬虫学习笔记 -- 多线程操作

Python爬虫学习（1）

Boss直聘Python爬虫实战