用pyhton从网页中爬取数据，是比较常用的爬虫方式。网页一般由html编写，里面包含大量的标签，我们所需的内容都包含在这些标签之中，除了对python的基础语法有了解之外，还要对html的结构以及标签选择有简单的认知，下面就用爬取fl小说网的案例带大家进入爬虫的世界

一、实现步骤

1.1 导入依赖

网页内容依赖

import requests，如没有下载依赖，在terminal处输出pip install requests，系统会自动导入依赖

解析内容依赖

常用的有BeautifulSoup、parsel、re等等

与上面步骤一样，如没有依赖，则在terminal处导入依赖

导入BeautifulSoup依赖

pip install bs4

导入pasel依赖

pip install parsel

使用依赖

from bs4 import BeautifulSoup
import requests
import parsel
import re

1.2 获取数据

简单的获取网页，网页文本

response = requests.get(url).text

对于很多网站可能需要用户身份登录，此时用headers伪装，此内容可以在浏览器f12获得

headers = {
    'Cookie': 'cookie，非真实的',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
}

headers = {
    'Host': 'www.qidian.com',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate'
}

伪装后获取网页数据

response = requests.get(url=url,headers=headers).get.text

甚至还有些跟SSL证书相关，还需设置proxies

proxies = {
    'http': 'http://127.0.0.1:9000',
    'https': 'http://127.0.0.1:9000'
}
response = requests.get(url=url,headers=headers, proxies=proxies).get.text

1.3 解析数据

数据的解析有几种方式，比如xpath，css, re。

css顾名思义，就是html标签解析方式了。

re是正则表达式解析。

1.4 写入文件

with open(titleName + '.txt', mode='w', encoding='utf-8') as f:
    f.write(content)

open函数打开文件IO，with函数让你不用手动关闭IO流，类似Java中Try catch模块中try()引入IO流。

第一个函数为文件名，mode为输入模式，encoding为编码，还有更多的参数，可以自行研究。

write为写入文件。

二、完整案例

import requests
import parsel


link = '小说起始地址，法律原因不给出具体的'
link_data = requests.get(url=link).text
link_selector = parsel.Selector(link_data)
href = link_selector.css('.DivTr a::attr(href)').getall()
for index in href:
    url = f'https:{index}'
    print(url)
    response = requests.get(url, headers)

    html_data = response.text
    selector = parsel.Selector(html_data)
    title = selector.css('.c_l_title h1::text').get()
    content_list = selector.css('div.noveContent p::text').getall()
    content = '\n'.join(content_list)
    with open(title + '.txt', mode='w', encoding='utf-8') as f:
        f.write(content)

以上案例可以获取fl小说网的免费章节，那么付费章节呢

付费章节是照片形式的存在，找到照片然后用百度云计算解析照片的文字即可，爬取付费内容是违法行为，这部分代码不能提供

作者：天道佩恩

链接：https://juejin.cn/post/7385350484411056154

Python简单爬虫案例

一、实现步骤

1.1 导入依赖

1.2 获取数据

1.3 解析数据

1.4 写入文件

二、完整案例

继续阅读

php爬虫：网络安全工程师揭秘神奇分析技术

拒绝网络爬虫，保护网站安全，PHP解决方案来了

Vue虚拟DOM：单页应用程序的爬虫抓取问题解析

推荐6款yyds的开源项目本周GitHub热门项目包括：Dub是一个用于生成短链接的工具，可以跟踪和分析用户使用情况；G

Python爬虫库Requests作者因狂躁症失业：在线求资助、找工作

《寄生虫》照进现实，原来他才是那条被无辜栽赃的爬虫，太痛心！

爬虫采集大量数据需要具备哪些条件和技巧？

Java,SpringBoot,Vue,Python爬虫,Hadoop大数据旅游推荐管理系统

Python高效爬虫——scrapy介绍与使用

末世之战追踪者爬虫