Using pyhton to crawl data from web pages is a more common way to crawl. Web pages are generally written by HTML, which contains a large number of tags, and the content we need is contained in these tags, in addition to having an understanding of the basic syntax of Python, we must also have a simple understanding of the structure of HTML and the selection of tags
First, the implementation steps
1.1 Importing Dependencies
Web page content dependency
import requests,如没有下载依赖,在terminal处输出pip install requests,系统会自动导入依赖
Analysis content dependency
常用的有BeautifulSoup、parsel、re等等
As in the previous steps, if there are no dependencies, import the dependencies at the terminal
导入BeautifulSoup依赖
pip install bs4
Import pasel dependencies
pip install parsel
Use dependencies
from bs4 import BeautifulSoup
import requests
import parsel
import re
1.2 Access to Data
Simple access to web pages, web page text
response = requests.get(url).text
For many websites that may require a user to log in, disguise with headers, which can be obtained in browser F12
headers = {
'Cookie': 'cookie,非真实的',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
}
headers = {
'Host': 'www.qidian.com',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate'
}
Get web page data after camouflage
response = requests.get(url=url,headers=headers).get.text
There are even some SSL certificates that need to be set up with proxies
proxies = {
'http': 'http://127.0.0.1:9000',
'https': 'http://127.0.0.1:9000'
}
response = requests.get(url=url,headers=headers, proxies=proxies).get.text
1.3 Parsing Data
数据的解析有几种方式,比如xpath,css, re。
As the name suggests, CSS is the way HTML tags are parsed.
re is a regular expression parsing.
1.4 Writing Files
with open(titleName + '.txt', mode='w', encoding='utf-8') as f:
f.write(content)
open函数打开文件IO,with函数让你不用手动关闭IO流,类似Java中Try catch模块中try()引入IO流。
The first function is the file name, the mode is the input mode, the encoding is the encoding, and there are many more parameters, which can be studied by yourself.
write to write to the file.
2. Complete case
import requests
import parsel
link = '小说起始地址,法律原因不给出具体的'
link_data = requests.get(url=link).text
link_selector = parsel.Selector(link_data)
href = link_selector.css('.DivTr a::attr(href)').getall()
for index in href:
url = f'https:{index}'
print(url)
response = requests.get(url, headers)
html_data = response.text
selector = parsel.Selector(html_data)
title = selector.css('.c_l_title h1::text').get()
content_list = selector.css('div.noveContent p::text').getall()
content = '\n'.join(content_list)
with open(title + '.txt', mode='w', encoding='utf-8') as f:
f.write(content)
The above cases can get free chapters of FL Novel Network, but what about paid chapters
The paid chapter is the existence of the form of a photo, find the photo and then use Baidu cloud computing to parse the text of the photo, crawling the paid content is illegal, this part of the code can not be provided
Author: Heavenly Dao Payne
Link: https://juejin.cn/post/7385350484411056154