Using pyhton to crawl data from web pages is a more common way to crawl. Web pages are generally written by HTML, which contains a large number of tags, and the content we need is contained in these tags, in addition to having an understanding of the basic syntax of Python, we must also have a simple understanding of the structure of HTML and the selection of tags

First, the implementation steps

1.1 Importing Dependencies

Web page content dependency

import requests，如没有下载依赖，在terminal处输出pip install requests，系统会自动导入依赖

Analysis content dependency

常用的有BeautifulSoup、parsel、re等等

As in the previous steps, if there are no dependencies, import the dependencies at the terminal

导入BeautifulSoup依赖

pip install bs4

Import pasel dependencies

pip install parsel

Use dependencies

from bs4 import BeautifulSoup
import requests
import parsel
import re

1.2 Access to Data

Simple access to web pages, web page text

response = requests.get(url).text

For many websites that may require a user to log in, disguise with headers, which can be obtained in browser F12

headers = {
    'Cookie': 'cookie，非真实的',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
}

headers = {
    'Host': 'www.qidian.com',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate'
}

Get web page data after camouflage

response = requests.get(url=url,headers=headers).get.text

There are even some SSL certificates that need to be set up with proxies

proxies = {
    'http': 'http://127.0.0.1:9000',
    'https': 'http://127.0.0.1:9000'
}
response = requests.get(url=url,headers=headers, proxies=proxies).get.text

1.3 Parsing Data

数据的解析有几种方式，比如xpath，css, re。

As the name suggests, CSS is the way HTML tags are parsed.

re is a regular expression parsing.

1.4 Writing Files

with open(titleName + '.txt', mode='w', encoding='utf-8') as f:
    f.write(content)

open函数打开文件IO，with函数让你不用手动关闭IO流，类似Java中Try catch模块中try()引入IO流。

The first function is the file name, the mode is the input mode, the encoding is the encoding, and there are many more parameters, which can be studied by yourself.

write to write to the file.

2. Complete case

import requests
import parsel


link = '小说起始地址，法律原因不给出具体的'
link_data = requests.get(url=link).text
link_selector = parsel.Selector(link_data)
href = link_selector.css('.DivTr a::attr(href)').getall()
for index in href:
    url = f'https:{index}'
    print(url)
    response = requests.get(url, headers)

    html_data = response.text
    selector = parsel.Selector(html_data)
    title = selector.css('.c_l_title h1::text').get()
    content_list = selector.css('div.noveContent p::text').getall()
    content = '\n'.join(content_list)
    with open(title + '.txt', mode='w', encoding='utf-8') as f:
        f.write(content)

The above cases can get free chapters of FL Novel Network, but what about paid chapters

The paid chapter is the existence of the form of a photo, find the photo and then use Baidu cloud computing to parse the text of the photo, crawling the paid content is illegal, this part of the code can not be provided

Author: Heavenly Dao Payne

Link: https://juejin.cn/post/7385350484411056154

Python simple crawler case

First, the implementation steps

1.1 Importing Dependencies

1.2 Access to Data

1.3 Parsing Data

1.4 Writing Files

2. Complete case

Read on

PHP Crawler: Network Security Engineer Demystifies Amazing Analysis Techniques

Say no to web crawlers, keep your website safe, PHP solution is coming

vue-virtual DOM: Crawler crawling problem for single-page applications

Recommend 6 yyds open source projects This week's top GitHub projects include: Dub is a tool for generating short links that can track and analyze user usage;

Python crawler library Requests author unemployed due to mania: ask for funding online, find a job

"Parasite" shines into reality, it turns out that he is the reptile who was innocently planted, it's so sad!

What conditions and skills do crawlers need to have to collect a large amount of data?

Java,SpringBoot,Vue,Python爬虫,Hadoop大数据旅游推荐管理系统

Python Efficient Crawler - Introduction and Use of Scrapy

The Apocalyptic War tracker crawler