laitimes

Python simple crawler case

author:Senior Internet Architect

Using pyhton to crawl data from web pages is a more common way to crawl. Web pages are generally written by HTML, which contains a large number of tags, and the content we need is contained in these tags, in addition to having an understanding of the basic syntax of Python, we must also have a simple understanding of the structure of HTML and the selection of tags

Python simple crawler case

First, the implementation steps

1.1 Importing Dependencies

Web page content dependency

import requests,如没有下载依赖,在terminal处输出pip install requests,系统会自动导入依赖

Analysis content dependency

常用的有BeautifulSoup、parsel、re等等

As in the previous steps, if there are no dependencies, import the dependencies at the terminal

导入BeautifulSoup依赖

pip install bs4           

Import pasel dependencies

pip install parsel           

Use dependencies

from bs4 import BeautifulSoup
import requests
import parsel
import re           

1.2 Access to Data

Simple access to web pages, web page text

response = requests.get(url).text           

For many websites that may require a user to log in, disguise with headers, which can be obtained in browser F12

headers = {
    'Cookie': 'cookie,非真实的',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
}

headers = {
    'Host': 'www.qidian.com',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate'
}           

Get web page data after camouflage

response = requests.get(url=url,headers=headers).get.text           

There are even some SSL certificates that need to be set up with proxies

proxies = {
    'http': 'http://127.0.0.1:9000',
    'https': 'http://127.0.0.1:9000'
}
response = requests.get(url=url,headers=headers, proxies=proxies).get.text           

1.3 Parsing Data

数据的解析有几种方式,比如xpath,css, re。

As the name suggests, CSS is the way HTML tags are parsed.

re is a regular expression parsing.

1.4 Writing Files

with open(titleName + '.txt', mode='w', encoding='utf-8') as f:
    f.write(content)           

open函数打开文件IO,with函数让你不用手动关闭IO流,类似Java中Try catch模块中try()引入IO流。

The first function is the file name, the mode is the input mode, the encoding is the encoding, and there are many more parameters, which can be studied by yourself.

write to write to the file.

2. Complete case

import requests
import parsel
​
​
link = '小说起始地址,法律原因不给出具体的'
link_data = requests.get(url=link).text
link_selector = parsel.Selector(link_data)
href = link_selector.css('.DivTr a::attr(href)').getall()
for index in href:
    url = f'https:{index}'
    print(url)
    response = requests.get(url, headers)
​
    html_data = response.text
    selector = parsel.Selector(html_data)
    title = selector.css('.c_l_title h1::text').get()
    content_list = selector.css('div.noveContent p::text').getall()
    content = '\n'.join(content_list)
    with open(title + '.txt', mode='w', encoding='utf-8') as f:
        f.write(content)           

The above cases can get free chapters of FL Novel Network, but what about paid chapters

The paid chapter is the existence of the form of a photo, find the photo and then use Baidu cloud computing to parse the text of the photo, crawling the paid content is illegal, this part of the code can not be provided

Author: Heavenly Dao Payne

Link: https://juejin.cn/post/7385350484411056154