python抓取网页当天信息（高校人才网为例）

2023-05-30 17:08:51

有关注过高校人才网的朋友，都会遇到信息太多太杂，关注不及时等问题。如果有个工具可以每天去帮你抓取关注领域的相关职位信息，是不是很方便呢？下面，让我们用python，来实时抓取你需要的信息吧。

首先，分析高校人才网的URL，发现他的模块是有规律的。

eg，

武汉高校URL：http://www.gaoxiaojob.com/zhaopin/gaoxiao/wuhan/

武汉中小学：http://www.gaoxiaojob.com/zhaopin/zhongxiaoxuejiaoshi/wuhan/

为了方便后期修改抓取模块，我们可以将这部分信息和数据库配置一起放在配置文件中。新建一个.INI文件，配置如下：

[db]
host=localhost
schema=数据库名称
user=root
psw=root
[setting]
#模块
sections=gaoxiao,zhongxiaoxuejiaoshi
#区域
areas=wuhan,guangzhou
#关键词
keywords=研究生,web

然后，新建一个.PY文件，导入相关包，并读取配置文件：

#coding=utf-8
import requests
import logging
import configparser
import traceback
import MySQLdb
import datetime
from bs4 import BeautifulSoup

logging.basicConfig(filename='log.txt', level=logging.DEBUG,
                    format='%(asctime)s - %(levelname)s - %(message)s')
cf = configparser.ConfigParser()
cf.read("gxrcw.ini",encoding='UTF-8')

遍历配置内容，生成待抓取的网页地址：

def postUrl(url):
    sections=cf.get("setting", "sections").split(',')
    for section in sections:
        areas=cf.get("setting", "areas").split(',')
        for area in areas:
            src=url+'/'+section+'/'+area
            gettitle(src)

分析网页结构，开始抓取列表：

def gettitle(url):
    res = requests.get(url)   # url为a标签的helf链接，即为图片封面的图片
    soup = BeautifulSoup(res.text, 'html.parser')  # 使用BeautifulSoup来解析我们获取到的网页
    div = soup.find('div', class_='list_zpqz1_vip')
    db = MySQLdb.connect(cf.get("db", "host"), cf.get("db", "user"), cf.get("db", "psw"), cf.get("db", "schema"), charset='utf8')
    ocursor = db.cursor()
    datenow=datetime.datetime.now()
    for link in div.find_all('div', class_='style2'):
        name = link.find('a',href=True)
        issue = link.find('span', class_='ltitle')
        issueday=issue.get_text(strip=True)
        issueday='2019-'+issueday[1:6].replace('.','-')
        issueday=datetime.datetime.strptime(issueday,"%Y-%m-%d")
        #判断仅抓取当天列表
        if(issueday+datetime.timedelta(days=2)>=datenow):
            href=name['href']
            getdetail(ocursor,href,datenow)
            db.commit()

    ocursor.close()
    db.close()

抓取网页内容，并存入mysql库：

def getdetail(ocursor,url,datenow):
    res = requests.get(url)  
    res.encoding = 'gb18030'
    soup = BeautifulSoup(res.text, 'html.parser')   # 使用BeautifulSoup来解析我们获取到的网页

    try:
         title=soup.find('h1', class_='title-a').get_text(strip=True)
         content= soup.find('div', class_='article_body').get_text(strip=True)

         sql="insert into requestnews (TITLE,CONTENT) values (%s,%s)"
         param=(title,content)
         ocursor.execute(sql,param)
    except:
         logging.debug(traceback.format_exc())

部分抓取结果，如下：

python抓取网页当天信息（高校人才网为例）

用pyinstaller命令，生成.exe。在系统中添加任务，设置每天执行，即可实时获取信息。后续可以在抓取时增加对关键词的判断，也可直接对数据库中的数据进行搜索、推送等等。

python抓取网页当天信息（高校人才网为例）

继续阅读

v2ex的简单爬虫

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的