python抓取網頁當天資訊（高校人才網為例）

2023-05-30 17:08:51

有關注過高校人才網的朋友，都會遇到資訊太多太雜，關注不及時等問題。如果有個工具可以每天去幫你抓取關注領域的相關職位資訊，是不是很友善呢？下面，讓我們用python，來實時抓取你需要的資訊吧。

首先，分析高校人才網的URL，發現他的子產品是有規律的。

eg，

武漢高校URL：http://www.gaoxiaojob.com/zhaopin/gaoxiao/wuhan/

武漢中國小：http://www.gaoxiaojob.com/zhaopin/zhongxiaoxuejiaoshi/wuhan/

為了友善後期修改抓取子產品，我們可以将這部分資訊和資料庫配置一起放在配置檔案中。建立一個.INI檔案，配置如下：

[db]
host=localhost
schema=資料庫名稱
user=root
psw=root
[setting]
#子產品
sections=gaoxiao,zhongxiaoxuejiaoshi
#區域
areas=wuhan,guangzhou
#關鍵詞
keywords=研究所學生,web

然後，建立一個.PY檔案，導入相關包，并讀取配置檔案：

#coding=utf-8
import requests
import logging
import configparser
import traceback
import MySQLdb
import datetime
from bs4 import BeautifulSoup

logging.basicConfig(filename='log.txt', level=logging.DEBUG,
                    format='%(asctime)s - %(levelname)s - %(message)s')
cf = configparser.ConfigParser()
cf.read("gxrcw.ini",encoding='UTF-8')

周遊配置内容，生成待抓取的網頁位址：

def postUrl(url):
    sections=cf.get("setting", "sections").split(',')
    for section in sections:
        areas=cf.get("setting", "areas").split(',')
        for area in areas:
            src=url+'/'+section+'/'+area
            gettitle(src)

分析網頁結構，開始抓取清單：

def gettitle(url):
    res = requests.get(url)   # url為a标簽的helf連結，即為圖檔封面的圖檔
    soup = BeautifulSoup(res.text, 'html.parser')  # 使用BeautifulSoup來解析我們擷取到的網頁
    div = soup.find('div', class_='list_zpqz1_vip')
    db = MySQLdb.connect(cf.get("db", "host"), cf.get("db", "user"), cf.get("db", "psw"), cf.get("db", "schema"), charset='utf8')
    ocursor = db.cursor()
    datenow=datetime.datetime.now()
    for link in div.find_all('div', class_='style2'):
        name = link.find('a',href=True)
        issue = link.find('span', class_='ltitle')
        issueday=issue.get_text(strip=True)
        issueday='2019-'+issueday[1:6].replace('.','-')
        issueday=datetime.datetime.strptime(issueday,"%Y-%m-%d")
        #判斷僅抓取當天清單
        if(issueday+datetime.timedelta(days=2)>=datenow):
            href=name['href']
            getdetail(ocursor,href,datenow)
            db.commit()

    ocursor.close()
    db.close()

抓取網頁内容，并存入mysql庫：

def getdetail(ocursor,url,datenow):
    res = requests.get(url)  
    res.encoding = 'gb18030'
    soup = BeautifulSoup(res.text, 'html.parser')   # 使用BeautifulSoup來解析我們擷取到的網頁

    try:
         title=soup.find('h1', class_='title-a').get_text(strip=True)
         content= soup.find('div', class_='article_body').get_text(strip=True)

         sql="insert into requestnews (TITLE,CONTENT) values (%s,%s)"
         param=(title,content)
         ocursor.execute(sql,param)
    except:
         logging.debug(traceback.format_exc())

部分抓取結果，如下：

python抓取網頁當天資訊（高校人才網為例）

用pyinstaller指令，生成.exe。在系統中添加任務，設定每天執行，即可實時擷取資訊。後續可以在抓取時增加對關鍵詞的判斷，也可直接對資料庫中的資料進行搜尋、推送等等。

python抓取網頁當天資訊（高校人才網為例）

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的