蚌埠學院官網綜合新聞條目抓取

2018-11-18 23:50:00

蚌埠學院綜合新聞

import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException
import json

def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
        } 
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

def page_parser(html):
    soup = BeautifulSoup(html,'lxml')
    
    for td in soup.find_all(name='td',attrs={'height':24}):
        # 超連結
        href = 'http://www.bbc.edu.cn' + td.find(name='td').a.attrs['href']
        # 标題
        title = td.find(name='td').a.font.string
        # 釋出時間
        postTime = td.find(class_='postTime').string
        yield {
            'href':href,
            'title':title,
            'postTime':postTime
        }

def get_pages(url):
    html = get_one_page(url)
    soup = BeautifulSoup(html,'lxml')
    # 擷取總頁碼
    pages = soup.find(name='a',attrs={'title':'進入尾頁'}).attrs['href']
    # 将總頁碼提取出來
    pages = pages.split('/')[8]
    if pages :
         return pages
    return None

def write_to_file(content):
    with open('result.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(content,ensure_ascii=False) + '\n')

def main(num=0):
    pages = get_pages('http://www.bbc.edu.cn/s/21/t/267/p/22/i/1/list.htm')
    pages = int(pages)
    if num:
        pages=num
    for page in range(1,pages):
        url = 'http://www.bbc.edu.cn/s/21/t/267/p/22/i/'+str(page)+'/list.htm'
        html = get_one_page(url)
        for item in page_parser(html):
            print(item)
    print('抓取了： '+str(pages)+'頁綜合新聞')

if __name__ == '__main__':
    main(20)

蚌埠學院官網綜合新聞條目抓取

繼續閱讀

Json 的三種解析方式Json簡介Json的三種解析方式

JSON三種建立方式

SpringMVC 傳回json的兩種方式

json傳輸資料解決中文亂碼問題

關于 underscore 中模闆引擎的應用示範樣例

underscore 模闆标簽修改。

Ajax——模闆引擎

使用underscore的template自定義模闆

underscore模闆功能的使用和學習

EGORefreshTableHeaderView 解讀代碼解讀 ELTableViewController 的使用寫在最後

今日頭條iOS用戶端啟動速度優化技術調研實測資料

[HTML5]自定義屬性 data-* 和 jQuery.data 詳解

七牛雲-C#SDK-上傳-前期準備

vue-cli簡介（中文翻譯）

Ajax發送和擷取json資料到Spring mvc 1.spring mvc後端2.web前段

JSONObject包導入異常 java.lang.NoClassDefFoundErrorweb項目的導入包的問題